Fulltext search infrastructure
Fred Schaettgen
kde.sch at ttgen.net
Tue Mar 29 01:52:56 CEST 2005
Hi,
Last week I talked to Roberto Cappuccio, who started writing a search tool
called KAT (http://kat.sf.net). It's not quite ready, but at least it
promises to fulfill a rather simple wish I had since a long time - being able
to do fulltext searches over various file formats, including pdf and doc or
sxw.
To be honest, I still don't know what exactly is in the scope of klink and
what not, but my guess is that extracting text information from different
file formats will at least be a tiny part of the whole thing.
So if it is, then I would like to suggest to agree on a common interface to
extract - possibly lengthy - fulltext data from documents. Writing a good
search engine is hard, but having to maintain various plugins for all kinds
of formats doesn't make things easier.
In another post in this list someone - Scott IIRC - suggested to extend the
kfile plugins to return fulltext data just like other metadata. Another
option would be to introduce a new fulltext kioslave, which uses it's own
plugins to extract the data from the files. There are pros and cons for each
approach, but because it was easier to create a new plugin type (I can use my
solid non-CVS KDE then), I chose the second alternative.
My idea was to let the kioslave emit an xml file with the fulltext data, with
additional markup for structural entities like pages, lines, timestamps,
whatever. With such an interface it could be easily used but other programs,
including non-indexing search tools like kfind, other 3rd-party tools or
maybe even for text-to-speech applications.
Could this be of any use for klink, too, or is it completely off-topic?
regards
Fred
Btw. please send me a CC, I'm not subscribed.
--
Fred Schaettgen
kde.sch at ttgen.net
More information about the Klink
mailing list