[Nepomuk] [RFC] Better Full text search
Christian Mollekopf
chrigi_1 at fastmail.fm
Sat May 4 14:59:25 UTC 2013
On Saturday 04 May 2013 20.11:34 Vishesh Handa wrote:
> > I think that's a good idea. We're also already using it that way to be
> > able to
> > search through emails with markup in the email feeder, and I see no reason
> > why
> > we can't extend that to other resource types (after all the property is
> > exactly for this purpose).
> > So that means, in the future all feeders should push all information which
> > should be matched by full text searching to nie:plainTextContent, right?
>
> I was actually thinking of adding a separate API for the text which is
> streamed instead of the current load everything in memory and push it. The
> File Indexers already have a function like that.
>
That would certainly be better for larger amounts of data. In akonadi we
anyways have to hold a copy of the item's content in memory (no streaming
support), so the best we can achieve is to avoid a full second copy, and I'm
not sure if that helps a lot (if we assume that akonadi items are generally
rather small). The thing we would probably gain most is if the data was pushed
over a socket instead of dbus (although this statement isn't based on any
facts, just hearsay).
> > The alternative would of course be to use a separate dedicated fulltext
> > index,
> > which may have better performance, some more features (tokenizer, stemming
> > etc.), but would obviously complicate the setup again (fulltext query =>
> > i.e.
> > filter by type in nepomuk => retrieve akonadi item). So not necessarily
> > the way
> > to go, but I wanted to bring it on the table anyways as it's IMO not
> > conflicting with what nepomuk provides (the semantic analysis), and could
> > result in better results (performance and feature wise) than letting
> > virtuoso
> > doing all the work.
>
> I have been thinking about the same thing - we have no support for stemming
> or any other advanced feature we want. I'll take more about this later. I
> have an idea which might be very controversial.
>
I'm all ears =)
After all the primary value of nepomuk is for me the network of data: the
ontologies for categorizing the chunks, and the relations between them.
The fulltext index is IMO only in nepomuk "because we can", and it seemed like
a simple and straightforward solution. I wouldn't see any fundamental problem
with separating the fulltext index from the semantic database. At least apart
from the potential additional complexity (not necessarily much), and the
potential performance loss due to the need to query two databases (the
potential gains are probably larger though ;-)
More information about the Nepomuk
mailing list