[Nepomuk] [RFC] Better Full text search

Sat May 4 14:59:25 UTC 2013

On Saturday 04 May 2013 20.11:34 Vishesh Handa wrote:

> > I think that's a good idea. We're also already using it that way to be
> > able to
> > search through emails with markup in the email feeder, and I see no reason
> > why
> > we can't extend that to other resource types (after all the property is
> > exactly for this purpose).
> > So that means, in the future all feeders should push all information which
> > should be matched by full text searching to nie:plainTextContent, right?
> 
> I was actually thinking of adding a separate API for the text which is
> streamed instead of the current load everything in memory and push it. The
> File Indexers already have a function like that.
> 

That would certainly be better for larger amounts of data. In akonadi we 
anyways have to hold a copy of the item's content in memory (no streaming 
support), so the best we can achieve is to avoid a full second copy, and I'm 
not sure if that helps a lot (if we assume that akonadi items are generally 
rather small). The thing we would probably gain most is if the data was pushed 
over a socket instead of dbus (although this statement isn't based on any 
facts, just hearsay).

> > The alternative would of course be to use a separate dedicated fulltext
> > index,
> > which may have better performance, some more features (tokenizer, stemming
> > etc.), but would obviously complicate the setup again (fulltext query =>
> > i.e.
> > filter by type in nepomuk => retrieve akonadi item). So not necessarily
> > the way
> > to go, but I wanted to bring it on the table anyways as it's IMO not
> > conflicting with what nepomuk provides (the semantic analysis), and could
> > result in better results (performance and feature wise) than letting
> > virtuoso
> > doing all the work.
> 
> I have been thinking about the same thing - we have no support for stemming
> or any other advanced feature we want. I'll take more about this later. I
> have an idea which might be very controversial.
> 

I'm all ears =)

After all the primary value of nepomuk is for me the network of data: the 
ontologies for categorizing the chunks, and the relations between them.
The fulltext index is IMO only in nepomuk "because we can", and it seemed like 
a simple and straightforward solution. I wouldn't see any fundamental problem 
with separating the fulltext index from the semantic database. At least apart 
from the potential additional complexity (not necessarily much), and the 
potential performance loss due to the need to query two databases (the 
potential gains are probably larger though ;-)