[Nepomuk] [RFC] Better Full text search

Vishesh Handa me at vhanda.in
Sat May 4 14:41:34 UTC 2013


On Sat, May 4, 2013 at 7:10 PM, Christian Mollekopf <chrigi_1 at fastmail.fm>wrote:

> On Saturday 04 May 2013 18.49:05 Vishesh Handa wrote:
> > Hey guys
> >
>
> > I was thinking of moving all the plain text related to a file into the
> > nie:plainTextContent of the resource. So in the case of music we would
> have
> > -
> >
> > <res> nie:plainTextContent "title artist album whatevereElse" .
> >
> > for the case of files, we would append the file name, and any other plain
> > text that we want searched just in the nie:plainTextConent. So a search
> for
> > any combination of text will just have to search through the plain text
> > content.
> >
> > Opinions?
>
> Hey Vishesh,
>
> I think that's a good idea. We're also already using it that way to be
> able to
> search through emails with markup in the email feeder, and I see no reason
> why
> we can't extend that to other resource types (after all the property is
> exactly for this purpose).
> So that means, in the future all feeders should push all information which
> should be matched by full text searching to nie:plainTextContent, right?
>

I was actually thinking of adding a separate API for the text which is
streamed instead of the current load everything in memory and push it. The
File Indexers already have a function like that.


>
> The alternative would of course be to use a separate dedicated fulltext
> index,
> which may have better performance, some more features (tokenizer, stemming
> etc.), but would obviously complicate the setup again (fulltext query =>
> i.e.
> filter by type in nepomuk => retrieve akonadi item). So not necessarily
> the way
> to go, but I wanted to bring it on the table anyways as it's IMO not
> conflicting with what nepomuk provides (the semantic analysis), and could
> result in better results (performance and feature wise) than letting
> virtuoso
> doing all the work.
>

I have been thinking about the same thing - we have no support for stemming
or any other advanced feature we want. I'll take more about this later. I
have an idea which might be very controversial.


>
> >
> > We can easily do this for the 4.11 release cause we already need everyone
> > to re-index everything cause of the migration.
>
> Cool.
>
> Cheers,
> Christian
>



-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130504/44306b64/attachment-0001.html>


More information about the Nepomuk mailing list