[Nepomuk] [RFC] Better Full text search

Thu May 9 10:50:13 UTC 2013

On Thu, May 9, 2013 at 3:21 PM, Ignacio Serantes <kde at aynoa.net> wrote:

> Hi,
>
> I see several problems with this implementation: performance, reliability
> and maintenance.
>
> Variable string fields are not designed for performance, only to store
> variable size text, so I'm a little bit worried about performance when
> there is a large data stored in it. If we are trying to create something
> like an index, an oversized index it's not a good solution. Someone
> suggested to ask Virtuoso developers and this is a good idea because they
> can confirm, or not, this point.
>

Alright. I'll bring it up on the virtuoso mailing list and cc the nepomuk
mailing list.

> Reliability it's another problem because, what information are you
> planning to store?. In my case I'm using extensively nao:altLabel to locate
> resources: Hyo Jung Kim, Hyolin, Hyolyn, Hyorin, 김효정, 효린 are all valid
> names for nco:fullname 효린 (Hyorin). With music there is a lot of
> information available that could be interested when you are searching:
> album artist, director, lyricist or composer so you must add this
> information or full text search will be limited to the indexed values.
> Other example are comments, I'm using it sometimes to search photographs,
> because it's easy to handle with Dolphin so probably others are doing the
> same with photos or with other stuff. By the way, you need to add tags
> label too or full text search will fail. In brief, if you add large data
> performance will be affected but if you add few data reliability will be
> affected.
>

You're correct about the data reliability. I'm not too sure about the
performance, specially if we decide to go with another full text search
instead of virtuoso in the future. I have been thinking about it - maybe
something nice like Xapian[1] which supports stemming in different
languages.

> Maintenance is the third problem and this could be a big problem if there
> are many information stored in this field because every time you updated a
> resource you must rebuild, because it's unformated, this field for all
> resources involved so you can't use this ontology to store any manual
> information. Actually Bangarang and Nepoogle allow update resources, a
> common feature for a database, so if you are storing album artist and I
> update the nco:Contact you must update all music pieces related to this
> album artist or full text search will fail.
>

Yeah. I was just thinking about the file indexer. If you incorporate the
web-miner, then it would be the web-miner's responsibility to update the
full text index (or nie:plainTextIndex) for the resource who it attaches
resources to.

>
> I read some comments about Nepomuk it's not a data store that concern me.
> I'm using Nepomuk as a data store extensively, tags, comments, rating and
> other stuff using Nepoogle, because without it Nepomuk less more useful for
> me and, honestly, I can't understand it without this functionality. Some
> time ago some effort was spend explaining that Nepomuk it's not a file
> search so don't transform it in a resources search tool.
>

I was hoping it would be more of a full text + structured data storage
tool. It is not a place to store plain text such as the file's contents and
expect to get them back exactly as they were.

> Finally, about lyrics stuff, seems like nie:plainTextContent is the right
> place to store text representation of an audio, and probably will be the
> right place to store video subtitles, both as a cache and as a method to
> search so seems like this ontology will be like Marx Brothers cabin scene
> in A Night at the Opera :).
>
> An as usual sorry for my Engrish :).
>

Your english is fine.

I'm worried that if we do not put all the plain text in one place we cannot
reliably solve the searching problem. Users mostly just provide plain text
when searching. They just provide words "blah blah blah". Most users would
not know about higher semantics such as "hasTag:" or "performer:". That's
only for more technical users.

Doing a search through for x words leads 2x unions which is very very slow.
In the case I highlighted in the first email, it takes a good 26 seconds on
my system. That's just too slow. The user expects feedback in MAX a second.
Generally, even less than a second.

Do you have any suggestions on how to fix that?

Additionally, with the query I showed above you also have a problem of
stuff like this -

res a nmm:MusicPiece .
nmm:MuiscPiece rdfs:comment "Used to assign music-specific properties such
a BPM to video and audio"

searching for 'assign music' can give me music results which have nothing
to do with music. I'm not sure how to solve this.

If we want to change how plain text is stored now is the time to do because
with the 4.11 release most users will already have to re-index all their
files and PIM data.

[1] http://xapian.org/

>
>
> On Sat, May 4, 2013 at 7:09 PM, Vishesh Handa <me at vhanda.in> wrote:
>
>>
>>
>>
>> On Sat, May 4, 2013 at 9:18 PM, <phreedom at yandex.ru> wrote:
>>
>>> On Суббота 04 мая 2013 20:14:37 Vishesh Handa wrote:
>>> > On Sat, May 4, 2013 at 7:47 PM, Ivan Čukić <ivan.cukic at kde.org> wrote:
>>> > > > <res> nie:plainTextContent "title artist album whatevereElse" .
>>> > >
>>> > > For me, the plainTextContent of a song would be the lyrics. This
>>> seems
>>> > > like a
>>> > > misuse of the property. With a very good reason behind it, but still
>>> a
>>> > > misuse.
>>> > >
>>> > > I remember when I wanted to keep all activities in one string
>>> property as
>>> > > a \n
>>> > > terminated list to make it speedy :D
>>> > >
>>> > > I'd say go for it, but only as a last resort.
>>> >
>>> > I would not like Nepomuk to be a data store. It's not the place to
>>> store
>>> > your lyrics to fetch them later, same for emails and files. It is a
>>> place
>>> > to store structured data.
>>> >
>>> > In the case of lyrics, the main reason we are storing them is to be
>>> able to
>>> > be search through them, not to display them to the user. So we can
>>> > potentially append other data.
>>>
>>> Yes and no.  Until discardable graphs were introduced, there was even no
>>> distinction between primary storage and cached stuff. The real life is
>>> even
>>> more complicated, you can have local data indexed, you can have  remote
>>> data
>>> indexed(and it would be very very nice to have it cached) and for some
>>> tuff
>>> nepomuk is used as the primary storage.
>>>
>>> The reason people are trying to stuff nepomuk with their blobs is very
>>> simple:
>>> there's a very real demand for this functionality and nepomuk ontologies
>>> as-is
>>> already allow you to store your whole filesystem, including all byte
>>> streams/file contents, so it looks like a very reasonable approach,
>>> especially
>>> since nobody actually offers an alternative. Ok akonadi is the only
>>> exception
>>> which provides caching of remote data but it's domain-specific.
>>>
>>> Imagine a user finding a music video by its lyrics, opening the video
>>> only to
>>> discover that (s)he can't see any lyrics, because nepomuk got its lyrics
>>> from
>>> some web extractor. Thus the motivation to use nepomuk at least as a
>>> cache of
>>> data, not only for search purposes.
>>>
>>
>> You do have a point. In this case they should be able to access the
>> lyrics.
>>
>>>
>>> There's no primary storage for user-generated rdf at all, so the data is
>>> stored in nepomuk and users are disappointed when something breaks or
>>> disappears.
>>>
>>
>> If we treat Nepomuk as a data store, then you have to deal with keeping
>> the store up to date. Specifically in the case of Akonadi - what are
>> applications supposed to use? Nepomuk or Akonadi? And then we also need a 2
>> way sync to keep both the databases up to date.
>>
>> So I prefer treating Nepomuk as a cache just for searching, but I get
>> that it isn't in the case of tags, and ratings, and other specific rdf. So
>> it's weird.
>>
>>
>>> I'm currently experimenting with solutions to some of these issues, but I
>>> can't do it fast due to time constraints. I don't expect anything worth
>>> going
>>> public with in the next couple of months at least and that's if I'm
>>> lucky :(
>>>
>>
>> Could you elaborate?
>>
>>
>>> _______________________________________________
>>> Nepomuk mailing list
>>> Nepomuk at kde.org
>>> https://mail.kde.org/mailman/listinfo/nepomuk
>>>
>>
>>
>>
>> --
>> Vishesh Handa
>>
>> _______________________________________________
>> Nepomuk mailing list
>> Nepomuk at kde.org
>> https://mail.kde.org/mailman/listinfo/nepomuk
>>
>>
>
>
> --
> Best wishes,
> Ignacio
>
>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>
>

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130509/eff0bf1c/attachment-0001.html>