[Nepomuk] Strigi Feeder

Tue Jul 13 22:21:08 CEST 2010

I've cleaned up the code, and added some comments. It works perfectly. It
would be really nice if somebody (hint Trueg) could review the code.

I'm posting a short summary of what the code does -*

PROBLEM:*
The Strigi analyzer on analyzing a file creates additional metadata which is
linked to the file's metadata. Example - When indexing an a audio file, say,
"Coldplay - Yellow" from the Album 'X&Y'. It will create 2 additional
resources one of type nco:Contact and the other of type nmm:MusicAlbum. It
will do that for every indexed song that has the artist 'Coldplay' and album
'X&Y'. Nepomuk simply adds all the data to the database without checking if
similar contacts or albums exist. This leads to multiple contacts, albums ..
with the same names, and makes queries harder to perform ( and longer ).

Additionally, some files may not contain totally accurate Metadata. For
example - I have a song whose metadata says that it has 2 artists both of
whom are called "Coldplay" (exact same spelling) The Strigi analyzer creates
2 different resources for both of these identical contacts. They should be
merged.

Additionally, all the metadata created ( even the contacts, albums, etc )
were contained in the same discardable graph. So when the file was deleted
the additional metadata was deleted as well.
*
SOLUTION:*
The Nepomuk Indexer ( kdebase/runtime/nepomuk/strigibackend/indexerwriter.*
) now contains an additional thread, which takes all the statements from the
IndexWriter, resolves duplicates and merges them. It has been done in a
separate thread so that the indexing speed does not suffer.

The current patch checks for blank Nodes in the object / subject of the
file's metadata, and tries to find them or creates them if not present. The
patch reverts to a the original behavior if any of the additionally
generated metadata ( contacts, albums) contain any blank nodes. in order to
fix this, a full blown dependency resolution algorithm would be required. I
don't think that it is currently required.

The patch also creates a different graph ( discardable ) for each individual
resource.

*Problem not fixed :*
This will only work on newly indexed files and does not affect the files
which have already been indexed. We'll need some kind of merger to do that.
It's a lot simpler to just re-index the files, but I don't think the end
users would like that.

*A New Problem :
*Since the additional metadata now has it's own graph. It will not be
deleted if the file is deleted. We need some kind of cleaner which cleans
resources which are no longer in use.

And, that's about it.

- Vishesh Handa

On Tue, Jul 13, 2010 at 7:42 PM, Vishesh Handa <handa.vish at gmail.com> wrote:

> Yes, I finally implemented it. :-D
>
> Please note that this is just the initial design. If you don't like the API
> design, or anything in particular, please tell me!
>
> I've debugged it, and it seems to running okay, but I'll test it more
> thoroughly, and benchmark it later. For what it's worth, it seems to be
> somewhat faster.
>
> There is one obvious bug in the implementation which I've highlighted.
> There are ways to fix it, but that would make the code messier than it
> already is, and AFIAK it currently isn't a problem, but it could be in the
> future.
>
> - Vishesh Handa
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/nepomuk/attachments/20100714/df7e1407/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strigifeeder_2.diff
Type: text/x-patch
Size: 26137 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/nepomuk/attachments/20100714/df7e1407/attachment-0001.diff