[Nepomuk] Strigi Feeder

Vishesh Handa handa.vish at gmail.com
Tue Jul 13 23:08:54 CEST 2010


Small bug fix. The previous version would add the metadata even if the
resource was found. Oops!

Will test more.

- Vishesh Handa

On Wed, Jul 14, 2010 at 1:51 AM, Vishesh Handa <handa.vish at gmail.com> wrote:

> I've cleaned up the code, and added some comments. It works perfectly. It
> would be really nice if somebody (hint Trueg) could review the code.
>
> I'm posting a short summary of what the code does -*
>
> PROBLEM:*
> The Strigi analyzer on analyzing a file creates additional metadata which
> is linked to the file's metadata. Example - When indexing an a audio file,
> say, "Coldplay - Yellow" from the Album 'X&Y'. It will create 2 additional
> resources one of type nco:Contact and the other of type nmm:MusicAlbum. It
> will do that for every indexed song that has the artist 'Coldplay' and album
> 'X&Y'. Nepomuk simply adds all the data to the database without checking if
> similar contacts or albums exist. This leads to multiple contacts, albums ..
> with the same names, and makes queries harder to perform ( and longer ).
>
> Additionally, some files may not contain totally accurate Metadata. For
> example - I have a song whose metadata says that it has 2 artists both of
> whom are called "Coldplay" (exact same spelling) The Strigi analyzer creates
> 2 different resources for both of these identical contacts. They should be
> merged.
>
> Additionally, all the metadata created ( even the contacts, albums, etc )
> were contained in the same discardable graph. So when the file was deleted
> the additional metadata was deleted as well.
> *
> SOLUTION:*
> The Nepomuk Indexer ( kdebase/runtime/nepomuk/strigibackend/indexerwriter.*
> ) now contains an additional thread, which takes all the statements from the
> IndexWriter, resolves duplicates and merges them. It has been done in a
> separate thread so that the indexing speed does not suffer.
>
> The current patch checks for blank Nodes in the object / subject of the
> file's metadata, and tries to find them or creates them if not present. The
> patch reverts to a the original behavior if any of the additionally
> generated metadata ( contacts, albums) contain any blank nodes. in order to
> fix this, a full blown dependency resolution algorithm would be required. I
> don't think that it is currently required.
>
> The patch also creates a different graph ( discardable ) for each
> individual resource.
>
> *Problem not fixed :*
> This will only work on newly indexed files and does not affect the files
> which have already been indexed. We'll need some kind of merger to do that.
> It's a lot simpler to just re-index the files, but I don't think the end
> users would like that.
>
> *A New Problem :
> *Since the additional metadata now has it's own graph. It will not be
> deleted if the file is deleted. We need some kind of cleaner which cleans
> resources which are no longer in use.
>
> And, that's about it.
>
> - Vishesh Handa
>
>
> On Tue, Jul 13, 2010 at 7:42 PM, Vishesh Handa <handa.vish at gmail.com>wrote:
>
>> Yes, I finally implemented it. :-D
>>
>> Please note that this is just the initial design. If you don't like the
>> API design, or anything in particular, please tell me!
>>
>> I've debugged it, and it seems to running okay, but I'll test it more
>> thoroughly, and benchmark it later. For what it's worth, it seems to be
>> somewhat faster.
>>
>> There is one obvious bug in the implementation which I've highlighted.
>> There are ways to fix it, but that would make the code messier than it
>> already is, and AFIAK it currently isn't a problem, but it could be in the
>> future.
>>
>> - Vishesh Handa
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/nepomuk/attachments/20100714/da32952b/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strigifeeder_3.diff
Type: text/x-patch
Size: 26149 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/nepomuk/attachments/20100714/da32952b/attachment-0001.diff 


More information about the Nepomuk mailing list