Small bug fix. The previous version would add the metadata even if the resource was found. Oops! <br><br>Will test more.<br><br>- Vishesh Handa<br><br><div class="gmail_quote">On Wed, Jul 14, 2010 at 1:51 AM, Vishesh Handa <span dir="ltr">&lt;<a href="mailto:handa.vish@gmail.com">handa.vish@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">I&#39;ve cleaned up the code, and added some comments. It works perfectly. It would be really nice if somebody (hint Trueg) could review the code.<br>

<br>I&#39;m posting a short summary of what the code does -<b><br><br>PROBLEM:</b><br>

The Strigi analyzer on analyzing a file creates additional metadata which is linked to the file&#39;s metadata. Example - When indexing an a audio file, say, &quot;Coldplay - Yellow&quot; from the Album &#39;X&amp;Y&#39;. It will create 2 additional resources one of type nco:Contact and the other of type nmm:MusicAlbum. It will do that for every indexed song that has the artist &#39;Coldplay&#39; and album &#39;X&amp;Y&#39;. Nepomuk simply adds all the data to the database without checking if similar contacts or albums exist. This leads to multiple contacts, albums .. with the same names, and makes queries harder to perform ( and longer ).<br>


<br>Additionally, some files may not contain totally accurate Metadata. For example - I have a song whose metadata says that it has 2 artists both of whom are called &quot;Coldplay&quot; (exact same spelling) The Strigi analyzer creates 2 different resources for both of these identical contacts. They should be merged.<br>


<br>Additionally, all the metadata created ( even the contacts, albums, etc ) were contained in the same discardable graph. So when the file was deleted the additional metadata was deleted as well.<br><b><br>SOLUTION:</b><br>


The Nepomuk Indexer ( kdebase/runtime/nepomuk/strigibackend/indexerwriter.* ) now contains an additional thread, which takes all the statements from the IndexWriter, resolves duplicates and merges them. It has been done in a separate thread so that the indexing speed does not suffer.<br>


<br>The current patch checks for blank Nodes in the object / subject of the file&#39;s metadata, and tries to find them or creates them if not present. The patch reverts to a the original behavior if any of the additionally generated metadata ( contacts, albums) contain any blank nodes. in order to fix this, a full blown dependency resolution algorithm would be required. I don&#39;t think that it is currently required.<br>


<br>The patch also creates a different graph ( discardable ) for each individual resource.  <br><br><b>Problem not fixed :</b><br>This will only work on newly indexed files and does not affect the files which have already been indexed. We&#39;ll need some kind of merger to do that. It&#39;s a lot simpler to just re-index the files, but I don&#39;t think the end users would like that.<br>


<br><b>A New Problem :<br></b>Since the additional metadata now has it&#39;s own graph. It will not be deleted if the file is deleted. We need some kind of cleaner which cleans resources which are no longer in use.<br><br>


And, that&#39;s about it.<br><font color="#888888"><br>- Vishesh Handa</font><div><div></div><div class="h5"><br><br><div class="gmail_quote">On Tue, Jul 13, 2010 at 7:42 PM, Vishesh Handa <span dir="ltr">&lt;<a href="mailto:handa.vish@gmail.com" target="_blank">handa.vish@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Yes, I finally implemented it. :-D<br><br>Please note that this is just the initial design. If you don&#39;t like the API design, or anything in particular, please tell me!<br>


<br>I&#39;ve debugged it, and it seems to running okay, but I&#39;ll test it more thoroughly, and benchmark it later. For what it&#39;s worth, it seems to be somewhat faster. <br>

<br>There is one obvious bug in the implementation which I&#39;ve highlighted. There are ways to fix it, but that would make the code messier than it already is, and AFIAK it currently isn&#39;t a problem, but it could be in the future.<br>


<font color="#888888">

<br>- Vishesh Handa<br><br>

</font></blockquote></div><br>

</div></div></blockquote></div><br>