<br><br><div class="gmail_quote">On Tue, Sep 11, 2012 at 8:34 PM, Dean Perry <span dir="ltr"><<a href="mailto:happy.heyoka@gmail.com" target="_blank">happy.heyoka@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<u></u>

<div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Ok, so I'm no expert on yet Nepomuk or Strigi, but I am investing time in coming up to speed with them.</p><div class="im">


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Vishesh Handa wrote:</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">>  I don't think this entire port should take me more than a week. <br></p>

</div><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">I'll bet you a beer this is still being discussed a year from now :-)</p><div class="im">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"><br>> This month I'm focusing on the file indexing part of Nepomuk, and right now it takes forever for Strigi to index all</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">> my files.</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

</div><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">well, I feel and share your pain, but I wonder... the file indexer has been banging away on my machine for at least 14 hours now (I'm on Kubuntu 4.9, so no patch for the reindexing thing... anyway).  I have been mostly away from my machine or doing light browsing/email for that time so Other than me writing this mail, firefox and the usual system/session stuff, no other demands on the CPU.</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Most of the 70% CPU utilization is Virtuoso, with blips every few seconds of 3% or so for nepomindex process instances.  </p></div>

</blockquote><div><br>Do you also have email indexing enabled? Cause that is handled separately by kdepim, though pushing the data does make virtuoso act up.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"><span style="text-decoration:underline">There is practically no disk I/O at all</span> (500ms every 50-70s) - all my indexable folders are on a physically distinct drive so it's easy to notice.</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">So my complaint is : why isn't the index using <span style="font-style:italic">more</span> resources?</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">(ie: it appears not to use resources when it could, and too many resources when it shouldn't, which is kind of the <span style="font-style:italic">reverse</span> of how you want it).</p>

<div class="im">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">>  I'm not the only one with this problem. We already have another project called the nepomuk-metadata-extractor [1] which implements the following indexers -<br>

* PDF ( Poppler Based )</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

</div><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">yeah, the Poppler pdfinfo already extracts more data than the current PDF indexer, I had been thinking about this personally.  Go Jörg!</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">>  I would like to move these indexers into nepomuk-core [...] It would then call the appropriate indexing class (if one exists) which would populate the SimpleResourceGraph or it would just add the appropriate rdf types.<br>

</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">I think you have it "inside out"; it needs to be *more pluggable* and instead make it easier to write a replacement indexer for a given MIME type and perhaps find a clever way to factor Nepomuk domain specific knowledge from file-type expertise.</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">For example, off the top of my head, I can think of at least ten different type of file I would want indexed;  I'm sure that everyone here could name ten <span style="font-style:italic">different</span> types.  It is an endless and thankless task.</p>

</div></blockquote><div><br>Of course. I understand that eventually, it has to pluggable. This email was more of a first step - something which I could easily do in a week.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal"><div class="im">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">As evidence - Jörg wrote:</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">> This will help a lot to make indexing better and easier to contribute.</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">> Strigi seems to be a very powerful solution. But writing the</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">> streamanalyzers or fixing in them isn't very intuitive.</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

</div><p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">So, four suggestions (not sure how much of this is already done now):</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"><span style="font-weight:600">(1) Indexer framework is data agnostic, only finds files/resources for indexing; two jobs only</span></p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">  - {a} wrangling which process to launch for MIME type, resource allocation and preemptive termination of that process. </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">  - {b} handling triplets supplied by process; simple validation and transaction support in case of crash or other preemptive termination.</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Why? Language agnostic indexer code; C++, bash, assembler, Python, Erlang or javascript.  Whatever works for the resource type in question.  It only has to know about being a regular process.</p>

</div></blockquote><div><br>Currently most of this is done by the nepomukindexer process. It works as follows -<br><br>1. Call the strigi plugins to analyze the file and give to metadata back to us<br>2. Store it in Nepomuk - This is done as one transaction and has performs the validation as well.<br>

<br>If the nepomukindexer process crashes, then that file is ignored, and we continue on the next file. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"><span style="font-weight:600">(2) Support multiple resources (of same type) per process (for launch efficiency)</span></p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">framework can keep a table of discovered resources of a given MIME type and when it has enough (10? 20?) launch the right process.  maybe in the future we grade each indexer as lightweight or piggy and we decide to launch several sets of processes for several MIME types in parallel.</p>

</div></blockquote><div><br>I take it you mean a separate process for each analyzer. We currently use a different approach - a different process for each file. Though, this approach seems interesting as well.<br> <br></div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"><span style="font-weight:600">(3) Support chains of processing per resource.</span></p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Why? So as not to rely on having to re-implement features of previous indexer.  Say I write an mpeg 4 parser to extract closed caption text; I do not have to reimplement Trueg's TV Show stuff. </p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Order of operation might be important - post processing seems like something that several people have asked about and I'm certainly interested in "hooking" onto indexer to capture each freshly completed file.</p>

</div></blockquote><div><br>Of course. This is something that goes without saying.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"><span style="font-weight:600">(4) Perhaps hand each process a handle (socket? dbus?) to write to</span></p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Yeah, I've been reading about 'systemd' :-)</p></div></blockquote><div><br>Even I like the concept of systemd. Currently half of the Nepomuk communication happens over a local socket, and the other half over dbus. Eventually, I would like to move completely to the local socket, but that's for later. And it's only when I profile and discover that dbus actually is a limiting factor.<br>

<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Imagine the simplest indexer that adds only resource/tag/value triplets - it just becomes just two nested loops:</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> -  iterate over resources</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> -- iterate over meta data items.</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> --- Test if resource contains item 1 (eg: jpeg/exif exposure), output triple for item 1</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> --- Test if resource contains item 2 (eg: jpeg/exif iso), output triple for item 2</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> - exit.</p></div></blockquote><div><br>I'm not sure I understand what you mean over here.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">What I'm trying to get at here is that if I have some document type that I am expert in or that good library support already exists (eg: JPEG, PDF, mp3 are good examples) then all I need to do is take a list of files and spit out triples, rather than understand how to plug into the framework.</p>


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">The <span style="font-style:italic">only Nepomuk domain specific knowledge</span> I need is the correct property URI and the appropriate format for the values of such properties.</p>

</div></blockquote><div><br>That's exactly what I want :)<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="font-family:'Arial';font-size:14pt;font-weight:400;font-style:normal">


<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">Anyway, enough already :-)</p><span class="HOEnZb"><font color="#888888">

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px">dean</p>

<p style="margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;text-indent:0px"> </p></font></span></div><br>_______________________________________________<br>

Nepomuk mailing list<br>

<a href="mailto:Nepomuk@kde.org">Nepomuk@kde.org</a><br>

<a href="https://mail.kde.org/mailman/listinfo/nepomuk" target="_blank">https://mail.kde.org/mailman/listinfo/nepomuk</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br><span style="color:rgb(192,192,192)">Vishesh Handa</span><br><br>