Hey Simeon<br><br><div class="gmail_quote">On Sat, Sep 22, 2012 at 9:17 PM, Simeon Bird <span dir="ltr"><<a href="mailto:bladud@gmail.com" target="_blank">bladud@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

Congratulations! This is already a great improvement on the old strigi indexer.<br>

I haven't looked at the code (not sure I'm qualified to, really), but I have a<br>

couple of comments just from testing it.<br>

<br>

1. It still calls itself 'strigi service' in the debug output<br></blockquote><div><br>Yup. I should fix that.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


2. symlink handling seems to be gone - if a file is symlinked to two<br>

places it now gets indexed twice. (Maybe you knew about this?)<br></blockquote><div><br>I remember Sebastian fixing system link handling, but I've written a lot of code from scratch so I'll have to check it out again. I'll add it to my list of things to do.<br>

<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

3. How are you planning to handle files which are almost text? For<br>

example, latex or html;<br>

currently they are not indexed. Do you intend that someone write<br>

specialized plugins<br>

for them, or extend the coverage of the text plugin? (qt's text<br>

widgets already handle some html)<br></blockquote><div><br>Either will do. I haven't really thought about it. I just don't want to do too much effort. If some library (like qt) can extract the data for us, I rather us it, instead of writing the parsing code on our own.<br>

<br>Do you want to start writing some plugins?<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

4. A number of times while indexing, I got the error message:<br>

'nepomukindexer(13152)/nepomuk (strigi service): SimpleIndexerError:<br>

"<a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" target="_blank">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</a> has a rdfs:range of<br>

<a href="http://www.w3.org/2000/01/rdf-schema#Class" target="_blank">http://www.w3.org/2000/01/rdf-schema#Class</a>" ' (not sure what it<br>

means).<br></blockquote><div><br>It means, that I, in my hurry have not written good plugins. There are pushing in correct data, and Nepomuk won't let them. In this case it seems I have added a property (rdf:type, something) where the something should be a class, but it is not. Probably a typo somewhere.<br>

<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Hope that's helpful,<br>

Simeon<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

On 20 September 2012 18:25, Vishesh Handa <<a href="mailto:me@vhanda.in">me@vhanda.in</a>> wrote:<br>

> Another update<br>

><br>

> I've pushed my changes into the feature/newIndexer branch. If someone could<br>

> review it, it would be nice.<br>

><br>

> The current architecture consists of 2 queues - BasicIndexingQueue and<br>

> FileIndexingQueues.<br>

><br>

> The BasicIndexing queues just extracts the mimetype, stat results and url.<br>

> On my system, with the latest Soprano, I manage around 10 files per second.<br>

> This queue is NOT throttled in any way, and make virtuoso peak around 70% of<br>

> one cpu. I'm still working on reducing this. I would ideally like this part<br>

> to not be noticeable. Even if it is working on full speed.<br>

><br>

> The FileIndexingQueue calls the 'nepomukindexer' process which extracts the<br>

> actual metadata from the file. It only works when the system is IDLE. This<br>

> is monitored using the KIdleTime, which is not that great, since I could<br>

> have left a compiling job and during that time I don't want the file<br>

> indexing to start. Ditto when watching an HD movie.<br>

><br>

> Here is what is left -<br>

><br>

> 1. The Nepomuk Controller widget needs to be updated properly. I'm not sure<br>

> if I should inform the controller about the basic indexing. Any opinions?<br>

><br>

> 2. Event Monitoring - Pausing on battery and all. For now the old approach<br>

> is being used that nothing gets indexed when on battery, but I'm not sure if<br>

> that is a good idea. I think I'm going to change it to only pause the file<br>

> indexing queue when on battery.<br>

><br>

> 3. Separate Process - It is not required at all. I would however like to<br>

> keep it for debugging purposes. If none has any problems, I'll stop the new<br>

> process approach, but still keep the nepomukindexer executable.<br>

><br>

> 4. Plugin Interface - They are currently called Extractors which is a lousy<br>

> name, but I couldn't come up with anything better. We need a better name and<br>

> a proper interface. I've just hacked together a plugin system without<br>

> thinking about the future design too much. This can be a good thing and a<br>

> bad thing.<br>

><br>

> We will have to release a public interface for 4.10. Specially, if we want<br>

> other people to write plugins.<br>

><br>

> 5. Plugins - They are only 5 plugins so far, and I have no plans of writing<br>

> any more. They are extremely simple to write, and my time is better spent<br>

> doing other things. I think this is an amazing place to get people<br>

> interested. So, we need to finalize (4) so that I can blog about it and<br>

> start talking about it.<br>

><br>

> 6. Packagers - I talked to Will (Open Suse) about the new approach, and they<br>

> would like the plugins to be in a separate tarball / repo. It's a lot easier<br>

> for them to ship it that way. I have no problem with that. Does anyone have<br>

> any opinions?<br>

><br>

> 7. Needs a proper review - Someone (not just Sebastian) needs to review the<br>

> code. The Nepomuk related part isn't that much, and it's not scary. So<br>

> please review it. I'd like a proper "Ship it" before I merge it into master,<br>

> and I would like to get it into master this month.<br>

><br>

> That's about it :)<br>

><br>

> On Wed, Sep 12, 2012 at 9:18 PM, Vishesh Handa <<a href="mailto:me@vhanda.in">me@vhanda.in</a>> wrote:<br>

>><br>

>> Hey everyone<br>

>><br>

>> Quick update. We have analyzers for -<br>

>><br>

>> * taglib<br>

>> * exiv2<br>

>> * ffmpeg<br>

>> * pdf<br>

>> * plain text files<br>

>><br>

>> Documents are still a problem. I've contacted the Calligra team. I'll let<br>

>> you know what they say.<br>

>><br>

>> The analyzers work pretty well. I might just code an epub based analyzer<br>

>> today.<br>

>><br>

>> Tomorrow, I'll start working on a plugin based architecture, and adding<br>

>> two queues in the index scheduler. One which will immediately call the<br>

>> SimpleIndexer to just save the basic metadata, and the other one will only<br>

>> work when on idle. It'll do the proper indexing for the file.<br>

>><br>

>> The obvious problem to this approach is that we need a way of saying that<br>

>> this file has passed the first indexing level, and needs to go through the<br>

>> second level. Maybe a new property for that?<br>

>><br>

>><br>

>> On Tue, Sep 11, 2012 at 8:18 PM, Sebastian Trüg <<a href="mailto:sebastian@trueg.de">sebastian@trueg.de</a>><br>

>> wrote:<br>

>>><br>

>>> I like this.<br>

>>> But I would vote for a plugin system nonetheless. A simple one though. A<br>

>>> plugin can register for one or more mimetypes and then it simply gets the<br>

>>> file path and returns a SimpleResourceGraph. You merge all and are done.<br>

>>> Plugins should never deal with file size, mimetype, or any of those basic<br>

>>> things the framework can handle.<br>

>>><br>

>>> This means that the first sweep is done without plugins, the second one<br>

>>> would call the plugins and the third one, well, that could be yet another<br>

>>> plugin system which does use RDF types instead of mimetypes. For example:<br>

>>> the TV show plugin handles nfo:Video. The framework thus calls the plugin,<br>

>>> provides the path and a handle to the existing metadata. The plugin can then<br>

>>> simply run its filename analysis and continue from there.<br>

>>><br>

>>> OK, one issue we have here is the following: the tv show extractor for<br>

>>> example works better when run on sets of video files, preferably a whole<br>

>>> season. Then it only needs to get feedback from the user once or can even do<br>

>>> its job automatically. This, however, means that third-sweep plugins would<br>

>>> need an option "can-handle-more-than-one-file-at-a-time".<br>

>>><br>

>>> My 2cents.<br>

>>><br>

>>><br>

>>> On 09/11/2012 04:06 PM, Alex Fiestas wrote:<br>

>>>><br>

>>>> I think we've discussed this somewhere but I don't remember the outcome<br>

>>>> of the<br>

>>>> discussion xD<br>

>>>><br>

>>>> I think that would be really interesting to have an indexer that does a<br>

>>>> 2pass<br>

>>>> strategy.<br>

>>>><br>

>>>> First pass will index only basic data such a name, dates, mimetype.<br>

>>>><br>

>>>> Second pass will index specific stuff, previews, texts, tags...<br>

>>>><br>

>>>> Doing this, we can even add third party "information fetchers" as a 3<br>

>>>> pass,<br>

>>>> for example to get information about tv shows and such.<br>

>>>><br>

>>>> Let's put an example:<br>

>>>><br>

>>>> -New file in my Downlaod folder detected<br>

>>>> -Quick super fast indexer indexs data, name, mimetype<br>

>>>>           From this point, this file is already usable in Nepomuk<br>

>>>> -Second pass, indexing tags, previews<br>

>>>> -Third pass (this can be onDemand via GUI) information from the<br>

>>>> internetz is<br>

>>>> fetched.<br>

>>>><br>

>>>> I got this idea from spotlight (osx indexer metadata thing), the most<br>

>>>> obvious<br>

>>>> way of seeing this in osx is when a new external storage is plugged,<br>

>>>> files<br>

>>>> will get indexed super fast but all you will get if you perform a search<br>

>>>> is<br>

>>>> going ot be filenames, not even mimetypes !<br>

>>>><br>

>>>> Cheerz.<br>

>>>> _______________________________________________<br>

>>>> Nepomuk mailing list<br>

>>>> <a href="mailto:Nepomuk@kde.org">Nepomuk@kde.org</a><br>

>>>> <a href="https://mail.kde.org/mailman/listinfo/nepomuk" target="_blank">https://mail.kde.org/mailman/listinfo/nepomuk</a><br>

>>>><br>

>>> _______________________________________________<br>

>>> Nepomuk mailing list<br>

>>> <a href="mailto:Nepomuk@kde.org">Nepomuk@kde.org</a><br>

>>> <a href="https://mail.kde.org/mailman/listinfo/nepomuk" target="_blank">https://mail.kde.org/mailman/listinfo/nepomuk</a><br>

>><br>

>><br>

>><br>

>><br>

>> --<br>

>> Vishesh Handa<br>

>><br>

><br>

><br>

><br>

> --<br>

> Vishesh Handa<br>

><br>

><br>

> _______________________________________________<br>

> Nepomuk mailing list<br>

> <a href="mailto:Nepomuk@kde.org">Nepomuk@kde.org</a><br>

> <a href="https://mail.kde.org/mailman/listinfo/nepomuk" target="_blank">https://mail.kde.org/mailman/listinfo/nepomuk</a><br>

><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br><span style="color:rgb(192,192,192)">Vishesh Handa</span><br><br>