[Nepomuk] [RFC] New File Indexer

Sat Sep 22 15:52:38 UTC 2012

Hey Simeon

On Sat, Sep 22, 2012 at 9:17 PM, Simeon Bird <bladud at gmail.com> wrote:

> Hi,
>
> Congratulations! This is already a great improvement on the old strigi
> indexer.
> I haven't looked at the code (not sure I'm qualified to, really), but I
> have a
> couple of comments just from testing it.
>
> 1. It still calls itself 'strigi service' in the debug output
>

Yup. I should fix that.

> 2. symlink handling seems to be gone - if a file is symlinked to two
> places it now gets indexed twice. (Maybe you knew about this?)
>

I remember Sebastian fixing system link handling, but I've written a lot of
code from scratch so I'll have to check it out again. I'll add it to my
list of things to do.

3. How are you planning to handle files which are almost text? For
> example, latex or html;
> currently they are not indexed. Do you intend that someone write
> specialized plugins
> for them, or extend the coverage of the text plugin? (qt's text
> widgets already handle some html)
>

Either will do. I haven't really thought about it. I just don't want to do
too much effort. If some library (like qt) can extract the data for us, I
rather us it, instead of writing the parsing code on our own.

Do you want to start writing some plugins?

> 4. A number of times while indexing, I got the error message:
> 'nepomukindexer(13152)/nepomuk (strigi service): SimpleIndexerError:
> "http://www.w3.org/1999/02/22-rdf-syntax-ns#type has a rdfs:range of
> http://www.w3.org/2000/01/rdf-schema#Class" ' (not sure what it
> means).
>

It means, that I, in my hurry have not written good plugins. There are
pushing in correct data, and Nepomuk won't let them. In this case it seems
I have added a property (rdf:type, something) where the something should be
a class, but it is not. Probably a typo somewhere.

> Hope that's helpful,
> Simeon
>
>
> On 20 September 2012 18:25, Vishesh Handa <me at vhanda.in> wrote:
> > Another update
> >
> > I've pushed my changes into the feature/newIndexer branch. If someone
> could
> > review it, it would be nice.
> >
> > The current architecture consists of 2 queues - BasicIndexingQueue and
> > FileIndexingQueues.
> >
> > The BasicIndexing queues just extracts the mimetype, stat results and
> url.
> > On my system, with the latest Soprano, I manage around 10 files per
> second.
> > This queue is NOT throttled in any way, and make virtuoso peak around
> 70% of
> > one cpu. I'm still working on reducing this. I would ideally like this
> part
> > to not be noticeable. Even if it is working on full speed.
> >
> > The FileIndexingQueue calls the 'nepomukindexer' process which extracts
> the
> > actual metadata from the file. It only works when the system is IDLE.
> This
> > is monitored using the KIdleTime, which is not that great, since I could
> > have left a compiling job and during that time I don't want the file
> > indexing to start. Ditto when watching an HD movie.
> >
> > Here is what is left -
> >
> > 1. The Nepomuk Controller widget needs to be updated properly. I'm not
> sure
> > if I should inform the controller about the basic indexing. Any opinions?
> >
> > 2. Event Monitoring - Pausing on battery and all. For now the old
> approach
> > is being used that nothing gets indexed when on battery, but I'm not
> sure if
> > that is a good idea. I think I'm going to change it to only pause the
> file
> > indexing queue when on battery.
> >
> > 3. Separate Process - It is not required at all. I would however like to
> > keep it for debugging purposes. If none has any problems, I'll stop the
> new
> > process approach, but still keep the nepomukindexer executable.
> >
> > 4. Plugin Interface - They are currently called Extractors which is a
> lousy
> > name, but I couldn't come up with anything better. We need a better name
> and
> > a proper interface. I've just hacked together a plugin system without
> > thinking about the future design too much. This can be a good thing and a
> > bad thing.
> >
> > We will have to release a public interface for 4.10. Specially, if we
> want
> > other people to write plugins.
> >
> > 5. Plugins - They are only 5 plugins so far, and I have no plans of
> writing
> > any more. They are extremely simple to write, and my time is better spent
> > doing other things. I think this is an amazing place to get people
> > interested. So, we need to finalize (4) so that I can blog about it and
> > start talking about it.
> >
> > 6. Packagers - I talked to Will (Open Suse) about the new approach, and
> they
> > would like the plugins to be in a separate tarball / repo. It's a lot
> easier
> > for them to ship it that way. I have no problem with that. Does anyone
> have
> > any opinions?
> >
> > 7. Needs a proper review - Someone (not just Sebastian) needs to review
> the
> > code. The Nepomuk related part isn't that much, and it's not scary. So
> > please review it. I'd like a proper "Ship it" before I merge it into
> master,
> > and I would like to get it into master this month.
> >
> > That's about it :)
> >
> > On Wed, Sep 12, 2012 at 9:18 PM, Vishesh Handa <me at vhanda.in> wrote:
> >>
> >> Hey everyone
> >>
> >> Quick update. We have analyzers for -
> >>
> >> * taglib
> >> * exiv2
> >> * ffmpeg
> >> * pdf
> >> * plain text files
> >>
> >> Documents are still a problem. I've contacted the Calligra team. I'll
> let
> >> you know what they say.
> >>
> >> The analyzers work pretty well. I might just code an epub based analyzer
> >> today.
> >>
> >> Tomorrow, I'll start working on a plugin based architecture, and adding
> >> two queues in the index scheduler. One which will immediately call the
> >> SimpleIndexer to just save the basic metadata, and the other one will
> only
> >> work when on idle. It'll do the proper indexing for the file.
> >>
> >> The obvious problem to this approach is that we need a way of saying
> that
> >> this file has passed the first indexing level, and needs to go through
> the
> >> second level. Maybe a new property for that?
> >>
> >>
> >> On Tue, Sep 11, 2012 at 8:18 PM, Sebastian Trüg <sebastian at trueg.de>
> >> wrote:
> >>>
> >>> I like this.
> >>> But I would vote for a plugin system nonetheless. A simple one though.
> A
> >>> plugin can register for one or more mimetypes and then it simply gets
> the
> >>> file path and returns a SimpleResourceGraph. You merge all and are
> done.
> >>> Plugins should never deal with file size, mimetype, or any of those
> basic
> >>> things the framework can handle.
> >>>
> >>> This means that the first sweep is done without plugins, the second one
> >>> would call the plugins and the third one, well, that could be yet
> another
> >>> plugin system which does use RDF types instead of mimetypes. For
> example:
> >>> the TV show plugin handles nfo:Video. The framework thus calls the
> plugin,
> >>> provides the path and a handle to the existing metadata. The plugin
> can then
> >>> simply run its filename analysis and continue from there.
> >>>
> >>> OK, one issue we have here is the following: the tv show extractor for
> >>> example works better when run on sets of video files, preferably a
> whole
> >>> season. Then it only needs to get feedback from the user once or can
> even do
> >>> its job automatically. This, however, means that third-sweep plugins
> would
> >>> need an option "can-handle-more-than-one-file-at-a-time".
> >>>
> >>> My 2cents.
> >>>
> >>>
> >>> On 09/11/2012 04:06 PM, Alex Fiestas wrote:
> >>>>
> >>>> I think we've discussed this somewhere but I don't remember the
> outcome
> >>>> of the
> >>>> discussion xD
> >>>>
> >>>> I think that would be really interesting to have an indexer that does
> a
> >>>> 2pass
> >>>> strategy.
> >>>>
> >>>> First pass will index only basic data such a name, dates, mimetype.
> >>>>
> >>>> Second pass will index specific stuff, previews, texts, tags...
> >>>>
> >>>> Doing this, we can even add third party "information fetchers" as a 3
> >>>> pass,
> >>>> for example to get information about tv shows and such.
> >>>>
> >>>> Let's put an example:
> >>>>
> >>>> -New file in my Downlaod folder detected
> >>>> -Quick super fast indexer indexs data, name, mimetype
> >>>>           From this point, this file is already usable in Nepomuk
> >>>> -Second pass, indexing tags, previews
> >>>> -Third pass (this can be onDemand via GUI) information from the
> >>>> internetz is
> >>>> fetched.
> >>>>
> >>>> I got this idea from spotlight (osx indexer metadata thing), the most
> >>>> obvious
> >>>> way of seeing this in osx is when a new external storage is plugged,
> >>>> files
> >>>> will get indexed super fast but all you will get if you perform a
> search
> >>>> is
> >>>> going ot be filenames, not even mimetypes !
> >>>>
> >>>> Cheerz.
> >>>> _______________________________________________
> >>>> Nepomuk mailing list
> >>>> Nepomuk at kde.org
> >>>> https://mail.kde.org/mailman/listinfo/nepomuk
> >>>>
> >>> _______________________________________________
> >>> Nepomuk mailing list
> >>> Nepomuk at kde.org
> >>> https://mail.kde.org/mailman/listinfo/nepomuk
> >>
> >>
> >>
> >>
> >> --
> >> Vishesh Handa
> >>
> >
> >
> >
> > --
> > Vishesh Handa
> >
> >
> > _______________________________________________
> > Nepomuk mailing list
> > Nepomuk at kde.org
> > https://mail.kde.org/mailman/listinfo/nepomuk
> >
>

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20120922/b1afdb11/attachment-0001.html>