[Nepomuk] [RFC] New File Indexer

Sat Sep 22 16:35:23 UTC 2012

Hi Vishesh,

>> 2. symlink handling seems to be gone - if a file is symlinked to two
>> places it now gets indexed twice. (Maybe you knew about this?)
>
> I remember Sebastian fixing system link handling, but I've written a lot of
> code from scratch so I'll have to check it out again. I'll add it to my list
> of things to do.

Cool.

> Either will do. I haven't really thought about it. I just don't want to do
> too much effort. If some library (like qt) can extract the data for us, I
> rather us it, instead of writing the parsing code on our own.
>
> Do you want to start writing some plugins?

Are you ready for me to? For html I would probably use webkit or khtml
(if qt can't handle it),
and for latex I would probably copy-paste detex
(http://code.google.com/p/opendetex/), since it
doesn't really have a library implementation. I can wait until the
interfaces are more
mature if you like though.

>> 4. A number of times while indexing, I got the error message:
>> 'nepomukindexer(13152)/nepomuk (strigi service): SimpleIndexerError:
>> "http://www.w3.org/1999/02/22-rdf-syntax-ns#type has a rdfs:range of
>> http://www.w3.org/2000/01/rdf-schema#Class" ' (not sure what it
>> means).
>
> It means, that I, in my hurry have not written good plugins. There are
> pushing in correct data, and Nepomuk won't let them. In this case it seems I
> have added a property (rdf:type, something) where the something should be a
> class, but it is not. Probably a typo somewhere.

Ok. Quite understandable :)

Simeon

>> On 20 September 2012 18:25, Vishesh Handa <me at vhanda.in> wrote:
>> > Another update
>> >
>> > I've pushed my changes into the feature/newIndexer branch. If someone
>> > could
>> > review it, it would be nice.
>> >
>> > The current architecture consists of 2 queues - BasicIndexingQueue and
>> > FileIndexingQueues.
>> >
>> > The BasicIndexing queues just extracts the mimetype, stat results and
>> > url.
>> > On my system, with the latest Soprano, I manage around 10 files per
>> > second.
>> > This queue is NOT throttled in any way, and make virtuoso peak around
>> > 70% of
>> > one cpu. I'm still working on reducing this. I would ideally like this
>> > part
>> > to not be noticeable. Even if it is working on full speed.
>> >
>> > The FileIndexingQueue calls the 'nepomukindexer' process which extracts
>> > the
>> > actual metadata from the file. It only works when the system is IDLE.
>> > This
>> > is monitored using the KIdleTime, which is not that great, since I could
>> > have left a compiling job and during that time I don't want the file
>> > indexing to start. Ditto when watching an HD movie.
>> >
>> > Here is what is left -
>> >
>> > 1. The Nepomuk Controller widget needs to be updated properly. I'm not
>> > sure
>> > if I should inform the controller about the basic indexing. Any
>> > opinions?
>> >
>> > 2. Event Monitoring - Pausing on battery and all. For now the old
>> > approach
>> > is being used that nothing gets indexed when on battery, but I'm not
>> > sure if
>> > that is a good idea. I think I'm going to change it to only pause the
>> > file
>> > indexing queue when on battery.
>> >
>> > 3. Separate Process - It is not required at all. I would however like to
>> > keep it for debugging purposes. If none has any problems, I'll stop the
>> > new
>> > process approach, but still keep the nepomukindexer executable.
>> >
>> > 4. Plugin Interface - They are currently called Extractors which is a
>> > lousy
>> > name, but I couldn't come up with anything better. We need a better name
>> > and
>> > a proper interface. I've just hacked together a plugin system without
>> > thinking about the future design too much. This can be a good thing and
>> > a
>> > bad thing.
>> >
>> > We will have to release a public interface for 4.10. Specially, if we
>> > want
>> > other people to write plugins.
>> >
>> > 5. Plugins - They are only 5 plugins so far, and I have no plans of
>> > writing
>> > any more. They are extremely simple to write, and my time is better
>> > spent
>> > doing other things. I think this is an amazing place to get people
>> > interested. So, we need to finalize (4) so that I can blog about it and
>> > start talking about it.
>> >
>> > 6. Packagers - I talked to Will (Open Suse) about the new approach, and
>> > they
>> > would like the plugins to be in a separate tarball / repo. It's a lot
>> > easier
>> > for them to ship it that way. I have no problem with that. Does anyone
>> > have
>> > any opinions?
>> >
>> > 7. Needs a proper review - Someone (not just Sebastian) needs to review
>> > the
>> > code. The Nepomuk related part isn't that much, and it's not scary. So
>> > please review it. I'd like a proper "Ship it" before I merge it into
>> > master,
>> > and I would like to get it into master this month.
>> >
>> > That's about it :)
>> >
>> > On Wed, Sep 12, 2012 at 9:18 PM, Vishesh Handa <me at vhanda.in> wrote:
>> >>
>> >> Hey everyone
>> >>
>> >> Quick update. We have analyzers for -
>> >>
>> >> * taglib
>> >> * exiv2
>> >> * ffmpeg
>> >> * pdf
>> >> * plain text files
>> >>
>> >> Documents are still a problem. I've contacted the Calligra team. I'll
>> >> let
>> >> you know what they say.
>> >>
>> >> The analyzers work pretty well. I might just code an epub based
>> >> analyzer
>> >> today.
>> >>
>> >> Tomorrow, I'll start working on a plugin based architecture, and adding
>> >> two queues in the index scheduler. One which will immediately call the
>> >> SimpleIndexer to just save the basic metadata, and the other one will
>> >> only
>> >> work when on idle. It'll do the proper indexing for the file.
>> >>
>> >> The obvious problem to this approach is that we need a way of saying
>> >> that
>> >> this file has passed the first indexing level, and needs to go through
>> >> the
>> >> second level. Maybe a new property for that?
>> >>
>> >>
>> >> On Tue, Sep 11, 2012 at 8:18 PM, Sebastian Trüg <sebastian at trueg.de>
>> >> wrote:
>> >>>
>> >>> I like this.
>> >>> But I would vote for a plugin system nonetheless. A simple one though.
>> >>> A
>> >>> plugin can register for one or more mimetypes and then it simply gets
>> >>> the
>> >>> file path and returns a SimpleResourceGraph. You merge all and are
>> >>> done.
>> >>> Plugins should never deal with file size, mimetype, or any of those
>> >>> basic
>> >>> things the framework can handle.
>> >>>
>> >>> This means that the first sweep is done without plugins, the second
>> >>> one
>> >>> would call the plugins and the third one, well, that could be yet
>> >>> another
>> >>> plugin system which does use RDF types instead of mimetypes. For
>> >>> example:
>> >>> the TV show plugin handles nfo:Video. The framework thus calls the
>> >>> plugin,
>> >>> provides the path and a handle to the existing metadata. The plugin
>> >>> can then
>> >>> simply run its filename analysis and continue from there.
>> >>>
>> >>> OK, one issue we have here is the following: the tv show extractor for
>> >>> example works better when run on sets of video files, preferably a
>> >>> whole
>> >>> season. Then it only needs to get feedback from the user once or can
>> >>> even do
>> >>> its job automatically. This, however, means that third-sweep plugins
>> >>> would
>> >>> need an option "can-handle-more-than-one-file-at-a-time".
>> >>>
>> >>> My 2cents.
>> >>>
>> >>>
>> >>> On 09/11/2012 04:06 PM, Alex Fiestas wrote:
>> >>>>
>> >>>> I think we've discussed this somewhere but I don't remember the
>> >>>> outcome
>> >>>> of the
>> >>>> discussion xD
>> >>>>
>> >>>> I think that would be really interesting to have an indexer that does
>> >>>> a
>> >>>> 2pass
>> >>>> strategy.
>> >>>>
>> >>>> First pass will index only basic data such a name, dates, mimetype.
>> >>>>
>> >>>> Second pass will index specific stuff, previews, texts, tags...
>> >>>>
>> >>>> Doing this, we can even add third party "information fetchers" as a 3
>> >>>> pass,
>> >>>> for example to get information about tv shows and such.
>> >>>>
>> >>>> Let's put an example:
>> >>>>
>> >>>> -New file in my Downlaod folder detected
>> >>>> -Quick super fast indexer indexs data, name, mimetype
>> >>>>           From this point, this file is already usable in Nepomuk
>> >>>> -Second pass, indexing tags, previews
>> >>>> -Third pass (this can be onDemand via GUI) information from the
>> >>>> internetz is
>> >>>> fetched.
>> >>>>
>> >>>> I got this idea from spotlight (osx indexer metadata thing), the most
>> >>>> obvious
>> >>>> way of seeing this in osx is when a new external storage is plugged,
>> >>>> files
>> >>>> will get indexed super fast but all you will get if you perform a
>> >>>> search
>> >>>> is
>> >>>> going ot be filenames, not even mimetypes !
>> >>>>
>> >>>> Cheerz.
>> >>>> _______________________________________________
>> >>>> Nepomuk mailing list
>> >>>> Nepomuk at kde.org
>> >>>> https://mail.kde.org/mailman/listinfo/nepomuk
>> >>>>
>> >>> _______________________________________________
>> >>> Nepomuk mailing list
>> >>> Nepomuk at kde.org
>> >>> https://mail.kde.org/mailman/listinfo/nepomuk
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Vishesh Handa
>> >>
>> >
>> >
>> >
>> > --
>> > Vishesh Handa
>> >
>> >
>> > _______________________________________________
>> > Nepomuk mailing list
>> > Nepomuk at kde.org
>> > https://mail.kde.org/mailman/listinfo/nepomuk
>> >
>
>
>
>
> --
> Vishesh Handa
>