[Nepomuk] [RFC] New File Indexer

Vishesh Handa me at vhanda.in
Mon Sep 10 19:54:25 UTC 2012


Hey everyone

This month I'm focusing on the file indexing part of Nepomuk, and right now
it takes forever for Strigi to index all my files. Additionally, it doesn't
do a very good job of it. I have tons of mp3 files whose metadata is not
correctly outputted by Strigi. This obviously makes Nepomuk not index those
files.

I realize this is a big change, but I would like to stop using Strigi. Here
is why -

* Doesn't always handle PDFs, Microsoft Document Formats
* Doesn't always handle ID3 tags properly
* Seeks into video files thereby slowing down the extraction
* Implements its own parsers for archives and utf handling
* Goes berserk handling some large video files
* Large code base
* Difficult to contribute to
* Very little documentation
* Un-maintained
* We have hacks on the Nepomuk side to get the correct types
* We use KDE's mimetype detection instead of Strigi


I'm not the only one with this problem. We already have another project
called the nepomuk-metadata-extractor [1] which implements the following
indexers -
* PDF ( Poppler Based )
* Audio Files ( Uses Taglib )
* Videos ( Only based on the file name )

I would like to move these indexers into nepomuk-core, and create light
wrappers to handle whatever file types are missing. Just to be clear, I am
not proposing a fancy plugin based architecture like Strigi. We would just
be detecting the mimetype using KMimeType. It would then call the
appropriate indexing class (if one exists) which would populate the
SimpleResourceGraph or it would just add the appropriate rdf types.

I've created a simple page listing some of the common file formats [2] and
how we would handle them. I obviously still need to figure out how we would
handle document files. I would love to reuse the code in Calligra + Okular
instead of rolling our own. Apart from that it seems fairly straight
forward.

What do you guys think?

I don't think this entire port should take me more than a week.

[1]
https://projects.kde.org/projects/playground/base/nepomuk-metadata-extractor
[2] http://community.kde.org/Projects/Nepomuk/FileIndexing

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20120911/03ac7add/attachment.html>


More information about the Nepomuk mailing list