[Nepomuk] [RFC] New File Indexer

Tue Sep 11 10:44:51 UTC 2012

Sounds good on paper but a lot of work will be required and a logic layer
will be necessary to avoid bad values, the all or nothing Nepomuk issue.
Now I'm on vacation but I will write an extended opinion when I come back
to Spain.

El lunes, 10 de septiembre de 2012, Vishesh Handa escribió:

> Hey everyone
>
> This month I'm focusing on the file indexing part of Nepomuk, and right
> now it takes forever for Strigi to index all my files. Additionally, it
> doesn't do a very good job of it. I have tons of mp3 files whose metadata
> is not correctly outputted by Strigi. This obviously makes Nepomuk not
> index those files.
>
> I realize this is a big change, but I would like to stop using Strigi.
> Here is why -
>
> * Doesn't always handle PDFs, Microsoft Document Formats
> * Doesn't always handle ID3 tags properly
> * Seeks into video files thereby slowing down the extraction
> * Implements its own parsers for archives and utf handling
> * Goes berserk handling some large video files
> * Large code base
> * Difficult to contribute to
> * Very little documentation
> * Un-maintained
> * We have hacks on the Nepomuk side to get the correct types
> * We use KDE's mimetype detection instead of Strigi
>
>
> I'm not the only one with this problem. We already have another project
> called the nepomuk-metadata-extractor [1] which implements the following
> indexers -
> * PDF ( Poppler Based )
> * Audio Files ( Uses Taglib )
> * Videos ( Only based on the file name )
>
> I would like to move these indexers into nepomuk-core, and create light
> wrappers to handle whatever file types are missing. Just to be clear, I am
> not proposing a fancy plugin based architecture like Strigi. We would just
> be detecting the mimetype using KMimeType. It would then call the
> appropriate indexing class (if one exists) which would populate the
> SimpleResourceGraph or it would just add the appropriate rdf types.
>
> I've created a simple page listing some of the common file formats [2] and
> how we would handle them. I obviously still need to figure out how we would
> handle document files. I would love to reuse the code in Calligra + Okular
> instead of rolling our own. Apart from that it seems fairly straight
> forward.
>
> What do you guys think?
>
> I don't think this entire port should take me more than a week.
>
> [1]
> https://projects.kde.org/projects/playground/base/nepomuk-metadata-extractor
> [2] http://community.kde.org/Projects/Nepomuk/FileIndexing
>
> --
> Vishesh Handa
>
>

-- 
Best wishes,
Ignacio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20120911/3563f5f2/attachment.html>