[Nepomuk] [RFC] File Indexers

Wed Mar 20 15:22:13 UTC 2013

On Среда 20 марта 2013 19:57:45 Vishesh Handa wrote:
> On Wed, Mar 20, 2013 at 7:39 PM, <phreedom at yandex.ru> wrote:
> > On Вторник 19 марта 2013 23:35:42 Vishesh Handa wrote:
> > > As your guys might remember, we moved away from Strigi for the 4.10
> > > release. Our solution however, still does not support any document
> > 
> > formats
> > 
> > > apart from PDF. We need to change that and support other formats.
> > > There
> > 
> > are
> > 
> > > 2 possible ways to go about this -
> > > 
> > > 1. We use Okular which supports a number of popular formats
> > > 2. We write our own indexers by using the relevant library.
> > 
> > I know I risk starting a flamewar, or more likely, there's no risk, and
> > instead
> 
> > a 100% guarantee, but:
> Not really. It was mostly just a decision taken by me.
> 
> >   3. Use libStreamAnalyzer.
> > 
> > Take a look back at how many tiny issues and corner cases had to be
> > fixed
> > so
> > far, how many lib quirks had to be accounted for? This was also the most
> > significant source of troubles for libstreamanalyzer.
> 
> The main reason I'm against this is Strigi does not have a maintainer. Bugs
> keep propping up - It doesn't handle all kinds of odf files, docs files,
> etc. I do not want to have to fix them.

But now Nepomuk file indexer needs a maintainer.

> Also, we're fundamentally
> duplicating work. Libraries already exist to parse those file formats, and
> they are actively being used all across kde. We can just reuse those
> libraries instead of having our own parsers, and maintaining them.

Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered to write 
an Okular plugin or massage TagLib people into making public their stream-
based api, which is used internally and wrapped by the file-based public api.
In fact, the plugin architecture was intended to allow kde apps  and libs to 
ship analyzers based on their format-specific libs.

Oh, and of course libs have bugs too. You either report them and patiently 
wait for a fix, or fix it yourself. Eg ffmpeg may crash on some malformed or 
exotic file, and it isn't a big problem for the majority of its user 
base(redownload the file, delete it, open with another tool). Crashing analyzer 
is very bad for Nepomuk.

> What this duplication of effort has accomplished so far? And what happens
> 
> > if or
> > hopefully when Nepomuk outgrows this file-based sandbox?
> 
> The duplication of effort has been quite small.
> 
> Currently all of the indexing code in Nepomuk which is doing 80% of the
> Strigi's job is about 1400 lines of code. In comparison the code required
> to just interface with Strigi in Nepomuk was a good 700 lines. Also, now
> with our 2 tier approach, Strigi would be giving us data which has already
> been pushed. One could remove that data and all, but it's just not
> something I want to do.

LSA indexers can be selectively enabled, so 2 or X tier approach has been 
supported for ages but apparently not used.

As to interface code, rdfindexer util from strigi is definitely smaller than 700 
lines of code

> I'm not sure when we will outgrow this file-based sandbox, but based on our
> current requirements, we do not need anything more than file handling. The
> other additional stuff that Strigi used to provide was just discarded.

I can definitely see at least 1 use case: akonadi and providing metadata for 
attachments. Yes, you can always download and store that 30 MB attachment to a 
temp location, do the file analysis, but imap4 was specifically intended to 
avoid this.

It's a rather bad idea to design frameworks based on immediate requirements. 
It's an ok approach for a quick and dirty hack or a tool, but a strategic 
mistake for a framework.

-- Evgeny