[Nepomuk] [RFC] File Indexers

Wed Mar 20 18:28:19 UTC 2013

On Среда 20 марта 2013 21:41:12 Vishesh Handa wrote:
> On Wed, Mar 20, 2013 at 8:52 PM, <phreedom at yandex.ru> wrote:
> > On Среда 20 марта 2013 19:57:45 Vishesh Handa wrote:
> > > On Wed, Mar 20, 2013 at 7:39 PM, <phreedom at yandex.ru> wrote:
> > > > On Вторник 19 марта 2013 23:35:42 Vishesh Handa wrote:
> > > > > As your guys might remember, we moved away from Strigi for
> > > > > the 4.10
> > > > > release. Our solution however, still does not support any
> > > > > document
> > > > 
> > > > formats
> > > > 
> > > > > apart from PDF. We need to change that and support other
> > > > > formats.
> > > > > There
> > > > 
> > > > are
> > > > 
> > > > > 2 possible ways to go about this -
> > > > > 
> > > > > 1. We use Okular which supports a number of popular formats
> > > > > 2. We write our own indexers by using the relevant library.
> > > > 
> > > > I know I risk starting a flamewar, or more likely, there's no
> > > > risk, and instead
> > > 
> > > > a 100% guarantee, but:
> > > Not really. It was mostly just a decision taken by me.
> > > 
> > > >   3. Use libStreamAnalyzer.
> > > > 
> > > > Take a look back at how many tiny issues and corner cases had to
> > > > be
> > > > fixed
> > > > so
> > > > far, how many lib quirks had to be accounted for? This was also
> > > > the
> > 
> > most
> > 
> > > > significant source of troubles for libstreamanalyzer.
> > > 
> > > The main reason I'm against this is Strigi does not have a
> > > maintainer.
> > 
> > Bugs
> > 
> > > keep propping up - It doesn't handle all kinds of odf files, docs
> > > files, etc. I do not want to have to fix them.
> > 
> > But now Nepomuk file indexer needs a maintainer.
> 
> I'm willing to maintain them. In fact I'm even willing to do the Okular
> code splitting, it'll just take time, and it might be better to focus on
> other things. Hence this thread asking for opinions.
> 
> > > Also, we're fundamentally
> > > duplicating work. Libraries already exist to parse those file
> > > formats,
> > 
> > and
> > 
> > > they are actively being used all across kde. We can just reuse those
> > > libraries instead of having our own parsers, and maintaining them.
> > 
> > Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered
> > to
> > write
> > an Okular plugin or massage TagLib people into making public their
> > stream- based api, which is used internally and wrapped by the
> > file-based public api.
> > In fact, the plugin architecture was intended to allow kde apps  and
> > libs
> > to
> > ship analyzers based on their format-specific libs.
> 
> Making taglib work with streams = a lot more work.

In fact, all it takes is installing one of their .h files and developers 
promising to not break it too much.

> Similarly, making Okular
> work with streams would have also been quite hard. The only thing happening
> right now is that the UI parts from Okular and being split.
> 
> Writing plugins in the case of lsa was never simple. There is virtually no
> documentation, you have to register all these fields and what not.

The documentation might be a bit outdated, but it still described the overall 
architecture rather well. For specific examples, there are many analyzers 
including trivial ones. Registering fields etc is an api that predates nepomuk. 
Analyzers can output triples directly. This is also a rather old api.

> It is not a simple job in comparison for writing a Nepomuk File-indexing 
one.

EndAnalyzers, which is what you usually end up writing, the job is to write 2 
functions: one does a quick mimetype-like detection, and another does 
everything else(read data, parse, emit triples or return an error code). 
Everything else is boilerplate that is copied and pasted between analyzers 
pretty much unchanged.

> Also, it's not just about how Strigi was designed or how many plugins it
> has. Maintaining about 1500 lines of well written Qt based code is a lot
> simpler for me. And considering that I'm dealing with the bug reports,
> unhappy users, and constant stream of "Nepomuk sucks", I think it is
> reasonable for me to want to fix that. My options are 1. fixing strigi or
> 2. building my own. I chose building my own, as it is a lot simpler and I
> can reuse other libraries.

It is still simpler because:
 * you are yet to match lsa coverage
 * lsa has functionality that you don't plan to use, but you don't have to fix 
bugs in it either. It lays in separate file you simply have no reason to look 
at. More about this below.

> > Oh, and of course libs have bugs too. You either report them and
> > patiently wait for a fix, or fix it yourself. Eg ffmpeg may crash on
> > some malformed or
> > exotic file, and it isn't a big problem for the majority of its user
> > base(redownload the file, delete it, open with another tool). Crashing
> > analyzer
> > is very bad for Nepomuk.
> 
> Yes. Libraries have bugs, but if the library is well used, the bugs will be
> prevalent in other applications as well, and will have to be fixed. Taglib
> is heavily used, if there is a bug, it will be noticed by many people.

I was specifically talking about priorities. It's not that the bug can't be 
noticed. It's that its severity for pretty much all ffmpeg users can be quite 
different from Nepomuk, thus you can often end up in a position of having to 
either fix it yourself or have users complain for months.

> > > What this duplication of effort has accomplished so far? And what
> > > happens> > 
> > > > if or
> > > > hopefully when Nepomuk outgrows this file-based sandbox?
> > > 
> > > The duplication of effort has been quite small.
> > > 
> > > Currently all of the indexing code in Nepomuk which is doing 80% of
> > > the
> > > Strigi's job is about 1400 lines of code. In comparison the code
> > > required to just interface with Strigi in Nepomuk was a good 700
> > > lines. Also, now with our 2 tier approach, Strigi would be giving
> > > us data which has> 
> > already
> > 
> > > been pushed. One could remove that data and all, but it's just not
> > > something I want to do.
> > 
> > LSA indexers can be selectively enabled, so 2 or X tier approach has
> > been
> > supported for ages but apparently not used.
> 
> I know. It's just a lot more effort. I've always said that Strigi is a lot
> more powerful than our solution. Our solution is just more maintainable for
> me.
> 
> > As to interface code, rdfindexer util from strigi is definitely smaller
> > than 700
> > lines of code
> 
> You're missing the point. Even if it just took us some 300 lines of code to
> interface with Strigi. When fixing bugs one has to deal with the additional
> Strigi code base which is by no means small. The entire libstreams +
> libstreamanalyzer is a good 30k. That's almost as big as nepomuk-core.

wc -l says libstreams is 7562 and libstreamanalyzer is 8142.
During a year or so that I spent actively maintaining lsa, I found and fixed 
just 1 sneaky bug in libstreams. It's a very old and mature code, as close to 
bugfree as possible in practice. Your concern is lsa only and, if you decide 
that you are fine with ffmpeg handling mp3, you don't care about strigi's 
builtin mp3 analyzer, same for flac, same for pdf etc etc.

In fact, if you subtract the size of the analyzers which are overriden by 
ffmpeg and (hypothetical) okular, you'll easily end up with 4k lines and what 
remains contains a lof of c++ boilerplate and copyright/license notices for a 
bunch of trivial analyzers, which simply doesn't compare in complexity to 
nepomuk-core.

When your implementation is finished, it's quite likely to get to about the 
same size, which isn't at all surprising, because the only real difference is 
that your analyzers take a file name, and lsa ones take a stream object.

> > > I'm not sure when we will outgrow this file-based sandbox, but based
> > > on
> > 
> > our
> > 
> > > current requirements, we do not need anything more than file
> > > handling.
> > 
> > The
> > 
> > > other additional stuff that Strigi used to provide was just
> > > discarded.
> > 
> > I can definitely see at least 1 use case: akonadi and providing metadata
> > for
> > attachments. Yes, you can always download and store that 30 MB
> > attachment
> > to a
> > temp location, do the file analysis, but imap4 was specifically intended
> > to avoid this.
> 
> When Strigi was being used - The entire attachment was being streamed into
> the nepomukindexer which would stream it into strigi and then it would be
> indexed. This is no different than storing it in /tmp/ and calling this
> file based indexer.
> 
> If there is a better way of doing this - I'm willing to listen.

I definitely saw 700M videos being indexed far faster than they could be read 
from disk. Either way it was an analyzer for that specific format not 
respecting the data transfer limit or not having the limit set properly. TLDR: 
a bug or a feature.

> > It's a rather bad idea to design frameworks based on immediate
> > requirements.
> > It's an ok approach for a quick and dirty hack or a tool, but a
> > strategic
> > mistake for a framework.
> 
> In my roadmap there are no requirements which need the stream based
> analyzer. The deal with imap4 isn't perfect, but it will still work
> reasonably well. I'm not aiming for perfection over here.
> 
> Being able to index files in archives is nice, but not something I'm
> willing to put in that much effort for.

Considering that Nepomuk-KDE doesn't have any viable competition right now, 
not that many features seem critical.

However, back in the days of Nokia's FOSS dive, IMAP4 handling would make or 
break the implementation, because the priorities on a mobile platform are 
quite different. Yes, of course, Nokia is dead so for now it doesn't matter 
that much, but one day a "real" linux distro is going to have another stab at 
mobile market and imap4(and other network protocols which are typical for a 
storage-constrained device) will be again relevant among the other things like 
performance of lsa's builtin analyzers.