[Nepomuk] [RFC] File Indexers

Wed Mar 20 19:20:30 UTC 2013

On Wed, Mar 20, 2013 at 11:58 PM, <phreedom at yandex.ru> wrote:

> On Среда 20 марта 2013 21:41:12 Vishesh Handa wrote:
> > On Wed, Mar 20, 2013 at 8:52 PM, <phreedom at yandex.ru> wrote:
> > > On Среда 20 марта 2013 19:57:45 Vishesh Handa wrote:
> > > > On Wed, Mar 20, 2013 at 7:39 PM, <phreedom at yandex.ru> wrote:
> > > > > On Вторник 19 марта 2013 23:35:42 Vishesh Handa wrote:
> > > > > > As your guys might remember, we moved away from Strigi for
> > > > > > the 4.10
> > > > > > release. Our solution however, still does not support any
> > > > > > document
> > > > >
> > > > > formats
> > > > >
> > > > > > apart from PDF. We need to change that and support other
> > > > > > formats.
> > > > > > There
> > > > >
> > > > > are
> > > > >
> > > > > > 2 possible ways to go about this -
> > > > > >
> > > > > > 1. We use Okular which supports a number of popular formats
> > > > > > 2. We write our own indexers by using the relevant library.
> > > > >
> > > > > I know I risk starting a flamewar, or more likely, there's no
> > > > > risk, and instead
> > > >
> > > > > a 100% guarantee, but:
> > > > Not really. It was mostly just a decision taken by me.
> > > >
> > > > >   3. Use libStreamAnalyzer.
> > > > >
> > > > > Take a look back at how many tiny issues and corner cases had to
> > > > > be
> > > > > fixed
> > > > > so
> > > > > far, how many lib quirks had to be accounted for? This was also
> > > > > the
> > >
> > > most
> > >
> > > > > significant source of troubles for libstreamanalyzer.
> > > >
> > > > The main reason I'm against this is Strigi does not have a
> > > > maintainer.
> > >
> > > Bugs
> > >
> > > > keep propping up - It doesn't handle all kinds of odf files, docs
> > > > files, etc. I do not want to have to fix them.
> > >
> > > But now Nepomuk file indexer needs a maintainer.
> >
> > I'm willing to maintain them. In fact I'm even willing to do the Okular
> > code splitting, it'll just take time, and it might be better to focus on
> > other things. Hence this thread asking for opinions.
> >
> > > > Also, we're fundamentally
> > > > duplicating work. Libraries already exist to parse those file
> > > > formats,
> > >
> > > and
> > >
> > > > they are actively being used all across kde. We can just reuse those
> > > > libraries instead of having our own parsers, and maintaining them.
> > >
> > > Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered
> > > to
> > > write
> > > an Okular plugin or massage TagLib people into making public their
> > > stream- based api, which is used internally and wrapped by the
> > > file-based public api.
> > > In fact, the plugin architecture was intended to allow kde apps  and
> > > libs
> > > to
> > > ship analyzers based on their format-specific libs.
> >
> > Making taglib work with streams = a lot more work.
>
> In fact, all it takes is installing one of their .h files and developers
> promising to not break it too much.
>
> > Similarly, making Okular
> > work with streams would have also been quite hard. The only thing
> happening
> > right now is that the UI parts from Okular and being split.
> >
> > Writing plugins in the case of lsa was never simple. There is virtually
> no
> > documentation, you have to register all these fields and what not.
>
> The documentation might be a bit outdated, but it still described the
> overall
> architecture rather well. For specific examples, there are many analyzers
> including trivial ones. Registering fields etc is an api that predates
> nepomuk.
> Analyzers can output triples directly. This is also a rather old api.
>
> > It is not a simple job in comparison for writing a Nepomuk File-indexing
> one.
>
> EndAnalyzers, which is what you usually end up writing, the job is to
> write 2
> functions: one does a quick mimetype-like detection, and another does
> everything else(read data, parse, emit triples or return an error code).
> Everything else is boilerplate that is copied and pasted between analyzers
> pretty much unchanged.
>
> > Also, it's not just about how Strigi was designed or how many plugins it
> > has. Maintaining about 1500 lines of well written Qt based code is a lot
> > simpler for me. And considering that I'm dealing with the bug reports,
> > unhappy users, and constant stream of "Nepomuk sucks", I think it is
> > reasonable for me to want to fix that. My options are 1. fixing strigi or
> > 2. building my own. I chose building my own, as it is a lot simpler and I
> > can reuse other libraries.
>
> It is still simpler because:
>  * you are yet to match lsa coverage
>  * lsa has functionality that you don't plan to use, but you don't have to
> fix
> bugs in it either. It lays in separate file you simply have no reason to
> look
> at. More about this below.
>
> > > Oh, and of course libs have bugs too. You either report them and
> > > patiently wait for a fix, or fix it yourself. Eg ffmpeg may crash on
> > > some malformed or
> > > exotic file, and it isn't a big problem for the majority of its user
> > > base(redownload the file, delete it, open with another tool). Crashing
> > > analyzer
> > > is very bad for Nepomuk.
> >
> > Yes. Libraries have bugs, but if the library is well used, the bugs will
> be
> > prevalent in other applications as well, and will have to be fixed.
> Taglib
> > is heavily used, if there is a bug, it will be noticed by many people.
>
> I was specifically talking about priorities. It's not that the bug can't be
> noticed. It's that its severity for pretty much all ffmpeg users can be
> quite
> different from Nepomuk, thus you can often end up in a position of having
> to
> either fix it yourself or have users complain for months.
>
> > > > What this duplication of effort has accomplished so far? And what
> > > > happens> >
> > > > > if or
> > > > > hopefully when Nepomuk outgrows this file-based sandbox?
> > > >
> > > > The duplication of effort has been quite small.
> > > >
> > > > Currently all of the indexing code in Nepomuk which is doing 80% of
> > > > the
> > > > Strigi's job is about 1400 lines of code. In comparison the code
> > > > required to just interface with Strigi in Nepomuk was a good 700
> > > > lines. Also, now with our 2 tier approach, Strigi would be giving
> > > > us data which has>
> > > already
> > >
> > > > been pushed. One could remove that data and all, but it's just not
> > > > something I want to do.
> > >
> > > LSA indexers can be selectively enabled, so 2 or X tier approach has
> > > been
> > > supported for ages but apparently not used.
> >
> > I know. It's just a lot more effort. I've always said that Strigi is a
> lot
> > more powerful than our solution. Our solution is just more maintainable
> for
> > me.
> >
> > > As to interface code, rdfindexer util from strigi is definitely smaller
> > > than 700
> > > lines of code
> >
> > You're missing the point. Even if it just took us some 300 lines of code
> to
> > interface with Strigi. When fixing bugs one has to deal with the
> additional
> > Strigi code base which is by no means small. The entire libstreams +
> > libstreamanalyzer is a good 30k. That's almost as big as nepomuk-core.
>
> wc -l says libstreams is 7562 and libstreamanalyzer is 8142.
>

vlap:~/kde/src/strigi/libstreamanalyzer $ cloc *
     235 text files.
     231 unique files.
      31 files ignored.

http://cloc.sourceforge.net v 1.56  T=1.0 s (201.0 files/s, 28933.0 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment
code
-------------------------------------------------------------------------------
C++                             95           1547           3432
15967
C/C++ Header                    91            706           3249
3499
CMake                           15             68             48
417
-------------------------------------------------------------------------------
SUM:                           201           2321           6729
19883
-------------------------------------------------------------------------------

vlap:~/kde/src/strigi/libstreams $ cloc *
-------------------------------------------------------------------------------
Language                     files          blank        comment
code
-------------------------------------------------------------------------------
C++                             62            451           1729
5519
HTML                            18            197             44
2851
C/C++ Header                    45            415           1920
1800
C                                1             95             28
884
CMake                            5             39             27
173
XML                              2             13
0              5
-------------------------------------------------------------------------------
SUM:                           133           1210           3748
11232
-------------------------------------------------------------------------------

During a year or so that I spent actively maintaining lsa, I found and fixed
> just 1 sneaky bug in libstreams.

Which year was that? Around 4.7 we introduced the data-management service,
which checked the data before pushing it. That revealed a number of bugs in
Strigi. The only fixes I saw were from Sebastian and me.

> It's a very old and mature code, as close to
> bugfree as possible in practice. Your concern is lsa only and, if you
> decide
> that you are fine with ffmpeg handling mp3, you don't care about strigi's
> builtin mp3 analyzer, same for flac, same for pdf etc etc.
>

> In fact, if you subtract the size of the analyzers which are overriden by
> ffmpeg and (hypothetical) okular, you'll easily end up with 4k lines and
> what
> remains contains a lof of c++ boilerplate and copyright/license notices
> for a
> bunch of trivial analyzers, which simply doesn't compare in complexity to
> nepomuk-core.
>

It would still be a lot more complex than the nepomuk-core solution.

> When your implementation is finished, it's quite likely to get to about the
> same size, which isn't at all surprising, because the only real difference
> is
> that your analyzers take a file name, and lsa ones take a stream object.
>

Really? If I implement the okular analyzer it will increase the code about
200 lines (The analyzer was implemented and then later reverted). I don't
see this scaling up to that size.

> > > > I'm not sure when we will outgrow this file-based sandbox, but based
> > > > on
> > >
> > > our
> > >
> > > > current requirements, we do not need anything more than file
> > > > handling.
> > >
> > > The
> > >
> > > > other additional stuff that Strigi used to provide was just
> > > > discarded.
> > >
> > > I can definitely see at least 1 use case: akonadi and providing
> metadata
> > > for
> > > attachments. Yes, you can always download and store that 30 MB
> > > attachment
> > > to a
> > > temp location, do the file analysis, but imap4 was specifically
> intended
> > > to avoid this.
> >
> > When Strigi was being used - The entire attachment was being streamed
> into
> > the nepomukindexer which would stream it into strigi and then it would be
> > indexed. This is no different than storing it in /tmp/ and calling this
> > file based indexer.
> >
> > If there is a better way of doing this - I'm willing to listen.
>
> I definitely saw 700M videos being indexed far faster than they could be
> read
> from disk.
>
Either way it was an analyzer for that specific format not
> respecting the data transfer limit or not having the limit set properly.
> TLDR:
> a bug or a feature.
>
> > > It's a rather bad idea to design frameworks based on immediate
> > > requirements.
> > > It's an ok approach for a quick and dirty hack or a tool, but a
> > > strategic
> > > mistake for a framework.
> >
> > In my roadmap there are no requirements which need the stream based
> > analyzer. The deal with imap4 isn't perfect, but it will still work
> > reasonably well. I'm not aiming for perfection over here.
> >
> > Being able to index files in archives is nice, but not something I'm
> > willing to put in that much effort for.
>
> Considering that Nepomuk-KDE doesn't have any viable competition right now,
> not that many features seem critical.
>

It was in risk of being thrown away. After 4 years of being shipped we
currently just support file index + rating + tagging. If you ask me - that
is quite sad. Nepomuk has the capability of doing many great things, and
yet over the last 4 years nothing has happened.

> However, back in the days of Nokia's FOSS dive, IMAP4 handling would make
> or
> break the implementation, because the priorities on a mobile platform are
> quite different. Yes, of course, Nokia is dead so for now it doesn't matter
> that much, but one day a "real" linux distro is going to have another stab
> at
> mobile market and imap4(and other network protocols which are typical for a
> storage-constrained device) will be again relevant among the other things
> like
> performance of lsa's builtin analyzers.
>

Look it's simple. If someone is willing to fix Strigi and all the use
cases, I might consider it. But that clearly hasn't been the case. I am not
willing to fix Strigi bugs. It's too much effort for me.

Can we end this discussion? I am not moving back to Strigi even if there
are some technical advantages. It's unmaintained, non-Qt, doesn't follow
any of KDE coding styles. It's too much of a burden on me.

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130321/870139fc/attachment-0001.html>