[Nepomuk] [RFC] File Indexers

Wed Mar 20 16:11:12 UTC 2013

On Wed, Mar 20, 2013 at 8:52 PM, <phreedom at yandex.ru> wrote:

> On Среда 20 марта 2013 19:57:45 Vishesh Handa wrote:
> > On Wed, Mar 20, 2013 at 7:39 PM, <phreedom at yandex.ru> wrote:
> > > On Вторник 19 марта 2013 23:35:42 Vishesh Handa wrote:
> > > > As your guys might remember, we moved away from Strigi for the 4.10
> > > > release. Our solution however, still does not support any document
> > >
> > > formats
> > >
> > > > apart from PDF. We need to change that and support other formats.
> > > > There
> > >
> > > are
> > >
> > > > 2 possible ways to go about this -
> > > >
> > > > 1. We use Okular which supports a number of popular formats
> > > > 2. We write our own indexers by using the relevant library.
> > >
> > > I know I risk starting a flamewar, or more likely, there's no risk, and
> > > instead
> >
> > > a 100% guarantee, but:
> > Not really. It was mostly just a decision taken by me.
> >
> > >   3. Use libStreamAnalyzer.
> > >
> > > Take a look back at how many tiny issues and corner cases had to be
> > > fixed
> > > so
> > > far, how many lib quirks had to be accounted for? This was also the
> most
> > > significant source of troubles for libstreamanalyzer.
> >
> > The main reason I'm against this is Strigi does not have a maintainer.
> Bugs
> > keep propping up - It doesn't handle all kinds of odf files, docs files,
> > etc. I do not want to have to fix them.
>
> But now Nepomuk file indexer needs a maintainer.
>

I'm willing to maintain them. In fact I'm even willing to do the Okular
code splitting, it'll just take time, and it might be better to focus on
other things. Hence this thread asking for opinions.

>
> > Also, we're fundamentally
> > duplicating work. Libraries already exist to parse those file formats,
> and
> > they are actively being used all across kde. We can just reuse those
> > libraries instead of having our own parsers, and maintaining them.
>
> Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered to
> write
> an Okular plugin or massage TagLib people into making public their stream-
> based api, which is used internally and wrapped by the file-based public
> api.
> In fact, the plugin architecture was intended to allow kde apps  and libs
> to
> ship analyzers based on their format-specific libs.
>

Making taglib work with streams = a lot more work. Similarly, making Okular
work with streams would have also been quite hard. The only thing happening
right now is that the UI parts from Okular and being split.

Writing plugins in the case of lsa was never simple. There is virtually no
documentation, you have to register all these fields and what not. It is
not a simple job in comparison for writing a Nepomuk File-indexing one.

Also, it's not just about how Strigi was designed or how many plugins it
has. Maintaining about 1500 lines of well written Qt based code is a lot
simpler for me. And considering that I'm dealing with the bug reports,
unhappy users, and constant stream of "Nepomuk sucks", I think it is
reasonable for me to want to fix that. My options are 1. fixing strigi or
2. building my own. I chose building my own, as it is a lot simpler and I
can reuse other libraries.

> Oh, and of course libs have bugs too. You either report them and patiently
> wait for a fix, or fix it yourself. Eg ffmpeg may crash on some malformed
> or
> exotic file, and it isn't a big problem for the majority of its user
> base(redownload the file, delete it, open with another tool). Crashing
> analyzer
> is very bad for Nepomuk.
>

Yes. Libraries have bugs, but if the library is well used, the bugs will be
prevalent in other applications as well, and will have to be fixed. Taglib
is heavily used, if there is a bug, it will be noticed by many people.

>
> > What this duplication of effort has accomplished so far? And what happens
> >
> > > if or
> > > hopefully when Nepomuk outgrows this file-based sandbox?
> >
> > The duplication of effort has been quite small.
> >
> > Currently all of the indexing code in Nepomuk which is doing 80% of the
> > Strigi's job is about 1400 lines of code. In comparison the code required
> > to just interface with Strigi in Nepomuk was a good 700 lines. Also, now
> > with our 2 tier approach, Strigi would be giving us data which has
> already
> > been pushed. One could remove that data and all, but it's just not
> > something I want to do.
>
> LSA indexers can be selectively enabled, so 2 or X tier approach has been
> supported for ages but apparently not used.
>

I know. It's just a lot more effort. I've always said that Strigi is a lot
more powerful than our solution. Our solution is just more maintainable for
me.

> As to interface code, rdfindexer util from strigi is definitely smaller
> than 700
> lines of code
>

You're missing the point. Even if it just took us some 300 lines of code to
interface with Strigi. When fixing bugs one has to deal with the additional
Strigi code base which is by no means small. The entire libstreams +
libstreamanalyzer is a good 30k. That's almost as big as nepomuk-core.

>
> > I'm not sure when we will outgrow this file-based sandbox, but based on
> our
> > current requirements, we do not need anything more than file handling.
> The
> > other additional stuff that Strigi used to provide was just discarded.
>
> I can definitely see at least 1 use case: akonadi and providing metadata
> for
> attachments. Yes, you can always download and store that 30 MB attachment
> to a
> temp location, do the file analysis, but imap4 was specifically intended to
> avoid this.
>

When Strigi was being used - The entire attachment was being streamed into
the nepomukindexer which would stream it into strigi and then it would be
indexed. This is no different than storing it in /tmp/ and calling this
file based indexer.

If there is a better way of doing this - I'm willing to listen.

> It's a rather bad idea to design frameworks based on immediate
> requirements.
> It's an ok approach for a quick and dirty hack or a tool, but a strategic
> mistake for a framework.
>

In my roadmap there are no requirements which need the stream based
analyzer. The deal with imap4 isn't perfect, but it will still work
reasonably well. I'm not aiming for perfection over here.

Being able to index files in archives is nice, but not something I'm
willing to put in that much effort for.

> -- Evgeny
>

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130320/d41ad983/attachment.html>