<br><br><div class="gmail_quote">On Wed, Mar 20, 2013 at 11:58 PM,  <span dir="ltr"><<a href="mailto:phreedom@yandex.ru" target="_blank">phreedom@yandex.ru</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Среда 20 марта 2013 21:41:12 Vishesh Handa wrote:<br>

> On Wed, Mar 20, 2013 at 8:52 PM, <<a href="mailto:phreedom@yandex.ru">phreedom@yandex.ru</a>> wrote:<br>

> > On Среда 20 марта 2013 19:57:45 Vishesh Handa wrote:<br>

> > > On Wed, Mar 20, 2013 at 7:39 PM, <<a href="mailto:phreedom@yandex.ru">phreedom@yandex.ru</a>> wrote:<br>

> > > > On Вторник 19 марта 2013 23:35:42 Vishesh Handa wrote:<br>

> > > > > As your guys might remember, we moved away from Strigi for<br>

> > > > > the 4.10<br>

> > > > > release. Our solution however, still does not support any<br>

> > > > > document<br>

> > > ><br>

> > > > formats<br>

> > > ><br>

> > > > > apart from PDF. We need to change that and support other<br>

> > > > > formats.<br>

> > > > > There<br>

> > > ><br>

> > > > are<br>

> > > ><br>

> > > > > 2 possible ways to go about this -<br>

> > > > ><br>

> > > > > 1. We use Okular which supports a number of popular formats<br>

> > > > > 2. We write our own indexers by using the relevant library.<br>

> > > ><br>

> > > > I know I risk starting a flamewar, or more likely, there's no<br>

> > > > risk, and instead<br>

> > ><br>

> > > > a 100% guarantee, but:<br>

> > > Not really. It was mostly just a decision taken by me.<br>

> > ><br>

> > > >   3. Use libStreamAnalyzer.<br>

> > > ><br>

> > > > Take a look back at how many tiny issues and corner cases had to<br>

> > > > be<br>

> > > > fixed<br>

> > > > so<br>

> > > > far, how many lib quirks had to be accounted for? This was also<br>

> > > > the<br>

> ><br>

> > most<br>

> ><br>

> > > > significant source of troubles for libstreamanalyzer.<br>

> > ><br>

> > > The main reason I'm against this is Strigi does not have a<br>

> > > maintainer.<br>

> ><br>

> > Bugs<br>

> ><br>

> > > keep propping up - It doesn't handle all kinds of odf files, docs<br>

> > > files, etc. I do not want to have to fix them.<br>

> ><br>

> > But now Nepomuk file indexer needs a maintainer.<br>

><br>

> I'm willing to maintain them. In fact I'm even willing to do the Okular<br>

> code splitting, it'll just take time, and it might be better to focus on<br>

> other things. Hence this thread asking for opinions.<br>

><br>

> > > Also, we're fundamentally<br>

> > > duplicating work. Libraries already exist to parse those file<br>

> > > formats,<br>

> ><br>

> > and<br>

> ><br>

> > > they are actively being used all across kde. We can just reuse those<br>

> > > libraries instead of having our own parsers, and maintaining them.<br>

> ><br>

> > Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered<br>

> > to<br>

> > write<br>

> > an Okular plugin or massage TagLib people into making public their<br>

> > stream- based api, which is used internally and wrapped by the<br>

> > file-based public api.<br>

> > In fact, the plugin architecture was intended to allow kde apps  and<br>

> > libs<br>

> > to<br>

> > ship analyzers based on their format-specific libs.<br>

><br>

> Making taglib work with streams = a lot more work.<br>

<br>

</div></div>In fact, all it takes is installing one of their .h files and developers<br>

promising to not break it too much.<br>

<div class="im"><br>

> Similarly, making Okular<br>

> work with streams would have also been quite hard. The only thing happening<br>

> right now is that the UI parts from Okular and being split.<br>

><br>

> Writing plugins in the case of lsa was never simple. There is virtually no<br>

> documentation, you have to register all these fields and what not.<br>

<br>

</div>The documentation might be a bit outdated, but it still described the overall<br>

architecture rather well. For specific examples, there are many analyzers<br>

including trivial ones. Registering fields etc is an api that predates nepomuk.<br>

Analyzers can output triples directly. This is also a rather old api.<br>

<div class="im"><br>

> It is not a simple job in comparison for writing a Nepomuk File-indexing<br>

one.<br>

<br>

</div>EndAnalyzers, which is what you usually end up writing, the job is to write 2<br>

functions: one does a quick mimetype-like detection, and another does<br>

everything else(read data, parse, emit triples or return an error code).<br>

Everything else is boilerplate that is copied and pasted between analyzers<br>

pretty much unchanged.<br>

<div class="im"><br>

> Also, it's not just about how Strigi was designed or how many plugins it<br>

> has. Maintaining about 1500 lines of well written Qt based code is a lot<br>

> simpler for me. And considering that I'm dealing with the bug reports,<br>

> unhappy users, and constant stream of "Nepomuk sucks", I think it is<br>

> reasonable for me to want to fix that. My options are 1. fixing strigi or<br>

> 2. building my own. I chose building my own, as it is a lot simpler and I<br>

> can reuse other libraries.<br>

<br>

</div>It is still simpler because:<br>

 * you are yet to match lsa coverage<br>

 * lsa has functionality that you don't plan to use, but you don't have to fix<br>

bugs in it either. It lays in separate file you simply have no reason to look<br>

at. More about this below.<br>

<div class="im"><br>

> > Oh, and of course libs have bugs too. You either report them and<br>

> > patiently wait for a fix, or fix it yourself. Eg ffmpeg may crash on<br>

> > some malformed or<br>

> > exotic file, and it isn't a big problem for the majority of its user<br>

> > base(redownload the file, delete it, open with another tool). Crashing<br>

> > analyzer<br>

> > is very bad for Nepomuk.<br>

><br>

> Yes. Libraries have bugs, but if the library is well used, the bugs will be<br>

> prevalent in other applications as well, and will have to be fixed. Taglib<br>

> is heavily used, if there is a bug, it will be noticed by many people.<br>

<br>

</div>I was specifically talking about priorities. It's not that the bug can't be<br>

noticed. It's that its severity for pretty much all ffmpeg users can be quite<br>

different from Nepomuk, thus you can often end up in a position of having to<br>

either fix it yourself or have users complain for months.<br>

<div><div class="h5"><br>

> > > What this duplication of effort has accomplished so far? And what<br>

> > > happens> ><br>

> > > > if or<br>

> > > > hopefully when Nepomuk outgrows this file-based sandbox?<br>

> > ><br>

> > > The duplication of effort has been quite small.<br>

> > ><br>

> > > Currently all of the indexing code in Nepomuk which is doing 80% of<br>

> > > the<br>

> > > Strigi's job is about 1400 lines of code. In comparison the code<br>

> > > required to just interface with Strigi in Nepomuk was a good 700<br>

> > > lines. Also, now with our 2 tier approach, Strigi would be giving<br>

> > > us data which has><br>

> > already<br>

> ><br>

> > > been pushed. One could remove that data and all, but it's just not<br>

> > > something I want to do.<br>

> ><br>

> > LSA indexers can be selectively enabled, so 2 or X tier approach has<br>

> > been<br>

> > supported for ages but apparently not used.<br>

><br>

> I know. It's just a lot more effort. I've always said that Strigi is a lot<br>

> more powerful than our solution. Our solution is just more maintainable for<br>

> me.<br>

><br>

> > As to interface code, rdfindexer util from strigi is definitely smaller<br>

> > than 700<br>

> > lines of code<br>

><br>

> You're missing the point. Even if it just took us some 300 lines of code to<br>

> interface with Strigi. When fixing bugs one has to deal with the additional<br>

> Strigi code base which is by no means small. The entire libstreams +<br>

> libstreamanalyzer is a good 30k. That's almost as big as nepomuk-core.<br>

<br>

</div></div>wc -l says libstreams is 7562 and libstreamanalyzer is 8142.<br></blockquote><div><br>vlap:~/kde/src/strigi/libstreamanalyzer $ cloc *<br>     235 text files.<br>     231 unique files.                                          <br>

      31 files ignored.<br><br><a href="http://cloc.sourceforge.net">http://cloc.sourceforge.net</a> v 1.56  T=1.0 s (201.0 files/s, 28933.0 lines/s)<br>-------------------------------------------------------------------------------<br>

Language                     files          blank        comment           code<br>-------------------------------------------------------------------------------<br>C++                             95           1547           3432          15967<br>

C/C++ Header                    91            706           3249           3499<br>CMake                           15             68             48            417<br>-------------------------------------------------------------------------------<br>

SUM:                           201           2321           6729          19883<br>-------------------------------------------------------------------------------<br><br>vlap:~/kde/src/strigi/libstreams $ cloc *<br>-------------------------------------------------------------------------------<br>

Language                     files          blank        comment           code<br>-------------------------------------------------------------------------------<br>C++                             62            451           1729           5519<br>

HTML                            18            197             44           2851<br>C/C++ Header                    45            415           1920           1800<br>C                                1             95             28            884<br>

CMake                            5             39             27            173<br>XML                              2             13              0              5<br>-------------------------------------------------------------------------------<br>

SUM:                           133           1210           3748          11232<br>-------------------------------------------------------------------------------<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


During a year or so that I spent actively maintaining lsa, I found and fixed<br>

just 1 sneaky bug in libstreams.</blockquote><div><br>Which year was that? Around 4.7 we introduced the data-management service, which checked the data before pushing it. That revealed a number of bugs in Strigi. The only fixes I saw were from Sebastian and me.<br>

<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> It's a very old and mature code, as close to<br>

bugfree as possible in practice. Your concern is lsa only and, if you decide<br>

that you are fine with ffmpeg handling mp3, you don't care about strigi's<br>

builtin mp3 analyzer, same for flac, same for pdf etc etc.<br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

In fact, if you subtract the size of the analyzers which are overriden by<br>

ffmpeg and (hypothetical) okular, you'll easily end up with 4k lines and what<br>

remains contains a lof of c++ boilerplate and copyright/license notices for a<br>

bunch of trivial analyzers, which simply doesn't compare in complexity to<br>

nepomuk-core.<br></blockquote><div><br>It would still be a lot more complex than the nepomuk-core solution.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


When your implementation is finished, it's quite likely to get to about the<br>

same size, which isn't at all surprising, because the only real difference is<br>

that your analyzers take a file name, and lsa ones take a stream object.<br></blockquote><div><br>Really? If I implement the okular analyzer it will increase the code about 200 lines (The analyzer was implemented and then later reverted). I don't see this scaling up to that size.<br>

 <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">> > > I'm not sure when we will outgrow this file-based sandbox, but based<br>

> > > on<br>

> ><br>

> > our<br>

> ><br>

> > > current requirements, we do not need anything more than file<br>

> > > handling.<br>

> ><br>

> > The<br>

> ><br>

> > > other additional stuff that Strigi used to provide was just<br>

> > > discarded.<br>

> ><br>

> > I can definitely see at least 1 use case: akonadi and providing metadata<br>

> > for<br>

> > attachments. Yes, you can always download and store that 30 MB<br>

> > attachment<br>

> > to a<br>

> > temp location, do the file analysis, but imap4 was specifically intended<br>

> > to avoid this.<br>

><br>

> When Strigi was being used - The entire attachment was being streamed into<br>

> the nepomukindexer which would stream it into strigi and then it would be<br>

> indexed. This is no different than storing it in /tmp/ and calling this<br>

> file based indexer.<br>

><br>

> If there is a better way of doing this - I'm willing to listen.<br>

<br>

</div>I definitely saw 700M videos being indexed far faster than they could be read<br>

from disk. <br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Either way it was an analyzer for that specific format not<br>

respecting the data transfer limit or not having the limit set properly. TLDR:<br>

a bug or a feature.<br>

<div class="im"><br>

> > It's a rather bad idea to design frameworks based on immediate<br>

> > requirements.<br>

> > It's an ok approach for a quick and dirty hack or a tool, but a<br>

> > strategic<br>

> > mistake for a framework.<br>

><br>

> In my roadmap there are no requirements which need the stream based<br>

> analyzer. The deal with imap4 isn't perfect, but it will still work<br>

> reasonably well. I'm not aiming for perfection over here.<br>

><br>

> Being able to index files in archives is nice, but not something I'm<br>

> willing to put in that much effort for.<br>

<br>

</div>Considering that Nepomuk-KDE doesn't have any viable competition right now,<br>

not that many features seem critical.<br></blockquote><div><br>It was in risk of being thrown away. After 4 years of being shipped we currently just support file index + rating + tagging. If you ask me - that is quite sad. Nepomuk has the capability of doing many great things, and yet over the last 4 years nothing has happened.<br>

<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

However, back in the days of Nokia's FOSS dive, IMAP4 handling would make or<br>

break the implementation, because the priorities on a mobile platform are<br>

quite different. Yes, of course, Nokia is dead so for now it doesn't matter<br>

that much, but one day a "real" linux distro is going to have another stab at<br>

mobile market and imap4(and other network protocols which are typical for a<br>

storage-constrained device) will be again relevant among the other things like<br>

performance of lsa's builtin analyzers.<br>

</blockquote></div><br>Look it's simple. If someone is willing to fix Strigi and all the use cases, I might consider it. But that clearly hasn't been the case. I am not willing to fix Strigi bugs. It's too much effort for me.<br>

<br>Can we end this discussion? I am not moving back to Strigi even if there are some technical advantages. It's unmaintained, non-Qt, doesn't follow any of KDE coding styles. It's too much of a burden on me.<br>

<br clear="all"><br>-- <br><span style="color:rgb(192,192,192)">Vishesh Handa</span><br>