<br><br><div class="gmail_quote">On Wed, Mar 20, 2013 at 8:52 PM, <span dir="ltr"><<a href="mailto:phreedom@yandex.ru" target="_blank">phreedom@yandex.ru</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5">On óÒÅÄÁ 20 ÍÁÒÔÁ 2013 19:57:45 Vishesh Handa wrote:<br>
> On Wed, Mar 20, 2013 at 7:39 PM, <<a href="mailto:phreedom@yandex.ru">phreedom@yandex.ru</a>> wrote:<br>
> > On ÷ÔÏÒÎÉË 19 ÍÁÒÔÁ 2013 23:35:42 Vishesh Handa wrote:<br>
> > > As your guys might remember, we moved away from Strigi for the 4.10<br>
> > > release. Our solution however, still does not support any document<br>
> ><br>
> > formats<br>
> ><br>
> > > apart from PDF. We need to change that and support other formats.<br>
> > > There<br>
> ><br>
> > are<br>
> ><br>
> > > 2 possible ways to go about this -<br>
> > ><br>
> > > 1. We use Okular which supports a number of popular formats<br>
> > > 2. We write our own indexers by using the relevant library.<br>
> ><br>
> > I know I risk starting a flamewar, or more likely, there's no risk, and<br>
> > instead<br>
><br>
> > a 100% guarantee, but:<br>
> Not really. It was mostly just a decision taken by me.<br>
><br>
> > š 3. Use libStreamAnalyzer.<br>
> ><br>
> > Take a look back at how many tiny issues and corner cases had to be<br>
> > fixed<br>
> > so<br>
> > far, how many lib quirks had to be accounted for? This was also the most<br>
> > significant source of troubles for libstreamanalyzer.<br>
><br>
> The main reason I'm against this is Strigi does not have a maintainer. Bugs<br>
> keep propping up - It doesn't handle all kinds of odf files, docs files,<br>
> etc. I do not want to have to fix them.<br>
<br>
</div></div>But now Nepomuk file indexer needs a maintainer.<br></blockquote><div><br>I'm willing to maintain them. In fact I'm even willing to do the Okular code splitting, it'll just take time, and it might be better to focus on other things. Hence this thread asking for opinions.<br>
š<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> Also, we're fundamentally<br>
> duplicating work. Libraries already exist to parse those file formats, and<br>
> they are actively being used all across kde. We can just reuse those<br>
> libraries instead of having our own parsers, and maintaining them.<br>
<br>
</div>Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered to write<br>
an Okular plugin or massage TagLib people into making public their stream-<br>
based api, which is used internally and wrapped by the file-based public api.<br>
In fact, the plugin architecture was intended to allow kde apps šand libs to<br>
ship analyzers based on their format-specific libs.<br></blockquote><div><br>Making taglib work with streams = a lot more work. Similarly, making Okular work with streams would have also been quite hard. The only thing happening right now is that the UI parts from Okular and being split.<br>
<br>Writing plugins in the case of lsa was never simple. There is virtually no documentation, you have to register all these fields and what not. It is not a simple job in comparison for writing a Nepomuk File-indexing one.<br>
<br>Also, it's not just about how Strigi was designed or how many plugins it has. Maintaining about 1500 lines of well written Qt based code is a lot simpler for me. And considering that I'm dealing with the bug reports, unhappy users, and constant stream of "Nepomuk sucks", I think it is reasonable for me to want to fix that. My options are 1. fixing strigi or 2. building my own. I chose building my own, as it is a lot simpler and I can reuse other libraries.<br>
<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Oh, and of course libs have bugs too. You either report them and patiently<br>
wait for a fix, or fix it yourself. Eg ffmpeg may crash on some malformed or<br>
exotic file, and it isn't a big problem for the majority of its user<br>
base(redownload the file, delete it, open with another tool). Crashing analyzer<br>
is very bad for Nepomuk.<br></blockquote><div><br>Yes. Libraries have bugs, but if the library is well used, the bugs will be prevalent in other applications as well, and will have to be fixed. Taglib is heavily used, if there is a bug, it will be noticed by many people.<br>
š<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> What this duplication of effort has accomplished so far? And what happens<br>
><br>
> > if or<br>
> > hopefully when Nepomuk outgrows this file-based sandbox?<br>
><br>
> The duplication of effort has been quite small.<br>
><br>
> Currently all of the indexing code in Nepomuk which is doing 80% of the<br>
> Strigi's job is about 1400 lines of code. In comparison the code required<br>
> to just interface with Strigi in Nepomuk was a good 700 lines. Also, now<br>
> with our 2 tier approach, Strigi would be giving us data which has already<br>
> been pushed. One could remove that data and all, but it's just not<br>
> something I want to do.<br>
<br>
</div>LSA indexers can be selectively enabled, so 2 or X tier approach has been<br>
supported for ages but apparently not used.<br></blockquote><div><br>I know. It's just a lot more effort. I've always said that Strigi is a lot more powerful than our solution. Our solution is just more maintainable for me.<br>
š<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
As to interface code, rdfindexer util from strigi is definitely smaller than 700<br>
lines of code<br></blockquote><div><br>You're missing the point. Even if it just took us some 300 lines of code to interface with Strigi. When fixing bugs one has to deal with the additional Strigi code base which is by no means small. The entire libstreams + libstreamanalyzer is a good 30k. That's almost as big as nepomuk-core.<br>
š<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> I'm not sure when we will outgrow this file-based sandbox, but based on our<br>
> current requirements, we do not need anything more than file handling. The<br>
> other additional stuff that Strigi used to provide was just discarded.<br>
<br>
</div>I can definitely see at least 1 use case: akonadi and providing metadata for<br>
attachments. Yes, you can always download and store that 30 MB attachment to a<br>
temp location, do the file analysis, but imap4 was specifically intended to<br>
avoid this.<br></blockquote><div><br>When Strigi was being used - The entire attachment was being streamed into the nepomukindexer which would stream it into strigi and then it would be indexed. This is no different than storing it in /tmp/ and calling this file based indexer.<br>
<br>If there is a better way of doing this - I'm willing to listen.<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
It's a rather bad idea to design frameworks based on immediate requirements.<br>
It's an ok approach for a quick and dirty hack or a tool, but a strategic<br>
mistake for a framework.<br></blockquote><div><br>In my roadmap there are no requirements which need the stream based analyzer. The deal with imap4 isn't perfect, but it will still work reasonably well. I'm not aiming for perfection over here.<br>
š<br>Being able to index files in archives is nice, but not something I'm willing to put in that much effort for.<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
-- Evgeny<br>
</blockquote></div><br><br clear="all"><br>-- <br><span style="color:rgb(192,192,192)">Vishesh Handa</span><br>