Review Request 130013: Make PlainTextExtractor match "text/plain" mimetype

Igor Poboiko igor.poboiko at gmail.com
Thu Mar 16 21:23:30 UTC 2017



> On Март 15, 2017, 4:13 д.п., Anthony Fieroni wrote:
> > Ship It!
> 
> Anthony Fieroni wrote:
>     Can you verify, https://git.reviewboard.kde.org/r/129703/ it is needed to limit CPU usage or to discard it?
> 
> Igor Poboiko wrote:
>     I didn't see much performance issues; and from what can I see, DB size didn't change much after reindexing, so there is no redundant extractors as far as I can see.
>     Concerning performance - I don't believe there is much overhead too much I think we should instead use profilers to find bottlenecks. I don't think it is one of them.
> 
> Anthony Fieroni wrote:
>     One more test, if i'm not too cheeky, please consider to have files like epub or svg i.e. complex mime types. For this types we have surely more than one extractor who can reflect on db size and cpu time.

Funny thing: I didn't manage to find any type with more than one extractor. SVG is matched only by PlainTextExtractor (apparently, the only extractor working with images is Exiv2Extractor, which doesn't support svg), and EPUB is matched only by EPubExtractor (apparently, internally it is zip-archive and there is no extractors working with archives).

Anyways, if two extractors got the same DocTerm some file, Baloo won't save it twice, it saves only unique terms. 
And if they extract different terms - well, it gives more chances to match users search, which is even better!

I also tried indexing whole /usr/share/icons directory, with lots of svg icons - well, I didn't see a change (again, subjectively: I didn't use profilers). Apparently, extraction takes considerably much more time that iterating over bunch of mimetypes per file (~100?).


- Igor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/130013/#review102846
-----------------------------------------------------------


On Март 16, 2017, 8:33 п.п., Igor Poboiko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/130013/
> -----------------------------------------------------------
> 
> (Updated Март 16, 2017, 8:33 п.п.)
> 
> 
> Review request for KDE Frameworks and Anthony Fieroni.
> 
> 
> Repository: kfilemetadata
> 
> 
> Description
> -------
> 
> After commit 7c7e985a4678fef5f5d0dd8faa9b9cb42e3844b4 (see https://git.reviewboard.kde.org/r/129720/), PlainTextExtractor no longer matches ANY of the text/ mimetypes.
> This broke completely Baloo indexing e.g. simple plain text files.
> Introduced check however allows to provide "text/plain" as supported mimetype for the extractor and hope that everything containing plain text will be inherited from it.
> 
> 
> Diffs
> -----
> 
>   autotests/CMakeLists.txt 5ab742b 
>   autotests/extractorcollectiontest.cpp PRE-CREATION 
>   src/externalextractor.cpp 05f0645 
>   src/extractors/plaintextextractor.cpp 26e1247 
> 
> Diff: https://git.reviewboard.kde.org/r/130013/diff/
> 
> 
> Testing
> -------
> 
> KFileMetaData compiles.
> Baloo indexes plain text files.
> Everybody is happy.
> 
> 
> Thanks,
> 
> Igor Poboiko
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20170316/709dbe56/attachment.html>


More information about the Kde-frameworks-devel mailing list