D23787: [baloo_file_extractor] Improve handling of large plain-text files
Igor Poboiko
noreply at phabricator.kde.org
Fri Oct 4 13:27:53 BST 2019
poboiko added a comment.
In D23787#537891 <https://phabricator.kde.org/D23787#537891>, @bruns wrote:
> Can you please provide an example which:
>
> - is currently indexed though it should be skipped due to size
> - is skipped after this change
Sure. Any mimetype inherited from "text/plain", but starting with "text/" counts. I've made an actual list:
F7515259: list.txt <https://phabricator.kde.org/F7515259>
(using simple python script, which iterates over `QMimeDatabase().allMimeTypes()`, checks if `type.inherits("text/plain")` and is not already excluded by default Baloo config from `file/fileexcludefilters.cpp`)
By looking at list, I see that some of them might be pretty heavy (and useless to index). For example, `application/x-valgrind-massif`, or `application/sql` (I know, SQL dumps are excluded by extension `*.sql`, but someone might simply use another extension like `.dump`). It's also pretty easy to imagine large Wolfram Mathematica file, i.e. containing pictures (that corresponds to `application/mathematica` from the list; although on my computer those are detected as `application/vnd.wolfram.nb`, which for some reason do not inherit `text/plain`, although it's plaintext-based).
We can do our best to exclude undesired types, but I'm not sure we will be able to cover all of them. And some files might be of desirable type, but simply too large (RSS feeds `application/rss+xml`, LyX files for some books `application/x-lyx`, mailboxes `message/rfc822` or `application/mbox`).
> and another example which:
>
> - is currently skipped though it should be indexed
> - is indexed after this change
There shouldn't be any. I mean, "PlaintextExtractor" should be inside `exList` for anything that starts with `text/`...
REPOSITORY
R293 Baloo
REVISION DETAIL
https://phabricator.kde.org/D23787
To: poboiko, #baloo, bruns, ngraham
Cc: broulik, kde-frameworks-devel, #baloo, lots0logs, LeGast00n, fbampaloukas, GB_2, domson, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20191004/c4c71896/attachment.html>
More information about the Kde-frameworks-devel
mailing list