D23787: [baloo_file_extractor] Improve handling of large plain-text files
Igor Poboiko
noreply at phabricator.kde.org
Sun Sep 8 13:05:02 BST 2019
poboiko created this revision.
poboiko added reviewers: Baloo, bruns, ngraham.
Herald added projects: Frameworks, Baloo.
poboiko requested review of this revision.
REVISION SUMMARY
First of all, not all plain text-based mimetypes starts with `text/`:
i.e. `application/sql` for SQL dumps (already handled in FileExcludeFilters),
or `application/postscript` for PS images. There are most likely to be more.
Alternative solution would be using `QMimeType::inherits` instead.
Secondly, not all extractors are bad with large files: for example, if it is
a PS image, then PostScriptDSExtractor still might extract useful information.
Issues are mostly caused by PlainTextExtractor, which generates just too much
terms.
This patch aims at tackling both issues: it just skips PlaintextExtractor for
large files, utilizing extractor metadata introduced in D19109: [Extractor] Add metadata to extractors <https://phabricator.kde.org/D19109>.
TEST PLAN
1. Create large `.txt` file (>10Mb)
2. `baloo_file_extractor` still skips it.
REPOSITORY
R293 Baloo
BRANCH
improve-large-text-files (branched from master)
REVISION DETAIL
https://phabricator.kde.org/D23787
AFFECTED FILES
src/file/extractor/app.cpp
To: poboiko, #baloo, bruns, ngraham
Cc: kde-frameworks-devel, #baloo, lots0logs, LeGast00n, fbampaloukas, GB_2, domson, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20190908/551a0bb3/attachment-0001.html>
More information about the Kde-frameworks-devel
mailing list