D23787: [baloo_file_extractor] Improve handling of large plain-text files

Igor Poboiko noreply at phabricator.kde.org
Sun Sep 8 13:05:02 BST 2019


poboiko created this revision.
poboiko added reviewers: Baloo, bruns, ngraham.
Herald added projects: Frameworks, Baloo.
poboiko requested review of this revision.

REVISION SUMMARY
  First of all, not all plain text-based mimetypes starts with `text/`:
  i.e. `application/sql` for SQL dumps (already handled in FileExcludeFilters),
  or `application/postscript` for PS images. There are most likely to be more.
  Alternative solution would be using `QMimeType::inherits` instead.
  
  Secondly, not all extractors are bad with large files: for example, if it is
  a PS image, then PostScriptDSExtractor still might extract useful information.
  Issues are mostly caused by PlainTextExtractor, which generates just too much
  terms.
  
  This patch aims at tackling both issues: it just skips PlaintextExtractor for
  large files, utilizing extractor metadata introduced in D19109: [Extractor] Add metadata to extractors <https://phabricator.kde.org/D19109>.

TEST PLAN
  1. Create large `.txt` file (>10Mb)
  2. `baloo_file_extractor` still skips it.

REPOSITORY
  R293 Baloo

BRANCH
  improve-large-text-files (branched from master)

REVISION DETAIL
  https://phabricator.kde.org/D23787

AFFECTED FILES
  src/file/extractor/app.cpp

To: poboiko, #baloo, bruns, ngraham
Cc: kde-frameworks-devel, #baloo, lots0logs, LeGast00n, fbampaloukas, GB_2, domson, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20190908/551a0bb3/attachment-0001.html>


More information about the Kde-frameworks-devel mailing list