D23787: [baloo_file_extractor] Improve handling of large plain-text files

Igor Poboiko noreply at phabricator.kde.org
Fri Oct 4 13:27:53 BST 2019


poboiko added a comment.


  In D23787#537891 <https://phabricator.kde.org/D23787#537891>, @bruns wrote:
  
  > Can you please provide an example which:
  >
  > - is currently indexed though it should be skipped due to size
  > - is skipped after this change
  
  
  Sure. Any mimetype inherited from "text/plain", but starting with "text/" counts. I've made an actual list:
  F7515259: list.txt <https://phabricator.kde.org/F7515259>
  (using simple python script, which iterates over `QMimeDatabase().allMimeTypes()`, checks if `type.inherits("text/plain")` and is not already excluded by default Baloo config from `file/fileexcludefilters.cpp`)
  
  By looking at list, I see that some of them might be pretty heavy (and useless to index). For example, `application/x-valgrind-massif`, or `application/sql` (I know, SQL dumps are excluded by extension `*.sql`, but someone might simply use another extension like `.dump`). It's also pretty easy to imagine large Wolfram Mathematica file, i.e. containing pictures (that corresponds to `application/mathematica` from the list; although on my computer those are detected as `application/vnd.wolfram.nb`, which for some reason do not inherit `text/plain`, although it's plaintext-based).
  
  We can do our best to exclude undesired types, but I'm not sure we will be able to cover all of them. And some files might be of desirable type, but simply too large (RSS feeds `application/rss+xml`, LyX files for some books `application/x-lyx`, mailboxes `message/rfc822` or `application/mbox`).
  
  > and another example which:
  > 
  > - is currently skipped though it should be indexed
  > - is indexed after this change
  
  There shouldn't be any. I mean, "PlaintextExtractor" should be inside `exList` for anything that starts with `text/`...

REPOSITORY
  R293 Baloo

REVISION DETAIL
  https://phabricator.kde.org/D23787

To: poboiko, #baloo, bruns, ngraham
Cc: broulik, kde-frameworks-devel, #baloo, lots0logs, LeGast00n, fbampaloukas, GB_2, domson, ashaposhnikov, michaelh, astippich, spoorun, ngraham, bruns, abrahams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-frameworks-devel/attachments/20191004/c4c71896/attachment.html>


More information about the Kde-frameworks-devel mailing list