[Kde-pim] Review Request 116692: Lower memory usage of akonadi_baloo_indexer with frequent commits

Aaron J. Seigo aseigo at kde.org
Thu Jul 10 17:14:58 BST 2014


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/116692/
-----------------------------------------------------------

(Updated July 10, 2014, 4:14 p.m.)


Status
------

This change has been discarded.


Review request for Akonadi and Baloo.


Repository: baloo


Description
-------

Baloo is using Xapian for storing processed results from data fed to it by akonadi; in doing so it processes all the data it is sent to index and only once this is complete is the data committed to the Xapian database. From http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693 we see: "For efficiency reasons, when performing multiple updates to a database it is best (indeed, almost essential) to make as many modifications as memory will permit in a single pass through the database. To ensure this, Xapian batches up modifications." This means that *all* the data to be stored in the Xapian database first ends up in RAM. When indexing large mailboxes (or any other large chunk of data) this results in a very large amount of memory allocation. On one test of 100k mails in a maildir folder this resulted in 1.5GB of RAM used. In normal daily usage with maildir I find that it easily balloons to several hundred megabytes within days.
  This makes the Baloo indexer unusable on systems with smaller amounts of memory (e.g. mobile devices, which typically have only 512MB-2GB of RAM)

Making this even worse is that the indexer is both long-lived *and* the default glibc allocator is unable to return the used memory back to the OS (probably due to memory fragmentation, though I have not confirmed this). Use of other allocators shows the temporary ballooning of memory during processing, but once that is done the memory is released and returned back to the OS. As such, this is not a memory leak .. but it behaves like one on systems with the default glibc allocator with akonai_baloo_indexer taking increasingly large amounts of memory on the system that never get returned to the OS. (This is actually how I noticed the problem in the first place.)

The approach used to address this problem is to periodically commit data to the Xapian database. This happens uniformly and transparently to the AbstractIndexer subclasses. The exact behavior is controlled by the s_maxUncommittedItems constant which is set arbitrarily to 100: after an indexer hits 100 uncommitted changes, the results are committed immediately. Caveats:

* This is not a guaranteed fix for the memory fragmentation issue experienced with glibc: it is still possible for the memory to grow slowly over time as each smaller commit leaves some % of un-releasable memory due to fragmentation. It has helped with day to day usage here, but in the "100k mails in a maildir structure" test memory did still balloon upwards. 

* It make indexing non-atomic from akonadi's perspective: data fed to akonadi_baloo_indexer to be indexed may show up in chunks and even, in the case of a crash of the indexer, be only partially added to the database.

Alternative approaches (not necessarily mutually exclusive to this patch or each other):

* send smaller data sets from akonadi to akonadi_baloo_indexer for processing. This would allow akonadi_baloo_indexer to retain the atomic commit approach while avoiding the worst of the Xapian memory usage; it would not address the issue of memory fragmentation
* restart akonadi_baloo_indexer process from time to time; this would resolve the fragmentation-over-time issue but not the massive memory usage due to atomically indexing large datasets
* improve Xapian's chert backend (to become default in 1.4) to not fragment memory so much; this would not address the issue of massive memory usage due to atomically indexing large datasets
* use an allocator other than glibc's; this would not address the issue of massive memory usage due to atomically indexing large datasets


Diffs
-----

  src/pim/agent/emailindexer.cpp 05f80cf 
  src/pim/agent/abstractindexer.h 8ae6f5c 
  src/pim/agent/abstractindexer.cpp fa9e96f 
  src/pim/agent/akonotesindexer.h 83f36b7 
  src/pim/agent/akonotesindexer.cpp ac3e66c 
  src/pim/agent/contactindexer.h 49dfdeb 
  src/pim/agent/contactindexer.cpp a5a6865 
  src/pim/agent/emailindexer.h 9a5e5cf 

Diff: https://git.reviewboard.kde.org/r/116692/diff/


Testing
-------

I have been running with the patch for a couple of days and one other person on irc has tested an earlier (but functionally equivalent) version. Rather than reaching the common 250MB+ during regular usage it now idles at ~20MB (up from ~7MB when first started; so some fragmentation remains as noted in the description, but with far better long-term results)


Thanks,

Aaron J. Seigo

_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/



More information about the kde-pim mailing list