[Kde-pim] Review Request 116692: Lower memory usage of akonadi_baloo_indexer with frequent commits

Mon Mar 10 13:13:00 GMT 2014

> On March 10, 2014, 11:54 a.m., Sergio Luis Martins wrote:
> > Memory usage seems stable now, but it's taking forever to index. 10 minutes has passed and it's still indexing a 70k mail folder.
> > CPU is at 1%. IO is at 100%.
> > 
> > On a SSD... This is worse than the memory problem IMHO. I can send you this maildir.. might help trigering this.
> 
> Aaron J. Seigo wrote:
>     The reason for the time difference is obvious: before it was making one commit after processing those 70k mails. Now it makes 700 commits in that same time. The commits are the expensive part, time-wise.
>     
>     "This is worse than the memory problem IMHO"
>     
>     Given that indexing 70k emails at once is not a typical use case, and that it does terminate at some point, that is better imo than memory that is *never* released and which can easily lead to OOM conditions.
> 
> Sergio Luis Martins wrote:
>     fair enough. Then could you add a env variable, as suggested by Pablo ? I also have lots of memory to spare.

I'll leave that up to the maintainer of the code, as that's more of a design decision.

Some more numbers: after increasing the threshold to 200 items and copying 700 emails, memory remains in check and it spent 45s in commit(). Moving the items to the trash incurs another 25s in commit(). Oddly, moving items to the trash results in them being *re-indexed entirely* rather than a more (one hopes) economical move() call.

Laurent has already done some very nice work in the last couple days to prevent duplicate indexing (things were being indexed as many as 4 times!), so this has gotten better since the weekend, but there is still room for improvement it seems .. and Xapian is just not very fast when it comes to committing things to the database.

Also, looking at the Xapian codebase, it was automatically committing after every 10k items.

- Aaron J.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/116692/#review52518
-----------------------------------------------------------

On March 10, 2014, 11:12 a.m., Aaron J. Seigo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/116692/
> -----------------------------------------------------------
> 
> (Updated March 10, 2014, 11:12 a.m.)
> 
> 
> Review request for Akonadi and Baloo.
> 
> 
> Repository: baloo
> 
> 
> Description
> -------
> 
> Baloo is using Xapian for storing processed results from data fed to it by akonadi; in doing so it processes all the data it is sent to index and only once this is complete is the data committed to the Xapian database. From http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693 we see: "For efficiency reasons, when performing multiple updates to a database it is best (indeed, almost essential) to make as many modifications as memory will permit in a single pass through the database. To ensure this, Xapian batches up modifications." This means that *all* the data to be stored in the Xapian database first ends up in RAM. When indexing large mailboxes (or any other large chunk of data) this results in a very large amount of memory allocation. On one test of 100k mails in a maildir folder this resulted in 1.5GB of RAM used. In normal daily usage with maildir I find that it easily balloons to several hundred megabytes within day
 s. This makes the Baloo indexer unusable on systems with smaller amounts of memory (e.g. mobile devices, which typically have only 512MB-2GB of RAM)
> 
> Making this even worse is that the indexer is both long-lived *and* the default glibc allocator is unable to return the used memory back to the OS (probably due to memory fragmentation, though I have not confirmed this). Use of other allocators shows the temporary ballooning of memory during processing, but once that is done the memory is released and returned back to the OS. As such, this is not a memory leak .. but it behaves like one on systems with the default glibc allocator with akonai_baloo_indexer taking increasingly large amounts of memory on the system that never get returned to the OS. (This is actually how I noticed the problem in the first place.)
> 
> The approach used to address this problem is to periodically commit data to the Xapian database. This happens uniformly and transparently to the AbstractIndexer subclasses. The exact behavior is controlled by the s_maxUncommittedItems constant which is set arbitrarily to 100: after an indexer hits 100 uncommitted changes, the results are committed immediately. Caveats:
> 
> * This is not a guaranteed fix for the memory fragmentation issue experienced with glibc: it is still possible for the memory to grow slowly over time as each smaller commit leaves some % of un-releasable memory due to fragmentation. It has helped with day to day usage here, but in the "100k mails in a maildir structure" test memory did still balloon upwards. 
> 
> * It make indexing non-atomic from akonadi's perspective: data fed to akonadi_baloo_indexer to be indexed may show up in chunks and even, in the case of a crash of the indexer, be only partially added to the database.
> 
> Alternative approaches (not necessarily mutually exclusive to this patch or each other):
> 
> * send smaller data sets from akonadi to akonadi_baloo_indexer for processing. This would allow akonadi_baloo_indexer to retain the atomic commit approach while avoiding the worst of the Xapian memory usage; it would not address the issue of memory fragmentation
> * restart akonadi_baloo_indexer process from time to time; this would resolve the fragmentation-over-time issue but not the massive memory usage due to atomically indexing large datasets
> * improve Xapian's chert backend (to become default in 1.4) to not fragment memory so much; this would not address the issue of massive memory usage due to atomically indexing large datasets
> * use an allocator other than glibc's; this would not address the issue of massive memory usage due to atomically indexing large datasets
> 
> 
> Diffs
> -----
> 
>   src/pim/agent/emailindexer.cpp 05f80cf 
>   src/pim/agent/abstractindexer.h 8ae6f5c 
>   src/pim/agent/abstractindexer.cpp fa9e96f 
>   src/pim/agent/akonotesindexer.h 83f36b7 
>   src/pim/agent/akonotesindexer.cpp ac3e66c 
>   src/pim/agent/contactindexer.h 49dfdeb 
>   src/pim/agent/contactindexer.cpp a5a6865 
>   src/pim/agent/emailindexer.h 9a5e5cf 
> 
> Diff: https://git.reviewboard.kde.org/r/116692/diff/
> 
> 
> Testing
> -------
> 
> I have been running with the patch for a couple of days and one other person on irc has tested an earlier (but functionally equivalent) version. Rather than reaching the common 250MB+ during regular usage it now idles at ~20MB (up from ~7MB when first started; so some fragmentation remains as noted in the description, but with far better long-term results)
> 
> 
> Thanks,
> 
> Aaron J. Seigo
> 
>

_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/