[Kde-pim] akonadinext update: entity processing pipelines in resources

Christian Mollekopf chrigi_1 at fastmail.fm
Thu Dec 18 12:23:33 GMT 2014


On Thursday 18 December 2014 12.52:28 Martin Steigerwald wrote:
> Am Donnerstag, 18. Dezember 2014, 09:55:30 schrieb Aaron J. Seigo:
> > On Wednesday, December 17, 2014 14.49:53 you wrote:
> > > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> > > > currently, pipelines are just a simple one-after-the-other processing
> > > > afair. It is set up already for asynchronous processing, however.
> > > > Eventually I would like to allow filters to note that they can be
> > > > parallelized, should be run in a separate thread, ??? ... mostly so
> > > > that
> > > > we can increase throughput.
> 
> […]
> 
> > > What other, _common_ usecase do you think of that would benefit from the
> > > additional design overhead?
> > 
> > The point of having pipelines is to ensure all post-delivery processing is
> > done before clients start showing (wrong) data. Filters that move an email
> > between folders, for instance, should be run *before* showing the email in
> > the wrong folder in the client.
> > 
> > So, real world use cases:
> > 
> > 1. a mail filter that moves an email to a folder
> > 2. a scam detector (currently this lives in libmessageviewer!)
> > 3.full text indexer
> > 4. threading agent (relies on knowing which folder it is in)
> > 5. a mail filter that flags mails from your boss as important
> > 6. an event checker that flags conflicts between incoming events and
> > existing ones
> > 
> > 1, 2, 3, 4 and 6 do not modify the entity itself. They touch indexes, but
> > not the entity itself. Number 5 does.
> > 
> > Number 1 needs to be run before numbers 3 and 4, but can be run in
> > parallel
> 
> Why?
> 
> Doesn´t the full text indexer reference the mail by some index in the
> database? If so, it wouldn´t care about the folder its stored in and can
> look that up on demand
> 

We don't have stable id's across resources by default, they are only unique 
and stable per resource. It could certainly be done, but I see no obvious 
advantages that would justify this.

> > as for why to parallelize, that's simple: throughput.
> > 
> > as you note, we should be able to parallelize processing of individual
> > emails, but even then only to an extent. the threading agent is much
> > simpler if it is only ever processing one email at a time, so maybe we
> > never want it to be running in parallel, which the scam detector perhaps
> > ought to be running in as many individual pipelines as possible at once.
> 
> I wonder whether its possible to parallelize in another way:
> 
> When I download new mail to a pop3 account, say about 1000 mails – I easily
> have this after a day absence – and I want them being filtered into folders:
> How about using mutiple filtering threads to sort the mail in order to
> utilize all available CPU cores and have it finished quickly?
> 
> Also spam filtering could be done multi-threaded.
> 
> Or just checking all folders of an IMAP account. The client could open 2-4
> folders at once to synchronize them. It might be good if that could be
> configured to what the IMAP client can handle. Or can Akonadi do this
> already? I usually see it checking one folder after another with Dovecot on
> the server side idling around.
> 

The aim is to be able to saturate all resource we have. In order to do this we 
have to keep various possibilities open:
* parallel network connections
* parallel processing of new entities
* parallel processing of filters on a single entity

How this is split up exactly will need some experimentation and will possibly 
have to be adjusted over time.

* With slow network connection, the network is likely to be the bottleneck and 
cpu and disk are not a big problem. => parallel processing doesn't help but 
multiple connections might
* With multiple cores and a fast disk, parallel processing could make a large 
difference so we can saturate the cpu cores.

> On any account I would like a see an important design goal for Akonadi Next:
> 
> *Never* block the client. Never ever block the client gui.
> 
> Current Akonadi still has issues with it. If Akonadi is busy with itself,
> KMail can still become quite unusable. While I think Akonadi should postpone
> background jobs to serve current user requests *quickly*. If I click on a
> mail, I want to see it. *Now*. The only excuse would be that the IMAP
> server doesn´t serve in time. But with a POP3 with locally stored maildirs
> on a SSD based BTRFS RAID 1 there is zero excuse for not serving the mail
> *now*. Same goes with switching folders and so on.
> 
> So throughput is one thing, but I think from a users point of view there is
> even something more important: Latency!
> 
> From my work as trainer and consultant regarding performance analysis &
> tuning on Linux I know that it can be challenging to have both. But in
> order best user experience, if need be I would reduce throughput (of
> background jobs) to decrease latency.
> 
> Please keep this in mind. In my eyes latency is key. And as I understood it
> this was one of the promises of Akonadi that it never quite fulfilled. The
> client will never have to block cause processing work happens in the
> background. Yet current Akonadi can block. It can block badly. For half a
> minute and more. Up to the point that KMail seems to loose connection with
> Akonadi and then I have KMail sitting there, doing nothing anymore, and then
> Akonadi also sitting there, doing nothing.
> 
> So please keep latency in mind. And robustness. The client shall never loose
> connection with the background store.

I think this will be much more predictable with the new system. Each client 
has direct, non-blocking access to the database and indexes, which should make 
read time (apart from i.e. having to spin-up the disk) rather predictable. The 
query processing will be what takes up most of the time, but that should be 
pretty much the same amount everytime.

To not block the client the query is processed in a thread, so that shouldn't 
be a problem either.

The only thing that we still have to do is that the syncronizer should 
prioritize on-demand client requests to i.e. a large background sync. Also 
something where parallelization inside the syncronizer could help to answer 
requests while a sync is ongoing.

> 
> I will happily test this with my "monster" account (one million mails and
> counting).

_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/


More information about the kde-pim mailing list