[Kde-pim] akonadinext update: entity processing pipelines in resources

Wed Dec 17 13:49:53 GMT 2014

On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> hey :)
> 
> hows that for a subject line full of jargon? ;)
> 
> in the akonadinext repo, sychnronizer processes are approaching some sort of
> early-stage completeness. they currently:
> 
> * accept connections from clients
> * inform clients when the revision in the store changes[1]
> * take commands from clients and respond with a task completion message[2]
> * load Resource plugins to deal with storing data and synchronizing with the
> source
> * manage processing pipelines for entities
> 
> it's that last part that i'm writing about here, actually. a pipeline is
> zero or more processing units (soon to be plugins) that (currently) sit in
> a chain and do some processing on entities[3] whenever they are created,
> modified or deleted. we will be using this to populate indexes, trigger
> full text indexing, applying client-side filters, spam/scam detection, etc.
> etc.
> 
> anything that can be / should be done to a given entity when it appears,
> changes or is removed will happen in these pipelines.
> 
> things left to do:
> 
> * make PipelineFilter pluggable so that it is easy for people to add new
> filters (including ones we don't ship ourselves)
> * generate a configuration scheme which Pipeline can use to populate
> pipelines at runtime according to the user's wishes
> * write a few PipelineFilter plugins that do some actually useful things
> that we can use in testing
> 
> currently, pipelines are just a simple one-after-the-other processing afair.
> It is set up already for asynchronous processing, however. Eventually I
> would like to allow filters to note that they can be parallelized, should
> be run in a separate thread, ??? ... mostly so that we can increase
> throughput.

This, imo, will kill user-configuration. You do not want to burden the user 
with a GUI where he can define dependencies etc. pp.

Also, I cannot think of any common use-case of mail filtering that could be 
parallelized for a single mail:

a) first, "move" filters are checked, such as spam filters which either 
discard the mail or put it into a sub folder, or mailing list filters, which 
put the mails into a folder for the given list. when any filter is met here, 
the chain is stopped, thus it cannot be parallelized
b) if the final place is found, and it is a folder that we want to have 
indexed, we feed it over to e.g. baloo. again, not something you can do in 
parallel, as you don't want to index spam mails, and also want to know the 
final place of the mail.

What other, _common_ usecase do you think of that would benefit from the 
additional design overhead?

> as for safety: before a pipeline is started, the entity will be stored on
> disk in storage, so if something Goes Wrong(tm) and the synchronizer
> process crashes data will not be lost and the pipeline can pick up again.
> to make this even more robust, i need to implement pipeline checkpointing
> so it automatically skip over problematic filters (self-healing).
> 
> at this point, i'm very much open to ideas and suggestions for the pipeline
> feature. this is the easiest point in time to adjust the design and
> implementation. so speak up if this interests you! :) also: if you want to
> hack on these things, you are very much welcome to get your hands dirty with
> me. there is a lot to do, it isn't (yet ;) hard to make useful progress in
> the code base, ... come in, the water's nice!

Just want to note that a proper pipeline was long planned for Akonadi-now but 
never implemented, afaik. This should hopefully also resolve some issues we 
have currently with filters duplicating mails etc. pp.

> [1] this is actually faked right now, waiting on further development of the
> Storage class and how we use it; but the synchronizer side of this is all
> done
> 
> [2] in case you're wondering about efficiency there, on my laptop a client
> and server can exchange in excess of 800,000 messages of this sort per
> second. that includes the buffer creation overhead, transmission,
> reception...
> 
> [3] emails, calendar items, whatever.

Bye

-- 
Milian Wolff
mail at milianw.de
http://milianw.de
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/