[Kde-pim] akonadinext update: entity processing pipelines in resources
Daniel Vrátil
dvratil at redhat.com
Thu Dec 18 13:34:58 GMT 2014
On Thursday, December 18, 2014 09:55:30 AM Aaron J. Seigo wrote:
> On Wednesday, December 17, 2014 14.49:53 you wrote:
> > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> > > currently, pipelines are just a simple one-after-the-other processing
> > > afair. It is set up already for asynchronous processing, however.
> > > Eventually I would like to allow filters to note that they can be
> > > parallelized, should be run in a separate thread, ??? ... mostly so that
> > > we can increase throughput.
> >
> > This, imo, will kill user-configuration. You do not want to burden the
> > user
> > with a GUI where he can define dependencies etc. pp.
>
> Yes, that would make no sense. So that's probably not what I'm proposing :)
>
> > Also, I cannot think of any common use-case of mail filtering that could
> > be
>
> > parallelized for a single mail:
> Christian already pointed to it, but:
>
> https://community.kde.org/KDE_PIM/Akonadi_Next/Terminology
>
> Mail filtering is a specific use case, but the abstract concept is
> "processing an entity for content". Evidently the word "filter" is causing
> confusion, and that's perhaps understandable since the word has meaning in
> the scope of email. (.. and of course, Akonadi is not, strictly speaking,
> even an email system; it's a system that can be used to manage email stores
> ..)
>
> Better suggestions for the word "filter" are welcome. We are early enough in
> that we can change these terms.
>
> > What other, _common_ usecase do you think of that would benefit from the
> > additional design overhead?
>
> The point of having pipelines is to ensure all post-delivery processing is
> done before clients start showing (wrong) data. Filters that move an email
> between folders, for instance, should be run *before* showing the email in
> the wrong folder in the client.
>
> So, real world use cases:
>
> 1. a mail filter that moves an email to a folder
> 2. a scam detector (currently this lives in libmessageviewer!)
> 3.full text indexer
> 4. threading agent (relies on knowing which folder it is in)
> 5. a mail filter that flags mails from your boss as important
> 6. an event checker that flags conflicts between incoming events and
> existing ones
>
> 1, 2, 3, 4 and 6 do not modify the entity itself. They touch indexes, but
> not the entity itself. Number 5 does.
>
> Number 1 needs to be run before numbers 3 and 4, but can be run in parallel
> with 2 and 5 (which also needs to be run before 3 and 4). 3 and 4 can be run
> in parallel. 6 may run on emails and on calender events, does not touch the
> entities, nothing depends on its output.
>
> the graph that comes from that is self-evident once all the information is
> known .. but that's the trick: making sure each element can provide enough
> useful, machine-processable information to know what the graph should be.
>
> as for user configuration, they may wish to not have scam detection on
> (e.g.). with that off, then the set of filters that are run change (in this
> case #2 is just not run at all) and the graph changes as a result as well.
>
> additionally, 1 and 5 are obviously generated from user configuration. the
> user won't know that, but that is what will be happening: their filters will
> be creating nodes in the pipeline.
>
> as for why to parallelize, that's simple: throughput.
I think we should think here about what the scope of paralellization should
be: do we want to run a single email instance through multiple filters in
parallel, or do we want to process multiple emails at once in parallel
pipelines?
I think that trying to run multiple filters on one email in parallel does not
make much sense, and unless you have real hard numbers to back this up, the
performance gain does not simply outweight the complexity of the code to
manage the filters graph (to detect which can be executed in parallel, and
when). This will not improve the throughput.
On the other hand, we really want to be able to process multiple emails in
parallel - for instance during sync. Having 4 or so identical pipelines
running in threads and distributing incoming emails between them evenly would
be a massive performance boost IMO. It would also reduce the complexity of the
filter-management code, as you would have only 3 types of filters:
* Pre-pipeline filters - filters that each entity has to pass before entering
the pipeline. There is only one instance of each filter, and it is not
parallelized. This has to be a super-fast filter. I listed it mostly just so
that the list is complete. The only case I can think of is balancing the
incoming entities between the paralellized pipelines.
* Pipeline filters - the filters are simply chained (= pipeline) - there are
multiple instances of the pipeline, each instance has it's thread. This
handles indexing, mail filtering, etc.
* Post-pipeline filters - same as pre-pipeline filters, just executed after
the entity leaves the pipeline. Could be the threading filter for example.
All you need to specify for each filter is it's type (Pre, Pipeline, Post) and
it's weight to enforce order of the filters in the chain (e.g. mail filter
filter (see why I prefer "preprocessor" to "filter" here? :D) should be before
indexer, etc.).
Dan
>
> as you note, we should be able to parallelize processing of individual
> emails, but even then only to an extent. the threading agent is much
> simpler if it is only ever processing one email at a time, so maybe we
> never want it to be running in parallel, which the scam detector perhaps
> ought to be running in as many individual pipelines as possible at once.
>
> additionally, some processes take more time than others and block yet
> others. runing 1, 2, 3 and 6 in parallel will gut us to 4, 5 that much
> faster. throughput, plain and simple.
>
> we are thinking about all of these issues with datasets of 100s of 1000s of
> folders / emails in a single collection in mind. Kolab Systems has clients
> with exactly such data sets, in fact.
>
> hope that helps clear up some things. if not, keep asking :)
--
Daniel Vrátil | dvratil at redhat.com | dvratil on #kde-devel, #kontact, #akonadi
Software Engineer - KDE Desktop Team, Red Hat Inc.
GPG Key: 0xC59D614F6F4AE348
Fingerprint: 4EC1 86E3 C54E 0B39 5FDD B5FB C59D 614F 6F4A E348
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20141218/a71d989b/attachment.sig>
-------------- next part --------------
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/
More information about the kde-pim
mailing list