[Kde-pim] akonadinext update: entity processing pipelines in resources

Aaron J. Seigo aseigo at kde.org
Thu Dec 18 17:02:30 GMT 2014


On Thursday, December 18, 2014 14.34:58 Daniel Vrátil wrote:
> I think we should think here about what the scope of paralellization should
> be: do we want to run a single email instance through multiple filters in
> parallel, or do we want to process multiple emails at once in parallel
> pipelines?

We don't know yet, and I don't think we *can know* until we profile various 
work loads.

I would not be surprised at all if the strategy employed during a large remote 
sync (e.g. pulling 100k emails from a Kolab server) is different from adding a 
new entity locally (e.g. Korganizer adding a new event to a calendar).
 
> I think that trying to run multiple filters on one email in parallel does
> not make much sense, and unless you have real hard numbers to back this up,

Exactly; we need real numbers. So the current implementation is entirely 
"naive" (non-parallel) but I want the design to allow for parallel processing 
to be added wherever it ends up making sense.

That said, full text indexing is slow, and there are some kinds of processing 
I can imagine which would require consulting remote databases which will be 
inherently slow. Being able to finish all the other jobs up while these slower 
processes occur will allow entities to be displayed in clients faster.

Then there are issues like calculating thread groups and their leaders. This 
may be faster, not to mention simpler, to do serially than in parallel since 
there are no race conditions possible due to emails in the same thread being 
processed simultaneously in different pipelines.

I don't think we'll end up with massive parallel processing where every job 
that is possible to run in parallel is. Rather, I sort of expect we'll end up 
with a single channel that runs fast processors with no out-of-process 
dependencies and other channels for processors that are either slow or have 
out-of-process requirements.ave only 3 types of
> filters:
> 
> * Pre-pipeline filters - filters that each entity has to pass before
> entering the pipeline. There is only one instance of each filter, and it is
> not parallelized. This has to be a super-fast filter. I listed it mostly
> just so that the list is complete. The only case I can think of is
> balancing the incoming entities between the paralellized pipelines.
> 
> * Pipeline filters - the filters are simply chained (= pipeline) - there are
> multiple instances of the pipeline, each instance has it's thread. This
> handles indexing, mail filtering, etc.

> the performance gain does not simply outweight the complexity of the code
> to manage the filters graph (to detect which can be executed in parallel,
> and when). This will not improve the throughput.

I don't think it will be very complex at all. That processing would also be 
done *once*, so if graph took 100ms to calculate (which would be absurdly 
long) but saved 1ms per run it would only require 100 entities to pass through 
it break even, and it would all be profit afterwards. 1ms is a full second for 
1000 entities.

I expect the savings could be significantly higher than that and the cost of 
creating the graph significantly lower than that ... but we'll find out with 
numbers in hand.

> On the other hand, we really want to be able to process multiple emails in
> parallel - for instance during sync. Having 4 or so identical pipelines
> running in threads and distributing incoming emails between them evenly
> would be a massive performance boost IMO. 

We don't actually know that, either. 

For example, full text indexing (FTI). It may turn out that having a single 
thread for FTI due to database contention or CPU usage. So perhaps we want to 
have all pipelines (even if there are multipe of them) use a single FTI 
thread.

Or consider a pipeline with 4 processors, three of which finish reliably in 
1ms each and one which takes 3ms. If we run 4 threads each running a pipeline 
in series, it will take 6s to process 4 mails, assuming each thread runs at 
full throttle on its own CPU. (Optimistic, but for simplicity's sake let's be 
optimistic :) If those pipelines instead run the three fast processors in one 
thread and the slow in the other, with 1 thread it will take 12ms .. 2 threads 
would take 6ms. 4 would take (again, optimistically) 3ms. 

This also assumes that the processing is CPU bound. If there is contention on 
disk storage, service access ... faster processing of individual entities may 
win out against slower processing in more threads.

We just don't know, and to find out we'll want a system that is flexible 
enough to support various approaches and then run it against actual data sets 
on typical "end user" type systems.

> It would also reduce the
> complexity of the filter-management code, as you would have only 3 types of
> filters:
> 
> * Pre-pipeline filters - filters that each entity has to pass before
> entering the pipeline. 

All filters are in the pipeline. there is no pre-pipeline filtering because we 
need to be able to checkpoint data into durable local storage before doing 
anything with it, since that "anything" may incur a crash or otherwise cause 
data to be unavailable.

So we can scratch this one :)

> * Pipeline filters - the filters are simply chained (= pipeline) -

This is what the current implementation does, but I expect it won't stay that 
way.

> there are
> multiple instances of the pipeline, each instance has it's thread. This
> handles indexing, mail filtering, etc.

Again, we don't know 

> * Post-pipeline filters - same as pre-pipeline filters, just executed after
> the entity leaves the pipeline. Could be the threading filter for example.

This is no different than creating a graph for which to parallelize a single 
pipeline. It requires knowing about the filters and the order in which they 
must run.

> All you need to specify for each filter is it's type (Pre, Pipeline, Post)
> and it's weight to enforce order of the filters in the chain (e.g. mail
> filter filter (see why I prefer "preprocessor" to "filter" here? :D) should
> be before indexer, etc.).

That will probably not be enough. What happens when two "prepipeline" filters 
/ processors are "pre" mail filter but one of those "prepipeline" filters is 
also pre-it? It really will need to know the relationships between all filters 
based on some generic set of hints just to order them properly in the pipeline 
in the first place.

Either that or we hardcode the layout of every pipeline for every resource and 
there is no extensibility. That is a possibility, if brittle as it then relies 
on people getting it right.

-- 
Aaron J. Seigo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20141218/884d18aa/attachment.sig>
-------------- next part --------------
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/


More information about the kde-pim mailing list