[Kde-pim] akonadinext update: entity processing pipelines in resources

Wed Dec 17 16:04:22 GMT 2014

On Wednesday 17 December 2014 16:50:41 Christian Mollekopf wrote:
> On Wednesday 17 December 2014 16.33:12 Milian Wolff wrote:
> > On Wednesday 17 December 2014 15:06:35 Christian Mollekopf wrote:
> > > On Wednesday 17 December 2014 14.49:53 Milian Wolff wrote:
> > > > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> > > > > hey :)
> > > > > 
> > > > > hows that for a subject line full of jargon? ;)
> > > > > 
> > > > > in the akonadinext repo, sychnronizer processes are approaching some
> > > > > sort
> > > > > of early-stage completeness. they currently:
> > > > > 
> > > > > * accept connections from clients
> > > > > * inform clients when the revision in the store changes[1]
> > > > > * take commands from clients and respond with a task completion
> > > > > message[2]
> > > > > * load Resource plugins to deal with storing data and synchronizing
> > > > > with
> > > > > the source
> > > > > * manage processing pipelines for entities
> > > > > 
> > > > > it's that last part that i'm writing about here, actually. a
> > > > > pipeline
> > > > > is
> > > > > zero or more processing units (soon to be plugins) that (currently)
> > > > > sit
> > > > > in
> > > > > a chain and do some processing on entities[3] whenever they are
> > > > > created,
> > > > > modified or deleted. we will be using this to populate indexes,
> > > > > trigger
> > > > > full text indexing, applying client-side filters, spam/scam
> > > > > detection,
> > > > > etc.
> > > > > etc.
> > > > > 
> > > > > anything that can be / should be done to a given entity when it
> > > > > appears,
> > > > > changes or is removed will happen in these pipelines.
> > > > > 
> > > > > things left to do:
> > > > > 
> > > > > * make PipelineFilter pluggable so that it is easy for people to add
> > > > > new
> > > > > filters (including ones we don't ship ourselves)
> > > > > * generate a configuration scheme which Pipeline can use to populate
> > > > > pipelines at runtime according to the user's wishes
> > > > > * write a few PipelineFilter plugins that do some actually useful
> > > > > things
> > > > > that we can use in testing
> > > > > 
> > > > > currently, pipelines are just a simple one-after-the-other
> > > > > processing
> > > > > afair. It is set up already for asynchronous processing, however.
> > > > > Eventually I would like to allow filters to note that they can be
> > > > > parallelized, should be run in a separate thread, ??? ... mostly so
> > > > > that
> > > > > we can increase throughput.
> > > > 
> > > > This, imo, will kill user-configuration. You do not want to burden the
> > > > user
> > > > with a GUI where he can define dependencies etc. pp.
> > > > 
> > > > Also, I cannot think of any common use-case of mail filtering that
> > > > could
> > > > be
> > > > parallelized for a single mail:
> > > > 
> > > > a) first, "move" filters are checked, such as spam filters which
> > > > either
> > > > discard the mail or put it into a sub folder, or mailing list filters,
> > > > which put the mails into a folder for the given list. when any filter
> > > > is
> > > > met here, the chain is stopped, thus it cannot be parallelized
> > > > b) if the final place is found, and it is a folder that we want to
> > > > have
> > > > indexed, we feed it over to e.g. baloo. again, not something you can
> > > > do
> > > > in
> > > > parallel, as you don't want to index spam mails, and also want to know
> > > > the
> > > > final place of the mail.
> > > > 
> > > > What other, _common_ usecase do you think of that would benefit from
> > > > the
> > > > additional design overhead?
> > > 
> > > "filter" in this context are not what we currently have as client-side
> > > filtering. It's rather a "processor" if you will, that get's processed
> > > as
> > > new or modified entites are processed.
> > > 
> > > See also:
> > > https://community.kde.org/KDE_PIM/Akonadi_Next/Terminology
> > > 
> > > Filter could be used for:
> > > * indexing
> > > * detecting spam
> > > * client-side filtering (which is what you meant I think)
> > > * ....
> > > 
> > > I basically allows us to plug in pieces of functionality that we can
> > > guarantee get processed before an entity officially enters the system.
> > > 
> > > So most of these filters will be fixed by the configuration shipped with
> > > the resource, not something the user can adjust. Some filter may be
> > > optional or react to user configuration such as client-side filtering,
> > > or
> > > optional full- text indexing.
> > 
> > The rest of my mail still stands as-is though. What do you think of that
> > can potentially be parallelized?
> 
> Everything read-only is easy to parallelize:

What is there, that is read-only? And I do _not_ mean hypothetical stuff here. 
The items you list below only make sense when you have multiple of each, which 
is, imo, rarely - if ever - the case.

> * various indexers (full text, indexes for efficient lookups of hierarchy
> etc.)

There is only one indexer, what other indexer do you seriously think will be 
added that is time consuming and justifies parallelism?

> * a hypothetical filter that extracts invitations or attachments from mails

I can also conjure a hypothetical scenario, this is uninteresting here.

> * spam-detection etc. could be parallelized if we don't modify the email
> message (which I think would be a good idea).

You'd run multiple spam detection tools in parallel? I only run spam assassin, 
and thus this also does not justify the parallelism.

> * various notifications could be implemented in a parallelized filter
> (although that should perhaps rather be implemented as daemon listening for
> updates)
> 
> the client-side email filtering is one of the few filters that can't be
> parallelized as it acts as a gatekeeper and actually moves mails elsewhere.

I think you two are overdesigning this. Parallelize by running filters for 
different mails in parallel. Don't complicate the code for miniscule gains on 
single mails.

> > What, besides client-side filtering, would be
> > configurable by the user?
> 
> I'd imagine i.e. full-text indexing would be user configurable for which
> folders should be indexed, as it's expensive. If we move some exotic
> functionality such as notifications, attachment extraction, .... into the
> filters then those would likely be optional and perhaps have further
> attached configuration items.

But all of these things are on/off switches of hardcoded features, which is 
trivial to implement, no?

Bye

-- 
Milian Wolff
mail at milianw.de
http://milianw.de
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/