[Kde-pim] akonadinext update: entity processing pipelines in resources

Wed Dec 17 15:50:41 GMT 2014

On Wednesday 17 December 2014 16.33:12 Milian Wolff wrote:
> On Wednesday 17 December 2014 15:06:35 Christian Mollekopf wrote:
> > On Wednesday 17 December 2014 14.49:53 Milian Wolff wrote:
> > > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> > > > hey :)
> > > > 
> > > > hows that for a subject line full of jargon? ;)
> > > > 
> > > > in the akonadinext repo, sychnronizer processes are approaching some
> > > > sort
> > > > of early-stage completeness. they currently:
> > > > 
> > > > * accept connections from clients
> > > > * inform clients when the revision in the store changes[1]
> > > > * take commands from clients and respond with a task completion
> > > > message[2]
> > > > * load Resource plugins to deal with storing data and synchronizing
> > > > with
> > > > the source
> > > > * manage processing pipelines for entities
> > > > 
> > > > it's that last part that i'm writing about here, actually. a pipeline
> > > > is
> > > > zero or more processing units (soon to be plugins) that (currently)
> > > > sit
> > > > in
> > > > a chain and do some processing on entities[3] whenever they are
> > > > created,
> > > > modified or deleted. we will be using this to populate indexes,
> > > > trigger
> > > > full text indexing, applying client-side filters, spam/scam detection,
> > > > etc.
> > > > etc.
> > > > 
> > > > anything that can be / should be done to a given entity when it
> > > > appears,
> > > > changes or is removed will happen in these pipelines.
> > > > 
> > > > things left to do:
> > > > 
> > > > * make PipelineFilter pluggable so that it is easy for people to add
> > > > new
> > > > filters (including ones we don't ship ourselves)
> > > > * generate a configuration scheme which Pipeline can use to populate
> > > > pipelines at runtime according to the user's wishes
> > > > * write a few PipelineFilter plugins that do some actually useful
> > > > things
> > > > that we can use in testing
> > > > 
> > > > currently, pipelines are just a simple one-after-the-other processing
> > > > afair. It is set up already for asynchronous processing, however.
> > > > Eventually I would like to allow filters to note that they can be
> > > > parallelized, should be run in a separate thread, ??? ... mostly so
> > > > that
> > > > we can increase throughput.
> > > 
> > > This, imo, will kill user-configuration. You do not want to burden the
> > > user
> > > with a GUI where he can define dependencies etc. pp.
> > > 
> > > Also, I cannot think of any common use-case of mail filtering that could
> > > be
> > > parallelized for a single mail:
> > > 
> > > a) first, "move" filters are checked, such as spam filters which either
> > > discard the mail or put it into a sub folder, or mailing list filters,
> > > which put the mails into a folder for the given list. when any filter is
> > > met here, the chain is stopped, thus it cannot be parallelized
> > > b) if the final place is found, and it is a folder that we want to have
> > > indexed, we feed it over to e.g. baloo. again, not something you can do
> > > in
> > > parallel, as you don't want to index spam mails, and also want to know
> > > the
> > > final place of the mail.
> > > 
> > > What other, _common_ usecase do you think of that would benefit from the
> > > additional design overhead?
> > 
> > "filter" in this context are not what we currently have as client-side
> > filtering. It's rather a "processor" if you will, that get's processed as
> > new or modified entites are processed.
> > 
> > See also:
> > https://community.kde.org/KDE_PIM/Akonadi_Next/Terminology
> > 
> > Filter could be used for:
> > * indexing
> > * detecting spam
> > * client-side filtering (which is what you meant I think)
> > * ....
> > 
> > I basically allows us to plug in pieces of functionality that we can
> > guarantee get processed before an entity officially enters the system.
> > 
> > So most of these filters will be fixed by the configuration shipped with
> > the resource, not something the user can adjust. Some filter may be
> > optional or react to user configuration such as client-side filtering, or
> > optional full- text indexing.
> 
> The rest of my mail still stands as-is though. What do you think of that can
> potentially be parallelized?
Everything read-only is easy to parallelize:
* various indexers (full text, indexes for efficient lookups of hierarchy 
etc.)
* a hypothetical filter that extracts invitations or attachments from mails
* spam-detection etc. could be parallelized if we don't modify the email 
message (which I think would be a good idea).
* various notifications could be implemented in a parallelized filter 
(although that should perhaps rather be implemented as daemon listening for 
updates)

the client-side email filtering is one of the few filters that can't be 
parallelized as it acts as a gatekeeper and actually moves mails elsewhere.

> What, besides client-side filtering, would be
> configurable by the user?

I'd imagine i.e. full-text indexing would be user configurable for which  
folders should be indexed, as it's expensive. If we move some exotic 
functionality such as notifications, attachment extraction, .... into the 
filters then those would likely be optional and perhaps have further attached 
configuration items.

Cheers,
Christian

_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/