[Kde-pim] akonadinext update: entity processing pipelines in resources

Wed Dec 17 16:34:22 GMT 2014

On Wednesday 17 December 2014 17.04:22 Milian Wolff wrote:
> On Wednesday 17 December 2014 16:50:41 Christian Mollekopf wrote:
> > On Wednesday 17 December 2014 16.33:12 Milian Wolff wrote:
> > > On Wednesday 17 December 2014 15:06:35 Christian Mollekopf wrote:
> > > > On Wednesday 17 December 2014 14.49:53 Milian Wolff wrote:
> > > > > On Wednesday 17 December 2014 11:39:10 Aaron J. Seigo wrote:
> > > > > > hey :)
> > > > > > 
> > > > > > hows that for a subject line full of jargon? ;)
> > > > > > 
> > > > > > in the akonadinext repo, sychnronizer processes are approaching
> > > > > > some
> > > > > > sort
> > > > > > of early-stage completeness. they currently:
> > > > > > 
> > > > > > * accept connections from clients
> > > > > > * inform clients when the revision in the store changes[1]
> > > > > > * take commands from clients and respond with a task completion
> > > > > > message[2]
> > > > > > * load Resource plugins to deal with storing data and
> > > > > > synchronizing
> > > > > > with
> > > > > > the source
> > > > > > * manage processing pipelines for entities
> > > > > > 
> > > > > > it's that last part that i'm writing about here, actually. a
> > > > > > pipeline
> > > > > > is
> > > > > > zero or more processing units (soon to be plugins) that
> > > > > > (currently)
> > > > > > sit
> > > > > > in
> > > > > > a chain and do some processing on entities[3] whenever they are
> > > > > > created,
> > > > > > modified or deleted. we will be using this to populate indexes,
> > > > > > trigger
> > > > > > full text indexing, applying client-side filters, spam/scam
> > > > > > detection,
> > > > > > etc.
> > > > > > etc.
> > > > > > 
> > > > > > anything that can be / should be done to a given entity when it
> > > > > > appears,
> > > > > > changes or is removed will happen in these pipelines.
> > > > > > 
> > > > > > things left to do:
> > > > > > 
> > > > > > * make PipelineFilter pluggable so that it is easy for people to
> > > > > > add
> > > > > > new
> > > > > > filters (including ones we don't ship ourselves)
> > > > > > * generate a configuration scheme which Pipeline can use to
> > > > > > populate
> > > > > > pipelines at runtime according to the user's wishes
> > > > > > * write a few PipelineFilter plugins that do some actually useful
> > > > > > things
> > > > > > that we can use in testing
> > > > > > 
> > > > > > currently, pipelines are just a simple one-after-the-other
> > > > > > processing
> > > > > > afair. It is set up already for asynchronous processing, however.
> > > > > > Eventually I would like to allow filters to note that they can be
> > > > > > parallelized, should be run in a separate thread, ??? ... mostly
> > > > > > so
> > > > > > that
> > > > > > we can increase throughput.
> > > > > 
> > > > > This, imo, will kill user-configuration. You do not want to burden
> > > > > the
> > > > > user
> > > > > with a GUI where he can define dependencies etc. pp.
> > > > > 
> > > > > Also, I cannot think of any common use-case of mail filtering that
> > > > > could
> > > > > be
> > > > > parallelized for a single mail:
> > > > > 
> > > > > a) first, "move" filters are checked, such as spam filters which
> > > > > either
> > > > > discard the mail or put it into a sub folder, or mailing list
> > > > > filters,
> > > > > which put the mails into a folder for the given list. when any
> > > > > filter
> > > > > is
> > > > > met here, the chain is stopped, thus it cannot be parallelized
> > > > > b) if the final place is found, and it is a folder that we want to
> > > > > have
> > > > > indexed, we feed it over to e.g. baloo. again, not something you can
> > > > > do
> > > > > in
> > > > > parallel, as you don't want to index spam mails, and also want to
> > > > > know
> > > > > the
> > > > > final place of the mail.
> > > > > 
> > > > > What other, _common_ usecase do you think of that would benefit from
> > > > > the
> > > > > additional design overhead?
> > > > 
> > > > "filter" in this context are not what we currently have as client-side
> > > > filtering. It's rather a "processor" if you will, that get's processed
> > > > as
> > > > new or modified entites are processed.
> > > > 
> > > > See also:
> > > > https://community.kde.org/KDE_PIM/Akonadi_Next/Terminology
> > > > 
> > > > Filter could be used for:
> > > > * indexing
> > > > * detecting spam
> > > > * client-side filtering (which is what you meant I think)
> > > > * ....
> > > > 
> > > > I basically allows us to plug in pieces of functionality that we can
> > > > guarantee get processed before an entity officially enters the system.
> > > > 
> > > > So most of these filters will be fixed by the configuration shipped
> > > > with
> > > > the resource, not something the user can adjust. Some filter may be
> > > > optional or react to user configuration such as client-side filtering,
> > > > or
> > > > optional full- text indexing.
> > > 
> > > The rest of my mail still stands as-is though. What do you think of that
> > > can potentially be parallelized?
> > 
> > Everything read-only is easy to parallelize:
> What is there, that is read-only? And I do _not_ mean hypothetical stuff
> here. The items you list below only make sense when you have multiple of
> each, which is, imo, rarely - if ever - the case.
> 

We are very much in a hypothetical phase. We don't know if we need to 
parallelize yet. It's possible that it makes sense for some things, it may 
also not be all that useful. Something we need to figure out once we have the 
prototype. I do think it makes sense to keep the option open and not rule it 
out by design.

> > * various indexers (full text, indexes for efficient lookups of hierarchy
> > etc.)
> 
> There is only one indexer, what other indexer do you seriously think will be
> added that is time consuming and justifies parallelism?
> 

There are multiple indexes in the new design. We i.e. use indexes to query 
tree hierarchies, too lookup events by date-range, to filter mail by flags...

All of this will require indexes and I expect that we use several filters for 
that. To what extent this should be palatalized I can't tell right now.
Indexing attachments would be something that we would probably parallelize, if 
not even delegate to another process.

> > * spam-detection etc. could be parallelized if we don't modify the email
> > message (which I think would be a good idea).
> 
> You'd run multiple spam detection tools in parallel? I only run spam
> assassin, and thus this also does not justify the parallelism.
> 

Obviously we'd run that in parallel to other filters.

> > * various notifications could be implemented in a parallelized filter
> > (although that should perhaps rather be implemented as daemon listening
> > for
> > updates)
> > 
> > the client-side email filtering is one of the few filters that can't be
> > parallelized as it acts as a gatekeeper and actually moves mails
> > elsewhere.
> 
> I think you two are overdesigning this. Parallelize by running filters for
> different mails in parallel. Don't complicate the code for miniscule gains
> on single mails.
> 

I think we'll see when we get there.

> > > What, besides client-side filtering, would be
> > > configurable by the user?
> > 
> > I'd imagine i.e. full-text indexing would be user configurable for which
> > folders should be indexed, as it's expensive. If we move some exotic
> > functionality such as notifications, attachment extraction, .... into the
> > filters then those would likely be optional and perhaps have further
> > attached configuration items.
> 
> But all of these things are on/off switches of hardcoded features, which is
> trivial to implement, no?

For the immediate goals, yes I think so. I currently don't see a need for 
user-pluggable filters, but simply for filters that get some configuration 
from somewhere that they can adapt to.

I don't think the design will turn out as complex as you seem to imagine it.

But thanks for you input in any case =)

Cheers,
Christian

_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/