[Nepomuk] [Kde-pim] Support for efficient searching in akonadi (e.g. by UID)

Mon May 20 12:25:10 UTC 2013

On Saturday 18 May 2013 11.01:00 Volker Krause wrote:

> > I think there are two possible options where we could add this to the db:
> > * a dedicated table
> > * a special part
> 
> There's a third: an extra (indexed) column for the item, and methods for
> setting/getting this in Akonadi::Item.
> 

Right, I dismissed that initially because I though of the UID as cache of a 
part of the payload, and figured the cache would be easier to extend to further 
properties if not just another column.

I would find the design with a separate table cleaner (no fields that can be 
empty), if the extra join would have a noticeable negative performance impact 
I don't know.

I'm not entirely sure if we need the accessors in Akonadi::Item at all, the 
serializers would anyways always overwrite it, which isn't very intuitive. 
Respectively I don't see in which case we wouldn't want the GID to be 
extracted by the serializer and be set by the user instead.

Maybe an ItemFetchJob and ItemModify/Create jobs which get the GID as string?

> > The latter would be less efficient, because an index would also contain
> > all
> > the full payloads, and there are of course many more parts in the db than
> > there are items, but would add less complexity to the design.
> 
> we can't index parts without adding a real full-text index, and that would
> be largely pointless since most parts are encoded in formats that aren't
> necessarily human readable.
> 

My point was that we want an index of the UID's for fast lookup, and if we 
have the UID's mixed with the other parts, the index would be needlessly 
cluttered with payload data, which is not useful as you mentioned. If the UID 
is in a separate column though it should be a lot more efficient.

> > To ensure that the UID is always up to date if it is being used, I think
> > the right place to update this cache would be the ItemSerializerPlugin,
> > either by using a part or by extending the serializer interface.
> 
> The serializer is indeed a good place to handle this, also for the third
> option mentioned above.
> 

Adding an extractUID(const Akonadi::Item &) virtual call to the serializer 
plugin interface makes sense?

> This would also include a new version of the FETCH command ("GID", next to
> the existing UID, RID and HRID ones). It can return zero, one or multiple
> results, which would be up to the user to handle.
> 

Since there are already versions for RID/HRID resulting in similar behavior 
(not guaranteed to to succeed in normal operation, identifier is not guaranteed 
to be unique), I suppose that makes sense.

Otherwise I would have suggested an ItemSearchJob.

> All this shouldn't even be too much work to add. And it's generic enough to
> be useful for all kinds of types. As mentioned above, we discussed this
> before and agreed it's a good idea to have this, just needs a volunteer to
> do the work ;)
> 

Yep, I'll add this rather sooner than later.

> > Note that while I talked only about UID's so far we may want to cache
> > other
> > parts of the payload in the future in order to be searchable. For
> > instance,
> > to be able to load only calendar objects which occur within a certain
> > timeframe, we could cache start and end date (I know in this specific case
> > that we have a performance problem for large calendars, and this would be
> > one way to reduce this). So the UID table could also be a CACHE table
> > containing an additional TYPE column (same if cached in special parts).
> 
> Here I have to disagree. This is getting way too close to Nepomuk. Object
> identification and id mapping is in scope for Akonadi, understanding
> content/ payload semantics is not.
> 
> The calendar performance use-case is very valid of course, but also not
> entirely new (there should be code for an unfinished calendar search agent
> somewhere). The rationale back then was that the ical time range query
> problem is actually more complex (and can also not easily covered by
> Nepomuk) to warrant a specialized solution (consider timezones,
> recurrences, etc). Also, lack of indexing leads "only" to poor performance,
> not to complete lack of a feature (as it's the case with the above listed
> use-cases for GID FETCH).
> 

Well, I think this feature doesn't fit in either problem domain particularly 
well, but could be solved by both systems. Obviously caching isn't the primary 
concern of akonadi, but IMO the GID is also only a special case of cache, that 
just happens to be very generally applicable.

While nepomuk hold's already structured data, it is IMO more about being able 
to store and retrieve structured data generally, and not about having a cache 
of a very specific subset for pure performance reasons (and is therefore not 
particularly efficient for this). In other words, it's more about being able to 
do complex queries against the data-structures, and not about being 
particularly efficient for a very specific subset.

What we need is a persistent caching solution that scales for large data-sets, 
and while this would generally belong to a separate component, we could 
implement this in akonadi, for the sake of not having yet another complex 
component in the system and for being able to reuse large parts of the 
existing akonadi.
As said, the GID case also falls into the same category IMO (it's not strictly 
required for the synchronization).

I'm fine treating the GID case specially though, as it is probably the most 
widely applicable form of such a cache. We can see later about where to solve 
the other problems we have.

Cheers,
Christian