[Kde-pim] PIM indexing and search - taking stock

Sun Mar 18 10:35:48 GMT 2012

See [srh]...

2012/3/18 Volker Krause <vkrause at kde.org>:
> Thanks for the overview Will, great progress indeed, I'm already getting used
> to the quick search doing full text search as well :)
>
> On Wednesday 14 March 2012 18:05:05 Will Stephenson wrote:
>> I'd like to share this list of the issues we face and have dealt with in
>> indexing and searching PIM data (mostly mails as their volume creates the
>> most difficulty).  You'll see that we have made quite a lot of progress
>> since the meeting.  Notice that I've written this list taking a high level
>> usefulness of product viewpoint; it's useless having working indexing if
>> it's not possible to use the index, additionally, there is no point
>> indexing data that will never be searched.
>>
>> My priority is to get kdepim 4.8 indexing usable, or at least not a
>> liability, so I'll be backporting any fixes below that are not yet in the
>> branch.
>>
>> I'll put this on the wiki, but please let me know here if you spot any
>> inconsistencies or omissions. I'll also make bug reports for the searching
>> problems I've identified.
>>
>> Will
>>
>> 1 Faults in indexing
>> 1.1 Performance faults while indexing
>> 1.1.1 FIXED Excessive work per item
>> * Excessive queries per item kde#289932#c58 [1],
>> kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in kde-runtime/4.8)
>> * Attachments fetched but not effectively indexed (Volker WIP?)
>
> The problem with attachments is that they are indexed by a helper process
> (nepomukindexer), which needs to final URI of the attachment object. However,
> what we pass in is the temporary _:xxxx URIs that still need to be resolved by
> DMS. StoreResourceJob contains the mapping AFAICT, so it's probably just a
> matter of deferring the indexData() calls until we have the result of that
> job.
>
>> * Setting the same icons on mails, attachments their tags while indexing, is
>> this necessary? (No, commented in other feeders - CM) Can they be de-
>> duplicated before storing the mail resource?
>
> From what I understand, the (expensive) resource identification only happens
> when creating new SimpleResource objects, not when setting existing URIs as
> properties. So, simply caching the icons should fix this.
>
> The reason to have those btw is to see nicer results in the KRunner search.
>
>> 1.1.2 Repeated indexing per item
>> 1.1.2.1 Failures to index items
>> * FIXED Cardinality fault on messageHeader
>> http://oscaf.git.sourceforge.net/git/gitweb.cgi?p=oscaf/shared-desktop-
>> ontologies;a=commitdiff;h=4697389c39b7112aaf0f6ac1a36b216e78ab5e14
>> * FIXED Cardinality fault on PIMO:Persons' properties
>> d732592b in kde-runtime/master
>> 1.1.2.2 FIXED Redundant reindexing
>> * kde#289932#c58?
>> 1.1.3 Repeated indexing per collection
>> * FIXED Attempted indexing of collections we cannot index
>> ec4f19eb781514ce0dfc09fe4e9ea4591ecc31e9 in kdepim-runtime/4.8
>> * FIXED Mark each collection on completion with indexing level
>> 2729771b765d0bd6e0e03d0a5b055e36bc48944c in kdepim-runtime/master
>> (does this prevent discovery of items changed while feeder was not running?)
>> 1.1.4 Indexing interferes with other work
>> * FIXED Hide indexing until user is idle kde#289932#c58
>> 1.1.5 Low nominal performance
>> * Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8 items/sec) on
>> Core i7-2620M (4x2.7GHz, HT), idle detection disabled. Not clear what is
>> the bottleneck.  Virtuoso using 80-90% of one core during this.
>> * Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates
>> performance advantage of fast Akonadi protocol
>
> Seeing the huge improvement after Sebastian's changes on the resource
> identification in DMS, I'd guess that this is where most of the time is spend.
> But that's just gut feeling.
>
> If that turns out to be true though, we can probably apply some more clever
> caching for e.g. email addresses (in a typically folder I'd assume some of
> them repeat quite often) to avoid running identification on them over and over
> again. List-Id is another good candidate for that.
>
>> 2 Ability to utilise indexing work (working search)
>
> New in 4.9: Quick search also does full-text search.
>
> WIP by Till: Composer address auto-completion based on all available Nepomuk
> data.
>
>> 2.1 Search features that fully use indexed data
>> * Indexed: Date, Subject, From, Sender, To, Cc, Bcc, List-Id, Organization,
>> some X-headers, Status flags, Tags, Important, Todo, Watched, Plain text
>> body Searchable: Age(days), Subject, From, To, Cc, Reply-To, List-Id,
>> Organization, some X-headers, Status flags, Tags, all headers (can this
>> work?), message body
>
> indexing all headers is possible, but considering they are about 30% of the
> entire mail volume (and would map to many Nepomuk resources), I'm wondering if
> the extra cost is really worth it. For List-Id it would probably be
> interesting to see if NMO knows about the concept of mailing lists (which is
> what you actually want to search for).
>
>> * No way to search by the actual PIMO Persons/Contacts
>> created by indexing, user must input part of name.
>> * No way to search attachments or whether something has an attachment
>>
>> 2.2 Faults in search
>> 2.2.1 Server side
>> * FIXED Truncated query strings cause broken search folders
>
> And broken again, someone apparently managed to go beyond 1024 chars as well.
> Let's remove the limit entirely if possible.
>
>> 2.2.2 Client side
>> * Dialog allows modifying existing search folder by name but fails (modifies
>> remote id)
>> * Possible to create search in search folders; doesn't work
>> 2.2.3 Viewing search results changes search results
>> * search on unread message status, messages disappear from search as message
>> preview makes them read
>> * Just viewing search results causes some messages to disappear from search
>> collection (according to akonadiconsole db browser, itemChanged + reindex at
>> fault?)
>
> Yes, itemChanged currently is handled in the feeder as add/remove. For emails
> this case can be optimized for the common case of flag/tag changes I guess,
> they rarely change content.

[srh] On a related topic, could we please have support for an
ItemCreateOrUpdateJob? The use case is as per an earlier email thread
where I am am refreshing a very large collection. Since there is no
reliable way to find "whats changed?" from the backend, for each batch
of items I fetch from the backend, I delete all the Akonadi contents
(based on remoteId) before creating them again. This not only seems
silly, but is presumably significantly more expensive than it need be
too. Payload-based optimisation would be a further gain.

>> 2.3 Minimising indexing work (assuming there is no/low demand for search, do
>> less expensive indexing)
>> * Change default set of indexed folders
>> * Make it easy to change per folder indexing attribute
>> * Show indexing status, allow attr change directly in folder selector in
>> search dialog.
>> * Indexing all except full text a useful compromise?
>
> Would require some measurement, I'm not sure if the full text part is what
> actually makes it so expensive.
>
>> [1] https://bugs.kde.org/show_bug.cgi?id=289932
>
> regards,
> Volker
>
> _______________________________________________
> KDE PIM mailing list kde-pim at kde.org
> https://mail.kde.org/mailman/listinfo/kde-pim
> KDE PIM home page at http://pim.kde.org/
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/