[Kde-pim] PIM indexing and search - taking stock

Volker Krause vkrause at kde.org
Sun Mar 18 10:12:27 GMT 2012


Thanks for the overview Will, great progress indeed, I'm already getting used 
to the quick search doing full text search as well :)

On Wednesday 14 March 2012 18:05:05 Will Stephenson wrote:
> I'd like to share this list of the issues we face and have dealt with in
> indexing and searching PIM data (mostly mails as their volume creates the
> most difficulty).  You'll see that we have made quite a lot of progress
> since the meeting.  Notice that I've written this list taking a high level
> usefulness of product viewpoint; it's useless having working indexing if
> it's not possible to use the index, additionally, there is no point
> indexing data that will never be searched.
> 
> My priority is to get kdepim 4.8 indexing usable, or at least not a
> liability, so I'll be backporting any fixes below that are not yet in the
> branch.
> 
> I'll put this on the wiki, but please let me know here if you spot any
> inconsistencies or omissions. I'll also make bug reports for the searching
> problems I've identified.
> 
> Will
> 
> 1 Faults in indexing
> 1.1 Performance faults while indexing
> 1.1.1 FIXED Excessive work per item
> * Excessive queries per item kde#289932#c58 [1],
> kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in kde-runtime/4.8)
> * Attachments fetched but not effectively indexed (Volker WIP?)

The problem with attachments is that they are indexed by a helper process 
(nepomukindexer), which needs to final URI of the attachment object. However, 
what we pass in is the temporary _:xxxx URIs that still need to be resolved by 
DMS. StoreResourceJob contains the mapping AFAICT, so it's probably just a 
matter of deferring the indexData() calls until we have the result of that 
job.

> * Setting the same icons on mails, attachments their tags while indexing, is
> this necessary? (No, commented in other feeders - CM) Can they be de-
> duplicated before storing the mail resource?

>From what I understand, the (expensive) resource identification only happens 
when creating new SimpleResource objects, not when setting existing URIs as 
properties. So, simply caching the icons should fix this.

The reason to have those btw is to see nicer results in the KRunner search.

> 1.1.2 Repeated indexing per item
> 1.1.2.1 Failures to index items
> * FIXED Cardinality fault on messageHeader
> http://oscaf.git.sourceforge.net/git/gitweb.cgi?p=oscaf/shared-desktop-
> ontologies;a=commitdiff;h=4697389c39b7112aaf0f6ac1a36b216e78ab5e14
> * FIXED Cardinality fault on PIMO:Persons' properties
> d732592b in kde-runtime/master
> 1.1.2.2 FIXED Redundant reindexing
> * kde#289932#c58?
> 1.1.3 Repeated indexing per collection
> * FIXED Attempted indexing of collections we cannot index
> ec4f19eb781514ce0dfc09fe4e9ea4591ecc31e9 in kdepim-runtime/4.8
> * FIXED Mark each collection on completion with indexing level
> 2729771b765d0bd6e0e03d0a5b055e36bc48944c in kdepim-runtime/master
> (does this prevent discovery of items changed while feeder was not running?)
> 1.1.4 Indexing interferes with other work
> * FIXED Hide indexing until user is idle kde#289932#c58
> 1.1.5 Low nominal performance
> * Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8 items/sec) on
> Core i7-2620M (4x2.7GHz, HT), idle detection disabled. Not clear what is
> the bottleneck.  Virtuoso using 80-90% of one core during this.
> * Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates
> performance advantage of fast Akonadi protocol

Seeing the huge improvement after Sebastian's changes on the resource 
identification in DMS, I'd guess that this is where most of the time is spend. 
But that's just gut feeling.

If that turns out to be true though, we can probably apply some more clever 
caching for e.g. email addresses (in a typically folder I'd assume some of 
them repeat quite often) to avoid running identification on them over and over 
again. List-Id is another good candidate for that.

> 2 Ability to utilise indexing work (working search)

New in 4.9: Quick search also does full-text search.

WIP by Till: Composer address auto-completion based on all available Nepomuk 
data.

> 2.1 Search features that fully use indexed data
> * Indexed: Date, Subject, From, Sender, To, Cc, Bcc, List-Id, Organization,
> some X-headers, Status flags, Tags, Important, Todo, Watched, Plain text
> body Searchable: Age(days), Subject, From, To, Cc, Reply-To, List-Id,
> Organization, some X-headers, Status flags, Tags, all headers (can this
> work?), message body 

indexing all headers is possible, but considering they are about 30% of the 
entire mail volume (and would map to many Nepomuk resources), I'm wondering if 
the extra cost is really worth it. For List-Id it would probably be 
interesting to see if NMO knows about the concept of mailing lists (which is 
what you actually want to search for).

> * No way to search by the actual PIMO Persons/Contacts
> created by indexing, user must input part of name.
> * No way to search attachments or whether something has an attachment
> 
> 2.2 Faults in search
> 2.2.1 Server side
> * FIXED Truncated query strings cause broken search folders

And broken again, someone apparently managed to go beyond 1024 chars as well. 
Let's remove the limit entirely if possible.

> 2.2.2 Client side
> * Dialog allows modifying existing search folder by name but fails (modifies
> remote id)
> * Possible to create search in search folders; doesn't work
> 2.2.3 Viewing search results changes search results
> * search on unread message status, messages disappear from search as message
> preview makes them read
> * Just viewing search results causes some messages to disappear from search
> collection (according to akonadiconsole db browser, itemChanged + reindex at
> fault?)

Yes, itemChanged currently is handled in the feeder as add/remove. For emails 
this case can be optimized for the common case of flag/tag changes I guess, 
they rarely change content.

> 2.3 Minimising indexing work (assuming there is no/low demand for search, do
> less expensive indexing)
> * Change default set of indexed folders
> * Make it easy to change per folder indexing attribute
> * Show indexing status, allow attr change directly in folder selector in
> search dialog.
> * Indexing all except full text a useful compromise?

Would require some measurement, I'm not sure if the full text part is what 
actually makes it so expensive.

> [1] https://bugs.kde.org/show_bug.cgi?id=289932

regards,
Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20120318/97aba9d7/attachment.sig>
-------------- next part --------------
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/


More information about the kde-pim mailing list