[Kde-pim] PIM indexing and search - taking stock

Wed Mar 14 21:00:04 GMT 2012

On Wednesday 14 Mar 2012 21:28:32 Ingo Klöcker wrote:
> On Wednesday 14 March 2012, Will Stephenson wrote:
> > 1 Faults in indexing
> > 1.1 Performance faults while indexing
> > 1.1.1 FIXED Excessive work per item
> > * Excessive queries per item kde#289932#c58 [1],
> > kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in
> > kde-runtime/4.8)
> > * Attachments fetched but not effectively indexed
> > (Volker WIP?)
> > * Setting the same icons on mails, attachments their
> > tags while indexing, is this necessary? (No, commented in other
> > feeders - CM) Can they be de- duplicated before storing the mail
> > resource?
> 
> I do not understand what you are trying to say.

About the icons? It's an example of doing too much indexing work per mail.
The nepomuk feeder creates an RDF graph representing the mail it is indexing, 
and adds several icon names to the graph ("internet-mail", ...). These are 
resources in their own right and like any other resource, will need to be 
merged Nepomuk-side by identifying any existing equivalent resource and 
replaced with a reference to that.  Since much of Sebastian's optimisations 
are to resource identification, it seems sensible to not load that system 
unnecessarily.

If it's not just the icons that you don't get, then I have a problem.

<snip>

> > 1.1.5 Low nominal performance
> > * Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8
> > items/sec) on Core i7-2620M (4x2.7GHz, HT), idle detection disabled.
> > Not clear what is the bottleneck.  Virtuoso using 80-90% of one core
> > during this.
> 
> Sounds like Virtuoso is doing stuff it probably shouldn't (need to) do.
> Do you have any idea what it is doing? Anything in the logs?

I honestly don't have enough insight into it yet to know whether this is a 
real problem (it just feels slow as), and whether it would lie in the work we 
are asking Virtuoso to do, the three IPC hops, the feeder (which currently 
processes items in serial).  One problem with figuring out exactly what 
Virtuoso is doing, to check if it is correct and minimal, is that the 
knowledge about what it should do is striped across Sebastian's and Vishesh's 
heads but not the PIM teams'.

> At work we did have serious problems with Postgres. In the end it turned
> out that it was our own fault. We were using random UUIDs instead of
> sequential UUIDs. Obviously, there is an index on the UUIDs. Since the
> UUIDs were random inserting lots of items made Postgres re-order the
> complete index B-tree all of the time. A colleague finally found the
> root cause of our problems by simulating our usage of the database.
Ignacio Serantes on the nepomuk-kde list has written a mail suggesting that 
the Nepomuk Query API generates query strings that are excessively complex and 
create overhead for the virtuoso query compiler (sparql queries are compiled 
to sql internally, apparently).  But I'm not in a position to verify this.

> Of course, the (non-)sequential UUIDs are just a shot in the dark
> because Virtuoso is a completely different beast than Postgres. But
> unless we can get any useful logging/debugging information from Virtuoso
> we should try to get down to the actual problem with simulations.

We do need to improve our Virtuoso debugging skills - at the moment the state 
of the art is to ask it what its status is on the command line, which only 
helps to trap slow queries, but not to profile the work we are sending its 
way.

> > * Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates
> > performance advantage of fast Akonadi protocol

<snip>
> > 
> > 2.3 Minimising indexing work (assuming there is no/low demand for
> > search, do less expensive indexing)
> > * Change default set of indexed folders
> > * Make it easy to change per folder indexing attribute
> > * Show indexing status, allow attr change directly in folder selector
> > in search dialog.
> > * Indexing all except full text a useful compromise?
> 
> My day-to-day experience with Thunderbird shows that I'm mostly
> searching by
> - Subject
> - Sender
> - Recipient
> - Date
> 
> A couple of times I used full text search when I was unable to guess
> Subject/Sender/Recipient of the message I was looking for.
> 
> I should note that by far most of the time I'm locating messages by
> looking into the folder were I think I put the message and then using
> the quick filter. So, I'm probably not representative for Joe Average,
> but I am probably representative for Jim "I have 200 folders and sort my
> 100+ K mails (mostly) by hand" Poweruser.

Yes, I work this way too, and our current indexing strategy bites this kind of 
user hard. I suggest only indexing inbox and sent-mail folders, or all except 
mailing list folders, but making it easy to select a folder for indexing (with 
nice indexing status display on the folder selector, and a notification when a 
manually-added folder is indexed and ready to search).

Will

_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/