[Kde-pim] Nepomuk Feeder Improvements

Vishesh Handa me at vhanda.in
Fri Oct 26 13:20:11 BST 2012


Hey everyone

There are generally two major causes for the whole virtuoso going mad cause
of PIM issue - Nepomuk Feeders and the Queries. I've been looking at both
of them. This email is concerning the feeders.

Here is the list of the major problems with the feeder. Fixing them should
result in significant performance boosts --

*1. Cache data in Nepomuk Feeder Utils
*
The NepomukFeederUtils class is used to easily add data to the
SimpleResourceGraph. It does so by creating a new SimpleResource with the
required properties. This SimpleResource when sent to Nepomuk is mapped to
the existing resource based on the properties. This process of mapping
(also called Resource Identification) is not particularly cheap. Its
results should be cached.

The StoreResources Job which is primarily used for pushing the data into
Nepomuk, returns the results of these mappings in a hash. Those results
*should* be cached. This is especially important for contacts. The resource
identification process for resource is slow because of the large number of
contacts.

The caching code will be something like this -

Nepomuk2::SimpleResource addContact( .. ) {
       // Check cache, otherwise -

       // Create the SimpleResource

      // Insert into cache
      // QHash< KMime::Types::MailBox, QUrl > m_contactCache;

      m_resourceToBeIdentifiedCache< KMime::Types::MailBox, QUrl >
m_tempCache;
      m_tempCache.insert( mbox, simpleResource.uri() );

      // This uri over here is temporary, it a random uri of the form
_:dafsdf, when the resource identification process ends
      // a QHash, mapping this temporary uri to its final value if returned
      // That has should be used to convert the tempCache data into the
final cache.
      // If this is not clear, please talk to me.
}

Something similar should be done for tags and icons as well.

*Email Feeder* - Cache the email headers. I know this sounds slightly
stupid, but each email header is stored as a separate resource (think
object), with its header and value. Considering that a large number of
headers would be common (mailing list, list id, etc) their uris could be
cached.

Based on the code this should yield a good 50-70% increase in speed, and
virtuoso's cpu consumption should go down drastically.

*2. Avoid using StoreResources flags

*The StoreResources flags were created specifically for the bugs in Strigi.
In general they double the number of queries required in order to push data
into Nepomuk. In case it wasn't obvious, the number of queries is already a
lot more than I would like.

Considering that you're never over writing data in Nepomuk, I don't think
you really need the OverwriteProperties flag. See (3)

*3. Properly reindex the data

*When some data in Akonadi changes, the proper way of reindexing the data
would be to initially remove the old data via
Nepomuk2::removeDataByApplication( .. ), and then pushing the new data via
storeResources. People often use the overwriteProperties flag to just
overwrite parts of the data. That is not correct.

You should be removing the invalid data before pushing the new data.

*4. Email Updates -*

Looking at the code, as far as I understand every time an Akonadi::Item is
changed, it is re-indexed. While this makes a lot of sense, I think this
also means that when an email's state changes from unread to read, it will
be reindexed. I hope that is not the case, and I've overlooked something.
But it that is the case, removing all existing data and then pushing it
again seems like a major overkill.

It would be better to know what all stuff can change when an email is
updated. I can't imagine it would be a lot, and only updating that
information. If it is just a couple of properties, using a setProperty call
might be better, and even faster. I'll need more information about what all
can change in an email.

I think it can just be status - read/unread, and tags?

--------

I won't mind review the patches as they go on. So if anyone (Christian?) is
actually going to make all these changes, please add me as the reviewer.
Though I have joined the pim mailing list, so I shouldn't miss them.

-- 
Vishesh Handa
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/



More information about the kde-pim mailing list