[Kde-pim] [Discussion] Search in Akonadi

Thu Aug 20 09:33:52 BST 2009

Hej Pimsters,

I'd like to start a discussion about search in Akonadi. Currently searching for
items with specific criteria in Akonadi can be done in two ways:

  1) Loading everything into a model, iterate over the model, filter out everything
     you don't need -> performance and memory problems

  2) Having the search implemented as separated engine, which returns only the
     Akonadi UIDs of the items that matches -> sounds perfect

I guess we all support the second approach when it comes down to implementing the
search functionality and with Nepomuk as search engine we already did the first
step in this direction. For those who haven't looked into the search infrastructure yet,
here a short explanation:

We have so called feeder agents for every datatype (contact, event, mail), which listen for
changes to the items in Akonadi and whenever an item is added, changed or removed, they update the
information of the search engine about the changed data. If a client application wants
to search an item by a criterion (e.g. a contact by email address), it sends a SPARQL
query to the Akonadi server (via Akonadi::ItemSearchJob), which passes the query further
to the search engine (currently to Nepomuk via DBus interface) and then returns the
matching UIDs.

As already mentioned we use Nepomuk for the search in current trunk version, and while
I was porting some of the old address book code to native Akonadi code, I had to replace
old search code with new search code as well, which made me realize that search is not
only a 'nice to have' but an essential part of our PIM infrastructure that is needed for
such simple stuff like sending a mail. So we have to make sure that search is rock stable
as the rest of the Akonadi server. And here we come back to Nepomuk... Don't get me wrong,
I really like Nepomuk and I see big potential in using it in the PIM applications, however
relying on it as part of the Akonadi infrastructure has some serious drawbacks:

  1) If users don't have a working Nepomuk installation with Sesame2 backend, they will
     have either a damn slow working search (if Redland backend is used) or a not working
     search at all -> half of Akonadi doesn't work

  2) Some users refuse to install Java (needed by Sesame2) because of economical reasons
     (disc space on embedded devices) or political ones (Why should I install Java on a C++ Desktop?)

  3) The data stored by the Nepomuk engine can easily get out-of-sync with the Akonadi data
     since they are holded in two places (Akonadi's MySQL database and Nepomuk repository).
     That happens quite easily if you start a KDEPIM application outside a KDE Desktop an
     therefor the nepomukservices are not running -> search engine data can't be updated via DBus

  4) The SPARQL language and the infrastructure behind it is too mighty for the basic search
     that we need to ensure Akonadi is working. The essential queries are not
        'Return all emails that has been sent by a manager that has a secretary with an age <= 50 years'
     but rather
        'Give me the Akonadi uid of the contact item that has email address foo at bar.org'
     or
        'Give me the Akonadi uid of the distribution list with the name my-family-contacts'

So to let Akonadi work rock stable we don't need a semantic search but _fast_ and _reliable_
internal search!

Of course I wouldn't have written the mail without making up my mind how to solve this problem ;)
To make one thing clear, I do _not_ want to get rid of Nepomuk! It should still be available
and feed with the PIM data as it is done currently. However we need an additional, reliable search
that is used to ensure the basic working of Akonadi!

Here comes a first rough idea of how we could implement such a search:

  1) We do not want to depend on external search engines to keep dependencies
     small and ensure a working system without additional configuration from the user
      => let's use the available search engine: MySQL

  2) The search data should be duplicated as less as possible and if they have to be duplicated,
     they should be kept in sync with the real data automatically

  3) The search language should be easy and only as mighty as necessary

=> The search service should be an integral part of the Akonadi server resp. a separate process
   that is _always_ started and controlled by the Akonadi server.

We could have an additional table in the MySQL database:

  CREATE TABLE AkonadiSearch (
    type_identifier       VARCHAR(100),   // a data type identifier e.g. 'contact'
    item_id               BIGINT,         // the item id the search entry belongs to
    field_identifier      VARCHAR(100),   // a identifier of the type specific search field e.g. 'email'
    value                 VARCHAR(100)    // the value of the search field e.g. 'foo at bar.org'
  )

The feeder agents would now feed the data that should be able to looked up into this table.
But be aware, only the basic data should be feed in here, no full text index etc... only the
basic stuff!!!

If an item is removed from the system, the Akonadi server can directly delete the entries in
the AkonadiSearch table together with the entries from the PimItems table, that will ensure
that no stale entries exists. A client application can send SQL queries to the server
(I guess there are more developers that understand SQL than SPARQL ;)) to retriev the item
uid for a matching search field.

  SELECT item_id FROM AkonadiSearch WHERE type_identifier = 'contact'
                                      AND field_identifier = 'name'
                                      AND value LIKE 'To%'

As you can see with the '%' operator we can even do simple startsWith, endsWith or contains statements.
It might be worth to have additional columns 'value_num' and 'value_date' that store numeric values or
date values, so we could use the following:

  SELECT item_id FROM AkonadiSearch WHERE type_identifier = 'event'
                                      AND field_identifier = 'start_date'
                                      AND value_date >= '2009-08-20'
                                      AND value_date <= '2009-08-25'

A further advantage, we can fine-tune the indexes of that table to make the most common searches as fast
as possible.

So what are your ideas? Comments?

Ciao,
Tobias
-- 
Separate politics from religion and economy!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20090820/df029fce/attachment.sig>
-------------- next part --------------
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/