[Kde-pim] [Discussion] Search in Akonadi

Volker Krause vkrause at kde.org
Thu Aug 20 10:59:45 BST 2009


Hi,

thanks for taking this topic on Tobias, I agree that it is essential for 
Akonadi.

On Thursday 20 August 2009 10:33:52 Tobias Koenig wrote:
> I'd like to start a discussion about search in Akonadi. Currently searching
> for items with specific criteria in Akonadi can be done in two ways:
>
>   1) Loading everything into a model, iterate over the model, filter out
> everything you don't need -> performance and memory problems

sure, but not worse than before. Which means it could be an acceptable 
intermediate solution until the search problem has been solved for real.

>   2) Having the search implemented as separated engine, which returns only
> the Akonadi UIDs of the items that matches -> sounds perfect

Yep, although I don't really like the UID-list interface. As a developer you 
don't need UIDs, you need Items. So, I'd rather suggest an interface similar 
to ItemFetchJob which can be configured using ItemFetchScope and returns as 
much payload data as you need for your current task. This also saves you some 
additional roundtrips to the Akonadi server.

> I guess we all support the second approach when it comes down to
> implementing the search functionality and with Nepomuk as search engine we
> already did the first step in this direction. For those who haven't looked
> into the search infrastructure yet, here a short explanation:
>
> We have so called feeder agents for every datatype (contact, event, mail),
> which listen for changes to the items in Akonadi and whenever an item is
> added, changed or removed, they update the information of the search engine
> about the changed data. If a client application wants to search an item by
> a criterion (e.g. a contact by email address), it sends a SPARQL query to
> the Akonadi server (via Akonadi::ItemSearchJob), which passes the query
> further to the search engine (currently to Nepomuk via DBus interface) and
> then returns the matching UIDs.

Just for completeness, there is a second way, namely creating a virtual 
collection based on a SPARQL query. The interface to access the results is 
identical to "normal" retrieval in this case, that is ItemFetchJob + Monitor.

> As already mentioned we use Nepomuk for the search in current trunk
> version, and while I was porting some of the old address book code to
> native Akonadi code, I had to replace old search code with new search code
> as well, which made me realize that search is not only a 'nice to have' but
> an essential part of our PIM infrastructure that is needed for such simple
> stuff like sending a mail. So we have to make sure that search is rock
> stable as the rest of the Akonadi server. And here we come back to
> Nepomuk... Don't get me wrong, I really like Nepomuk and I see big
> potential in using it in the PIM applications, however relying on it as
> part of the Akonadi infrastructure has some serious drawbacks:
>
>   1) If users don't have a working Nepomuk installation with Sesame2
> backend, they will have either a damn slow working search (if Redland
> backend is used) or a not working search at all -> half of Akonadi doesn't
> work

I agree that the setup problems have to be taken serious, we still have some 
for the database part. OTOH nothing will change there if we don't push it. 
Sure, that's painful for everyone involved but it will eventually get us 
there. At least this time we are not alone with this problem, as you could 
see in a recent k-c-d thread plasma is considering making Nepomuk mandatory 
as well.

>   2) Some users refuse to install Java (needed by Sesame2) because of
> economical reasons (disc space on embedded devices)

if your embedded device has a problem with Java, you'd likely not want 
Akonadi/MySQL either.

> or political ones (Why should I install Java on a C++ Desktop?)

Same as with MySQL, the answer is very simple: It's the best/only available 
option currently that actually does work.

>   3) The data stored by the Nepomuk engine can easily get out-of-sync with
> the Akonadi data since they are holded in two places (Akonadi's MySQL
> database and Nepomuk repository). That happens quite easily if you start a
> KDEPIM application outside a KDE Desktop an therefor the nepomukservices
> are not running -> search engine data can't be updated via DBus

That's something that is fixable I think and has to be fixed anyway.

>   4) The SPARQL language and the infrastructure behind it is too mighty for
> the basic search that we need to ensure Akonadi is working. The essential
> queries are not 'Return all emails that has been sent by a manager that has
> a secretary with an age <= 50 years' but rather
>         'Give me the Akonadi uid of the contact item that has email address
> foo at bar.org' or
>         'Give me the Akonadi uid of the distribution list with the name
> my-family-contacts'

SPARQL alone will certainly not cover all our needs, I agree. Mostly because 
we eventually want query-translation for searches on backends (IMAP, LDAP), 
which is extremely difficult with SPARQL, if possible at all. So, having a 
simplified query language (or a corresponding sub-set of SPARQL) is needed 
anyway.

> So to let Akonadi work rock stable we don't need a semantic search but
> _fast_ and _reliable_ internal search!

Well, if we could get a fast and reliable Nepomuk version in the near future, 
that looks preferable to me than doing our own stuff there. If we cannot 
count on this however it probably is the only option.

> Of course I wouldn't have written the mail without making up my mind how to
> solve this problem ;) To make one thing clear, I do _not_ want to get rid
> of Nepomuk! It should still be available and feed with the PIM data as it
> is done currently. However we need an additional, reliable search that is
> used to ensure the basic working of Akonadi!
>
> Here comes a first rough idea of how we could implement such a search:
>
>   1) We do not want to depend on external search engines to keep
> dependencies small and ensure a working system without additional
> configuration from the user => let's use the available search engine: MySQL

Keep in mind that nowaday Akonadi works with PostgreSQL as well and Sqlite 
support is underway. So, relying on specific DB features can be problematic.

>   2) The search data should be duplicated as less as possible and if they
> have to be duplicated, they should be kept in sync with the real data
> automatically

Note that the feeder agent use the same infrastructure for keeping the index 
in sync as the resources use to keep their backends in sync. So, if that's 
not reliable enough we have a much bigger problem anyway.

>   3) The search language should be easy and only as mighty as necessary
>
> => The search service should be an integral part of the Akonadi server
> resp. a separate process that is _always_ started and controlled by the
> Akonadi server.

We already have autostart support, what the agent manager is missing is 
deletion prevention for the mandatory agents. That's needed no matter what 
backend we use for the search.

> We could have an additional table in the MySQL database:
>
>   CREATE TABLE AkonadiSearch (
>     type_identifier       VARCHAR(100),   // a data type identifier e.g.
> 'contact' item_id               BIGINT,         // the item id the search
> entry belongs to field_identifier      VARCHAR(100),   // a identifier of
> the type specific search field e.g. 'email' value                
> VARCHAR(100)    // the value of the search field e.g. 'foo at bar.org' )
>
> The feeder agents would now feed the data that should be able to looked up
> into this table. But be aware, only the basic data should be feed in here,
> no full text index etc... only the basic stuff!!!

How does the feeder agent access this table? It is a separate process and 
therefore has no direct access to the database.

> If an item is removed from the system, the Akonadi server can directly
> delete the entries in the AkonadiSearch table together with the entries
> from the PimItems table, that will ensure that no stale entries exists. A
> client application can send SQL queries to the server (I guess there are
> more developers that understand SQL than SPARQL ;)) to retriev the item uid
> for a matching search field.
>
>   SELECT item_id FROM AkonadiSearch WHERE type_identifier = 'contact'
>                                       AND field_identifier = 'name'
>                                       AND value LIKE 'To%'
>
> As you can see with the '%' operator we can even do simple startsWith,
> endsWith or contains statements. It might be worth to have additional
> columns 'value_num' and 'value_date' that store numeric values or date
> values, so we could use the following:
>
>   SELECT item_id FROM AkonadiSearch WHERE type_identifier = 'event'
>                                       AND field_identifier = 'start_date'
>                                       AND value_date >= '2009-08-20'
>                                       AND value_date <= '2009-08-25'
>
> A further advantage, we can fine-tune the indexes of that table to make the
> most common searches as fast as possible.

Using SQL and exposing various internal implementation details that way sounds 
like a really bad idea to me. Sure, it's probably easy to implement, but it 
will make it impossible to change anything there later on. Also, SQL doesn't 
really look like an easy language for query translation on the resource side 
to me.

Even if you reduce the query that is sent to just the WHERE part to prevent 
people from messing around with the other tables or return the wrong result 
column, I'd still expect people to use database specific features sooner or 
later.

Regarding developer knowledge, we probably don't want to write those queries 
manually anyway, but use API for that. Then you only need to know the query 
language for more advanced uses.

During the last Akonadi meeting we thought about using a XESAM subset, which 
is XML and therefore (hopefully) much easier to automatically translate into 
other query languages (IMAP, LDAP, SQL, ...).

regards
Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.kde.org/pipermail/kde-pim/attachments/20090820/e33fcb39/attachment.sig>
-------------- next part --------------
_______________________________________________
KDE PIM mailing list kde-pim at kde.org
https://mail.kde.org/mailman/listinfo/kde-pim
KDE PIM home page at http://pim.kde.org/


More information about the kde-pim mailing list