[Nepomuk] [RFC] Avoid communicating through the Nepomuk Storage

Sun May 26 11:23:14 UTC 2013

On Sunday 26 May 2013 04.28:01 Vishesh Handa wrote:
> Hey guys
> 
> I have made a very important discovery - The Storage service is a big
> bottleneck!
> 
> Running a query such as - 'select * where { graph ?g { ?r ?p ?o. } } LIMIT
> 50000' by directly connecting to virtuoso via ODBC takes about 2.65
> seconds. In contrast running the same query by using the Nepomuk
> ResourceManager's main model takes about 19.5 seconds.
> 
> Nepomuk internally uses the Soprano::LocalSocketClient to connect to the
> storage service which runs a Soprano::LocalServer.
> 
> I've been trying to optimize this Soprano code for some time now and from
> 4.9 we have a good 200% performance increase. But we can increase it a LOT
> more by just directly communicating with virtuoso.
> 
> Pros -
> * 6-8x performance upgrade
> * The storage service isn't using such high cpu when reading
> * Accurate reporting - Suppose app 'x' does a costly query which requires a
> large number of results, then 'x' will have high cpu consumption. Currently
> both NepomukStorage and 'x' have very high cpu consumption.
> 
> Cons -
> * Less Control - By having all queries go through the Nepomuk Storage we
> could theoretical build amazing tools to tell us which query is executing
> and how long it is taking. However, no such tool has ever been written - so
> we won't be loosing anything.
> 
> Before 4.10 this could never have been done because we used to have a lot
> of code in the storage service which handled removable media and other
> devices. This code would often modify the sparql queries and modify the
> results. With 4.10, I threw away all that code.
> 
> Comments?

Hey Vishesh,

Since akonadi has a similar design (database<->akonadi server<->akonadi 
session, where the akonadi session sits in the user process), and I've been 
pondering the same thing there as well...
The server process through which all queries are going is a design decision I 
don't fully understand in either system.

It would seem to me much more efficient to move all work that has to be done to 
the user process, to avoid the extra communication between application and 
server process (which in akonadi is an IMAP like protocol, and in nepomuk even 
results in an extra serialization/deserialization). My moving all the required 
logic for the databinding etc. to a library, which then does it's work in the 
user process, each application could directly talk to the database, which I 
would expect to be always more efficient than the extra process, as databases 
typically handle concurrent access well (except sqllite AFAIK).

I therefore don't really see how the cons apply, or how this shouldn't have 
been possible before 4.10. All the necessary work can just as well be done in 
the process of each application (by using a library).

Even for write access any db has the required serialization mechanisms 
implemented for concurrent access. If you have a server process, the server 
just needs to implement that serialization again.

To me getting rid of those server processes seems to only have advantages:
* Uses the strengths of databases in concurrency and it's ACID properties
* Less context switches and other overhead
* Simpler design
* Less mutual interference between processes

If you have all the necessary abstraction layers, you can do this fully 
transparent to the user of the library. So I'd say go for it ;-)

CC'ing Volker, because he might know what he's been doing in akonadi ;-)

Cheers,
Christian

> 
> PS: This is only for read only operations. All writes should still go
> through the storage service. Though maybe we want to change that as well?