[Nepomuk] [RFC] Avoid communicating through the Nepomuk Storage

Mon May 27 08:58:39 UTC 2013

On 05/26/2013 12:58 AM, Vishesh Handa wrote:
> Hey guys
>
> I have made a very important discovery - The Storage service is a big
> bottleneck!
>
> Running a query such as - 'select * where { graph ?g { ?r ?p ?o. } }
> LIMIT 50000' by directly connecting to virtuoso via ODBC takes about
> 2.65 seconds. In contrast running the same query by using the Nepomuk
> ResourceManager's main model takes about 19.5 seconds.
>
> Nepomuk internally uses the Soprano::LocalSocketClient to connect to the
> storage service which runs a Soprano::LocalServer.
>
> I've been trying to optimize this Soprano code for some time now and
> from 4.9 we have a good 200% performance increase. But we can increase
> it a LOT more by just directly communicating with virtuoso.
>
> Pros -
> * 6-8x performance upgrade
> * The storage service isn't using such high cpu when reading
> * Accurate reporting - Suppose app 'x' does a costly query which
> requires a large number of results, then 'x' will have high cpu
> consumption. Currently both NepomukStorage and 'x' have very high cpu
> consumption.
>
> Cons -
> * Less Control - By having all queries go through the Nepomuk Storage we
> could theoretical build amazing tools to tell us which query is
> executing and how long it is taking. However, no such tool has ever been
> written - so we won't be loosing anything.
>
> Before 4.10 this could never have been done because we used to have a
> lot of code in the storage service which handled removable media and
> other devices. This code would often modify the sparql queries and
> modify the results. With 4.10, I threw away all that code.
>
> Comments?
>
> PS: This is only for read only operations. All writes should still go
> through the storage service. Though maybe we want to change that as well?

My 2 cents:

You could even do this for write operations but then you would need 
clients to always use a client library which does all the checks and 
notifications. I suppose this is fine but of course requires to for 
example write a python lib. Alternatively you could support both: direct 
ODBC writes via C++, slower writes via the server (internally using the 
C++ client lib) for everyone else (for example scripts).

All in all it seems like a good idea. I always liked the modular system 
with the storage service, but let's face it: it's a performance drain 
and in the end does not give us much besides a nice design.

Cheers,
Sebastian