[Nepomuk] [RFC] Avoid communicating through the Nepomuk Storage

Mon May 27 12:48:37 UTC 2013

On Mon, May 27, 2013 at 2:28 PM, Sebastian Trüg <sebastian at trueg.de> wrote:

> On 05/26/2013 12:58 AM, Vishesh Handa wrote:
>
>> Hey guys
>>
>> I have made a very important discovery - The Storage service is a big
>> bottleneck!
>>
>> Running a query such as - 'select * where { graph ?g { ?r ?p ?o. } }
>> LIMIT 50000' by directly connecting to virtuoso via ODBC takes about
>> 2.65 seconds. In contrast running the same query by using the Nepomuk
>> ResourceManager's main model takes about 19.5 seconds.
>>
>> Nepomuk internally uses the Soprano::LocalSocketClient to connect to the
>> storage service which runs a Soprano::LocalServer.
>>
>> I've been trying to optimize this Soprano code for some time now and
>> from 4.9 we have a good 200% performance increase. But we can increase
>> it a LOT more by just directly communicating with virtuoso.
>>
>> Pros -
>> * 6-8x performance upgrade
>> * The storage service isn't using such high cpu when reading
>> * Accurate reporting - Suppose app 'x' does a costly query which
>> requires a large number of results, then 'x' will have high cpu
>> consumption. Currently both NepomukStorage and 'x' have very high cpu
>> consumption.
>>
>> Cons -
>> * Less Control - By having all queries go through the Nepomuk Storage we
>> could theoretical build amazing tools to tell us which query is
>> executing and how long it is taking. However, no such tool has ever been
>> written - so we won't be loosing anything.
>>
>> Before 4.10 this could never have been done because we used to have a
>> lot of code in the storage service which handled removable media and
>> other devices. This code would often modify the sparql queries and
>> modify the results. With 4.10, I threw away all that code.
>>
>> Comments?
>>
>> PS: This is only for read only operations. All writes should still go
>> through the storage service. Though maybe we want to change that as well?
>>
>
> My 2 cents:
>
> You could even do this for write operations but then you would need
> clients to always use a client library which does all the checks and
> notifications. I suppose this is fine but of course requires to for example
> write a python lib. Alternatively you could support both: direct ODBC
> writes via C++, slower writes via the server (internally using the C++
> client lib) for everyone else (for example scripts).
>
> All in all it seems like a good idea. I always liked the modular system
> with the storage service, but let's face it: it's a performance drain and
> in the end does not give us much besides a nice design.
>

Doing it for writes seems a little messy right now. Integrating the
ResourceWatcher is going to be hard.

I'm going to push my changes to remove the LocalServer and LocalClient.
This is only for reads, since clients shouldn't be writing raw sparql to
insert stuff.

>
> Cheers,
> Sebastian
>
> ______________________________**_________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/**listinfo/nepomuk<https://mail.kde.org/mailman/listinfo/nepomuk>
>

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130527/cdacc36f/attachment.html>