[Nepomuk] Re: Presentation, queries and other stuff

Sun Mar 6 14:33:23 CET 2011

On Wed, Mar 2, 2011 at 12:24 PM, Sebastian Trüg <trueg at kde.org> wrote:

Hi!

Sorry for the lag but I'm really busy.

On 03/02/2011 11:57 AM, Ignacio Serantes wrote:
> > On Wed, Mar 2, 2011 at 9:55 AM, Sebastian Trüg <trueg at kde.org
> > <mailto:trueg at kde.org>> wrote:
>

> > This is worrying because many problems can't be solved with brutal
> > queries with brutal inner joins. This stuff must be added if you want
> > that nepomuk db works with a limited amount of RAM and CPU.
>
> Please make suggestions.
>

> > The story of my life, I'm always be an step before a working nepomuk :).
> > Can you explain me a little bit how are this final optimizations?
>
> A better visibility check directly on the resource. This also requires a
> nao:userVisible property on all resources which is done by a new class.
>

Great!, precisely this is the big problem I found with your queries and you
resolved adding a flag. Please, check the following simple queries with one
string tests. Strings are selected to obtain result sets with different
size.

Test #01 [    API_H1,       (dorama)]:    93 results in   0.06073117 seconds

Test #02 [    API_H2,       (dorama)]:    93 results in   1.84762883 seconds

Test #03 [    API_H3,       (dorama)]:    93 results in   2.01492691 seconds
Test #04 [    API_H4,       (dorama)]:    93 results in   1.91813993 seconds
Test #05 [    API_H5,       (dorama)]:    93 results in   1.60597801 seconds
Test #06 [       API,       (dorama)]:    93 results in   2.23234606 seconds
Test #07 [    API_H1,    (ha ji won)]:     2 results in   0.01688004 seconds

Test #08 [    API_H2,    (ha ji won)]:     2 results in   5.00310612 seconds
Test #09 [    API_H3,    (ha ji won)]:     2 results in   4.11244702 seconds
Test #10 [    API_H4,    (ha ji won)]:     2 results in   3.95332003 seconds
Test #11 [    API_H5,    (ha ji won)]:     2 results in   3.90592694 seconds
Test #12 [       API,    (ha ji won)]:     2 results in  11.18835807 seconds
Test #13 [    API_H1,        (music)]:  5434 results in   1.39107108 seconds
Test #14 [    API_H2,        (music)]:  5420 results in   4.74685478 seconds
Test #15 [    API_H3,        (music)]:  5420 results in   5.27211499 seconds
Test #16 [    API_H4,        (music)]:  5420 results in   5.14297509 seconds
Test #17 [    API_H5,        (music)]:  5420 results in   4.99993420 seconds
Test #18 [       API,        (music)]:  5420 results in   9.88414216 seconds

The query is the same but I applied minor changes:
API_H1: is the same query without visibility inner join so easy to wonder
where is the main performance problem.
API_H2: is the query using a subquery and not an inner join. Query is
equivalent (A U B) X C = (A X C) U (B X C). Because subquery is the same
could be optimized by query optimizer and, because intermediate joins result
sets are small union is fast. This is easy to see with a simple SELECT
DISTINCT * in both queries and comparing result sets.
API_H3: is like H1 without using optional to obtain columns.
API_H4: is API_H3 with a different filter construction method.
API_H5: is API_H3 with another different filter construction method.
API: is the query created by the API so no additional explanation is needed.
It is the last one to give it cache advantage over other queries.
Note: API_H1 results number differs in "music" case because this query is
not equivalent to the other five.

This is an initial test and with my current db knowledge caution is
mandatory but here are some initial conclusions:
1) As in SQL, subqueries increment performance over indiscriminated inner
joins in large results sets. I'm not sure if this is general to all
tripletstore dbms or only to Virtuoso that is a rdbms with added
tripletstore functionally. Seems like search API must build queries use less
joins and more subqueries.
2) Optional seems haven't performance impact extracting column values but
queries are less easy to read and you must write more characters :).
3) API_H3, API_H4 and API_H5 has similar times so without a profiling tool
and more test is difficult to wonder what is the best. Probably query
optimizer is doing it's job and, in fact, there is no differences at all.

In "sparql_test.spql" attach there are the queries.

I go back to stored procedures because this problem could be solved using
Virtuoso. You have the need to filter for an value always so you construct a
relation any time you need to filter for this value. The problem with this
approach is that any time you do a query you must build again this relation
and this is time consuming. And the problem is more and more serious when
data grows.

In my db there are 23.766 results to userVisible = True and doing all
queries with a join with a table like this is not a good idea and we can't
trust that query optimizers do all your job without any help. There are some
solutions to this problem at db level:

1) The (probably) ideal one, create a view in the dbms using your query and
use always this to filter your data. If query is very simple and dbms is
good this is automaintained and fast because view don't exists and is only a
different representation of data in your db.

2) If the first approach don't works, view is slow and general db
performance is degrading, you must create a table to store this data and use
stored procedures and triggers to maintain this table. A view from first
case and a physical table is the same at query language level so, if we
tried the first case, most of the work is done and your queries need only
minor changes or even none.

3) The easy one, add a property with and index to do the filter. You told
that this is your approach and works but you must be cautious because now is
visibility, later is user restriction, later devices restriction and so on,
and you can't solve all your filter problems with properties as you can't
solve all your query problems with joins. Increase your size register and
add so many indexes could have a performance penalty when your data grows
but note that this differs from one dbms to other so there is no general
rules.

You can solve this problem in application layer too but, if the problem is
db related must be solved at db level if if possible. Obviously this is a
relaxed rule.

Because, sadly, there is no perfect solution you must try
different approaches and use the solution that better suits to any case and,
here, profiling tools are your friends. About this, I will try to activate
Virtuoso web interface to activate profiling but I can't. I download and
compile Virtuoso and with my instance web interface is available but I can't
imagine how activate it in Nepomuk's Virtuoso instance. I can copying the db
and use it in my Virtuoso instance but this is uncomfortably if I need doing
changes in data using Nepomuk.

I found many errors in soprano-virtuoso db log. I created a db from scratch
and this errors persists: no method of name existsNode, getClobVal,
getNumVal, getSchemaURL etc... I'm informing you because I don't know if
this errors are relevant or not.

I'm sorry if I'm not much help but I'm doing my best. I never worked before
with Sparql and tripletstore but I'm learning.

> Cheers,
> Sebastian
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>

-- 
Cheers,
Ignacio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/nepomuk/attachments/20110306/ff8a7e2a/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparql_tests.spql
Type: application/octet-stream
Size: 9860 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/nepomuk/attachments/20110306/ff8a7e2a/attachment-0001.dll