[Nepomuk] Re: Presentation, queries and other stuff
Sebastian Trüg
trueg at kde.org
Mon Mar 7 10:09:39 CET 2011
I forgot one thing:
about the web interface. Nepomuk is running Virtuoso in lite mode. To
disable that you need to hack Soprano and comment line 328 in
soprano/backends/virtuoso/virtuosocontroller.cpp.
Sadly there is no configuration option for that at the moment.
Cheers,
Sebastian
On 03/06/2011 02:33 PM, Ignacio Serantes wrote:
> On Wed, Mar 2, 2011 at 12:24 PM, Sebastian Trüg <trueg at kde.org
> <mailto:trueg at kde.org>> wrote:
>
> Hi!
>
> Sorry for the lag but I'm really busy.
>
> On 03/02/2011 11:57 AM, Ignacio Serantes wrote:
> > On Wed, Mar 2, 2011 at 9:55 AM, Sebastian Trüg <trueg at kde.org
> <mailto:trueg at kde.org>
> > <mailto:trueg at kde.org <mailto:trueg at kde.org>>> wrote:
>
>
>
> > This is worrying because many problems can't be solved with brutal
> > queries with brutal inner joins. This stuff must be added if you want
> > that nepomuk db works with a limited amount of RAM and CPU.
>
> Please make suggestions.
>
>
> > The story of my life, I'm always be an step before a working
> nepomuk :).
> > Can you explain me a little bit how are this final optimizations?
>
> A better visibility check directly on the resource. This also requires a
> nao:userVisible property on all resources which is done by a new class.
>
>
> Great!, precisely this is the big problem I found with your queries and
> you resolved adding a flag. Please, check the following simple queries
> with one string tests. Strings are selected to obtain result sets with
> different size.
>
> Test #01 [ API_H1, (dorama)]: 93 results in 0.06073117
> seconds
> Test #02 [ API_H2, (dorama)]: 93 results in 1.84762883
> seconds
> Test #03 [ API_H3, (dorama)]: 93 results in 2.01492691 seconds
> Test #04 [ API_H4, (dorama)]: 93 results in 1.91813993 seconds
> Test #05 [ API_H5, (dorama)]: 93 results in 1.60597801 seconds
> Test #06 [ API, (dorama)]: 93 results in 2.23234606 seconds
> Test #07 [ API_H1, (ha ji won)]: 2 results in 0.01688004
> seconds
> Test #08 [ API_H2, (ha ji won)]: 2 results in 5.00310612 seconds
> Test #09 [ API_H3, (ha ji won)]: 2 results in 4.11244702 seconds
> Test #10 [ API_H4, (ha ji won)]: 2 results in 3.95332003 seconds
> Test #11 [ API_H5, (ha ji won)]: 2 results in 3.90592694 seconds
> Test #12 [ API, (ha ji won)]: 2 results in 11.18835807 seconds
> Test #13 [ API_H1, (music)]: 5434 results in 1.39107108 seconds
> Test #14 [ API_H2, (music)]: 5420 results in 4.74685478 seconds
> Test #15 [ API_H3, (music)]: 5420 results in 5.27211499 seconds
> Test #16 [ API_H4, (music)]: 5420 results in 5.14297509 seconds
> Test #17 [ API_H5, (music)]: 5420 results in 4.99993420 seconds
> Test #18 [ API, (music)]: 5420 results in 9.88414216 seconds
>
> The query is the same but I applied minor changes:
> API_H1: is the same query without visibility inner join so easy to
> wonder where is the main performance problem.
> API_H2: is the query using a subquery and not an inner join. Query is
> equivalent (A U B) X C = (A X C) U (B X C). Because subquery is the same
> could be optimized by query optimizer and, because intermediate joins
> result sets are small union is fast. This is easy to see with a simple
> SELECT DISTINCT * in both queries and comparing result sets.
> API_H3: is like H1 without using optional to obtain columns.
> API_H4: is API_H3 with a different filter construction method.
> API_H5: is API_H3 with another different filter construction method.
> API: is the query created by the API so no additional explanation is
> needed. It is the last one to give it cache advantage over other queries.
> Note: API_H1 results number differs in "music" case because this query
> is not equivalent to the other five.
>
> This is an initial test and with my current db knowledge caution is
> mandatory but here are some initial conclusions:
> 1) As in SQL, subqueries increment performance
> over indiscriminated inner joins in large results sets. I'm not sure if
> this is general to all tripletstore dbms or only to Virtuoso that is a
> rdbms with added tripletstore functionally. Seems like search API must
> build queries use less joins and more subqueries.
> 2) Optional seems haven't performance impact extracting column values
> but queries are less easy to read and you must write more characters :).
> 3) API_H3, API_H4 and API_H5 has similar times so without a profiling
> tool and more test is difficult to wonder what is the best. Probably
> query optimizer is doing it's job and, in fact, there is no differences
> at all.
>
> In "sparql_test.spql" attach there are the queries.
>
> I go back to stored procedures because this problem could be solved
> using Virtuoso. You have the need to filter for an value always so you
> construct a relation any time you need to filter for this value. The
> problem with this approach is that any time you do a query you must
> build again this relation and this is time consuming. And the problem is
> more and more serious when data grows.
>
> In my db there are 23.766 results to userVisible = True and doing all
> queries with a join with a table like this is not a good idea and we
> can't trust that query optimizers do all your job without any
> help. There are some solutions to this problem at db level:
>
> 1) The (probably) ideal one, create a view in the dbms using your query
> and use always this to filter your data. If query is very simple and
> dbms is good this is automaintained and fast because view don't exists
> and is only a different representation of data in your db.
>
> 2) If the first approach don't works, view is slow and general db
> performance is degrading, you must create a table to store this data and
> use stored procedures and triggers to maintain this table. A view from
> first case and a physical table is the same at query language level so,
> if we tried the first case, most of the work is done and your queries
> need only minor changes or even none.
>
> 3) The easy one, add a property with and index to do the filter. You
> told that this is your approach and works but you must be cautious
> because now is visibility, later is user restriction, later devices
> restriction and so on, and you can't solve all your filter problems with
> properties as you can't solve all your query problems with joins.
> Increase your size register and add so many indexes could have a
> performance penalty when your data grows but note that this differs from
> one dbms to other so there is no general rules.
>
> You can solve this problem in application layer too but, if the problem
> is db related must be solved at db level if if possible. Obviously this
> is a relaxed rule.
>
> Because, sadly, there is no perfect solution you must try
> different approaches and use the solution that better suits to any case
> and, here, profiling tools are your friends. About this, I will try to
> activate Virtuoso web interface to activate profiling but I can't. I
> download and compile Virtuoso and with my instance web interface is
> available but I can't imagine how activate it in Nepomuk's Virtuoso
> instance. I can copying the db and use it in my Virtuoso instance but
> this is uncomfortably if I need doing changes in data using Nepomuk.
>
> I found many errors in soprano-virtuoso db log. I created a db from
> scratch and this errors persists: no method of name existsNode,
> getClobVal, getNumVal, getSchemaURL etc... I'm informing you because I
> don't know if this errors are relevant or not.
>
> I'm sorry if I'm not much help but I'm doing my best. I never worked
> before with Sparql and tripletstore but I'm learning.
>
>
> Cheers,
> Sebastian
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org <mailto:Nepomuk at kde.org>
> https://mail.kde.org/mailman/listinfo/nepomuk
>
>
>
>
> --
> Cheers,
> Ignacio
>
>
>
>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
More information about the Nepomuk
mailing list