[Nepomuk] [RFC] Simplify Nepomuk Graph handling

Thu Jan 3 09:52:30 UTC 2013

I totally agree. In fact have a look at 
DataManagemenModel::Private::m_ignoreCreationDate.

As for the duplicate statements: that is a good point also. It greatly 
simplifies the removal of data per app.

On 01/03/2013 10:21 AM, Vishesh Handa wrote:
> Ping?
>
>
> On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa <me at vhanda.in
> <mailto:me at vhanda.in>> wrote:
>
>     Hey everyone
>
>     This is another one of those big changes that I have been thinking
>     about for quite some time. This email has a number of different
>     proposals, all of which add up to create this really simple system,
>     with the same functionality.
>
>     Graph Introduction
>     ---------------------------
>
>     For those of you who don't know about graphs in Nepomuk. Please read
>     [1]. It serves as a decent introduction to where Graphs are used.
>     Currently, we create a new graph for each data-management command.
>
>     What does this provide?
>     ----------------------------------
>
>     We currently use graphs for 2 features -
>
>     1. Remove Data By Application
>     2. Backup
>
>     What all information do we store?
>     ------------------------------------------------
>
>     1. Creation date of each graph
>     2. Modification date of each graph ( Always the same as creation date )
>     3. Type of the graph - Normal or Discardable
>     4. Maintained by which application
>
>     (1) and (2) currently serve us no purpose. They never have. They are
>     just things that are nice to have. I cannot even name a single use
>     case for it. Except for they let us see when a statement was added.
>
>     (3) is what powers Nepomuk Backup. We do not backup everything but
>     only backup the data that is not discardable. So, stuff like
>     indexing information is not saved. Currently this system is slightly
>     broken as one cannot just filter on the basis of not Discardable
>     Data, as that includes stuff like the Ontologies. So the queries get
>     quite complicated. Plus, one still needs to save certain information
>     from the Discardable Data such as the rdf:type, nao:creation, and
>     nao:lastModified. Hence, the query becomes even more complex. For my
>     machine with some 10 million triples, creating a backup takes a
>     sizeable amount of time ( Over 5 minutes ), with a lot of cpu execution.
>
>     Current query -
>
>     select distinct ?r ?p ?o ?g where {
>     graph ?g { ?r ?p ?o. }
>     ?g a nrl:InstanceBase .
>     FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) .
>     FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . }
>     } ORDER BY ?r ?p
>
>     + Requires additional queries to backup the type, nao:lastModified,
>     and nao:created.
>
>     Maybe it would be simpler if we did not make this distinction?
>     Instead we backup everything (really fast), and just discard the
>     data for files that no longer exist during restoration? It would
>     save users the trouble of re-indexing their files as well. More
>     importantly, it (might) save them the trouble of re-indexing their
>     email, which is a very slow process.
>
>     Also, right now one can only set the graph via StoreResources, and
>     not via any other Data Management command.
>
>     ----
>
>     (4) is the most important reason for graphs. It allows us to know
>     which application added the data. Stuff starts to get a little
>     messy, when two application add the same data. In that case those
>     statements need to be split out of their existing graph and a new
>     graph needs to be created which will be maintained by the both the
>     applications. This is expensive.
>
>     I'm proposing that instead of splitting the statement out of the
>     existing graph, we just create a duplicate of the statement with a
>     new graph, containing the other application.
>
>     Eg -
>
>     Before -
>
>     graph <G1> { <resA> a nco:Contact . }
>     <G1> nao:maintainedBy <App1> .
>     <G1> nao:maintainedBy <App2> .
>
>     After -
>
>     graph <G1> { <resA> a nco:Contact . }
>     graph <G2> { <resA> a nco:Contact . }
>     <G1> nao:maintainedBy <App1>
>     <G2> nao:maintainedBy <App2> .
>
>     The advantage of this approach is that it would simplify some of the
>     extremely complex queries in the DataManagementModel. That would
>     result in a direct performance upgrade. It would also solve some of
>     the ugly transaction problems we have 2 commands are accessing the
>     same statement, and one command removes the data in order to move it
>     to another graph. This has happened to me a couple of times.
>
>     ---
>
>     My third proposal is that considering that the modification and
>     creation date of a graph do not serve any benefit. Perhaps we
>     shouldn't store them at all? Unless there is a proper use case, why
>     go through the added effort? Normally, storing a couple of extra
>     properties isn't a big deal, but if we do not store them, then we
>     can effectively kill the need to create new graph for each data
>     management command.
>
>     With this one would just need 1 graph per application, in which all
>     of its data would reside. We wouldn't need to check for empty graphs
>     or anything. It would also reduce the number of triples in a
>     database, which can get alarmingly high.
>
>     This seems like a pretty good system to me, which provides all the
>     benefits and none of the losses.
>
>     What do you guys think?
>
>     [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts
>
>     --
>     Vishesh Handa
>
>
>
>
> --
> Vishesh Handa
>
>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>