[Nepomuk] [RFC] Simplify Nepomuk Graph handling

Thu Jan 3 09:21:49 UTC 2013

Ping?

On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa <me at vhanda.in> wrote:

> Hey everyone
>
> This is another one of those big changes that I have been thinking about
> for quite some time. This email has a number of different proposals, all of
> which add up to create this really simple system, with the same
> functionality.
>
> Graph Introduction
> ---------------------------
>
> For those of you who don't know about graphs in Nepomuk. Please read [1].
> It serves as a decent introduction to where Graphs are used. Currently, we
> create a new graph for each data-management command.
>
> What does this provide?
> ----------------------------------
>
> We currently use graphs for 2 features -
>
> 1. Remove Data By Application
> 2. Backup
>
> What all information do we store?
> ------------------------------------------------
>
> 1. Creation date of each graph
> 2. Modification date of each graph ( Always the same as creation date )
> 3. Type of the graph - Normal or Discardable
> 4. Maintained by which application
>
> (1) and (2) currently serve us no purpose. They never have. They are just
> things that are nice to have. I cannot even name a single use case for it.
> Except for they let us see when a statement was added.
>
> (3) is what powers Nepomuk Backup. We do not backup everything but only
> backup the data that is not discardable. So, stuff like indexing
> information is not saved. Currently this system is slightly broken as one
> cannot just filter on the basis of not Discardable Data, as that includes
> stuff like the Ontologies. So the queries get quite complicated. Plus, one
> still needs to save certain information from the Discardable Data such as
> the rdf:type, nao:creation, and nao:lastModified. Hence, the query becomes
> even more complex. For my machine with some 10 million triples, creating a
> backup takes a sizeable amount of time ( Over 5 minutes ), with a lot of
> cpu execution.
>
> Current query -
>
> select distinct ?r ?p ?o ?g where {
> graph ?g { ?r ?p ?o. }
> ?g a nrl:InstanceBase .
> FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) .
> FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . }
> } ORDER BY ?r ?p
>
> + Requires additional queries to backup the type, nao:lastModified, and
> nao:created.
>
> Maybe it would be simpler if we did not make this distinction? Instead we
> backup everything (really fast), and just discard the data for files that
> no longer exist during restoration? It would save users the trouble of
> re-indexing their files as well. More importantly, it (might) save them the
> trouble of re-indexing their email, which is a very slow process.
>
> Also, right now one can only set the graph via StoreResources, and not via
> any other Data Management command.
>
> ----
>
> (4) is the most important reason for graphs. It allows us to know which
> application added the data. Stuff starts to get a little messy, when two
> application add the same data. In that case those statements need to be
> split out of their existing graph and a new graph needs to be created which
> will be maintained by the both the applications. This is expensive.
>
> I'm proposing that instead of splitting the statement out of the existing
> graph, we just create a duplicate of the statement with a new graph,
> containing the other application.
>
> Eg -
>
> Before -
>
> graph <G1> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1> .
> <G1> nao:maintainedBy <App2> .
>
> After -
>
> graph <G1> { <resA> a nco:Contact . }
> graph <G2> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1>
> <G2> nao:maintainedBy <App2> .
>
> The advantage of this approach is that it would simplify some of the
> extremely complex queries in the DataManagementModel. That would result in
> a direct performance upgrade. It would also solve some of the ugly
> transaction problems we have 2 commands are accessing the same statement,
> and one command removes the data in order to move it to another graph. This
> has happened to me a couple of times.
>
> ---
>
> My third proposal is that considering that the modification and creation
> date of a graph do not serve any benefit. Perhaps we shouldn't store them
> at all? Unless there is a proper use case, why go through the added effort?
> Normally, storing a couple of extra properties isn't a big deal, but if we
> do not store them, then we can effectively kill the need to create new
> graph for each data management command.
>
> With this one would just need 1 graph per application, in which all of its
> data would reside. We wouldn't need to check for empty graphs or anything.
> It would also reduce the number of triples in a database, which can get
> alarmingly high.
>
> This seems like a pretty good system to me, which provides all the
> benefits and none of the losses.
>
> What do you guys think?
>
> [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts
>
> --
> Vishesh Handa
>
>

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20130103/90623913/attachment-0001.html>