[Nepomuk] [RFC] Simplify Nepomuk Graph handling
Sebastian Trüg
trueg at kde.org
Thu Jan 3 09:52:30 UTC 2013
I totally agree. In fact have a look at
DataManagemenModel::Private::m_ignoreCreationDate.
As for the duplicate statements: that is a good point also. It greatly
simplifies the removal of data per app.
On 01/03/2013 10:21 AM, Vishesh Handa wrote:
> Ping?
>
>
> On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa <me at vhanda.in
> <mailto:me at vhanda.in>> wrote:
>
> Hey everyone
>
> This is another one of those big changes that I have been thinking
> about for quite some time. This email has a number of different
> proposals, all of which add up to create this really simple system,
> with the same functionality.
>
> Graph Introduction
> ---------------------------
>
> For those of you who don't know about graphs in Nepomuk. Please read
> [1]. It serves as a decent introduction to where Graphs are used.
> Currently, we create a new graph for each data-management command.
>
> What does this provide?
> ----------------------------------
>
> We currently use graphs for 2 features -
>
> 1. Remove Data By Application
> 2. Backup
>
> What all information do we store?
> ------------------------------------------------
>
> 1. Creation date of each graph
> 2. Modification date of each graph ( Always the same as creation date )
> 3. Type of the graph - Normal or Discardable
> 4. Maintained by which application
>
> (1) and (2) currently serve us no purpose. They never have. They are
> just things that are nice to have. I cannot even name a single use
> case for it. Except for they let us see when a statement was added.
>
> (3) is what powers Nepomuk Backup. We do not backup everything but
> only backup the data that is not discardable. So, stuff like
> indexing information is not saved. Currently this system is slightly
> broken as one cannot just filter on the basis of not Discardable
> Data, as that includes stuff like the Ontologies. So the queries get
> quite complicated. Plus, one still needs to save certain information
> from the Discardable Data such as the rdf:type, nao:creation, and
> nao:lastModified. Hence, the query becomes even more complex. For my
> machine with some 10 million triples, creating a backup takes a
> sizeable amount of time ( Over 5 minutes ), with a lot of cpu execution.
>
> Current query -
>
> select distinct ?r ?p ?o ?g where {
> graph ?g { ?r ?p ?o. }
> ?g a nrl:InstanceBase .
> FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) .
> FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . }
> } ORDER BY ?r ?p
>
> + Requires additional queries to backup the type, nao:lastModified,
> and nao:created.
>
> Maybe it would be simpler if we did not make this distinction?
> Instead we backup everything (really fast), and just discard the
> data for files that no longer exist during restoration? It would
> save users the trouble of re-indexing their files as well. More
> importantly, it (might) save them the trouble of re-indexing their
> email, which is a very slow process.
>
> Also, right now one can only set the graph via StoreResources, and
> not via any other Data Management command.
>
> ----
>
> (4) is the most important reason for graphs. It allows us to know
> which application added the data. Stuff starts to get a little
> messy, when two application add the same data. In that case those
> statements need to be split out of their existing graph and a new
> graph needs to be created which will be maintained by the both the
> applications. This is expensive.
>
> I'm proposing that instead of splitting the statement out of the
> existing graph, we just create a duplicate of the statement with a
> new graph, containing the other application.
>
> Eg -
>
> Before -
>
> graph <G1> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1> .
> <G1> nao:maintainedBy <App2> .
>
> After -
>
> graph <G1> { <resA> a nco:Contact . }
> graph <G2> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1>
> <G2> nao:maintainedBy <App2> .
>
> The advantage of this approach is that it would simplify some of the
> extremely complex queries in the DataManagementModel. That would
> result in a direct performance upgrade. It would also solve some of
> the ugly transaction problems we have 2 commands are accessing the
> same statement, and one command removes the data in order to move it
> to another graph. This has happened to me a couple of times.
>
> ---
>
> My third proposal is that considering that the modification and
> creation date of a graph do not serve any benefit. Perhaps we
> shouldn't store them at all? Unless there is a proper use case, why
> go through the added effort? Normally, storing a couple of extra
> properties isn't a big deal, but if we do not store them, then we
> can effectively kill the need to create new graph for each data
> management command.
>
> With this one would just need 1 graph per application, in which all
> of its data would reside. We wouldn't need to check for empty graphs
> or anything. It would also reduce the number of triples in a
> database, which can get alarmingly high.
>
> This seems like a pretty good system to me, which provides all the
> benefits and none of the losses.
>
> What do you guys think?
>
> [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts
>
> --
> Vishesh Handa
>
>
>
>
> --
> Vishesh Handa
>
>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>
More information about the Nepomuk
mailing list