[Nepomuk] [RFC] Simplify Nepomuk Graph handling

Thu Jan 3 14:02:21 UTC 2013

Hey,

I agree about creation date and modification date, I can't think of any use-
case for them. I'm not sure about the other two though:

== Type of graph ==
Isn't this the primary distinction between user created content and machine 
generated content? As we're dealing with massive amounts of data and should in 
theory be able to maintain the db forever, but must at the same time be able 
to guarantee to never delete any user generated data it seems crucial to me.

Maybe there are other indicators which we can use, such as PIMO in theory 
always being user generated content, or all resources being indexed by the 
akonadi feeder having a special type, but we should at least have a concept 
for that problem I think. We could also add a special property to each 
discardable resource (maybe in a separate graph), as this information wouldn't 
be required in normal operation it wouldn't complicate the queries but we'd 
still have that information available for cleanup tasks.

== Maintained by, duplicating instead of shared graph ==
Normally I'd say this sounds like a very bad idea, as it duplicates the data, 
and defeats IMO one of the primary purposes of the semantic desktop: Tight 
integration between applications because of the shared underlying datastore.
So it probably depends on where you want to go with it:

* If we want nepomuk as a read/write storage layer (which would imply having 
writeback services to various storage backends such as akonadi), it would IMO 
be a step in the wrong direction.

* If we keep nepomuk as a mostly read only cache of data, the duplication 
might not be much of a problem.

There's one thing I don't really understand yet though:

A feeder creates a resource <res>, and then two applications add annotations 
<a1> and <a2>, in which graph will which statement end up?

Like this?

graph <Gfeeder> { <res> a rdfs:Resource }
graph <Gapp1> { <res> pimo:related <a1> }
graph <Gapp2> { <res> pimo:related <a2> }

I don't really get how collaborative editing of annotations is supposed to 
work if the graphs are separated; what happens if both applications operate on 
the same annotation? It seems to me that it would no longer be possible.

Some further comments inline.

On Sunday 16 December 2012 03.03:51 Vishesh Handa wrote:
> Hey everyone
> 
> This is another one of those big changes that I have been thinking about
> for quite some time. This email has a number of different proposals, all of
> which add up to create this really simple system, with the same
> functionality.
> 
> Graph Introduction
> ---------------------------
> 
> For those of you who don't know about graphs in Nepomuk. Please read [1].
> It serves as a decent introduction to where Graphs are used. Currently, we
> create a new graph for each data-management command.
> 
> What does this provide?
> ----------------------------------
> 
> We currently use graphs for 2 features -
> 
> 1. Remove Data By Application
> 2. Backup
> 
> What all information do we store?
> ------------------------------------------------
> 
> 1. Creation date of each graph
> 2. Modification date of each graph ( Always the same as creation date )
> 3. Type of the graph - Normal or Discardable
> 4. Maintained by which application
> 
> (1) and (2) currently serve us no purpose. They never have. They are just
> things that are nice to have. I cannot even name a single use case for it.
> Except for they let us see when a statement was added.
> 
> (3) is what powers Nepomuk Backup. We do not backup everything but only
> backup the data that is not discardable. So, stuff like indexing
> information is not saved. Currently this system is slightly broken as one
> cannot just filter on the basis of not Discardable Data, as that includes
> stuff like the Ontologies. So the queries get quite complicated. Plus, one
> still needs to save certain information from the Discardable Data such as
> the rdf:type, nao:creation, and nao:lastModified. 

One could think about just not backing up this data, it doesn't seem all that 
important to me, no? At least as long the back-up is really just used as that 
(it might be more important to catch those if used for synchronization as 
well).

> Hence, the query becomes
> even more complex. For my machine with some 10 million triples, creating a
> backup takes a sizeable amount of time ( Over 5 minutes ), with a lot of
> cpu execution.
> 
> Current query -
> 
> select distinct ?r ?p ?o ?g where {
> graph ?g { ?r ?p ?o. }
> ?g a nrl:InstanceBase .
> FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) .
> FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . }
> } ORDER BY ?r ?p
> 
> + Requires additional queries to backup the type, nao:lastModified, and
> nao:created.
> 
> Maybe it would be simpler if we did not make this distinction? Instead we
> backup everything (really fast), and just discard the data for files that
> no longer exist during restoration? It would save users the trouble of
> re-indexing their files as well. More importantly, it (might) save them the
> trouble of re-indexing their email, which is a very slow process.
> 

That might make sense indeed. However I still think it's important to have the 
distinction available for clean-up maintenance tasks, which we can run less 
often than the backup (maybe even just manually). So for that we wouldn't have 
to worry too much about performance.

> Also, right now one can only set the graph via StoreResources, and not via
> any other Data Management command.
> 
> ----
> 
> (4) is the most important reason for graphs. It allows us to know which
> application added the data. Stuff starts to get a little messy, when two
> application add the same data. In that case those statements need to be
> split out of their existing graph and a new graph needs to be created which
> will be maintained by the both the applications. This is expensive.
> 
> I'm proposing that instead of splitting the statement out of the existing
> graph, we just create a duplicate of the statement with a new graph,
> containing the other application.
> 
> Eg -
> 
> Before -
> 
> graph <G1> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1> .
> <G1> nao:maintainedBy <App2> .
> 

Does this effectively lead to one graph per resource which is maintained by 
more than one application? That seems like an unnecessary indirection, 
wouldn't it be simpler and more efficient to just have:

<resA> a nco:Contact
<resA> nao:maintainedBy <App1>
<resA> nao:maintainedBy <App2>

If the data is anyways stored as Subject - Predicate - Object - Graph 
quadruple.

Statement count wise that should result in the same but it wouldn't be 
necessary to "move" resources to a shared graph.

> After -
> 
> graph <G1> { <resA> a nco:Contact . }
> graph <G2> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1>
> <G2> nao:maintainedBy <App2> .
> 

See my questions and concerns about this at the beginning.

> The advantage of this approach is that it would simplify some of the
> extremely complex queries in the DataManagementModel. That would result in
> a direct performance upgrade. It would also solve some of the ugly
> transaction problems we have 2 commands are accessing the same statement,
> and one command removes the data in order to move it to another graph. This
> has happened to me a couple of times.
> 

Isn't this solvable by placing the locks at the right place, respectively by 
using transactions? There shouldn't be race conditions in a db if the 
isolation level is high enough. 

> ---
> 
> My third proposal is that considering that the modification and creation
> date of a graph do not serve any benefit. Perhaps we shouldn't store them
> at all? Unless there is a proper use case, why go through the added effort?
> Normally, storing a couple of extra properties isn't a big deal, but if we
> do not store them, then we can effectively kill the need to create new
> graph for each data management command.

I lost you here. Why would we need a new graph per command? I though there is 
one per application?

> 
> With this one would just need 1 graph per application, in which all of its
> data would reside. We wouldn't need to check for empty graphs or anything.
> It would also reduce the number of triples in a database, which can get
> alarmingly high.
> 
> This seems like a pretty good system to me, which provides all the benefits
> and none of the losses.
> 
> What do you guys think?
> 
> [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts