[Nepomuk] [Soprano-devel] Refactoring for Soprano 3

Tue Oct 20 13:29:15 CEST 2009

On Friday 16 October 2009 00:32:01 Greg Beauchesne wrote:
> Sebastian Trüg wrote:
> > On Thursday 15 October 2009 01:23:18 Greg Beauchesne wrote:
> >> I will say as a side node, though, that I think the
> >> statementAdded()/statementRemoved() calling pattern should be revised a
> >> bit. One thing that sort of bugged me about the existing Soprano 2
> >> models is that statementsAdded()/statementsRemoved() is called when any
> >> statement is added or removed, and statementAdded()/statementRemoved()
> >> is only called when known statements are added or removed. This makes it
> >> difficult to write code that intelligently performs a small update for
> >> statementAdded() and a large update for statementsAdded(), because
> >> statementsAdded() is always emitted.
> >
> > I fully agree. This always bugged me as the signals are pretty useless as
> > is. I only use them cached, i.e. through Soprano::SignalCacheModel which
> > is not pretty in itself but at least does not bring down the whole system
> > performance-wise.
> > That is why I thought about having a system where clients can register
> > for changes based on statement patterns. That would be convenient but
> > probably much harder to implement.
> 
> Hmm... I can see how that would be useful if you have a lot of
> listeners, although I think most of my listeners can reject statements
> they don't care about pretty quickly. Maybe something taking your
> statements pattern thing a little further, wherein the system allows
> models to report partial update notifications (e.g. "A statement or
> statements with the obj:blah subject were added, but that's all the
> information I have"), as opposed to the all-or-nothing reporting that
> currently exists.

So you mean keep the current signal-based system but make it simpler for the 
backends?

> >> How do you handle multiple distinct (i.e. structured; not in a list or
> >> set) pieces of data for the same predicate? Force a generated URI?
> >> Concatenate string literals? Or just split them into multiple predicates
> >> on the same subject?
> >
> > The ontologies are simply designed in a way to prevent this. And
> > everything that normally would get a blank node gets a random URI, yes.
> > This decision was taken in the course of the Nepomuk ontology design, a
> > phase where I was not even part of the project.
> >
> >From what I gather, Nepomuk deals with a lot of persistent/cached data,
> 
> right? I guess that would make sense that you then have a persistent URI
> to refer back to.

there are still some instances that would make sense to be represented by 
blank nodes. An example are address instances related to contacts.
Maybe if we improve the blank node situation in Soprano3 we can also think 
about dropping that restriction.

> > I see.
> > So two nodes can even be equal if their private/native data is not.
> 
> Exactly -- with the exception of blank nodes. But for all other node
> types, the private data is for optimization purposes only, and should
> have zero effect on the actual functionality.

OK.

> > Right, I certainly did not think of this case. I always think in terms of
> > SPARQL queries and almost never use listStatements. But for model
> > federations that is another story.
> >
> > So to recap: we would have the same Node interface as we have now +
> > something like NativeNodeData which can be set or not. If a node is
> > created from a NativeNodeData its internal RDF data
> > (QUrl/QString/LiteralValue) is not created until requested.
> >
> > In addition the internal RDF data *can* be part of the pool, but must
> > not? And if it is then it is done via a hash of the internal RDF data?
> 
> A node in the pool is assigned a unique (machine-word sized) identifier.
> The pointer to the node data seems to be the simplest way to come up
> with this, but it doesn't really matter. But yes, the lookup would
> probably be a hash table.
>
> > But at what point does a Node become part of the pool? Does it need to be
> > set explicitly or do we simply cache all nodes by default. The latter
> > would mean to call NodePool::instance()->nodeData( uri/literal/id ) in
> > the Node constructor.
> 
> It would be explicit, but it would probably be mostly done by Models or
> by client code that needed to store constant Node data. Since there
> would only be one pool, the method to pool a Node could be on Node
> itself (e.g. Node.pool(), much like Java's String.intern()). The idea is
> that getting a pooled version of a Node from non-pooled data is a little
> more expensive than just creating a Node, but that it can save you time
> with comparisons and lookups in the long run. (Getting a pooled version
> of an already-pooled node would be quick because those nodes would be
> flagged as such, eliminating the need for a second look-up.)

sounds good.

> So memory models, for example, would do it on addStatement(), and then
> every Node they returned from listStatements() would already be pooled.
> For other Model types, it might be just something they can fall back on
> if their private data is not present, or they might rely on it
> exclusively as a key into any internal caches they might have.
>
> The way I see it, the use of each of these boils down to this:
> 
> 1. Pooled: NO, private data: NO - Fastest of all types to create, but no
> optimization in lookups. The default for client code.
> 
> 2. Pooled: YES, private data: NO - Some speed-up when used any Model
> that takes advantage of pooled data. Pooled status is persistent, so
> once pooled, there is never a need to un-pool a node. Client code can
> create this for Nodes that are going to be reused often.
> 
> 3. Pooled: NO, private data: YES - Node data may or may not be
> immediately realized. This type is the fastest when used with single
> Models. When Model boundaries are crossed, the optimization data is
> lost/ignored, and the Node is just treated like #1. Client code does not
> create this directly, but if it knows the target Model in advance, it
> could ask the Model to attach its private data.

I don't really get the last sentence. Why and how would a client ask a Model 
to attach its internal data. I thought that a backend like redland would 
simply always use its internal node representation for lazy conversion of the 
node data.
The way I see it there is only one way to get internal data into a node: using 
a dedicated constructor or operator=.
The only disadvantage of the model storing its internal data is a little more 
memory being used.
Although maybe some backends might need to keep track of all the nodes they 
created. Since the model could be deleted before all the nodes leaving 
dangling pointers. In theory. That would be an additional overhead that was 
not necessary in many situations. Like in Nepomuk for example where pretty 
much everything is done via queries and internal data thus has no advantage.
But then there would have to be a configuration parameter or something which 
could be used by a client to enable/disable the use of internal node data.
And this would make sense one a generic Model/Repository level.

> 4. Pooled: YES, private data: YES - Fastest when used with the Model
> that created it, but the pooled data survives across Model boundaries
> and thus can still be faster than #1 when passed to arbitrary Models.

Cheers,
Sebastian