[Nepomuk] Metadata syncing

Wed Jul 14 14:43:14 CEST 2010

This is the part of the log of the conversation about metadata syncing
between me and Vishesh. The discussion is about 2 types of primary keys for
syncing resources between different sources.

All comments are welcome

<--- some more messages>
You have 2 models m1 & m2. You create a primary key for a resource r1 from
m1. The primary key consists of all the identifying properties.
  Then you try to find a similar resource in r2.
3:33 PM you call a function like FindMatch( primary key, m2 )
3:34 PM when creating the query it discovers that one of the objects is a
resourcce uri whose identifying properties it requires.
  So it needs to ask m1 for that resources identifying properties as well,
but it has no knowledge of m1
3:35 PM The Solution is for the Primary key to contains identifying
properties of the resource in question and the identifying properties of all
other resources it is connected it.
  I've been a little reluctant to do that.
  But I think there is no other solution
3:36 PM Do you get what I mean?
3:38 PM me: I think yes. But I am not sure. Why do not add source model as
parameter to find match ?
  *findMatch
  *FindMatch
3:39 PM Then you can use simple recursive algorithm
 Vishesh: Yea. That's the other solution. But that doesn't work with
backupsync cause the other model is on another system.
3:40 PM Plus the whole idea of the Primary Key was that once it has been
created, it becomes model independent
3:44 PM me: 2 types of primary keys ? BoundPrimaryKey - key that require
pointer to model, and UnboundPrimaryKey - totaly independend one, that is
searilization of main resource identifying properties, resources that the
main one connected to, etc ?
  The unbound one looks more like serialization of subset on rdf model.
  *subset *of*
 Vishesh: yea it is.
3:45 PM It is a serialization represented in a compact form.
  If we have 2 kind of keys, that would mean addition functions for matching
both kind of keys
  = More code
 me: Yes. So if you have access to model, you use BoundPrimaryKey. If you
have not, you use UnboundPrimaryKey
3:46 PM Not exactly.
 Vishesh: In theory yes. But then we have to maintain separate functions for
each key which don't have many things related.
3:47 PM me: I was going to ask trueg about in-memory soprano::backend. If
there is one, then you can just deserialize UnbondPK to model, and convert
it to BoundPK and call FindMatch with BoundPK, temporal model
  *(BoundPK, temporal model)
3:48 PM Vishesh: There is an in-memory Soprano::Backend
  we use it while loading the ontologies.
  Check out the Ontology loader class, if you're interested
  It has been moved to services/nepomukstorage
3:49 PM me: Oh, thanks!
 Vishesh: I'm still not convinced that having 2 kinds of keys is the right
approach
  Only unbounded keys might be better, but then they would be huge.
3:50 PM me: I am not convinced too.
  We a just discussing and trying to find a good solution.
 Vishesh: Yes. It's good we're discussing it.
3:51 PM me: Yes. In some cases it will be equal to the size of all rdf
storage. That's why I think that BoundKeys are better.
3:53 PM Vishesh: But when we are trying to sync it ( or identify it ) we
would need all that data
  So, it's just a question of getting it in one go or slowly by querying
multiple times
3:54 PM me: quering multiple times will be faster.
3:55 PM Caches will start working
  Squid(may be) in case of syncing with Internet accesable database.
 Vishesh: Yes. But I need to have all the data for BackupSync, otherwise we
will land up duplicating code from backkupsync that can't really be merged.
 me: etc.
3:56 PM Yes. I see.
3:57 PM wait pls.
  I am no so sure.
3:58 PM M1 and M2 are 2 model
  *models
  and we have synced them at 00:00:00 14 Wed.
3:59 PM Now I add a new Resource to M1. Thes resource is connected to some
other resources, and so on. Let this resource be as complicated, so it's
UnboundPK is big.
4:00 PM Now we start syncing.
  1) Syncing with UnboundPK
  * Create UnboundPK - it is big
  * Send this UnboundPK through network
4:01 PM ** network is bluetooth and we are in the outer space. So connection
is slow.
  * Recive this UnboundPK
4:02 PM * Unpack it to the local model [optional, may be some other way of
syncing with help of UnboundPK]
  * Sync
  * Profit
  Properties:
4:03 PM A lot of data to send and a lot of memory to store unpacked one
 Vishesh: yea
 me: 2) Syncing with BoundPK
 Vishesh: I see what you mean
 me: * use iterative algo
  ** I mean recursive
 Vishesh: Okay. Stop
 me: sorry
 Vishesh: sorry?
  I get what you're saying but what if the user doesn't have access to the
other model once the key has been created
4:04 PM which is the case with backups
4:05 PM me: Then he should use UnboundPK ? May be I understand you question
wrong, doesn't I ?
4:06 PM Vishesh: Uhh. A little bit. I get that in some cases Unbound is
better than bound and vice verse ( the opposite )
  but if we support both we have a large amount of code duplication.
4:07 PM which is something I'm not too fond of.
4:08 PM me: I doesn't know internals of you service well enough. But why
converting UnboundPK to pair <inmemory model, BoundPK> is bad idea ?
4:10 PM Vishesh: Hmm
  I might be able to simply convert it..
  yea. I don't have to do it in process.
  I didn't think of that
  You're right
4:12 PM me: I think that you( or me, as you want) should send copy of this
discussion to trueg. Or may be to mailing list. May be both of us are
missing something important.

-- 
Sincerely yours,
Artem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/nepomuk/attachments/20100714/772fbc0d/attachment-0001.htm