[Nepomuk] Duplicates merging in DataManagementModel::storeResources

Christian Mollekopf chrigi_1 at fastmail.fm
Mon Oct 31 16:04:54 UTC 2011


Hey Vishesh/Sebastian,

As for the algorithm:

The simplest approach is just to run the current merging code multiple times.
Currently you're doing only one run, where check for duplicates and then you 
replace the corresponding pseudo uris ("_:gqe" style). Now all you need to do 
is to run the same algorithm again so the resource which had a pseudo uri 
replaced are evaluated again by the algorithm. This should be done until no 
uris are replaced anymore. I think that would serve our needs, at least for 
now, and should be fairly easy to implement if you understand your code =)

Of course it would be nice to have the merging done by the resource merger, 
but I don't need that for now.

If you still want to go the resource merger route, imo you only have to 
implement a post order traversal of the resourcegraph as described in my first 
mail, the rest would be straight forward.

I'll be available tomorrow the whole day, so we can discuss it then if you 
like.

Thanks for your efforts and cheers,

Christian


On Monday, October 31, 2011 04:52:29 PM Vishesh Handa wrote:
> Hey Christian
> 
> Sorry for not getting to it. I'm actually not exactly sure on how to go
> about it. My approach of using hashes is blatantly wrong and only covers my
> simplistic test case. The identification should obviously be done in the
> ResourceIdentifier and then allow the resource merger to merge them. But
> I'm not sure how I could do the identification.
> 
> I think I'll think about it and discuss it with Sebastian. The moment we
> figure out how to implement it, it shouldn't take more than a couple of
> hours to implement it (with tests). It would be a nice change than all this
> studying :/
> 
> Sorry about the delay.
> 
> On Mon, Oct 31, 2011 at 4:42 PM, Sebastian Trüg <trueg at kde.org> wrote:
> > Hi Christian,
> > 
> > let's meet up this week to discuss the problem and hopefully fix it. So
> > far I stayed clean of the storeResources code but with Vishesh not
> > having much time I will dive into it.
> > 
> > Cheers,
> > Sebastian
> > 
> > On 10/31/2011 12:42 PM, Christian Mollekopf wrote:
> > > Hey,
> > > 
> > > This issue starts to get pressing, a solution is needed for 4.8.
> > > Currently the feeders are broken because of that issue.
> > > 
> > > The code in storeResources is beyond me and my attempts to fix it failed
> > 
> > so
> > 
> > > far. So if no one fixes it there I'll have to work around the issue in
> > 
> > the
> > 
> > > feeder code.
> > > 
> > > I don't mean to push anyone, I'd just like to know if somebody from the
> > > nepomuk team (yes vishesh I'm looking at you ;-) is going to fix this,
> > 
> > or if
> > 
> > > I'm on my own. As said, I do understand if you currently lack the time
> > 
> > to make
> > 
> > > this happen, just tell me.
> > > 
> > > Thanks,
> > > Christian
> > > 
> > > PS: I added the pastes before they are deleted from pastie
> > > 
> > > On Saturday, October 08, 2011 03:12:51 PM Christian Mollekopf wrote:
> > >> Hi Vishesh,
> > >> 
> > >> The duplicates merging code doesn't cut it for the feeders yet.
> > >> As far as I could track it down the problem is that I have hierarchies
> > 
> > of
> > 
> > >> resources which need to be merged together.
> > >> I.e. I add a contact with it's email address several times to the
> > 
> > graph. The
> > 
> > >> email addresses are now correctly merged, but because the contacts had
> > >> different email uris in the first hashing run (before they have been
> > >> merged), the contacts remain duplicated.
> > >> 
> > >> Here is the test which currently fails:
> > >> http://paste.kde.org/131371/
> > > 
> > > void DataManagementModelTest::testStoreResources_duplicates2()
> > > {
> > > 
> > >     SimpleResource contact1;
> > >     contact1.addType( NCO::Contact() );
> > >     contact1.addProperty( NCO::fullname(), QLatin1String("Spiderman") );
> > >     contact1.addProperty( NAO::prefLabel(), QLatin1String("test") );
> > >     
> > >     SimpleResource email1;
> > >     email1.addType(NCO::EmailAddress());
> > >     email1.addProperty(NCO::emailAddress(), QLatin1String("email at foo.com
> > 
> > "));
> > 
> > >     contact1.addProperty(NCO::hasEmailAddress(), email1.uri());
> > >     
> > >     SimpleResource contact2;
> > >     contact2.addType( NCO::Contact() );
> > >     contact2.addProperty( NCO::fullname(), QLatin1String("Spiderman") );
> > >     contact2.addProperty( NAO::prefLabel(), QLatin1String("test") );
> > >     
> > >     SimpleResource email2;
> > >     email2.addType(NCO::EmailAddress());
> > >     email2.addProperty(NCO::emailAddress(), QLatin1String("email at foo.com
> > 
> > "));
> > 
> > >     contact2.addProperty(NCO::hasEmailAddress(), email2.uri());
> > >     
> > >     SimpleResourceGraph graph;
> > >     graph << email1 << contact1 << email2 << contact2;
> > >     
> > >     m_dmModel->storeResources( graph, "appA" );
> > >     QVERIFY(!m_dmModel->lastError());
> > >     
> > >     int contactCount = m_model->listStatements( Node(), RDF::type(),
> > > 
> > > NCO::Contact() ).allStatements().size();
> > > 
> > >     QCOMPARE( contactCount, 1 );
> > >     
> > >     int emailCount = m_model->listStatements( Node(), RDF::type(),
> > > 
> > > NCO::EmailAddress() ).allStatements().size();
> > > 
> > >     QCOMPARE( emailCount, 1 );
> > >     
> > >     QCOMPARE( m_model->listStatements( Node(), NCO::fullname(), Node()
> > > 
> > > ).allStatements().size(), 1 );
> > > 
> > >     QCOMPARE( m_model->listStatements( Node(), NAO::prefLabel(), Node()
> > > 
> > > ).allStatements().size(), 1 );
> > > 
> > >     QVERIFY(!haveTrailingGraphs());
> > > 
> > > }
> > > 
> > > add to qtest_dms.cpp:
> > >     model.addStatement( NCO::emailAddress(), RDF::type(),
> > 
> > RDF::Property(),
> > 
> > > graph );
> > > 
> > >     model.addStatement( NCO::emailAddress(), RDFS::range(),
> > > 
> > > XMLSchema::string(), graph );
> > > 
> > >     model.addStatement( NCO::emailAddress(), RDFS::domain(),
> > > 
> > > NCO::EmailAddress(), graph );
> > > 
> > >     model.addStatement( NCO::hasEmailAddress(), RDF::type(),
> > 
> > RDF::Property(),
> > 
> > > graph );
> > > 
> > >     model.addStatement( NCO::hasEmailAddress(), RDFS::range(),
> > > 
> > > NCO::EmailAddress(), graph );
> > > 
> > >     model.addStatement( NCO::hasEmailAddress(), RDFS::domain(),
> > > 
> > > NCO::Contact(), graph );
> > > 
> > >     model.addStatement( NCO::EmailAddress(), RDF::type(),
> > 
> > RDFS::Resource(),
> > 
> > > graph );
> > > 
> > >     model.addStatement( NCO::EmailAddress(), RDF::type(), RDFS::Class(),
> > 
> > graph
> > 
> > > );
> > > 
> > >     model.addStatement( NCO::EmailAddress(), RDFS::subClassOf(),
> > > 
> > > NCO::ContactMedium(), graph );
> > > 
> > >> And here's an excerpt of the debugging output which shows the problem
> > 
> > in the
> > 
> > >> actual feeders:
> > >> http://paste.kde.org/131377/
> > > 
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:zre""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/08/15/nao#prefLabel
> > 
> > >"""Sebastian
> > >
> > > Trueg""
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:zre""<http://www.w3.org/1999/02/22-rdf-syntax-
> > > ns#type>""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#PersonContact>"
> > 
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:zre""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullname
> > 
> > >"""Sebastian
> > >
> > > Trueg"^^<http://www.w3.org/2001/XMLSchema#string>"
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:zre""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#hasEmailAddress
> > 
> > >""_:gqe"
> > >
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:gqe""<http://www.w3.org/1999/02/22-rdf-syntax-
> > > ns#type>""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#EmailAddress>"
> > 
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:gqe""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#emailAddress>"""
> > sebastian at trueg.de"^^<http://www.w3.org/2001/XMLSchema#string>"
> > 
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:fqe""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/08/15/nao#prefLabel
> > 
> > >"""Sebastian
> > >
> > > Trueg""
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:fqe""<http://www.w3.org/1999/02/22-rdf-syntax-
> > > ns#type>""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#PersonContact>"
> > 
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:fqe""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullname
> > 
> > >"""Sebastian
> > >
> > > Trueg"^^<http://www.w3.org/2001/XMLSchema#string>"
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources:
> > > "_:fqe""<
> > 
> > http://www.semanticdesktop.org/ontologies/2007/03/22/nco#hasEmailAddress
> > 
> > >""_:gqe"
> > >
> > > This is the error returned after the storeResourceCall:
> > > nepomukstorage(21806)/nepomuk (storage service)
> > > Nepomuk::DataManagementModel::storeResources: Setting error! "Invalid
> > 
> > argument
> > 
> > > (1)":
> > > "http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullnamehas a
> > > max cardinality of 1. Provided 2 values - "Sebastian
> > > Trueg"^^<http://www.w3.org/2001/XMLSchema#string>, "Sebastian
> > > Trueg"^^<http://www.w3.org/2001/XMLSchema#string>. Existing -  Affected
> > 
> > > Resource: nepomuk:/res/75164167-3ae0-413f-a991-ed73a08ca9ec, new card:
> > 2, old
> > 
> > > card: 0"
> > > "/opt/devel/KDE/bin/nepomukservicestub(21806)" Soprano: "Invalid
> > > argument
> > > (1)":
> > > "http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullnamehas a
> > > max cardinality of 1. Provided 2 values - "Sebastian
> > > Trueg"^^<http://www.w3.org/2001/XMLSchema#string>, "Sebastian
> > > Trueg"^^<http://www.w3.org/2001/XMLSchema#string>. Existing -  Affected
> > 
> > > Resource: nepomuk:/res/75164167-3ae0-413f-a991-ed73a08ca9ec, new card:
> > 2, old
> > 
> > > card: 0"
> > > 
> > >> As I understand your code you generate a hash of each resource to check
> > 
> > if
> > 
> > >> two are exactly the same. That probably works for most use-cases, but
> > 
> > I'm
> > 
> > >> not sure if it is the best solution.
> > >> Given the problem above you'd have to rerun the hashing for the
> > 
> > resources
> > 
> > >> which were modified due to a merged resource, so that already
> > 
> > complicates
> > 
> > >> matters.
> > >> 
> > >> I thought maybe it would be possible to leave the merging up to the
> > 
> > normal
> > 
> > >> resource merger. This would have the effect that not only exactly equal
> > >> resources would be merged, but all, just as the resource merger would
> > >> normally merge them.
> > >> If you think of the SimpleResourceGraph as a tree, a post-order
> > 
> > traversal of
> > 
> > >> the tree would allow you to store each resource one by one, starting
> > 
> > from
> > 
> > >> the leaves of the bran	ch going to the root. The ResourceMerger would
> > 
> > then
> > 
> > >> automatically merge all resources as necessary.
> > >> 
> > >> Do you think that would be a viable option?
> > >> 
> > >> Cheers,
> > >> Christian
> > >> 
> > >> _______________________________________________
> > >> Nepomuk mailing list
> > >> Nepomuk at kde.org
> > >> https://mail.kde.org/mailman/listinfo/nepomuk
> > > 
> > > _______________________________________________
> > > Nepomuk mailing list
> > > Nepomuk at kde.org
> > > https://mail.kde.org/mailman/listinfo/nepomuk
> > 
> > _______________________________________________
> > Nepomuk mailing list
> > Nepomuk at kde.org
> > https://mail.kde.org/mailman/listinfo/nepomuk


More information about the Nepomuk mailing list