[Nepomuk] Duplicates merging in DataManagementModel::storeResources

Sebastian Trüg trueg at kde.org
Mon Oct 31 11:12:35 UTC 2011


Hi Christian,

let's meet up this week to discuss the problem and hopefully fix it. So
far I stayed clean of the storeResources code but with Vishesh not
having much time I will dive into it.

Cheers,
Sebastian

On 10/31/2011 12:42 PM, Christian Mollekopf wrote:
> Hey,
> 
> This issue starts to get pressing, a solution is needed for 4.8.
> Currently the feeders are broken because of that issue.
> 
> The code in storeResources is beyond me and my attempts to fix it failed so 
> far. So if no one fixes it there I'll have to work around the issue in the 
> feeder code.
> 
> I don't mean to push anyone, I'd just like to know if somebody from the 
> nepomuk team (yes vishesh I'm looking at you ;-) is going to fix this, or if 
> I'm on my own. As said, I do understand if you currently lack the time to make 
> this happen, just tell me.
> 
> Thanks,
> Christian
> 
> PS: I added the pastes before they are deleted from pastie
> 
> On Saturday, October 08, 2011 03:12:51 PM Christian Mollekopf wrote:
>> Hi Vishesh,
>>
>> The duplicates merging code doesn't cut it for the feeders yet.
>> As far as I could track it down the problem is that I have hierarchies of
>> resources which need to be merged together.
>> I.e. I add a contact with it's email address several times to the graph. The
>> email addresses are now correctly merged, but because the contacts had
>> different email uris in the first hashing run (before they have been
>> merged), the contacts remain duplicated.
>>
>> Here is the test which currently fails:
>> http://paste.kde.org/131371/
> 
> void DataManagementModelTest::testStoreResources_duplicates2()
> {
>     SimpleResource contact1;
>     contact1.addType( NCO::Contact() );
>     contact1.addProperty( NCO::fullname(), QLatin1String("Spiderman") );
>     contact1.addProperty( NAO::prefLabel(), QLatin1String("test") );
>  
>     SimpleResource email1;
>     email1.addType(NCO::EmailAddress());
>     email1.addProperty(NCO::emailAddress(), QLatin1String("email at foo.com"));
>     contact1.addProperty(NCO::hasEmailAddress(), email1.uri());
>  
>     SimpleResource contact2;
>     contact2.addType( NCO::Contact() );
>     contact2.addProperty( NCO::fullname(), QLatin1String("Spiderman") );
>     contact2.addProperty( NAO::prefLabel(), QLatin1String("test") );
>  
>     SimpleResource email2;
>     email2.addType(NCO::EmailAddress());
>     email2.addProperty(NCO::emailAddress(), QLatin1String("email at foo.com"));
>     contact2.addProperty(NCO::hasEmailAddress(), email2.uri());
>  
>     SimpleResourceGraph graph;
>     graph << email1 << contact1 << email2 << contact2;
>  
>     m_dmModel->storeResources( graph, "appA" );
>     QVERIFY(!m_dmModel->lastError());
>  
>     int contactCount = m_model->listStatements( Node(), RDF::type(), 
> NCO::Contact() ).allStatements().size();
>     QCOMPARE( contactCount, 1 );
>  
>     int emailCount = m_model->listStatements( Node(), RDF::type(), 
> NCO::EmailAddress() ).allStatements().size();
>     QCOMPARE( emailCount, 1 );
>  
>     QCOMPARE( m_model->listStatements( Node(), NCO::fullname(), Node() 
> ).allStatements().size(), 1 );
>     QCOMPARE( m_model->listStatements( Node(), NAO::prefLabel(), Node() 
> ).allStatements().size(), 1 );
>  
>     QVERIFY(!haveTrailingGraphs());
> }
>  
> add to qtest_dms.cpp:
>  
>     model.addStatement( NCO::emailAddress(), RDF::type(), RDF::Property(), 
> graph );
>     model.addStatement( NCO::emailAddress(), RDFS::range(), 
> XMLSchema::string(), graph );
>     model.addStatement( NCO::emailAddress(), RDFS::domain(), 
> NCO::EmailAddress(), graph );
>     
>     model.addStatement( NCO::hasEmailAddress(), RDF::type(), RDF::Property(), 
> graph );
>     model.addStatement( NCO::hasEmailAddress(), RDFS::range(), 
> NCO::EmailAddress(), graph );
>     model.addStatement( NCO::hasEmailAddress(), RDFS::domain(), 
> NCO::Contact(), graph );
>     
>     model.addStatement( NCO::EmailAddress(), RDF::type(), RDFS::Resource(), 
> graph );
>     model.addStatement( NCO::EmailAddress(), RDF::type(), RDFS::Class(), graph 
> );
>     model.addStatement( NCO::EmailAddress(), RDFS::subClassOf(), 
> NCO::ContactMedium(), graph );
> 
>>
>> And here's an excerpt of the debugging output which shows the problem in the
>> actual feeders:
>> http://paste.kde.org/131377/
>>
> 
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:zre""<http://www.semanticdesktop.org/ontologies/2007/08/15/nao#prefLabel>"""Sebastian 
> Trueg""
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:zre""<http://www.w3.org/1999/02/22-rdf-syntax-
> ns#type>""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#PersonContact>"
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:zre""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullname>"""Sebastian 
> Trueg"^^<http://www.w3.org/2001/XMLSchema#string>"
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:zre""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#hasEmailAddress>""_:gqe"
>  
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:gqe""<http://www.w3.org/1999/02/22-rdf-syntax-
> ns#type>""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#EmailAddress>"
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:gqe""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#emailAddress>"""sebastian at trueg.de"^^<http://www.w3.org/2001/XMLSchema#string>"
>  
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:fqe""<http://www.semanticdesktop.org/ontologies/2007/08/15/nao#prefLabel>"""Sebastian 
> Trueg""
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:fqe""<http://www.w3.org/1999/02/22-rdf-syntax-
> ns#type>""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#PersonContact>"
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:fqe""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullname>"""Sebastian 
> Trueg"^^<http://www.w3.org/2001/XMLSchema#string>"
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: 
> "_:fqe""<http://www.semanticdesktop.org/ontologies/2007/03/22/nco#hasEmailAddress>""_:gqe"
>  
> This is the error returned after the storeResourceCall:
> nepomukstorage(21806)/nepomuk (storage service) 
> Nepomuk::DataManagementModel::storeResources: Setting error! "Invalid argument 
> (1)": "http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullname has a 
> max cardinality of 1. Provided 2 values - "Sebastian 
> Trueg"^^<http://www.w3.org/2001/XMLSchema#string>, "Sebastian 
> Trueg"^^<http://www.w3.org/2001/XMLSchema#string>. Existing -  Affected 
> Resource: nepomuk:/res/75164167-3ae0-413f-a991-ed73a08ca9ec, new card: 2, old 
> card: 0"
> "/opt/devel/KDE/bin/nepomukservicestub(21806)" Soprano: "Invalid argument 
> (1)": "http://www.semanticdesktop.org/ontologies/2007/03/22/nco#fullname has a 
> max cardinality of 1. Provided 2 values - "Sebastian 
> Trueg"^^<http://www.w3.org/2001/XMLSchema#string>, "Sebastian 
> Trueg"^^<http://www.w3.org/2001/XMLSchema#string>. Existing -  Affected 
> Resource: nepomuk:/res/75164167-3ae0-413f-a991-ed73a08ca9ec, new card: 2, old 
> card: 0"
> 
>> As I understand your code you generate a hash of each resource to check if
>> two are exactly the same. That probably works for most use-cases, but I'm
>> not sure if it is the best solution.
>> Given the problem above you'd have to rerun the hashing for the resources
>> which were modified due to a merged resource, so that already complicates
>> matters.
>>
>> I thought maybe it would be possible to leave the merging up to the normal
>> resource merger. This would have the effect that not only exactly equal
>> resources would be merged, but all, just as the resource merger would
>> normally merge them.
>> If you think of the SimpleResourceGraph as a tree, a post-order traversal of
>> the tree would allow you to store each resource one by one, starting from
>> the leaves of the branch going to the root. The ResourceMerger would then
>> automatically merge all resources as necessary.
>>
>> Do you think that would be a viable option?
>>
>> Cheers,
>> Christian
>>
>> _______________________________________________
>> Nepomuk mailing list
>> Nepomuk at kde.org
>> https://mail.kde.org/mailman/listinfo/nepomuk
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
> 


More information about the Nepomuk mailing list