[Nepomuk] Something to think about on the trip to Freiburg

Sat Nov 7 14:13:29 CET 2009

В сообщении от Среда 04 ноября 2009 22:37:56 автор Sebastian Trüg написал:
> Only two days until the Nepomuk workshop. Let me spoil your trip with some
>  use cases which I would like to solve this weekend. Some of you have a
>  long trip. Use the time to find answers to all the problems and implement
>  the solution. With any luck we are done Friday evening and can go clubbing
>  all weekend. ;)
> 
> 1. Copy meta data with a copied file to a removable storage
>    - How do we encode the data on the device (let's say USB stick)? A
>  simple trig-encoded and gzipped file containing a graph data?

There was a discussion like this with Tracker guys. I'd suggest everyone to at 
least take a look at issues that we're going to face. However the solution 
proposed may not be optimal. At a later date Philip van Hoof proposed to use 
SPARUL to store diffs. This is similar to how SQL databases dump their data for 
backup purposes as SQL INSERTs etc.

From http://live.gnome.org/MetadataOnRemovableDevices :
"Issues to solve:
* Writing the entire file in time is going to cause too much I/O and require 
too much time to succeed before unmount has completed 
*Rewriting the entire file for each change is going to cause too much I/O and 
on older flash devices it is going to cause level wearing 
* Our users take USBSticks and MMC cards forcefully out of their sockets. The 
format therefore needs to be append-only unless the developer can do an atomic 
rename(). 
* The full format of Turtle might be too slow to parse for relatively easy 
records 
* We don't want to burden developers with a complex format or with a format 
that has no existing parsers. 

Reasoning behind some decisions:
* This format is append-only. This solves most I/O problems while writing. 
It's also the best strategy for avoiding data corruption whenever a user 
forcefully removes the removable device from its socket. 
* XML is not appendable. Therefore it's not a good format for large amounts of 
records nor is it a usable format when data corruption is a realistic 
possibility. 
* Google protocol buffers is an interesting component, but the format prepends 
instead of appends. Which makes sense for Network I/O, but not for disk I/O 
nor for avoiding data corruption. 
* cLucene sounded like a too large dependency and complexity. Although we have 
found reasons to believe that the format is made to be robust against data 
corruption too. 
* SQLite has transaction support, sure, but try pulling a USB-stick out of its 
socket while SQLite is writing to the .db file. Usually next time you open the 
database, it'll be corrupted. We also don't expect this from SQLite: its 
purpose is not to protect you against this use-case. 
* Since it's compatible with Turtle, has this format existing parsers. Like 
for example Raptor. This makes it more easy for implementers. 
* Sure we have heard about XMP, but to write a sidekick.xmp file of a few 
hundred bytes next to each file on a FAT32 partition is going to lead to sector 
waste. Especially given that a sector is between 8k and usually 32kb in size 
on FAT32. This file doesn't mean that you can't embed your XMP data into files 
that can embed it. You can still do this of course. XMP sidekick files also 
create a filename conflict when you need to store metadata in a sidekick file for 
for example "My Work.txt" and "My Work.doc". "

>    - Which data do we export (the whole related project or person? Only the
>      resource URI? Only literal properties?)

Perhaps should depend on privacy/sharing properties. Will write another email 
about this.

>    - How is this data re-imported? Maybe even on another desktop?

The question is unclear.

>    - When this file is part of a search result do we display it anyway? How
>      would Dolphin for example show that the user needs to insert the USB
>      stick?

First of all we really really need a KIO slave for removable media and we want 
an identical GIO plugin. This will fix both issues: the one you mention and the 
ugliness of relative uris in nie:url. The URL would look like 
media://$UUID/somefile

When user clicks URL of the file in Dolphin or anywhere else, it's handled 
similar to how authorization is requested: you ask the user, if he says no, 
you fail the file operation.

> 2. Send a file to another desktop via jabber or email or whatever including
> meta data
>    - Which meta data is exported?
Sharing/privacy settings. More to come in another email :)

>    - Should strigi-extracted data be excluded?
Probably. You let the download stream pass thru libstreamanalyzer and get the 
data at no cost(apart from hash maybe).

The exported data could contain unknown ontologies generated by some 3rd-party 
plugin.

On the next reindexing any "Extra" strigi metadata may be dropped anyway.

You will be hoping someone else did metadata extraction properly, a source of 
sneaky bugs and problems.

>    - How "deep" do we traverse the tree? Do we copy a related person or a
>      project?

For Jabber we can do a trick and let the receiving side request data using eg 
DESCRIBE SPARQL query (subject to your privacy/sharing settings). Essentially 
this is orthogonal to file sending. This is just a normal process of p2p 
metadata exchange where peers request metadata from each other.

Sending everything related may mean sending lots and lots of data but for 
email so it's almost out of question.
Since email is not interactive, there's no nice away... apart from providing 
your jabber account in the metadata and letting the receiving side use this to 
retrieve more metadata :)

>    - How do we "mark" the meta data as coming from another desktop?

If we regenerate strigi metadata, then realistically the only data to come 
will be PIMO realm. I think this is already handled in PIMO :)

>  Embedding a jabber id or email address in the URIs?
>    - Do we only do the above or do we actually try to sync with local data.
>      Typical example here is a tag with the same name. Possibilities are a
>      merger (and thus, loosing the origin of the meta data) or something
>  like equality on the ontology level (possibly more complex queries) or
>  simply ignoring it.

We can use named graphs to store provenance for each triple. If we already 
have the triple or if they conflict, we keep our local copy.

owl:sameAs is still not in a good shape afaik and it doesn't have a conflict 
resolution mechanism eg if you have two slightly different spellings of 
person's name.

One more important issue is data updates: if the peer changes their rating of 
some resource, you should be able to track and delete the rating in your 
datastore. So total merge and pretending the data is yours is a bad idea.

>    - How is privacy handled? We need an ontology and a mechanism to protect
>      sensible data. Private/public key systems could be a good solution.
>    - Looks like basically the same problem as 1.

Privacy or security and authentication? For the later, I'd suggest a plugin-
based approach and reuse of existing infrastructure. It could be provided by 
Jabber, possibly with additional twists like OTR 
http://www.cypherpunks.ca/otr/
Or it could be plain old GPG.

> 4. Query other desktops in the intra-net (or a dedicated subnet)
>    - privacy: how do we protect the sensible data?
>    - which technologies to use? Avahi? Which protocol should we use? HTTPS

How is this different from P2P exchanges? Why not use the same tech? Either way 
this would be handled by plugins.

>  - maybe simply a dedicated SPARQL endpoint? But again: how to filter the
>  data then?

There are 2 approaches that are no mutually exclusive(probably should be 
implemented both): let others download the dataset( and receive updates) or 
letting others query the dataset.

Dataset mirroring should be ok if you directly exchange data with friends and 
they don't use bots to populate their PIMOs

Larger datastores need to expose a sparql endpoint, however exposing SPARQL 
and trying to filter it is going to be tricky. There will be many non-obvious 
ways to access the data that you aren't supposed to access using tricky 
queries, aggregation, filters and whatnot.

So my bet is you can't do filtering nicely unless your backend supports this.

>    - should the network always be queried?

Let the user decide.

>    - when importing data from other desktops we are back at use case 2
> 
> 3. Integrate data from opendesktop.org.
>    - The easiest solution is probably to have an Akonadi feeder for this
>  data. - Can we somehow propose the public data from Nepomuk desktops
>  available in the opendesktop interface?

If OpenDesktop can handle the load, definitely yes. Let users push their 
datasets into OpenDesktop, and make it a web interface into the Nepomuk P2P 
community.

>    - Can opendesktop act as just another Nepomuk node in the network?

Can and should. Since you can avoid storing any private data in the rdf store, 
you can expose sparql.

> General:
> - Which kind of APIs can we provide to simplify life for app developers?
> - Which kind of services/tools do we need?

These will need an even longer email I'm afraid :(

> Looking forward to seeing you this weekend.

-- Evgeny