[Nepomuk] Something to think about on the trip to Freiburg
Evgeny Egorochkin
phreedom.stdin at gmail.com
Sat Nov 7 14:13:29 CET 2009
В сообщении от Среда 04 ноября 2009 22:37:56 автор Sebastian Trüg написал:
> Only two days until the Nepomuk workshop. Let me spoil your trip with some
> use cases which I would like to solve this weekend. Some of you have a
> long trip. Use the time to find answers to all the problems and implement
> the solution. With any luck we are done Friday evening and can go clubbing
> all weekend. ;)
>
> 1. Copy meta data with a copied file to a removable storage
> - How do we encode the data on the device (let's say USB stick)? A
> simple trig-encoded and gzipped file containing a graph data?
There was a discussion like this with Tracker guys. I'd suggest everyone to at
least take a look at issues that we're going to face. However the solution
proposed may not be optimal. At a later date Philip van Hoof proposed to use
SPARUL to store diffs. This is similar to how SQL databases dump their data for
backup purposes as SQL INSERTs etc.
From http://live.gnome.org/MetadataOnRemovableDevices :
"Issues to solve:
* Writing the entire file in time is going to cause too much I/O and require
too much time to succeed before unmount has completed
*Rewriting the entire file for each change is going to cause too much I/O and
on older flash devices it is going to cause level wearing
* Our users take USBSticks and MMC cards forcefully out of their sockets. The
format therefore needs to be append-only unless the developer can do an atomic
rename().
* The full format of Turtle might be too slow to parse for relatively easy
records
* We don't want to burden developers with a complex format or with a format
that has no existing parsers.
Reasoning behind some decisions:
* This format is append-only. This solves most I/O problems while writing.
It's also the best strategy for avoiding data corruption whenever a user
forcefully removes the removable device from its socket.
* XML is not appendable. Therefore it's not a good format for large amounts of
records nor is it a usable format when data corruption is a realistic
possibility.
* Google protocol buffers is an interesting component, but the format prepends
instead of appends. Which makes sense for Network I/O, but not for disk I/O
nor for avoiding data corruption.
* cLucene sounded like a too large dependency and complexity. Although we have
found reasons to believe that the format is made to be robust against data
corruption too.
* SQLite has transaction support, sure, but try pulling a USB-stick out of its
socket while SQLite is writing to the .db file. Usually next time you open the
database, it'll be corrupted. We also don't expect this from SQLite: its
purpose is not to protect you against this use-case.
* Since it's compatible with Turtle, has this format existing parsers. Like
for example Raptor. This makes it more easy for implementers.
* Sure we have heard about XMP, but to write a sidekick.xmp file of a few
hundred bytes next to each file on a FAT32 partition is going to lead to sector
waste. Especially given that a sector is between 8k and usually 32kb in size
on FAT32. This file doesn't mean that you can't embed your XMP data into files
that can embed it. You can still do this of course. XMP sidekick files also
create a filename conflict when you need to store metadata in a sidekick file for
for example "My Work.txt" and "My Work.doc". "
> - Which data do we export (the whole related project or person? Only the
> resource URI? Only literal properties?)
Perhaps should depend on privacy/sharing properties. Will write another email
about this.
> - How is this data re-imported? Maybe even on another desktop?
The question is unclear.
> - When this file is part of a search result do we display it anyway? How
> would Dolphin for example show that the user needs to insert the USB
> stick?
First of all we really really need a KIO slave for removable media and we want
an identical GIO plugin. This will fix both issues: the one you mention and the
ugliness of relative uris in nie:url. The URL would look like
media://$UUID/somefile
When user clicks URL of the file in Dolphin or anywhere else, it's handled
similar to how authorization is requested: you ask the user, if he says no,
you fail the file operation.
> 2. Send a file to another desktop via jabber or email or whatever including
> meta data
> - Which meta data is exported?
Sharing/privacy settings. More to come in another email :)
> - Should strigi-extracted data be excluded?
Probably. You let the download stream pass thru libstreamanalyzer and get the
data at no cost(apart from hash maybe).
The exported data could contain unknown ontologies generated by some 3rd-party
plugin.
On the next reindexing any "Extra" strigi metadata may be dropped anyway.
You will be hoping someone else did metadata extraction properly, a source of
sneaky bugs and problems.
> - How "deep" do we traverse the tree? Do we copy a related person or a
> project?
For Jabber we can do a trick and let the receiving side request data using eg
DESCRIBE SPARQL query (subject to your privacy/sharing settings). Essentially
this is orthogonal to file sending. This is just a normal process of p2p
metadata exchange where peers request metadata from each other.
Sending everything related may mean sending lots and lots of data but for
email so it's almost out of question.
Since email is not interactive, there's no nice away... apart from providing
your jabber account in the metadata and letting the receiving side use this to
retrieve more metadata :)
> - How do we "mark" the meta data as coming from another desktop?
If we regenerate strigi metadata, then realistically the only data to come
will be PIMO realm. I think this is already handled in PIMO :)
> Embedding a jabber id or email address in the URIs?
> - Do we only do the above or do we actually try to sync with local data.
> Typical example here is a tag with the same name. Possibilities are a
> merger (and thus, loosing the origin of the meta data) or something
> like equality on the ontology level (possibly more complex queries) or
> simply ignoring it.
We can use named graphs to store provenance for each triple. If we already
have the triple or if they conflict, we keep our local copy.
owl:sameAs is still not in a good shape afaik and it doesn't have a conflict
resolution mechanism eg if you have two slightly different spellings of
person's name.
One more important issue is data updates: if the peer changes their rating of
some resource, you should be able to track and delete the rating in your
datastore. So total merge and pretending the data is yours is a bad idea.
> - How is privacy handled? We need an ontology and a mechanism to protect
> sensible data. Private/public key systems could be a good solution.
> - Looks like basically the same problem as 1.
Privacy or security and authentication? For the later, I'd suggest a plugin-
based approach and reuse of existing infrastructure. It could be provided by
Jabber, possibly with additional twists like OTR
http://www.cypherpunks.ca/otr/
Or it could be plain old GPG.
> 4. Query other desktops in the intra-net (or a dedicated subnet)
> - privacy: how do we protect the sensible data?
> - which technologies to use? Avahi? Which protocol should we use? HTTPS
How is this different from P2P exchanges? Why not use the same tech? Either way
this would be handled by plugins.
> - maybe simply a dedicated SPARQL endpoint? But again: how to filter the
> data then?
There are 2 approaches that are no mutually exclusive(probably should be
implemented both): let others download the dataset( and receive updates) or
letting others query the dataset.
Dataset mirroring should be ok if you directly exchange data with friends and
they don't use bots to populate their PIMOs
Larger datastores need to expose a sparql endpoint, however exposing SPARQL
and trying to filter it is going to be tricky. There will be many non-obvious
ways to access the data that you aren't supposed to access using tricky
queries, aggregation, filters and whatnot.
So my bet is you can't do filtering nicely unless your backend supports this.
> - should the network always be queried?
Let the user decide.
> - when importing data from other desktops we are back at use case 2
>
> 3. Integrate data from opendesktop.org.
> - The easiest solution is probably to have an Akonadi feeder for this
> data. - Can we somehow propose the public data from Nepomuk desktops
> available in the opendesktop interface?
If OpenDesktop can handle the load, definitely yes. Let users push their
datasets into OpenDesktop, and make it a web interface into the Nepomuk P2P
community.
> - Can opendesktop act as just another Nepomuk node in the network?
Can and should. Since you can avoid storing any private data in the rdf store,
you can expose sparql.
> General:
> - Which kind of APIs can we provide to simplify life for app developers?
> - Which kind of services/tools do we need?
These will need an even longer email I'm afraid :(
> Looking forward to seeing you this weekend.
-- Evgeny
More information about the Nepomuk
mailing list