[Nepomuk] Handling multiple sources of metadata

Tue May 3 01:33:22 CEST 2011

Hi,
    I am revisiting the idea of file tagging again. 
There are potentially several places to store meta data.

1 Stored in a database.
2 Embedded in the file format. E.g XMP/EXIF
3 Stored in extended file attributes
4 Stored in a special meta-data file associated with the original file.

Embedded data is explicitly mentioned here:

http://api.kde.org/4.0-api/kdelibs-apidocs/nepomuk/html/index.html

with ID3 tags used as the example.
What about XMP tags added, for example, in digikam?

Unless I am mistaken Nepomuk currently only uses its own database.
I understand the reason for this approach is that the database solution is the 
only one that works for all cases.
(Though the link to the FAQ where I was looking was broken)
I personally think it is wrong to make it the primary location as losing 
metadata when you copy files around is broken behaviour.

I was wondering (especially with a sprint potentially coming) what the ideal 
system would be.
This is revisiting old ground but bitrot seems to have affected my google search 
results so forgive me re-asking old questions.

If you have multiple sources of the same data and they disagree which should be 
considered primary?
Who is responsible for syncing them if they disagree?

My thinking is as follows:

File embedded data is primary.
Extended file attributes are secondary and should only be used for data when the 
file format does not allow for embedding.
Meta data associated with the original file is simulation of the above and hence 
comes next.

The database is last but definitely not least,
If it is able the server should sync the data.

For example:
 Given an image tagged in nepomuk (e.g. via gwenview) nepomuk or a service on 
its behalf should
  add the embedded tags itself (on gwenviews behalf - assuming gwenview did't do 
it) 

 Given an image tagged outside of nepomuk (e.g. in digikam) nepomuk should 
import the tags into its database
 the next time it needs to query the file (or when indexing it).

Similarly I think extended file attributes should be imported/exported where the 
file system supports them
and with an optional fall back to simulating them with .metadata files or 
similar.

I read something alluding that extended file attributes are unsuitable for 
nepomuk data as they are stored as pairs
whereas nepomuk uses triples. Hyperlinks to the details were either missing or 
broken.
I'm not sure I understand the problem. Surely both triples and pairs can be 
converted between easily enough when the
base representations are strings?  E.g. "A" "B" "C" becomes either "A:B" "C" or 
"A" "B:C" 

Are there some other limitations on extended file attributes that I'm not aware 
of?

At the risk of re-opening old wounds I notice beagle uses extended attributes so 
I assume xesam does too.
This would be another way to promote interoperability in spite of incompatible 
ontologies.
For the case of simple file tagging  nepomuk <-> XMP <-> xesam might work for 
example with only the simple
digikam ontology as an interchange.  xmp.digikam.TagsList for example.

Any thoughts?

Regards,

Bruce.