[Nepomuk] Re: Handling multiple sources of metadata

Tue May 3 18:34:37 CEST 2011

For now this is mostly about removable devices such as USB keys. In a
second step this could be applied to emails and IM messages where you
attach the additional information to the email or the message.

As far as USB keys go the idea was to define one special hidden file or
folder in which the information would be saved as compressed text
containing some RDF serialization - preferably trig. So the task is to
update this file whenever the meta-data of the files on the disk are
changed. In this case meta-data refers to non-file-meta-data, ie. tags,
relations to people, projects, manual annotations of any kind.

At first glance this is rather simple but when you look at it a bit
deeper it becomes harder since it is not immediately clear what to save.
Example: One file on the disk is related to a project. Thus, you store
the project, too. But which details of the project do you store with it?
Do you include all the participants or only the title? Do you store the
tags the project has? And so on.

The good thing is that Vishesh and I came up with a solution for the
latter problem: we defined identifying and non-identifying properties.
In a situation like the above you would only save the identifying
properties of the project since that is all it takes to uniquely
identify it, allowing it to be merged with a counterpart representing
the same project later on.
Without a doubt things like this need to be put into more words and be
published. This is planned (like so many other things).

Anyway, I would suggest you start with a stand-alone tool that can
create such a file on a removable storage device when triggered
manually. The next step would then be to integrate it with Nepomuk and
let the updating be done automatically. After that we can look at
importing this information as soon as the device is mounted and
providing configuration like "Do you want meta-data to be stored on this
device?".

Cheers,
Sebastian

On 05/03/2011 03:10 PM, Bruce Adams wrote:
> 
> Hi,
> 
> I'm certainly interested. How much time I can dedicate to it is another matter.
> 
> Do you have a particular scheme in mind?
> 
> 
> Incremental improvements aside this also overlaps with network shares.
> How do you get data from one server to another assuming both are running 
> nepomuk.
> To tackle that properly you need to tackle security and multi-user issues.
> With the file-system approach you can leave it to the OS.
> 
> Regards,
> 
> Bruce.
> 
> 
> ----- Original Message ----
>> From: Sebastian Trüg <trueg at kde.org>
>> To: nepomuk at kde.org
>> Sent: Tue, May 3, 2011 10:14:26 AM
>> Subject: [Nepomuk] Re: Handling multiple sources of metadata
>>
>> Hi Bruce,
>>
>> this is what is done:
>> - We store everything in one db
>> -  We index file metadate like id3 or xmp tags
>> - We have a GSoC project for  metadata writeback, ie. changed metadata in
>> the db will be written back to  the file if possible
>>
>> about extended attributes:
>> - AFAIK most  distributions disable them by default.
>> - they are not supported by such file  systems like fat which is used on
>> most usb keys. thus, they do not increase  interoperability much
>>
>> The idea we have is to store the metadata on the  filesystem itself in a
>> cross-platform way. This has been looked into but we  need someone to
>> really do it. Are you  interested?
>>
>> Cheers,
>> Sebastian
>>
>> On 05/03/2011 01:33 AM, Bruce  Adams wrote:
>>>
>>> Hi,
>>>     I am revisiting the idea  of file tagging again. 
>>> There are potentially several places to store  meta data.
>>>
>>> 1 Stored in a database.
>>> 2 Embedded in the  file format. E.g XMP/EXIF
>>> 3 Stored in extended file attributes
>>> 4  Stored in a special meta-data file associated with the original file.
>>>
>>> Embedded data is explicitly mentioned here:
>>>
>>>  http://api.kde.org/4.0-api/kdelibs-apidocs/nepomuk/html/index.html
>>>
>>>
>>> with ID3 tags used as the example.
>>> What about XMP tags  added, for example, in digikam?
>>>
>>> Unless I am mistaken Nepomuk  currently only uses its own database.
>>> I understand the reason for this  approach is that the database solution is 
>> the 
>>
>>> only one that works for  all cases.
>>> (Though the link to the FAQ where I was looking was  broken)
>>> I personally think it is wrong to make it the primary location  as losing 
>>> metadata when you copy files around is broken  behaviour.
>>>
>>> I was wondering (especially with a sprint  potentially coming) what the ideal 
>>
>>> system would be.
>>> This is  revisiting old ground but bitrot seems to have affected my google 
>> search 
>>
>>> results so forgive me re-asking old questions.
>>>
>>> If you  have multiple sources of the same data and they disagree which should 
>> be 
>>
>>> considered primary?
>>> Who is responsible for syncing them if they  disagree?
>>>
>>> My thinking is as follows:
>>>
>>> File  embedded data is primary.
>>> Extended file attributes are secondary and  should only be used for data when 
>> the 
>>
>>> file format does not allow for  embedding.
>>> Meta data associated with the original file is simulation of  the above and 
>> hence 
>>
>>> comes next.
>>>
>>> The database is last  but definitely not least,
>>> If it is able the server should sync the  data.
>>>
>>> For example:
>>>  Given an image tagged in  nepomuk (e.g. via gwenview) nepomuk or a service 
>> on 
>>
>>> its behalf  should
>>>   add the embedded tags itself (on gwenviews behalf -  assuming gwenview 
>> did't do 
>>
>>> it) 
>>>
>>>  
>>>   Given an image tagged outside of nepomuk (e.g. in digikam) nepomuk should 
>>> import the tags into its database
>>>  the next time it needs  to query the file (or when indexing it).
>>>
>>> Similarly I think  extended file attributes should be imported/exported where 
>> the 
>>
>>> file  system supports them
>>> and with an optional fall back to simulating them  with .metadata files or 
>>> similar.
>>>
>>> I read something  alluding that extended file attributes are unsuitable for 
>>> nepomuk data  as they are stored as pairs
>>> whereas nepomuk uses triples. Hyperlinks to  the details were either missing 
>> or 
>>
>>> broken.
>>> I'm not sure I  understand the problem. Surely both triples and pairs can be 
> 
>>> converted  between easily enough when the
>>> base representations are strings?   E.g. "A" "B" "C" becomes either "A:B" "C" 
>> or 
>>
>>> "A" "B:C" 
>>>
>>>
>>> Are there some other limitations on extended file attributes  that I'm not 
>> aware 
>>
>>> of?
>>>
>>> At the risk of re-opening old  wounds I notice beagle uses extended 
>> attributes so 
>>
>>> I assume xesam does  too.
>>> This would be another way to promote interoperability in spite of  
>> incompatible 
>>
>>> ontologies.
>>> For the case of simple file  tagging  nepomuk <-> XMP <-> xesam might work 
>> for 
>>
>>>  example with only the simple
>>> digikam ontology as an interchange.   xmp.digikam.TagsList for example.
>>>
>>> Any thoughts?
>>>
>>> Regards,
>>>
>>> Bruce.
>>>  _______________________________________________
>>> Nepomuk mailing  list
>>> Nepomuk at kde.org
>>> https://mail.kde.org/mailman/listinfo/nepomuk
>>>
>> _______________________________________________
>> Nepomuk mailing  list
>> Nepomuk at kde.org
>> https://mail.kde.org/mailman/listinfo/nepomuk
>>
>