[Nepomuk] Re: Handling multiple sources of metadata

Wed May 4 10:53:05 CEST 2011

Hi Bruce,

On 05/03/2011 07:24 PM, Bruce Adams wrote:
>     That roughly accords with my originally intentions anyway. 
> I was thinking in terms of a standalone tool, library & api 
> for managing simple meta data (just tags)

IMHO it does not make sense to start with tags alone. I think it would
be much simpler to only start with literal properties, i.e. those for
which there is no need to store additional resources.
Then the next step would be to also store the additional resources which
gets much more complicated as it also involves garbage collection when
the user removes a property.

> and later growing this to support integration with nepomuk.
> and incorporate other kinds of metadata.
> 
> I'm happy to hear suggestions.
> 
> There are two main design choices to consider.
>  1. the location of the metadata
>        one per file
>        one metadata area per directory
>        one per filesystem

IMHO there should be one file per file system. The reason is simple:
that way we only need to store additional resources like the previously
mentioned project once. If we had one file per dir or file than we would
have to store (and later merge) these additional resources over and over.

> on balance I believe per directory makes most sense.
> Though it is not that much extra complication to say a metadata area is not
> required for a sub-directory of a directory which already has one and this would
> keep the meta-data layout simple.
> 
> 2. the format of the metadata
>   binary or text
>   if text,  trig, turle , trix or something else.

As I already mentioned I would prefer trig since that allows us to store
graph metadata which contains information like "when was the data
created" and "who created the data".

One could compress this file. Sadly there is no pseudo-standard for RDF
storage yet as there is for SQL (sqlite) so using redland seems weird to me.

>    there is an advantage to the simplicity of <key>=<value> for just tags
>    but it will not scale well to complex meta data.
>   for binary I would imagine a standard database such as sqlite.
>   The advantage there is compactness.
> 
> There is nothing to stop either of these being configurable but it is sensible 
> to
> start as you mean to go in.
> 
> I think metadata should live in a .metadata directory except that .metadata is 
> used by eclipse.
> This is something that should be adoptable as part of the linux filesystem 
> hierarchy.
> I don't think it should be .nepomuk as that might alienate gnomes.
> If all metadata is rdf .rdf might be a good choice.

I would personally go for .nepomuk for now since there will be no
collaboration with Gnome anyway (well, at least I do not believe in it
after trying for several years. But maybe you would have more luck ;)

Cheers,
Sebastian

> Anyway my intention is to start simple and go from there. No sense in running 
> before we can walk.
> 
> I will be able to test this on windows and linux. My primary target platform is 
> linux.
> I don't have access to a mac.
> The main thing is to get on and do something while I have the time and 
> enthusiasm.
> Hopefully my plans will complement yours.
> 
> Regards,
> 
> Bruce.
> 
> 
> ----- Original Message ----
>> From: Sebastian Trüg <trueg at kde.org>
>> To: Bruce Adams <tortoise_74 at yahoo.co.uk>
>> Cc: Nepomuk at kde.org
>> Sent: Tue, May 3, 2011 5:34:37 PM
>> Subject: Re: [Nepomuk] Re: Handling multiple sources of metadata
>>
>> For now this is mostly about removable devices such as USB keys. In a
>> second  step this could be applied to emails and IM messages where you
>> attach the  additional information to the email or the message.
>>
>> As far as USB keys go  the idea was to define one special hidden file or
>> folder in which the  information would be saved as compressed text
>> containing some RDF  serialization - preferably trig. So the task is to
>> update this file whenever  the meta-data of the files on the disk are
>> changed. In this case meta-data  refers to non-file-meta-data, ie. tags,
>> relations to people, projects, manual  annotations of any kind.
>>
>> At first glance this is rather simple but when  you look at it a bit
>> deeper it becomes harder since it is not immediately  clear what to save.
>> Example: One file on the disk is related to a project.  Thus, you store
>> the project, too. But which details of the project do you  store with it?
>> Do you include all the participants or only the title? Do you  store the
>> tags the project has? And so on.
>>
>> The good thing is that  Vishesh and I came up with a solution for the
>> latter problem: we defined  identifying and non-identifying properties.
>> In a situation like the above you  would only save the identifying
>> properties of the project since that is all  it takes to uniquely
>> identify it, allowing it to be merged with a counterpart  representing
>> the same project later on.
>> Without a doubt things like this  need to be put into more words and be
>> published. This is planned (like so  many other things).
>>
>> Anyway, I would suggest you start with a stand-alone  tool that can
>> create such a file on a removable storage device when  triggered
>> manually. The next step would then be to integrate it with Nepomuk  and
>> let the updating be done automatically. After that we can look  at
>> importing this information as soon as the device is mounted  and
>> providing configuration like "Do you want meta-data to be stored on  this
>> device?".
>>
>> Cheers,
>> Sebastian
>>
>> On 05/03/2011 03:10 PM,  Bruce Adams wrote:
>>>
>>> Hi,
>>>
>>> I'm certainly  interested. How much time I can dedicate to it is another 
>> matter.
>>>
>>> Do you have a particular scheme in mind?
>>>
>>>
>>>  Incremental improvements aside this also overlaps with network shares.
>>>  How do you get data from one server to another assuming both are running 
>>> nepomuk.
>>> To tackle that properly you need to tackle security  and multi-user issues.
>>> With the file-system approach you can leave it to  the OS.
>>>
>>> Regards,
>>>
>>> Bruce.
>>>
>>>
>>> ----- Original Message ----
>>>> From: Sebastian Trüg <trueg at kde.org>
>>>> To: nepomuk at kde.org
>>>> Sent: Tue, May 3,  2011 10:14:26 AM
>>>> Subject: [Nepomuk] Re: Handling multiple sources of  metadata
>>>>
>>>> Hi Bruce,
>>>>
>>>> this is what  is done:
>>>> - We store everything in one db
>>>> -  We  index file metadate like id3 or xmp tags
>>>> - We have a GSoC project  for  metadata writeback, ie. changed metadata in
>>>> the db will be  written back to  the file if possible
>>>>
>>>> about  extended attributes:
>>>> - AFAIK most  distributions disable them  by default.
>>>> - they are not supported by such file  systems like  fat which is used on
>>>> most usb keys. thus, they do not increase   interoperability much
>>>>
>>>> The idea we have is to store the  metadata on the  filesystem itself in a
>>>> cross-platform way.  This has been looked into but we  need someone to
>>>> really do it.  Are you  interested?
>>>>
>>>> Cheers,
>>>>  Sebastian
>>>>
>>>> On 05/03/2011 01:33 AM, Bruce  Adams  wrote:
>>>>>
>>>>> Hi,
>>>>>     I am  revisiting the idea  of file tagging again. 
>>>>> There are  potentially several places to store  meta  data.
>>>>>
>>>>> 1 Stored in a database.
>>>>> 2  Embedded in the  file format. E.g XMP/EXIF
>>>>> 3 Stored in  extended file attributes
>>>>> 4  Stored in a special meta-data  file associated with the original file.
>>>>>
>>>>> Embedded  data is explicitly mentioned here:
>>>>>
>>>>>   http://api.kde.org/4.0-api/kdelibs-apidocs/nepomuk/html/index.html
>>>>>
>>>>>
>>>>>  with ID3 tags used as the example.
>>>>> What about XMP tags   added, for example, in digikam?
>>>>>
>>>>> Unless I am  mistaken Nepomuk  currently only uses its own database.
>>>>> I  understand the reason for this  approach is that the database solution 
>> is 
>>
>>>> the 
>>>>
>>>>> only one that works for  all  cases.
>>>>> (Though the link to the FAQ where I was looking was   broken)
>>>>> I personally think it is wrong to make it the primary  location  as losing 
> 
>>>>> metadata when you copy files around is  broken  behaviour.
>>>>>
>>>>> I was wondering  (especially with a sprint  potentially coming) what the 
>> ideal 
>>
>>>>
>>>>> system would be.
>>>>> This is   revisiting old ground but bitrot seems to have affected my google 
>>
>>>>  search 
>>>>
>>>>> results so forgive me re-asking old  questions.
>>>>>
>>>>> If you  have multiple sources of  the same data and they disagree which 
>> should 
>>
>>>> be 
>>>>
>>>>> considered primary?
>>>>> Who is  responsible for syncing them if they   disagree?
>>>>>
>>>>> My thinking is as  follows:
>>>>>
>>>>> File  embedded data is  primary.
>>>>> Extended file attributes are secondary and  should  only be used for data 
>> when 
>>
>>>> the 
>>>>
>>>>> file  format does not allow for  embedding.
>>>>> Meta data associated  with the original file is simulation of  the above 
>> and 
>>
>>>> hence 
>>>>
>>>>> comes next.
>>>>>
>>>>> The  database is last  but definitely not least,
>>>>> If it is able  the server should sync the  data.
>>>>>
>>>>> For  example:
>>>>>  Given an image tagged in  nepomuk (e.g. via  gwenview) nepomuk or a 
>> service 
>>
>>>> on 
>>>>
>>>>> its  behalf  should
>>>>>   add the embedded tags itself (on  gwenviews behalf -  assuming gwenview 
>>>> did't do 
>>>>
>>>>> it) 
>>>>>
>>>>>  
>>>>>   Given an image tagged outside of nepomuk (e.g. in  digikam) nepomuk 
>> should 
>>
>>>>> import the tags into its  database
>>>>>  the next time it needs  to query the file  (or when indexing it).
>>>>>
>>>>> Similarly I think   extended file attributes should be imported/exported 
>> where 
>>
>>>> the 
>>>>
>>>>> file  system supports them
>>>>> and  with an optional fall back to simulating them  with .metadata files or 
>>
>>>>> similar.
>>>>>
>>>>> I read something   alluding that extended file attributes are unsuitable 
>> for 
>>
>>>>>  nepomuk data  as they are stored as pairs
>>>>> whereas nepomuk  uses triples. Hyperlinks to  the details were either 
>> missing 
>>
>>>>  or 
>>>>
>>>>> broken.
>>>>> I'm not sure I   understand the problem. Surely both triples and pairs can 
>> be 
>>
>>>
>>>>> converted  between easily enough when the
>>>>>  base representations are strings?   E.g. "A" "B" "C" becomes either "A:B"  
>> "C" 
>>
>>>> or 
>>>>
>>>>> "A" "B:C" 
>>>>>
>>>>>
>>>>> Are there some other  limitations on extended file attributes  that I'm not 
>>
>>>> aware 
>>>>
>>>>> of?
>>>>>
>>>>> At the risk of  re-opening old  wounds I notice beagle uses extended 
>>>>  attributes so 
>>>>
>>>>> I assume xesam does   too.
>>>>> This would be another way to promote interoperability in  spite of  
>>>> incompatible 
>>>>
>>>>>  ontologies.
>>>>> For the case of simple file  tagging   nepomuk <-> XMP <-> xesam might work 
>>
>>>> for 
>>>>
>>>>>  example with only the simple
>>>>>  digikam ontology as an interchange.   xmp.digikam.TagsList for  example.
>>>>>
>>>>> Any  thoughts?
>>>>>
>>>>>  Regards,
>>>>>
>>>>> Bruce.
>>>>>   _______________________________________________
>>>>> Nepomuk  mailing  list
>>>>> Nepomuk at kde.org
>>>>> https://mail.kde.org/mailman/listinfo/nepomuk
>>>>>
>>>>  _______________________________________________
>>>> Nepomuk  mailing  list
>>>> Nepomuk at kde.org
>>>> https://mail.kde.org/mailman/listinfo/nepomuk
>>>>
>>>
>>
>