[Digikam-users] Controlled vocabulary

Tue Feb 28 13:53:02 GMT 2012

Hello Mark,

On Mon, 27 Feb 2012, Mark Hayes (Hotmail) wrote:

> Is there anyone else generally interested in this, either specifically
> for Digikam or generally in the Open Source world ?

I am. But I totally agree with what Martin wrote a couple of hours ago.
You just can't expect setup and use a general purpose vocabulary, it's
really too much context dependent.
An indexing system requires, as Martin said, three things :
  a vocabulary,
  + accurate semantics and meanings,
  + a language.

(In my opinion, there's also a fourth aspect: interoperability.
I comment this later on.)

If you don't have the three, your system is potentially dead.
In my job I use controlled vocabularies, they are all English, and they
restrict to strictly defined scientifical areas.
For my personal images I do as Martin does, free tags, but in my
language, French, (weil mein Sohn spricht nicht Deutsch:-)
and with *my* meanings. Can't do otherwise.

Language is very important because we, humans, rebuild meanings from 
whole sentences, not just out of context keywords. I live in France,
near Paris, and if I document an image with the following sentence
« Main entrance of the Hilton hotel in Paris », probably any English
reader will understand that the image is of a building and which one.
But now if you extract keywords, entrance - hilton - hotel - paris,
what about a teenager looking for pictures of a well known (and rich)
american woman Paris Hilton ? Will probably be very disappointed with
my picture:-)

And this is the limit of predefined vocabularies, they can be only
contextual. Think of the U.S. movie « Men in black », and what about
an image tagged « Black men » ?
As indexing systems usually drop what is considered as lexical noise,
articles, pronouns, conjunctions, etc., with two keywords *men* and 
*black*, you're unable to guess if this relates to an image with
african persons or persons wearing black clothes.
Each word, in a indexing vocabulary, should have one and only one
meaning. So, use context is mandatory.

I also agree with Martin, about locations names. Which spelling ?
We French have the really bad habit to rename all foreign names in
something like a French way. E.g. we call « Londres » the British 
capital town London. This suits the case written by Martin,
« As long as the country/city starts with the same letter... »,
because we have 4 same initial letters. But we call « Pékin » the
capital town of China, usually spelled Beijing. Hem:-)
And how many German readers, on that DK list, will recognize the German
town we call « Aix la Chapelle » ?
(Not obvious that it's: Aachen.)

Another problem with indexing is what should be documented ?
Answering "all" is not the good answer because one can probably find
hundreds of keyword for the same image. Indexing cannot be processed
apart from final users (i.e. persons that will do searches on your
index system).
Images banks that offer images for web designers and the edition world
tag all the major colours of their images, because a designer looking
for images will usually have some design and colour scheme criteria
and will search e.g. « woman wearing a red dress, on a blue background »
But an images bank dedicated to wild life will mostly tag represented
animals, lion, springbok, shark, and not always the background colour.
(Even if, in case of a shark, we could expect a blue background:-)

So, Mark, if you plan to sell photos in a future, perhaps the good
initial question would be to whom you wish to sell ?
If you know that, the good way would be to select several professional
keywords lists relating to the future context.
If you don't know, prefer setting up you own tags system and tree, with
maximum informations, for future use. Probably the best vocabulary is
the one that relates to the kind of photos you like to shoot. No one
is all purposes all domains.

About interoperability :
An important thing for future use is how will the tags system be saved
and which application software will need to use it.
And this is an important issue because there's no official and stable
tags schema. Digikam uses a tree structured tags system, and this is
a real help to meanings.
E.g. if I have a Digikam tagged photo with: Localisation/France/Paris,
this seems almost clear. Localisation relates to the place the image was
shot, France is a country, Paris is a town.

But this is Digikam specific, and stored in an application namespace
xmp.digikam.tagslist
Another application reading the image metadata and not aware of specific
Digikam namespaces will find keywords in more standard places, e.g. the
xmp.dc.subject field, or iptc.keywords, but will find only Paris.
The tree structure is lost and the keyword Paris can be a town or the
firstname of a rich american woman, cf. supra.

And if you plan to turn into an images seller, you'll probably need to
set up an images bank, via a web database application.
This will not be Digikam (except if developping CGI tools able to
exploit directly the Digikam DB format).

This is a real problem that, probably, prevents from setting a priori
indexing system for future use.
From my own point of view, it seems better to document images in free
text mode, with detailled sentences, and keep this either in standard
metadata fields, dc.description, dc.subject, etc., or why not in separate
side car infos files. Free text can always be parsed in a future,
extracting keywords and building on demand a tags system suited to such
or such contextual application. (And, when needed, language translations.)

Regards,
Jean-François