UTF-8 captions

Tue Jan 17 17:22:53 GMT 2017

On mardi 17 janvier 2017 08:48:30 CET Andrey Goreev wrote:
> Hello Remco,
> 
> Digikam shows the caption under the thumbnail as well as in the right panel:
> Properties -> digiKam properties/Caption
> Metadata -> EXIF/Image Description; IPTC/Caption (IPTC/Character Set shows
> UTF-8); XMP/Description, XMP/User comment; XMP/Image description;
> Captions -> Description/Captions
> 
I know where to find the captions within Digikam. What wasn't clear to me is 
where _you_ saw that mutilated utf-8.

> Here is an extract from the output of ExifTool -a -G1 -s command:
> 
>  [File]          Comment                         : ├â┬Ø├â┬¼├â┬«
> 
>  [IFD0]          ImageDescription                : ├Ø├¼├«
> 
>  [ExifIFD]       UserComment                     : ├â┬Ø├â┬¼├â┬«
> 
>  [XMP-tiff]      ImageDescription                : ├â┬Ø├â┬¼├â┬«
> 
>  [XMP-exif]      UserComment                     : ├â┬Ø├â┬¼├â┬«
> 
>  [XMP-acdsee]    Notes                           : ├â┬Ø├â┬¼├â┬«
> 
>  [XMP-dc]        Description                     : ├â┬Ø├â┬¼├â┬«
> 
>  [IPTC]          Caption-Abstract                : ├â┬Ø├â┬¼├â┬«
> 

Even stranger: this doesn't even look like the original string you posted, 
almost as if your terminal uses something like the IBM850 codepage.

So what seems to have happened: somewhere in your chain, an utf-8 string was 
interpreted using an 8-bit char encoding. And it looks like your terminal does 
the same thing...

To give you an idea what I'm talking about (hoping the strings pass...)
UTF-8 string:               æâÂ¢az#&ˇÉÉŠ
same coded as cp-8859-15:   ÃŠÃ¢ÃÂ¢az#&ËÃÃÅ 
same coded as cp-1254:      Ã¦Ã¢Ã‚Â¢az#&Ë‡Ã‰Ã‰Å 
same coded as IBM850:       ├ª├ó├é┬óaz#&╦ç├ë├ë┼á
(the last three are different codepages, or different ways to assign char 
glyphs to 8-bit values, the standard before utf-8 became more or less 
generally used). Note that the 4 ASCII chars in the middle (az#&) survive 
intact: those are coded on 7 bits, and utf-8 uses the same encoding as ASCII 
for the first 127 characters. After that, the codes differ (utf-8 can use up 
to 4 bytes per character, iirc).

Note that all of these examples use the exact same bytes, just interpreted 
differently... (this would be even more striking with the utf-8 text in 
cyrillic or greek alphabet, but I don't have such a keyboard handy)