UTF-8 captions
Remco Viëtor
remco.vietor at wanadoo.fr
Tue Jan 17 17:22:53 GMT 2017
On mardi 17 janvier 2017 08:48:30 CET Andrey Goreev wrote:
> Hello Remco,
>
> Digikam shows the caption under the thumbnail as well as in the right panel:
> Properties -> digiKam properties/Caption
> Metadata -> EXIF/Image Description; IPTC/Caption (IPTC/Character Set shows
> UTF-8); XMP/Description, XMP/User comment; XMP/Image description;
> Captions -> Description/Captions
>
I know where to find the captions within Digikam. What wasn't clear to me is
where _you_ saw that mutilated utf-8.
> Here is an extract from the output of ExifTool -a -G1 -s command:
>
> [File] Comment : ├â┬Ø├â┬¼├â┬«
>
> [IFD0] ImageDescription : ├Ø├¼├«
>
> [ExifIFD] UserComment : ├â┬Ø├â┬¼├â┬«
>
> [XMP-tiff] ImageDescription : ├â┬Ø├â┬¼├â┬«
>
> [XMP-exif] UserComment : ├â┬Ø├â┬¼├â┬«
>
> [XMP-acdsee] Notes : ├â┬Ø├â┬¼├â┬«
>
> [XMP-dc] Description : ├â┬Ø├â┬¼├â┬«
>
> [IPTC] Caption-Abstract : ├â┬Ø├â┬¼├â┬«
>
Even stranger: this doesn't even look like the original string you posted,
almost as if your terminal uses something like the IBM850 codepage.
So what seems to have happened: somewhere in your chain, an utf-8 string was
interpreted using an 8-bit char encoding. And it looks like your terminal does
the same thing...
To give you an idea what I'm talking about (hoping the strings pass...)
UTF-8 string: æâ¢az#&ˇÉÉŠ
same coded as cp-8859-15: Êââaz#&ËÃÃÅ
same coded as cp-1254: æâ¢az#&ˇÉÉÅ
same coded as IBM850: æâ¢az#&ˇÉÉŠ
(the last three are different codepages, or different ways to assign char
glyphs to 8-bit values, the standard before utf-8 became more or less
generally used). Note that the 4 ASCII chars in the middle (az#&) survive
intact: those are coded on 7 bits, and utf-8 uses the same encoding as ASCII
for the first 127 characters. After that, the codes differ (utf-8 can use up
to 4 bytes per character, iirc).
Note that all of these examples use the exact same bytes, just interpreted
differently... (this would be even more striking with the utf-8 text in
cyrillic or greek alphabet, but I don't have such a keyboard handy)
More information about the Digikam-users
mailing list