No subject

Wed Apr 17 13:17:15 BST 2019

data bits plus space for one optional parity bit.), in contrast to LatinX w=
hich
are 8-bit-encodings. The first 128 characters of Latin1 match the ASCII
character set, AFAIK, but all codes >=3D 128 are not defined for ASCII.

To be sure, I looked up the standardisation papers which seem to back my
opinion:

The EXIF 2.2 standard (http://www.exif.org/Exif2-2.PDF) states on page 28 t=
hat
the reference documentation for character code ASCII is ITU-T T.50 IA5 (ITU=
-T
International Alphabet No. 5, now ITU-T IRA =3D International Reference
Alphabet).

The International Reference Alphabet is a 7-bit-encoding, the ITU-T
recommendation document can be found at:
http://www.itu.int/rec/dologin_pub.asp?lang=3De&id=3DT-REC-T.50-199209-I!!P=
DF-E&type=3Ditems

So in my eyes Latin1 strings containing characters with character codes lar=
ger
than 127 are not allowed in UserComment fields with an encoding type of "AS=
CII"
(or any EXIF header field which mandates ASCII encoding) and the string mus=
t be
recoded to unicode and be written as a UserComment field with type "Unicode=
".
(It'd probably good for interoperability to use "ASCII" if no invalid
characters appear within the string.)

In case of header fields which only allow ASCII encoding, transliteration f=
or
these invalid characters would need to be used. (iconv can do that, for
example, converting eg. "=C3=B6" to "oe" and the like.)

Latin1 would be acceptable with an "undefined" encoding type (8 null bytes,=
 see
EXIF spec page 29), but that would not help interoperability at all...

The EXIF spec only refers to the unicode spec in case of a "Unicode" encodi=
ng
type, so just as you I'm not sure which flavour of unicode could be used. I=
'm
not familiar with the unicode spec and have not looked up any details so fa=
r,
but the exact encoding of unicode files is determined by its first few bytes
which must carry a Byte Order Mark (BOM) in case of UTF-16 and UTF-32, while
this BOM is allowed but optional for UTF-8 files
(http://en.wikipedia.org/wiki/Byte-order_mark). Maybe the encoding used for
shorter unicode sequences like the UserComment string is also distinguished
this way?

In this case it would probably be preferrable to use UTF-8 if the input is
LatinX, as this should result in the shortest byte sequences after recoding.
The "deluxe solution" in this case would be to dynamically use the unicode
encoding which produces the shortes byte sequence.

--=20
Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=3Demail
------- You are receiving this mail because: -------
You are the assignee for the bug.=