text translation

Sat Oct 13 16:23:01 UTC 2012

On Sat, Oct 13, 2012 at 6:08 PM, jon bird <news at onastick.clara.co.uk> wrote:
> Livin’ On The Edge
>
> The issue seems to be on the translation of the accent character "’".
> In the ISO character set I believe this is 0xB4.
>
> The text is stored in the tag in unicode, with the accent character
> encoded as:
>
> 0x1920

U+2019 and U+00B4 are two different characters, both exist in Unicode.
The one in the string is U+2019, which is not representable in
ISO-8859-1.

> As I understand it, the default text encoding is ISO-8859-1. I don't
> change this so I would expect this character to be converted to 0xb4 in
> the return string. However it isn't, what I end up with is 0x19 - in
> effect the lower byte of the original UTF-16 string.

You are right that toCString will convert the string to ISO-8859-1,
but it does so very simply by simply stripping the Unicode code-points
to 8-bits. That does the trick for ISO-8859-1, but for characters
outside of ISO-8859-1 it simply returns the lower byte instead of
either ignoring it or returning '?'. This could be seen as a bug, but
you would not get the original string anyway, as it's not possible to
encode it in ISO-8859-1.

Lukas