Can I display Chinese character filenemes in an

Mon Oct 4 22:28:13 BST 2004

On Monday 04 October 2004 18.35, James Richard Tyrer wrote:
> Robin Rosenberg wrote:
> > On Monday 04 October 2004 04.56, James Richard Tyrer wrote:
> >>Obviously, what I said is not Chinese specific.  It applies to any and
> >> all UTF-8 encoded file names.  ISO-8859-1 is a subset of UTF-8 so Latin
> >> characters will display just the same.
> >
> > No. ASCII is a subset of UTF-8.  ISO-8859-1 and UTF-8 are different and
> > incompatible (or I'd would be using UTF-8 today).
>
> I have: "LANG=en_us.utf8" and I have no problems.  IIRC, that is what I
> have read at authoritative sources.  But, do you mean that glyphs 128-255
> are not the same in ISO-8859-1 and UTF-8?  Perhaps there are some problems
> that I am not aware of since all I ever use (128-255) are Latin letters
> with diacritical marks.  It does appear that odd combinations of characters
> could be interpreted as something other than ISO-8859-1.

ISO-8859-1 is both an encoding and a character set while UTF-8 is only and 
encoding for the unicode character set. The code points of these overlap at 
the first 256 posititions.  When looked upon as encodings only the first 127 
positions are identical. UTF-8 can encoding all characters in the ISO-8859-1 
character set, but it does it differently. UTF-8 does this with a variable 
length encoding.

The filename "åäö" can be stored as the byte sequence [e5 e4 f6] when my 
locale is set to ISO-8859-1 or [c3 a5 c3 a4 c3 b6] when using UTF-8. I can't
have it both ways. The UTF-8 encoding shows up as "Ã¥Ã¤Ã¶" (unreadable 
garbage). In order to swith my locale from ISO-8859-1 to UTF-8 I have to 
convert my filenames as most non-ascii filename would be illegal in UTF-8 
(not that many programs care). The others (non-ascii again) will look wrong.

Do "ls filenamewithdiacriticalmarks|od -tx1" and you'll see a variable length
encoding with one or two bytes depending on character (chinese characters are 
even longer). UTF-8 could require up to six bytes for one single character. 
I'm not sure if the unicode consortium has defined any such character yet.

-- robin
___________________________________________________
This message is from the kde mailing list.
Account management:  https://mail.kde.org/mailman/listinfo/kde.
Archives: http://lists.kde.org/.
More info: http://www.kde.org/faq.html.