Character encodings (UTF16)

Wed Feb 9 18:17:41 GMT 2005

On Wednesday 09 February 2005 17:01, Waldo Bastian wrote:
> On Wednesday 09 February 2005 15:50, Andras Mantia wrote:
> > Hi,
> >
> >  I'm sending this to core-devel, as it affect many applications
> > including Kate (and everything using Katepart), KEdit, Konqueror and
> > maybe others. There seem to be a problem with dealing with certain
> > UTF16 encoded files. The question is whether the problem is in KDE/Qt
> > or the files in question are broken. Attached is a file that renders
> > fine in Firefox and Opera, the reporter says that it was saved in NVU,
> > while it shows up as garbage in Konqueror, Kate if opened as UTF16.  In
> > KEdit it's the same as in Kate when opened in UTF8 mode ("space" after
> > every character), while Konqueror in UTF8 mode shows the source.
> >  Does anybody know if this is a real problem (wrong handling of such
> > files) or it's a problem in the file itself? Certainly for the user it
> > looks like a real problem, especially that there are applications out
> > there that can work with the file. If I run a
> > "recode utf16LE..utf16 filename" on it, the resulted file can be opened
> > in every KDE application.
>
> I assume that the LE designation stands for "little endian" and that Qt
> defaults to "big endian". I believe one is supposed to insert a BOM (byte
> order mark) so that applications can guess correctly between utf16LE and
> utf16BE. The spaces that you see in utf8 mode are the NUL values from the
> high-bytes.
>
> I think it would be possible for konqueror to detect LE and BE by looking
> for "<NUL" versus "NUL<" and adjust accordingly. Would be easier if there
> was a separate "utf16le" codec.

Attached patch (khtml_utf16_endianness.patch) fixes Konqueror to correctly 
auto-detect the endianness.

Instead of relying on '<', it's perhaps nicer to have something slightly more 
generic approach, something like the following algorithm:

nulcount_even = number of nul's at the first 5 even positions
nulcount_odd = number of nul's at the first 5 odd positions
if (nulcount_even == 0 && nulcount_odd == 5) encoding = utf16LE;
if (nulcount_even == 5 && nulcount_odd == 0) encoding = utf16BE;

This basically relies on the fact that the first few characters in a html file 
will all be in the ASCII range.

The second patch does just that (khtml_utf16_endianness2.patch).

Please review.

Cheers,
Waldo
-- 
bastian at kde.org   |   Free Novell Linux Desktop 9 Evaluation Download
bastian at suse.com  |   http://www.novell.com/products/desktop/eval.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: khtml_utf16_endianness.patch
Type: text/x-diff
Size: 2139 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20050209/7780ece8/attachment.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: khtml_utf16_endianness2.patch
Type: text/x-diff
Size: 3443 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20050209/7780ece8/attachment-0001.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20050209/7780ece8/attachment.sig>