make khtml/misc/decoder.* public

Thiago Macieira thiago at kde.org
Mon Mar 5 17:32:09 GMT 2007


Allan Sandfeld Jensen wrote:
>Safe encoding detection:
>* Look for Unicode BOMs

This is the only safe encoding detection that I know of. Everything else 
is speculative.

UTF-8: ef bb bf
UTF-16 little endian: ff fe
UTF-16 big endian: fe ff
UTF-32 little endian: ff fe 00 00
UTF-32 big endian: 00 00 fe ff

Also note that "utf-8" and "utf-8 with bom" should be treated as two 
different encodings. Prepending a BOM to an existing UTF-8 file could be 
catastrophic (for example, scripts).

This file has a BOM:
$ cat /tmp/script.sh
#!/bin/sh
true
$ file /tmp/script.sh
/tmp/script.sh: Unicode text, UTF-8
$ /tmp/script.sh
bash: /tmp/script.sh: cannot execute binary file

Detecting UTF-8 files without BOM is doable because of the very specific 
high-order bytes. Unfortunately, any UTF-8 file is also a legally valid 
ISO-8859 encoding.

>* Look for <?xml encoding?> in the beginning of XML-documents
>* When editing HTML documents use the whole shebang

I don't think this should be done by default. It should be provided as an 
option for detecting, in addition to MIME type and BOM. But applications 
that cannot afford misdetection should be able to turn this off, since it 
can sometimes return wrong results.

-- 
  Thiago Macieira  -  thiago (AT) macieira.info - thiago (AT) kde.org
    PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20070305/fbbd393a/attachment.sig>


More information about the kde-core-devel mailing list