make khtml/misc/decoder.* public
Thiago Macieira
thiago at kde.org
Mon Mar 5 17:32:09 GMT 2007
Allan Sandfeld Jensen wrote:
>Safe encoding detection:
>* Look for Unicode BOMs
This is the only safe encoding detection that I know of. Everything else
is speculative.
UTF-8: ef bb bf
UTF-16 little endian: ff fe
UTF-16 big endian: fe ff
UTF-32 little endian: ff fe 00 00
UTF-32 big endian: 00 00 fe ff
Also note that "utf-8" and "utf-8 with bom" should be treated as two
different encodings. Prepending a BOM to an existing UTF-8 file could be
catastrophic (for example, scripts).
This file has a BOM:
$ cat /tmp/script.sh
#!/bin/sh
true
$ file /tmp/script.sh
/tmp/script.sh: Unicode text, UTF-8
$ /tmp/script.sh
bash: /tmp/script.sh: cannot execute binary file
Detecting UTF-8 files without BOM is doable because of the very specific
high-order bytes. Unfortunately, any UTF-8 file is also a legally valid
ISO-8859 encoding.
>* Look for <?xml encoding?> in the beginning of XML-documents
>* When editing HTML documents use the whole shebang
I don't think this should be done by default. It should be provided as an
option for detecting, in addition to MIME type and BOM. But applications
that cannot afford misdetection should be able to turn this off, since it
can sometimes return wrong results.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kde-core-devel/attachments/20070305/fbbd393a/attachment.sig>
More information about the kde-core-devel
mailing list