[konsole] [Bug 395171] Remove UTF-16 and other non ASCII compatible encodings

Jayadevan bugzilla_noreply at kde.org
Sun May 2 20:26:03 BST 2021


https://bugs.kde.org/show_bug.cgi?id=395171

--- Comment #13 from Jayadevan <jayadevanraja at yandex.com> ---
(In reply to tcanabrava from comment #11)
> This thread is now under Community Working Group supervision.
> 
> (1) All strings should be sanitised, so that they will be perfectly safe,
> and will not break anything.
> 
> You clearly are ignoring the issues pointed out by Egmond, sanitization has
> nothing to do with this.
> 
> (2) It is racist to suggest that all non-English people are Chinese (or
> Japanese or Korean). 
> 
> Please take a look at the KDE Code of Conduct, we will not tolerate
> accusations of racism on what as meat to be an explanation based on a
> example. if there is more than CJK that uses more bytes per enconding, is
> irrelevant.
> 
> Most scripts in the world are given only 3 byte encodings per character in
> UTF-8, and not a code point per spoken word, as you say. That is a lie.
> 
> (3) The world has still not settled on UTF-16. But modern languages and
> platforms tend to do so. Java, Dotnet, ICU, KDE, QT, Windows NT, JavaScript,
> Dart, Flutter...
> In today's world, support for both the modern UTF-16 and the legacy UTF-8 is
> needed.
> 
> Patches welcome, I won't spend time working on this untill the *base
> software* (bash, zsh, etc) supports it.


He mentioned that Scripts other than English are having one code point to stand
for one "syllable or an entire word". He used CJK as an example. That is a
cherry-picked example to prove a wrong point. The conclusion was that 1 code
point can have 3 bytes for non-Latin scripts, as they have one word per code
point.

Most scripts like Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian,
Ethiopic, Cherokee, Unified Canadian Aboriginal Syllabics, Khmer, and many
others, used by billions of people are having 3 bytes per code-point, and have
only one phoneme per code point, unlike he mentioned.

His cherry-picking of examples to prove a wrong point. He said "The only sense
in which one can perhaps claim that UTF-8 is Anglo-centric, is that it uses 1
byte for English letters vs. 3 bytes for CJK (Chinese, Japanese, Korean)
symbols; whereas UTF-16 uses 2 for both. Given that an English letter
represents, well, a single letter of a word, whereas a CJK symbol represents a
syllable or an entire word, I actually do think UTF-8's 1:3 split is a way more
fair system." The implication is clearly that other than English (or Latin),
the only scripts which matter is CJK. That is clearly inappropriate against
people from South Asia, SE Asia, Cherokee, Canadian Aboriginals etc.

The scripts of South Asia, SE Asia, Cherokee, Canadian Aboriginals etc. deserve
equal status as English. These scripts are used by billions of people. Claiming
that "The only sense in which one can perhaps claim that UTF-8 is
Anglo-centric, is that it uses 1 byte for English letters vs. 3 bytes for CJK"
ignores the importance of scripts used by billions of humans. It is a factually
wrong statement, and not just a case of using a bad example.

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the konsole-devel mailing list