[kde] [Bug 463848] New: KDE Text Encoding for Korean (applies to KWrite and SubtitleComposer in Flatpaks)

Wed Jan 4 20:10:26 GMT 2023

https://bugs.kde.org/show_bug.cgi?id=463848

            Bug ID: 463848
           Summary: KDE Text Encoding for Korean (applies to KWrite and
                    SubtitleComposer in Flatpaks)
    Classification: I don't know
           Product: kde
           Version: unspecified
          Platform: Ubuntu
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: general
          Assignee: unassigned-bugs at kde.org
          Reporter: j_j_chiarella at posteo.net
  Target Milestone: ---

SUMMARY

The text encoding for Korean is broken/wrong on KDE software like KWrite and
SubtitleComposer. The "Save As with Encoding ... EUC-KR" is actually not
EUC-KR, but Unified Hangul Code/Windows-949/CP 949. The "Save As with Encoding
... CP 949" just corrupts every single non-ASCII character.

STEPS TO REPRODUCE
1.  Create a text file in Unicode (UTF-8), which is the default.
2.  Insert Korean Hangul text like `로씨써쑤쪼뢔쌰쎼쓔쬬`
3a. Save As with Encoding ... EUC-KR
or
3b. Save As with Encoding ... CP 949

OBSERVED RESULT

With EUC-KR, all of the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` are present in the file.
(`로씨써쑤쪼` are in EUC-KR, but `뢔쌰쎼쓔쬬` are only theoretically possible but are
*not* in EUC-KR.)

With CP 949 all of the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` become `??????????`.
(`로씨써쑤쪼뢔쌰쎼쓔쬬` *are* all in Windows-949/CP 949/UHC.)

EXPECTED RESULT

With EUC-KR, the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` should become `로씨써쑤쪼?????` or `로씨써쑤쪼`
because `로씨써쑤쪼` *are* in EUC-KR, but `뢔쌰쎼쓔쬬` are *not* in EUC-KR, despite being
theoretically possible arrangements of letters into pre-composed blocks.

With CP 949, the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` should all be preserved as
`로씨써쑤쪼뢔쌰쎼쓔쬬` because *all* are in CP 949/Windows-949/UHC.

SOFTWARE/OS VERSIONS

Latest Flatpak as of 2022-12-31, running on Linux (Ubuntu)

ADDITIONAL INFORMATION

EUC-KR *does* have `로씨써쑤쪼`, but it does *not* have `뢔쌰쎼쓔쬬` or `낥` or several
other theoretical possibilities. In Korean, one types letters to form blocks.
`낥` is theoretically possible. One just types `ㄴ` and `ㅏ` and `ㄹ` and `ㅌ`. 
Then, the IME assembles these into the block `낥` and the computer saves this
block as a pre-composed block in Unicode.

However, this syllable `낥` never occurs in any native or borrowed words. It is
*not* in EUC-KR. The English/Latin script equivalent is writing "igloo" as "ig"
and "loo" in pre-composed blocks. Korean usually uses its own alphabet, but
with letters arranged into monospaced blocks by morpho-phonemic syllable.
(Unicode also does have combining individual letters. It can store `낥` as four
code points: `combining ㄴ` and `combining ㅏ` and `combining ㄹ` and combining
`ㅌ`. However, Unicode included pre-composed blocks for the sake of round-trip
conversion, and no IME has ever moved away from pre-composed blocks. In other
words, you will always see the pre-composed blocks in real-life text.)

To correct this deficiency, Microsoft added *all* possible pre-composed Hangul
blocks to a new encoding style. The cost was sacrificing true ASCII
compatibility. This encoding, like Shift JIS and others, can have an ASCII byte
(0xxxxxxx) as a sole byte (an ASCII character) or as the trailing byte in a
two-byte character. Microsoft called its new encoding "Windows-949" or "Code
Page 949" or "Unified Hangul Code (UHC)." This price was worth it to ensure
that a typo character (`낥` instead of `날`) would not be lost. UTF-8 everywhere
is the way to go, of course. Still, many of us need to work with the legacy
encodings, especially with smart TVs. (Smart TVs and players only seem to
support some form of ISO-8859-# or a variable 1-2-byte encoding.)

KDE's KWrite and SubtitleComposer as of now do use Windows-949/CP 949/UHC, but
the menu option is erroneously titled `EUC-KR`. There is a menu option for CP
949 that does not work at all. This is confusing.

SUGGESTION

1. Change the behavior of the menu entry that says `EUC-KR` so that it behaves
as expected and rejects characters like `낥`.
2. Make the menu entry that says `CP 949` just do what the menu entry called
`EUC-KR` does right now.

OR ...

1. Change the menu entry that currently and erroneously says `EUC-KR` so that
it will say `EUC-KR (Windows)` or `CP 949` or `Windows-949` or `UHC`.
2. Remove the broken menu entry that currently and erroneously claims to
support CP 949.
3. Forget about true `EUC-KR` support on saving.

-- 
You are receiving this mail because:
You are the assignee for the bug.