[kde] [Bug 465305] New: character counts are wrong when text includes emojis; each counted as two (2)

Sun Feb 5 05:24:25 GMT 2023

https://bugs.kde.org/show_bug.cgi?id=465305

            Bug ID: 465305
           Summary: character counts are wrong when text includes emojis;
                    each counted as two (2)
    Classification: I don't know
           Product: kde
           Version: unspecified
          Platform: Other
                OS: Other
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: general
          Assignee: unassigned-bugs at kde.org
          Reporter: kdebugs at toeai.com
  Target Milestone: ---

SUMMARY
Product/Component unknown; sorry.  Observed in kate and konsole, so probably
affects something they both depend on.

STEPS TO REPRODUCE (kate)
1. In kate, enter an emoji, e.g. 😊
2. Move cursor back and forth from before to after cursor.
3. Look at line:column indicator at bottom.

OBSERVED RESULT (kate)
Column jumps by two for a single character.

EXPECTED RESULT (kate)
Column should increase by only one per character.

STEPS TO REPRODUCE (konsole)
1. In kate, copy and paste the emoji until you have OVER 4000 (e.g. 4001). 
(Remember that the column number will say 8003 at the end of a line with 4001
emojis.)
2. Select them all and copy to clipboard.
3. In konsole, run 'python3'.  Then type:
len("""
4. Press Ctrl+Shift+V (or go to Edit, Paste; or right-click and select Paste).

OBSERVED RESULT (konsole)
It will ask you if you want to paste X number of characters (e.g. 8002) instead
of the correct number (e.g. 4001).
Answer 'yes'.  Then complete the python expression with:
""")
and hit enter.
The correct number of characters (e.g. 4001) is displayed.

EXPECTED RESULT (konsole)
It should count the characters correctly, not double-count them.

SOFTWARE/OS VERSIONS
Kubuntu 22.10
KDE Plasma Version: 5.25.5
KDE Frameworks Version: 5.98.0
Qt Version: 5.15.6
Kate 22.08.2
Konsole 22.08.2

ADDITIONAL INFORMATION
For casual users, the number of characters may not really matter, but for
people like me who do programming or work on data projects, I need to know
correct character counts, and not be wondering where did X number of characters
go or where did X number of characters magically come from.  If it's a single
Unicode code point (e.g. U+1F60A) then it needs to be treated as just one
character, regardless of how many bytes it might require to encode in a
particular encoding.  The whole point of working with text instead of bytes is
that you can work with characters, not worrying about how things are encoded
under the hood.

-- 
You are receiving this mail because:
You are the assignee for the bug.