Cpp Parser & multibyte chars (bug 274430)

Sun Nov 20 21:18:03 UTC 2011

On 20.11.11 20:39:21, David Nolden wrote:
> The parser uses IndexedString directly, and we have defined that the
> contents of IndexedString should be utf-8 encoded.

Thats good.

> So, to get the encoding right, all we would have to do is:
> 1. Get the ranges right when the contents is utf-8 encoded
> 2. Convert contents which is not utf-8 encoded into utf-8 while reading it
> 
> Both are independent. However, I don't like the idea of using
> ".kateconfig" for configuring the encoding, that seems messy, because
> this file means to configure the editor, and using the information
> more extensively even for closed files feels like an obscure
> side-effect.
> 
> Doing the mapping while highlighting should not be too difficult,
> although would require some work. We would have to read the utf-8
> encoded line, extract the specific set of column-offsets, and apply
> those offsets to the when before creating KTextEditor::Range from
> RangeInRevision. This would need reading the utf-8 specification to
> check how to extract the offsets, however I'm pretty sure that the
> utf-8 specification is easy enough regarding this.

Hmm, if the IndexedString is in utf-8 then all you really need to ensure
is that the positions that are generated for it are character-based and
not byte-based and similarly the lengths. Then there's no conversion
necesary at all since the number of characters should stay the same.

Of course that leaves the task to actually convert from the input file
to utf-8 using the correct encoding. Including taking into account
whatever the user has chosen in kate if the file is opened.