Cpp Parser & multibyte chars (bug 274430)

Sat Nov 19 14:21:50 UTC 2011

On 2011-11-18 17:55, Milian Wolff wrote:
> Andreas tried to convince me in IRC that this is "broken code", since anything
> besides ASCII in C++ code is undefined.

That is not true, however. According to the standard: (2.2)

1. Physical source file characters are mapped, in an implementation 
defined manner, to the basic source character set (introducing new-line 
characters for end-of-line indicators) if necessary. The set of phys-
ical source file characters accepted is implementation-defined. Trigraph 
sequences (2.4) are replaced
by corresponding single-character internal representations. Any source 
file character not in the basic
source character set (2.3) is replaced by the universal-character-name 
that designates that charac-
ter. (An implementation may use any internal encoding, so long as an 
actual extended character
encountered in the source file, and the same extended character 
expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are handled 
equivalently.)

So any character are valid in the source code in an 
implementation-defined manner. Later on, 2.14.5 explains how these 
characters are valid in strings.

So I am pretty sure it is perfect valid to do "½" or "å" or whatever. 
However, what exactly happens is implementation defined (there is as I 
recall a rather big section on the GCC manual about this).

-- 
very kind regards,
Esben Mose Hansen