Patch for Java Tokenizer
Richard Dale
Richard_Dale at tipitina.demon.co.uk
Tue Apr 17 14:48:32 UTC 2001
I've been reading the definition of a java identifier in 'The Java Language
Specification', second edition by James Gosling, Bill Joy, Guy Steele and Gilad
Bracha:
"An identifier is an unlimited-length sequence of Java letters and Java digits,
the first of which must be a java letter. An identifier cannot have the same
spelling (Unicode character sequence) as a keyword, boolean literal or the null
literal.
Letters and digits may be drawn from the entire Unicode character set..
...
A 'Java letter' is a character for which the method
Character.isJavaIdentifierStart returns true. A 'java letter-or-digit' is a
character for which the method Character.isJavaIdentifierPart returns true.
...
The Java letters include uppercase and lowercase ASCII latin letters A-Z, and
a-z, and, for historical reasons the ASCII underscore and dollar sign."
It doesn't mention the tilde '~' character. We also need to be able to cater
for unicode escape sequences of the form '\uxxxx' which 'x' is a hex digit. So
I think the grammar should like something like this:
WS [[:blank:]\r]+
JAVALETTER [A-Za-z_$]
UNICODE_ESCAPE \\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
LETTER [A-Za-z_~$\xc0-\xd6\xd8-\xf6\xf8-\xff]
DIGIT [0-9]
NUM {DIGIT}+
ID {JAVALETTER}({LETTER}|{DIGIT}|UNICODE_ESCAPE)*
I think the user wants to edit in unicode, and then have the unicode characters
converted to unicode escapes automatically when they save the file. I won't
commit anything to the CVS yet, as I think this needs more thought, and isn't
causing any serious problems.
-- Richard
On Thu, 12 Apr 2001, you wrote:
> The following patch for the Java classparser should resolve some problems
> with non US-ASCII identifiers in Java source files. IIRC Flex does not allow
> the definition of Unicode tokens but the patch at least adds all characters
> legal for Java identifiers within the range of 0x00-0xff as defined by the
> Java Language Specification.
>
> Bye, Oliver
>
>
> - patch for parts/javasupport/tokenizer.l -
>
> @@ -50,7 +50,7 @@
> %}
>
> WS [[:blank:]\r]+
> -LETTER [A-Za-z_~]
> +LETTER [A-Za-z_~$\xc0-\xd6\xd8-\xf6\xf8-\xff]
> DIGIT [0-9]
> NUM {DIGIT}+
> ID {LETTER}+({LETTER}|{DIGIT})*
>
-
to unsubscribe from this list send an email to kdevelop-devel-request at kdevelop.org with the following body:
unsubscribe »your-email-address«
More information about the KDevelop-devel
mailing list