Patch for Java Tokenizer

Richard Dale Richard_Dale at tipitina.demon.co.uk
Tue Apr 17 14:48:32 UTC 2001


I've been reading the definition of a java identifier in 'The Java Language
Specification', second edition by James Gosling, Bill Joy, Guy Steele and Gilad
Bracha:

"An identifier is an unlimited-length sequence of Java letters and Java digits,
the first of which must be a java letter. An identifier cannot have the same
spelling (Unicode character sequence) as a keyword, boolean literal or the null
literal.

Letters and digits may be drawn from the entire Unicode character set..
...
A 'Java letter' is a character for which the method
Character.isJavaIdentifierStart returns true. A 'java letter-or-digit' is a
character for which the method Character.isJavaIdentifierPart returns true.
...
The Java letters include uppercase and lowercase ASCII latin letters A-Z, and
a-z, and, for historical reasons the ASCII underscore and dollar sign."

It doesn't mention the tilde '~' character. We also need to be able to cater
for unicode escape sequences of the form '\uxxxx' which 'x' is a hex digit. So
I think the grammar should like something like this:

WS           [[:blank:]\r]+
JAVALETTER       [A-Za-z_$]
UNICODE_ESCAPE \\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
LETTER       [A-Za-z_~$\xc0-\xd6\xd8-\xf6\xf8-\xff]
DIGIT        [0-9]
NUM          {DIGIT}+
ID           {JAVALETTER}({LETTER}|{DIGIT}|UNICODE_ESCAPE)*

I think the user wants to edit in unicode, and then have the unicode characters
converted to unicode escapes automatically when they save the file. I won't
commit anything to the CVS yet, as I think this needs more thought, and isn't
causing any serious problems.

-- Richard

On Thu, 12 Apr 2001, you wrote:
> The following patch for the Java classparser should resolve some problems 
> with non US-ASCII identifiers in Java source files. IIRC Flex does not allow 
> the definition of Unicode tokens but the patch at least adds all characters 
> legal for Java identifiers within the range of 0x00-0xff as defined by the 
> Java Language Specification.
> 
> Bye, Oliver
> 
> 
> - patch for parts/javasupport/tokenizer.l -
> 
> @@ -50,7 +50,7 @@
>  %}
>  
>  WS           [[:blank:]\r]+
> -LETTER       [A-Za-z_~]
> +LETTER       [A-Za-z_~$\xc0-\xd6\xd8-\xf6\xf8-\xff]
>  DIGIT        [0-9]
>  NUM          {DIGIT}+
>  ID           {LETTER}+({LETTER}|{DIGIT})*
> 

-
to unsubscribe from this list send an email to kdevelop-devel-request at kdevelop.org with the following body:
unsubscribe »your-email-address«



More information about the KDevelop-devel mailing list