[RFC] kdev-pg: AST improvements

Mon Jul 17 08:48:13 UTC 2006

Hi list,

recently I met Roberto on IRC again, and we concluded that kdevelop-pg 
generated parsers need better abstract syntax trees (ASTs) than the currently 
generated ones. (Roberto also concluded that I am now the official kdev-pg 
maintainer, which is, well, gnrr. I hate it when things depend on me ;)
But well, at least it meets well with my current tasks :D )

Currently, kdev-pg just takes the structure of the grammar rules and applies 
that structure directly onto the generated tree. In other words, the current 
AST is not really an "abstract" tree, but rather the original parse tree.

Example: "LPAREN args=argument_list RPAREN -> parenthesized_argument_list ;;"
produces a function parse_parenthesized_argument_list() and the structure 
parenthesized_argument_list_ast with one element args, which is of the type 
argument_list_ast.

The problem with that approach are the characteristics of LL(1) parsers, which 
sometimes require that logically connected rules are split in two, or that 
several different rules are combined into one big rule. This can't really be 
avoided.

Example: All type declarations (class decls, interface decls, enum decls, ...) 
can begin with modifiers like public, const, abstract, and the likes.
LL(1) parsers can't foresee the "class" or "interface" tokens, so all of those 
rules have to be combined into one rule, and split up after the modifiers:
"mods=optional_modifiers
(class_decl=class_declaration | interface_decl=interface_declaration | ...)
-> type_declaration ;;"
In the resulting parse tree, that makes two distinct rules for each kind of 
type declaration where it should be only one per declaration.

Another example: The same problem, only that you don't always want to split 
the rule into subrules, like in the previous example. See and believe:
"mods=optional_modifiers type=type member_name=identifier
(LPAREN method_arguments=optional_argument_list RPAREN | 0)
SEMICOLON
-> class_member_declaration ;;"
That simplified example of a class member declaration can either be a variable 
or a method declaration. The same rule for two different kinds of 
declarations, which is necessary because both start the same way
(but splitting out the argument list doesn't really bring any gains).
For more complex rules like real class member declarations, this makes the 
generated parse tree something like VERY UGLY.

One possible approach for improving it would be a tree parser, which means 
transforming the original parse tree into something more usable. But that 
would require much additional complexity, new concepts, one more different 
code generator, and whatnot. In general, I don't think it's a good idea.
Roberto seems to agree with me.

So, I recently had an idea that would solve several issues in kdev-pg at once, 
namely how to make rule arguments the solution for this problem, and others.
This is what I plan to implement this week. I imagine it like this:

- At the end of each rule declaration in a grammar, you can specify arguments 
that ought to be passed to the rule when it is called (or should we 
say "encountered"). I think of three different modifier types:
  1. if the variable should be stored in the AST ("member")
     or not ("temporary").
  2. if the variable is an argument ("argument")
     or a self-defined variable ("").
  3. if the variable is a node ("node") or a custom type ("variable").

Example:
"0  -- meaning: no tokens are consumed for this rule
-> method_declaration [
     member argument node mods: optional_modifiers;
     member argument node type: type;
     member argument node method_name: identifier;
     member argument node args: optional_arguments;
] ;;

You get the idea.

- When referencing a rule with arguments inside another rule, you have to 
supply the required arguments.

Example:
"mods=optional_modifiers type=type member_name=identifier
(  LPAREN method_arguments=optional_argument_list RPAREN SEMICOLON
   method_decl=method_declaration[mods, type, member_name, args]
 | SEMICOLON
   variable_decl=variable_declaration[mods, type, member_name]
) 
-> class_member_declaration ;;"

Starts making sense, right?

- When written like above, the original rules are still stored in 
class_member_declaration_ast, but we don't need that anymore, because the two 
subrules (and substructs, consecutively) now contain everything we want.
So let's only store mods, type, identifier and method_arguments as local 
variable, using kdev-pg's ":" local assignment instead of the 
member-producing "=" sign. I plan to modify kdev-pg so that those local 
assignment variables are automatically declared at the beginning of the 
appropriate parse_*() function.

- With all that, we don't really need the "%member" directive anymore, because 
additional members are declared right at the end of each rule. So drop it.
For enum declarations, there will be a "%namespace" directive instead, where 
you can put your favorite typedefs. With that, we also get rid of the 
annoying "_ast" structure suffix in custom member assignments like this:
"  REF       [: (*yynode)->modifier = parameter_modifier_ast::mod_ref; :]
 | OUT       [: (*yynode)->modifier = parameter_modifier_ast::mod_out; :]
-> parameter_modifier ;;"

Short summary:

Advantages:
- We can have a customized AST while still taking advantage of kdevelop-pg's 
automatic AST and default visitor generation.
- We can selectively improve node structures in need, and can leave the other 
ones as they are, no need for duplicating them in a "tree parser" definition.
- No additional concepts to learn: the generated AST follows exactly the same 
rules as before, only with those modest additions.
- Custom member variables can be declared directly beneath the rule 
declaration instead of sitting at the top of the file, making for a better 
overview.
- We can drop class-wide instance variables like ellipsis_occured (in the Java 
parser) and let the two concerning rules communicate directly by rule 
arguments.

Disadvantages:
- ?

Have I forgot something? Stuff to consider? Roberto, your opinion please?
Unless anyone brings up a valid objection, I'm going to implement rule 
arguments this way from now on, and we can then have clean, proper ASTs as 
foundation for the code models / language parts afterwards.

Cheerio,
  Jakob