[RFC] kdev-pg: AST improvements
Jakob Petsovits
jpetso at gmx.at
Mon Jul 17 08:48:13 UTC 2006
Hi list,
recently I met Roberto on IRC again, and we concluded that kdevelop-pg
generated parsers need better abstract syntax trees (ASTs) than the currently
generated ones. (Roberto also concluded that I am now the official kdev-pg
maintainer, which is, well, gnrr. I hate it when things depend on me ;)
But well, at least it meets well with my current tasks :D )
Currently, kdev-pg just takes the structure of the grammar rules and applies
that structure directly onto the generated tree. In other words, the current
AST is not really an "abstract" tree, but rather the original parse tree.
Example: "LPAREN args=argument_list RPAREN -> parenthesized_argument_list ;;"
produces a function parse_parenthesized_argument_list() and the structure
parenthesized_argument_list_ast with one element args, which is of the type
argument_list_ast.
The problem with that approach are the characteristics of LL(1) parsers, which
sometimes require that logically connected rules are split in two, or that
several different rules are combined into one big rule. This can't really be
avoided.
Example: All type declarations (class decls, interface decls, enum decls, ...)
can begin with modifiers like public, const, abstract, and the likes.
LL(1) parsers can't foresee the "class" or "interface" tokens, so all of those
rules have to be combined into one rule, and split up after the modifiers:
"mods=optional_modifiers
(class_decl=class_declaration | interface_decl=interface_declaration | ...)
-> type_declaration ;;"
In the resulting parse tree, that makes two distinct rules for each kind of
type declaration where it should be only one per declaration.
Another example: The same problem, only that you don't always want to split
the rule into subrules, like in the previous example. See and believe:
"mods=optional_modifiers type=type member_name=identifier
(LPAREN method_arguments=optional_argument_list RPAREN | 0)
SEMICOLON
-> class_member_declaration ;;"
That simplified example of a class member declaration can either be a variable
or a method declaration. The same rule for two different kinds of
declarations, which is necessary because both start the same way
(but splitting out the argument list doesn't really bring any gains).
For more complex rules like real class member declarations, this makes the
generated parse tree something like VERY UGLY.
One possible approach for improving it would be a tree parser, which means
transforming the original parse tree into something more usable. But that
would require much additional complexity, new concepts, one more different
code generator, and whatnot. In general, I don't think it's a good idea.
Roberto seems to agree with me.
So, I recently had an idea that would solve several issues in kdev-pg at once,
namely how to make rule arguments the solution for this problem, and others.
This is what I plan to implement this week. I imagine it like this:
- At the end of each rule declaration in a grammar, you can specify arguments
that ought to be passed to the rule when it is called (or should we
say "encountered"). I think of three different modifier types:
1. if the variable should be stored in the AST ("member")
or not ("temporary").
2. if the variable is an argument ("argument")
or a self-defined variable ("").
3. if the variable is a node ("node") or a custom type ("variable").
Example:
"0 -- meaning: no tokens are consumed for this rule
-> method_declaration [
member argument node mods: optional_modifiers;
member argument node type: type;
member argument node method_name: identifier;
member argument node args: optional_arguments;
] ;;
You get the idea.
- When referencing a rule with arguments inside another rule, you have to
supply the required arguments.
Example:
"mods=optional_modifiers type=type member_name=identifier
( LPAREN method_arguments=optional_argument_list RPAREN SEMICOLON
method_decl=method_declaration[mods, type, member_name, args]
| SEMICOLON
variable_decl=variable_declaration[mods, type, member_name]
)
-> class_member_declaration ;;"
Starts making sense, right?
- When written like above, the original rules are still stored in
class_member_declaration_ast, but we don't need that anymore, because the two
subrules (and substructs, consecutively) now contain everything we want.
So let's only store mods, type, identifier and method_arguments as local
variable, using kdev-pg's ":" local assignment instead of the
member-producing "=" sign. I plan to modify kdev-pg so that those local
assignment variables are automatically declared at the beginning of the
appropriate parse_*() function.
- With all that, we don't really need the "%member" directive anymore, because
additional members are declared right at the end of each rule. So drop it.
For enum declarations, there will be a "%namespace" directive instead, where
you can put your favorite typedefs. With that, we also get rid of the
annoying "_ast" structure suffix in custom member assignments like this:
" REF [: (*yynode)->modifier = parameter_modifier_ast::mod_ref; :]
| OUT [: (*yynode)->modifier = parameter_modifier_ast::mod_out; :]
-> parameter_modifier ;;"
Short summary:
Advantages:
- We can have a customized AST while still taking advantage of kdevelop-pg's
automatic AST and default visitor generation.
- We can selectively improve node structures in need, and can leave the other
ones as they are, no need for duplicating them in a "tree parser" definition.
- No additional concepts to learn: the generated AST follows exactly the same
rules as before, only with those modest additions.
- Custom member variables can be declared directly beneath the rule
declaration instead of sitting at the top of the file, making for a better
overview.
- We can drop class-wide instance variables like ellipsis_occured (in the Java
parser) and let the two concerning rules communicate directly by rule
arguments.
Disadvantages:
- ?
Have I forgot something? Stuff to consider? Roberto, your opinion please?
Unless anyone brings up a valid objection, I'm going to implement rule
arguments this way from now on, and we can then have clean, proper ASTs as
foundation for the code models / language parts afterwards.
Cheerio,
Jakob
More information about the KDevelop-devel
mailing list