r++

Thu Aug 25 06:19:05 UTC 2005

I have so many thoughts and questions on this subject that I could write a 
book.  I guess I should probably take it one piece at a time.  First let me 
point out that I have found a very nice discussion of the subject of syntax 
and parsing here:

http://cs.wwc.edu/~aabyan/PLBook/HTML/Syntax.html

When I went to copy the link, I tried to get to the top of the book and found 
there is more here than I originally thought.  What I had been reading looks 
to be an older version.  There appears to be two (or more) separate books, 
and at least two version of the Programming Languages book.

What I had been looking at is rooted here:
http://cs.wwc.edu/~aabyan/PLBook/HTML/

There seems to be a more recent version here:
http://cs.wwc.edu/~aabyan/PLBook/book/node8.html

(I don't like the info-style layout as much as the free-flow of the older 
version FWIW.)

Then there's another book:

http://www.cs.wwc.edu/~aabyan/Logic/index.html
The Logical Foundations of Computer Science and Mathematics

That is the Queen of the Sciences!

Now, here's my question for Robe, or for anyone else who cares to address it.  
I now realize there are two distinct sets of nodes in a generalized parse 
tree.  There are abstract nodes, and there are concrete nodes.  The abstract 
nodes do not immediately represent tokens in the parsed language.  

Interestingly, two abstract nodes can represent the exact same range of 
terminals.  To expose this I hacked on Robe's DumpTree code a bit to create a 
poor-man's XML dumper:

// XmlDumpTree.h
#ifndef XMLDUMPTREE_H
#define XMLDUMPTREE_H

#include "default_visitor.h"
#include "XmlDumpTree.h"
#include <iostream>

class XmlDumpTree: protected DefaultVisitor {
public:
  XmlDumpTree(std::ostream &out_ = std::cout, unsigned char indent_ = 2);

  void dump(AST *node) { visit(node); }

protected:
  virtual void visit(AST *node);

private:
  std::ostream &_out;
  unsigned char _indent;    
};

#endif // XMLDUMPTREE_H

// XmlDumpTree.cpp
#include "XmlDumpTree.h" //N.B. the mixed case.
#include <iostream>

static char const * const names[] = {
  0,
  "AccessSpecifier",
  "AsmDefinition",
  "BaseClause",
  "BaseSpecifier",
  "BinaryExpression",
  "CastExpression",
  "ClassMemberAccess",
  "ClassSpecifier",
  "CompoundStatement",
  "Condition",
  "ConditionalExpression",
  "CppCastExpression",
  "CtorInitializer",
  "DeclarationStatement",
  "Declarator",
  "DeleteExpression",
  "DoStatement",
  "ElaboratedTypeSpecifier",
  "EnumSpecifier",
  "Enumerator",
  "ExceptionSpecification",
  "ExpressionOrDeclarationStatement",
  "ExpressionStatement",
  "ForStatement",
  "FunctionCall",
  "FunctionDefinition",
  "IfStatement",
  "IncrDecrExpression",
  "InitDeclarator",
  "Initializer",
  "InitializerClause",
  "LabeledStatement",
  "LinkageBody",
  "LinkageSpecification",
  "MemInitializer",
  "Name",
  "Namespace",
  "NamespaceAliasDefinition",
  "NewDeclarator",
  "NewExpression",
  "NewInitializer",
  "NewTypeId",
  "Operator",
  "OperatorFunctionId",
  "ParameterDeclaration",
  "ParameterDeclarationClause",
  "PostfixExpression",
  "PrimaryExpression",
  "PtrOperator",
  "PtrToMember",
  "ReturnStatement",
  "SimpleDeclaration",
  "SimpleTypeSpecifier",
  "SizeofExpression",
  "StringLiteral",
  "SubscriptExpression",
  "SwitchStatement",
  "TemplateArgument",
  "TemplateDeclaration",
  "TemplateParameter",
  "ThrowExpression",
  "TranslationUnit",
  "TryBlockStatement",
  "TypeId",
  "TypeIdentification",
  "TypeParameter",
  "Typedef",
  "UnaryExpression",
  "UnqualifiedName",
  "Using",
  "UsingDirective",
  "WhileStatement",
  "WinDeclSpec"
};

XmlDumpTree::XmlDumpTree(std::ostream &out_, unsigned char indent_)
  : _out(out_)
  , _indent(indent_)
{}

void XmlDumpTree::visit(AST *node) {
  static int _indent = 0;

  if (node) _out
    <<std::string(_indent * 2, ' ')<<"<"<<names[node->kind]
    <<" \"start="<<node->start_token<<"\""
    <<" \"end="<<node->end_token<<"\">"<<"\n";

  ++_indent;
  DefaultVisitor::visit(node);
  --_indent;
  if (node) _out<<std::string(_indent * 2, ' 
')<<"</"<<names[node->kind]<<">"<<'\n';

}

$ diff main.cpp /download/org/kdevelop/kdevelop-4/parser/main.cpp 
26c26
< #include "XmlDumpTree.h"
---
> #include "dumptree.h"
61,64c61,65
<   if (ast && dump) {
<     XmlDumpTree dump(std::cout);
<     dump.dump(ast);
<   }
---
>   if (ast && dump)
>     {
>       DumpTree dump(std::cout);
>       dump.dump(ast);
>     }

// HelloWorld.cpp 
#include <iostream>

int main () {
  std::cout<<"Hello World"<<std::endl;
}

 ./r++0 -dump HelloWorld.cpp
<TranslationUnit "start=0" "end=16">
  <FunctionDefinition "start=0" "end=16">
    <SimpleTypeSpecifier "start=0" "end=1">
    </SimpleTypeSpecifier>
    <InitDeclarator "start=1" "end=4">
      <Declarator "start=1" "end=4">
        <Name "start=1" "end=2">
          <UnqualifiedName "start=1" "end=2">
          </UnqualifiedName>
        </Name>
        <ParameterDeclarationClause "start=3" "end=3">
        </ParameterDeclarationClause>
      </Declarator>
    </InitDeclarator>
    <CompoundStatement "start=4" "end=16">
      <ExpressionStatement "start=5" "end=15">
        <BinaryExpression "start=5" "end=14">
          <BinaryExpression "start=5" "end=10">
            <PostfixExpression "start=5" "end=8">
              <PrimaryExpression "start=5" "end=8">
                <Name "start=5" "end=8">
                  <UnqualifiedName "start=5" "end=6">
                  </UnqualifiedName>
                  <UnqualifiedName "start=7" "end=8">
                  </UnqualifiedName>
                </Name>
              </PrimaryExpression>
            </PostfixExpression>
            <PostfixExpression "start=9" "end=10">
              <PrimaryExpression "start=9" "end=10">
                <StringLiteral "start=9" "end=10">
                </StringLiteral>
              </PrimaryExpression>
            </PostfixExpression>
          </BinaryExpression>
          <PostfixExpression "start=11" "end=14">
            <PrimaryExpression "start=11" "end=14">
              <Name "start=11" "end=14">
                <UnqualifiedName "start=11" "end=12">
                </UnqualifiedName>
                <UnqualifiedName "start=13" "end=14">
                </UnqualifiedName>
              </Name>
            </PrimaryExpression>
          </PostfixExpression>
        </BinaryExpression>
      </ExpressionStatement>
    </CompoundStatement>
  </FunctionDefinition>
</TranslationUnit>

Notice how 
<PostfixExpression "start=11" "end=14">
  <PrimaryExpression "start=11" "end=14">
     <Name "start=11" "end=14">
all share the same token range.  I assume that means [11,14).

My question is this: do any of the nodes in the parse tree formally represent 
the concrete tokens in the source code being parsed?  Can these be easily 
identified?  Is it simply a matter of determining if they are terminals?
-- 
Regards,
Steven