Adding location info to the C++ parser

Fri Mar 17 22:36:08 UTC 2006

On Friday 17 March 2006 03:17, Richard Dale wrote:
> On Friday 17 March 2006 03:53, Matt Rogers wrote:
> > Hi,
> >
> > I've come up with a plan that I intend to use to modify the c++ parser so
> > that it provides proper location info for use in KDevelop. I submit my
> > plan here so that it can be reviewed, suggestions can be provided, or i
> > can be flat out told I'm wrong.  :)
> >
> > ==========================
> >
> > Plan:
> >
> > Change: Modify the preprocessor so it does not strip indentation or blank
> > lines (blank lines are mostly when comments are being removed)
> >
> > Reason: Proper column information is needed and if the preprocessor
> > removes indentation that will mess up column information. If the
> > preprocessor removes comments and the newline that follows them, then the
> > line information is automatically thrown off.
> >
> > Change: Verify the preprocessor outputs line number markers similar to
> > those output by gcc -E and if it does not, modify the preprocessor to
> > output line number markers similar to the output of gcc -E
> >
> > Reason: This needs to be done to ensure that the parser (via the
> > tokenizer) has proper line numbers to work with.
> >
> > Change: Modify the tokenizer to store line and column information within
> > the tokens
> >
> > Reason: This needs to be done so that the parser can add this information
> > to the code model via the binder
> >
> > ==========================
> >
> > Please let me know what you think, if i'm on the right track, if i'm just
> > completely wrong, if i've left out something, etc. I would appreciate any
> > feedback. I will attempt to keep the parser as fast as it is now, but i
> > can't guarantee anything.
>
> I think you need to have two set of tokens, the first set when parsing the
> original source before preprocessing and these tokens would have
> line/column info for the original source. Then after preprocessing there
> would need to be a second set of tokens which are passed to the language
> parser. The second set of tokens might have pointers to the token in the
> first set that they were 'derived' from via a preprocessor expansion. The
> reason for this is the if the parser is to be used for refactoring it must
> be able to know which chunks of text in the original source correspond to a
> particular grammar rule, and as far as I can see this can only be done by
> introducing an extra set of tokens and with an extra level of indirection
> in the second set.
>
> I don't think you can get round the problem by not stripping comments and
> white space, because a macro expansion on a particular line will obviously
> screw up the column info of any items on the same line that follow it.
>
> -- Richard
>

yes, i hadn't thought about that. After thinking about it a bit more, I'm 
pretty sure a preprocess pass would only be needed to pull in symbols from 
includes so that they're parseable for code completion purposes and to verify 
that macros used are actually present. 

Anyways, I guess the thing to do is to make the binder see the difference 
between the preprocessed source and the original source and to sort of merge 
the two. Sound sane?
--
Matt