Adding location info to the C++ parser

Fri Mar 17 10:45:49 UTC 2006

On Friday 17 March 2006 03:53, Matt Rogers wrote:
> Hi,
>
> I've come up with a plan that I intend to use to modify the c++ parser so
> that it provides proper location info for use in KDevelop. I submit my plan
> here so that it can be reviewed, suggestions can be provided, or i can be
> flat out told I'm wrong.  :)
>
> ==========================
>
> Plan:
>
> Change: Modify the preprocessor so it does not strip indentation or blank
> lines (blank lines are mostly when comments are being removed)
>
> Reason: Proper column information is needed and if the preprocessor removes
> indentation that will mess up column information. If the preprocessor
> removes comments and the newline that follows them, then the line
> information is automatically thrown off.
>
> Change: Verify the preprocessor outputs line number markers similar to
> those output by gcc -E and if it does not, modify the preprocessor to
> output line number markers similar to the output of gcc -E
>
> Reason: This needs to be done to ensure that the parser (via the tokenizer)
> has proper line numbers to work with.
>
> Change: Modify the tokenizer to store line and column information within
> the tokens
>
> Reason: This needs to be done so that the parser can add this information
> to the code model via the binder
>
> ==========================
>
> Please let me know what you think, if i'm on the right track, if i'm just
> completely wrong, if i've left out something, etc. I would appreciate any
> feedback. I will attempt to keep the parser as fast as it is now, but i
> can't guarantee anything.
I think you need to have two set of tokens, the first set when parsing the 
original source before preprocessing and these tokens would have line/column 
info for the original source. Then after preprocessing there would need to be 
a second set of tokens which are passed to the language parser. The second 
set of tokens might have pointers to the token in the first set that they 
were 'derived' from via a preprocessor expansion. The reason for this is the 
if the parser is to be used for refactoring it must be able to know which 
chunks of text in the original source correspond to a particular grammar 
rule, and as far as I can see this can only be done by introducing an extra 
set of tokens and with an extra level of indirection in the second set.

I don't think you can get round the problem by not stripping comments and 
white space, because a macro expansion on a particular line will obviously 
screw up the column info of any items on the same line that follow it.

-- Richard