Fixing the preprocessor

Tue May 26 11:52:38 UTC 2009

Am Dienstag 26 Mai 2009 11:26:42 schrieb Christoph Bartoschek:
> Hi,
>
> I try to fix the preprocessor. My target is to correctly parse the
> following code:
>
> #define MA(x) T<x> a
> #define MB(x) T<x>
> #define MC(X) int
> #define MD(X) c
>
> template <typename P1> struct A {};
> template <typename P2> struct T {};
>
> int main(int argc, char ** argv) {
>   MA(A<int>);
>   A<MB(int)> b;
>   MC(a)MD(b);
>   MC(a)d;
> }
>
> Currently the output of the preprocessor is:
>
> template <typename P1> struct A {};
> template <typename P2> struct T {};
>
> int main(int argc, char ** argv) {
>   T<A<int>> a;
>   A<T<int>> b;
>   intc;
>   intd;
> }
>
> All four declarations are wrong. All are different instances of the same
> error: After preprocessing tokens are not allowed to merge, but kdevelop
> ignores this.
>
> The cpp fixes this by adding spaces where necessary.
>
> What is the best way to handle this in kdevelop?
>
> 1. One solution would be to check upon macro expansion what the last
> character in the output stream is and to also insert a space if necessary.
> This would solve the first three declarations. The last requires a check
> after macro expansion. The check would also need a table of invalid
> character
> combinations. Altogether such a fix would be quite big and ugly.
>
> 2. Another solution would be to always add a space before and after a macro
> expansion. This would produce different output than cpp, but would it cause
> harm?
>
> 3. A third idea: The output stream of the preprocessor consists of a
> splitted string into different substrings. What about having this
> substrings match the tokens of the program? Then further processing steps
> would no longer merge any tokens. This also requires lots of work but seems
> to be quite clean for me.
>
>
> What is your opinion and do you have a better solution for the problem?
>
> Christoph

The best option would probably be just making sure that tokens are not merged 
where it's not desired. That is probably the way most close to cpp, although 
it would make the "stringified" preprocessor output look the same way it looks 
now (Approximately your Idea Nr. 3).

From what I understand it should be like this: Only merge tokens within a 
macro expansion and a "##" in between, else always keep them separate.

See pp-scanner.cpp:131, that is where the token merging happens. Now it looks 
like this should not be done at all. Instead, this token merging should be 
done when encountering a "##" at the beginning.

Another problem though is the whole input text. The input text to the 
preprocessor is given in a "fake tokenization": Each character is represented 
as an own token. This means that at least one such initial tokenization has to 
happen before processing the text.

Actually, since I know the knobs that need to be turned, I thought this would 
be easy and fast to implement, so I just took a look at it. Unfortunately it 
did cost me a few hours, and at the end it still didn't work. The problem with 
this approach: The lexer will still merge consecutive "<" and "<" tokens into 
"<<", as those are never merged by the preprocessor, it only creates tokens 
for strings. Thus, only  2 of these problems were fixed with this approach.

Now I tried the other approach of inserting whitespaces, and all it took was 3 
added "outout << ' ';" in pp-macro-expander to make _all_ your tests succeed, 
thus this is probably the better approach given the architecture.

So in short: The problem is fixed now. Thanks for your investigations on this, 
and feel free to find/fix more problems on the parser/preprocessor.

Greetings, David