Implementing a general language plugin (Reposting from correct address)

Mon Apr 29 00:08:57 BST 2019

On Friday, 26 April 2019 22:06:03 BST Jonathan Verner wrote:
> Hi,
> 
> I got addicted to semantic coloring (the thing where kdevelop colors
> variables and their uses in separate colors) and now using other IDEs is a
> pain :-) However, I was recently doing some frontend JavaScript work and
> found out that kdevelop has trouble understanding modern JavaScript so I
> had to reluctantly resort to VSCode [4].  Also, I will probably be forced
> to write quite a lot of code in C# which, as far as I know, has no support
> in kdevelop.  So I decided to try to sit down and write a language plugin.
> Now there is no way that I would be able to write and *maintain* a decent
> parser for either of the languages so the only option left to me was to use
> the tools the languages already have.
> 
> However, the problem here is, that kdevelop needs language plugins to be
> written in C++. At first I thought I would look into writing a plugin which
> would be able to utilize the "Language Server Protocol" used by VSCode [1].
> However, it seems that the LSP is too limited to support the cool stuff
> kdevelop does (i.e. semantic coloring).

This isn't correct as such - all that's needed for semantic colouring is the 
category and declaration position for each name, and those can be queried 
through LSP.

I don't know if you've seen Emma's blog post, almost certainly the most 
detailed consideration so far of using LSP in KDevelop:
https://perplexinglyemma.blogspot.com/2017/06/language-servers-and-ides.html

I believe she was probably correct in the specific case of Rust, but that 
support for LSP in KDevelop would be very useful in general.

The limitations in information returned by a particular LSP server aren't 
relevant if you plan to implement your own. Also, anything is better than 
nothing - there are existing LSP servers for many more languages than are ever 
likely to have a high-quality KDevelop plugin, so even partial autocomplete/
highlighting support is an improvement.

The more serious hurdle to overcome is that LSP's design is poorly suited to 
feeding KDevelop's DUChain as in current language plugins - rather than 
providing definition/use/type information in bulk, LSP is oriented toward 
providing the final, user-facing information and actions for a code location.

In fact, LSP acts a lot like the current *querying* of the DUChain within 
KDevelop - it supplies very similar results, but the data backing them stays 
within the language server.

To look at it another way, KDevelop's existing DUChain and language plugins 
could themselves make rather good LSP servers...

> So my next plan is to implement
> something "along the lines of" LSP, but for kdevelop.  The main idea is,
> that the plugin would connect to separate servers for each supported
> language. The servers would then provide it with a DUChain which it would
> feed to KDevelop. I.e. the workflow would look roughly like this
> 
>   1. User updates a file
>   2. KDevelops calls my language plugin to start a Parse Job
>   3. The plugin connects to an appropriate language server and asks it to
> produce a DU Chain for the updated file 4. Upon receiving the DUChain it
> updates the DUChain  KDevelop has for the file
> 
> I am currently experimenting with a quick and dirty implementation [5] where
> the communication between the plugin and the server is based on gRPC ([2])
> and ProtocolBuffers. I have a few questions:
> 
>   1. do you think this approach is workable (so far I didn't run into any
> obvious roadblocks, but I am new to the codebase)

I do think it could be made to work, but I'm not sure it's the best approach 
and there are some issues.

This is where, instead of actually answering your questions, I suggest 
something quite different. :P

TL;DR - I believe it would be better to implement a simplified "query" API on 
top of the DUChain, converting existing UI and non-language plugins to use it 
instead of direct lookups, and then add support for LSP as a backend to that 
while bypassing the DUChain entirely.

[the end goal would be to convert the DUChain and its current language plugins 
to an LSP server, but that's far more out-of-scope]
---

Most existing KDevelop language plugins /read/ the DUChain as much as they 
store things in it. To know the type of `foo.bar` and where it was declared, 
you have to look up where `foo` was declared, and with what type, and then find 
the declaration of `bar` within that type.

The big exception is kdev-clang, where all the declaration and typing 
information is deduced by Clang and the language plugin is fundamentally a 
Clang AST -> DUChain convertor.

Unless the proposed protocol is bidirectional (or the DUChain is shared cross-
process by some other means), plugins for it can't read the DUChain 
themselves. They must either keep their own copy of the data, reinvent an 
equivalent framework for analyzing declarations or types, or (most likely) be 
similar to kdev-clang and rely on an external library.

That is: plugins for it don't really /benefit/ in their own right from being 
KDevelop-specific.
Much as kdev-python etc. struggle with the DUChain in places (see below), all 
of their internal analysis is based on queries to it; they wouldn't be 
possible in anything like their current form as standalone utilities.
In contrast, without read access a plugin has to do all the work on its own 
and then (lossily) turn it into something the DUChain can understand.

It's important to note at this point that the DUChain is actually quite 
restrictive, and more so the further you get from C. The implementation is 
very elegant and performant, but it makes a lot of assumptions about how types 
and declarations work that can be hard to map other languages onto.
I spent a lot of time and effort trying to get kdev-python to store more things 
usefully in the DUChain - types are objects and vice versa, any assignment 
might be a declaration, and that's very difficult to represent.
This also seems to be an increasing problem with C++ itself - try anything 
involving non-trivial templates.

Given that, adding content to it through a protocol from an external plugin 
might be quite painful. Current language plugins can define new DUChain data 
types, are versioned in lockstep with kdevplatform so it's easy to make API 
changes if needed, and can be frustrating anyway.

[sorry, it's midnight and I'm going to sleep. Part 2 hopefully tomorrow].

The attached file is a somewhat-relevant IRC discussion from last year. 
Needless to say I didn't get to working on it. I might be able to this time 
around...

> 2. my knowledge of the DU
> Chain comes just from reading the available docs here [3] and the source
> code of the go plugin; however, I am sometimes quite confused and I wonder,
> if there is a tool which would output the DU Chain for a given file in a
> "human readable" format --- that would help me very much, I think 3. I
> haven't been able to figure out how kdevelop determines, what plugin should
> run the parse job for a given file; so far it seems that I need to provide
> a mime-type in the plugin json file and then kdevelop calls my plugin if
> the mimetype matches; unfortunately, this will obviously not work if my
> plugin wants to support different languages, ideally configured at runtime
> based on the available language servers... Is there a way around this?
> 
> Anyway, if you've read this far, thanks for taking the time, even if you
> don't have any answers :-)
> 
> 
> Best,
> 
> Jonathan
> 
> 
> [1] https://microsoft.github.io/language-server-protocol/
> [2] https://grpc.io/
> [3]
> https://api.kde.org/extragear-api/kdevelop-apidocs/kdevelop/kdevplatform/ht
> ml/index.html [4] https://code.visualstudio.com/
> [5] It really is very dirty, but if you want to make fun of me, you can find
> it here: https://gitlab.com/Verner/duserver

-------------- next part --------------
<rakadam>	how undermanned is the Kdevelop project? If I have ideas for improving Kdevelop, should I code them myself?
<kfunk>	rakadam: patches are always welcome!
<kfunk>	we're all super busy with real life & paid jobs atm
<rakadam>	I have a few smaller ideas, which I think I can implement fine, and you would be able to merge them. But I have a big one too, (too big)
<rakadam>	I think the clang parser should be separated into another process, because it is constantly crashing the editor, which is a very user experience.
<rakadam>	this is probably too big to implement alone, or even too big for you to implement soon....
<kfunk>	rakadam: I'd rather spend that time improving the Clang parser itself, if possible, so it doesn't crash
<kfunk>	unfortunately changing the whole communication with the Clang parser, so it is out-of-process, will be a huge undertaking
<rakadam>	it will always crash. In my experience you will never have the time&energy to make it 100% stable. New features are higher priority.
<rakadam>	Curretly it is "almost" stable. But at the moment I am forced to use Kate instead, because my code crashes it.
<FLHerne>	kdev-python used to have a thing where a separate parser script generated XML files, which the main process then fed into the DUChain. It doesn't anymore though...
<rakadam>	most of the time, it is cleaner and more maintainable to do them in the same process.
<rakadam>	There are some research about how to recover from a segfault. You might go that way, if you like the adventure. It seems simpler than implementing process separation for clang.
<FLHerne>	What proportion of KDevelop crashes do you get in the Clang parser itself? The only thing I remember was the documentation-parsing thing
<FLHerne>	(and the case where Clang manages to parse its own test-suite, which is sort of a niche problem)
<FLHerne>	I think most of the crashes I've experienced since getParsedComment() was worked-around have been in the cmake/include-path handling, i.e. wouldn't be helped
<rakadam>	I remember only crashes in the parsing. Mostly clang c++ parsing, sometimes python parsing.
<FLHerne>	Hm
<FLHerne>	Are you proposing to move the parsers themselves out-of-process, or the entire DUChain builders?
<FLHerne>	I've fixed quite a few crashes in the Python duchain builder, but never seen one in the parser alone
<rakadam>	I do not know. Maybe the entire DUChain would be better to move out. Mostly because the editor can live without it.
<rakadam>	maybe I remember wrong. I have only glanced at the stack trace.
<FLHerne>	Moving the whole DUChain out-of-process would be /really/ nice IMO; having everything related to it be asynchronous would also fix the various nasty UI hangs
<FLHerne>	But correspondingly difficult, I'm sure :-/
<rakadam>	I also have lots of UI hangs by the way.
<rakadam>	usually 1-2 seconds hangs
<rakadam>	but still annoying.
<FLHerne>	Yeah, there are tons of those
<rakadam>	separating DUChain would mean that you will have to maintain an extra interface between the processes. 
<rakadam>	if you are really undermanned, that might not be feasible at all (long term or short term).
<FLHerne>	Depends how reliable it is to start with, probably
<rakadam>	Attempting a segfault recovery would be much less work, but much speculative.
<FLHerne>	The duchain storage code is rather complicated and I don't know if anyone but milian really understands it, but that doesn't seem to be a maintenance problem because it Just Works most of the time
<FLHerne>	I can just put things in the duchain and not worry about the implementation
<rakadam>	is the DUChain interface documented, or should I read the code if I want to understand it?
<FLHerne>	The degree of coupling between the UI and duchain doesn't seem all that high; mostly it's just querying information for some position in some version of a document
<FLHerne>	(also, settings that affect parser behaviour, like the interpreter and include-paths etc.)
<milian>	FLHerne: except if it crashes :D
<rakadam>	that is encouraging.
<FLHerne>	I guess the buildsystem would also be in the backend process?
<kfunk>	rakadam: sorry, not much time to talk to you. but re. segfault recovering: Clang already implements that. most segfaults inside Clang/libclang don't cause a crash of the parent process (as it's handled internally)
<kfunk>	but still, it can't recover from some worse segfaults
<FLHerne>	rakadam: https://api.kde.org/extragear-api/kdevelop-apidocs/kdevelop/kdevplatform/language/duchain/html/duchain-design.html
<milian>	reading the backlog: I'd love to see language parsers be separated into sub processes, but that's a big task
<milian>	kfunk: what clang does it not enough
<FLHerne>	(see also 'Implementing' and 'Using' at the top)
<milian>	and it seems to be badly leaking, too
<milian>	so, rakadam - if you are into this, please do continue. you can ask me any questions you have
<milian>	but be warned: this isn't going to be easy
<kfunk>	yeah, I'd guess if it detects a segfault and recovers from it, it'll just leave any allocated memory where it is... (as it cannot safely delete it anyway; might cause yet another segfault)
<kfunk>	right...
<milian>	though usually, tasks like that are highly rewarding from an "getting experience" POV
<rakadam>	at the point I do not know what I should even try to code.
<kfunk>	it has to be said: QtCreator does that. it runs libclang in a separate process. 
<milian>	yes, and we may even want to incorporate language servers eventually, if they provide enough information
<rakadam>	separate processes are more reliable than segfault recovery.
<milian>	yep
<kfunk>	that's by the way an awesome task for GSoC (in case you're eligible)
<milian>	indeed
<rakadam>	I am not eligible for GSoC. (and do not have the time anyway)
<milian>	rakadam: so, if you would like to prototype something, I guess the best way is to actually start with the kdev-clang plugin since it's relatively small
<FLHerne>	rakadam: You might want to look at `util/duchainify` (both the source, and the utility that it builds)?
<milian>	actually that's an awesome idea for me to look into during my parental leave - let's see if I actually get around to it though...
<FLHerne>	It's almost like a very primitive version of what you're wanting to build, AAUI :P
<FLHerne>	rakadam: e.g. http://www.flherne.uk/files/duchainify.txt
<rakadam>	Kdevelop would need a DUChain server and not just a tool, for real-time info right?
<FLHerne>	Yes, almost certainly
<milian>	rakadam: I'd personally suggest to start in small steps, instead of refactoring everything
<FLHerne>	Perhaps 'very primitive' was still an understatement
<milian>	so keep the DUChain in the KDevelop process space
<milian>	and have the language parsers stream their data to the KDevelop process, which then builds the duchain
<milian>	though, thinking about it, that will break all language plugins except for kdev-clang :D
<milian>	since they all operate directly on the duchain... oh boy
<FLHerne>	milian: Are you sure that's actually easier?
<FLHerne>	Well, yes, as you just said
<milian>	this is going to be... interesting :)
<rakadam>	so the editor and duchain are weakly coupled, but duchain the lang plugins are storgly coupled?
<FLHerne>	All the language backends that aren't Clang (or qmljs?) interact hugely with the duchain
<milian>	I really would have to think more about this, it's not easy
<FLHerne>	Whereas the editor mostly just reads from it
<milian>	rakadam: the language plugins all are strongly coupled to the duchain, but kdev-clang is a bit of a special case - it actually is quite weakly coupled
<milian>	but still makes heavy use of the duchain API
<milian>	then we have multiple places in the "GUI" that also operate on the DUChain, e.g.:
<FLHerne>	I mean, I think running `duchainify` on each file change and then feeding that data into the editor would almost work :P
<milian>	quickopen
<milian>	outline
<milian>	document browser
<milian>	code browser / tooltips
<milian>	highlighting
<FLHerne>	It would be painfully slow and useless, but you should be able to highlight things with it
<milian>	problem browser
<FLHerne>	milian: Yeah, but all of those are only querying bits of data, right?
<milian>	yes, we'd have to refactor the code a lot though - so certainly not an easy task
<milian>	sure, but they all operate on the DUChain API, and tue data itself is not cross-process safe
<FLHerne>	The actual API surface needed to feed them doesn't need to be that large
<milian>	I fear that the current DUChain API is leaking its implementation in a few areas
<FLHerne>	Oh, maybe your idea is different
<FLHerne>	I was envisaging something where the UI thread would never touch the DUChain data/storage at all
<FLHerne>	The backend process would have a socket interface of some kind to query it, and hand back the information by value
<rakadam>	FLHerne: that would be a much more clean interface
<FLHerne>	i.e. the UI just asks for "declarations in <file set> matching <name>" and gets a plaintext list of them and the locations
<FLHerne>	It never gets a reference to the underlying duchain data or anything
<FLHerne>	rakadam: Sure, but I bet milian will now explain why I'm naïve and it won't work like that :P
<milian>	FLHerne: no, that's fine - but it will mean refactoring the whole code that uses the duchain API outside the language plugins
<milian>	and I agree that it would be much cleaner
<milian>	but I was originally hoping we could get away with something that doesn't need such invasive code changes, but I doubt it's possible
<milian>	one side has to change, the question is which one
<milian>	I currently think that your idea is the best, really
<milian>	but we'll have to refactor all of this code:
<milian>	outline
<milian>	manpage
<milian>	qthelp
<milian>	switchtobuddy
<milian>	contextbrowser
<milian>	quickopen
<milian>	filetemplates
<milian>	codeutils
<milian>	problemreporter
<milian>	cmake (though potentially not)
<milian>	classbrowser
<FLHerne>	Yes, yes, I know :P
<milian>	parts of the shell, too
<FLHerne>	At least it should be doable incrementally, up to a point?
<milian>	potentially, yes
<FLHerne>	The new interface could just be implemented within the same process until all the uses were moved
<milian>	you mean, wrap the new interface in the old API?
<FLHerne>	Not as such
<FLHerne>	If the new interface doesn't change the way the duchain works fundamentally, it shouldn't inherently break anything that accesses the duchain directly using the existing APIs
<milian>	ah, that's the naive assumption on your side then :P
<milian>	as I said, the DUChain API leaks its implementation
<milian>	it does pointer magic and directly accesses mmapped files e.g.
<FLHerne>	Yes, I remember this
<milian>	which must not be done from multiple processes
<milian>	so once you move anything into a separate process, all the other users must not use the old API
<FLHerne>	<FLHerne> The new interface could just be implemented within the same process   <- that was what I meant
<milian>	then I don't understand that point :)
<FLHerne>	- Implement the shiny socket-like interface within the /same/ process initially
<FLHerne>	 - Incrementally port things to access things via the socket-querying API rather than direct duchain access
<milian>	ah!
<milian>	yes, good idea
<FLHerne>	 - When all the accessors have been ported, move the server side into a separate process
<milian>	man, now I _really_ want to start on this :)
<milian>	maybe in a few months, if rakadam isn't beating me to it
<FLHerne>	Ditto :P
<rakadam>	I can only work on it for half hour per day max.
<FLHerne>	I think in a few weeks I might be able to spend some time on it.