Query-Based Compiler Architectures

6 years ago (ollef.github.io)

35 comments

matt_d

I have some negative feelings about this trend (of increased integration in compilers), but I can't quite put my finger on the reason.

Before the language server idea came along all compilers were pure functions. Dependency management and caching were the responsibility of the build system. A flexible build system could handle things that the language designers haven't though of, like code generation, mixed language projects, linking between different languages etc. Things are very composable and extendable and everything can be modeled as a DAG.

With language servers and incremental compilation becoming part of some long running compiler process the responsibilities are blurred, it all leads to increased integration, less flexibility and when things break, you will not be able to tell why.

Aren't we giving up too much to get faster recommendations in IDEs?

comex 6 years ago

Contrary to other replies, I think this is a reasonable concern. For example, the traditional C model allowed people to relatively easily write wrappers that do transparent caching (ccache) and parallelism (distcc), which is harder with a more integrated model.
Except:
> Aren't we giving up too much to get faster recommendations in IDEs?
The goal of incremental compilation isn't just IDE recommendations. It's faster compile times. C's system of header files and file-level dependency tracking works for C, but modern C++ severely strains its limits, considering how templates often force you to put all your code in header files. Rust doesn't even have header files and uses a single compilation unit for an entire library, which is much worse… but Rust is working on fine-grained query-based incremental compilation, which might someday let it leapfrog C++ in rebuild times.
Maybe.
bcherny 6 years ago

I wonder if you could approach this like you’d approach designing any other data structure, by first nailing down what operations you are optimizing for.
We could optimize for ease of iterating on the compiler, or for ease of integration into an ecosystem of various build systems. Or for correctness, consistency, compilation speed, etc.
Or, maybe it turns out that saving developers’ time is more important than any of these, as long as we keep all of these within reasonable bounds, since it’s by a factor of 10,000 the most common operation you’re going to be performing.
drivebycomment 6 years ago
I honestly fail to see anything we're giving up.
What makes you assume somehow compilers won't support the pipeline manner ? What "less flexibility" you've actually suffered from ? When was the last time you had to debug compiler problems yourself ?
- sly010 6 years ago
  
  If the compiler and the build system is so coupled, I am now stuck with that build system.
  Specifically compiling frontend code with a generic build tool like bazel is a torture, because the entire javascript ecosystem consists of giant monolith everything-in-a-box toolkits like webpack.
  
  3 replies →
ratmice 6 years ago
The assertion "all compilers were pure functions", is a strange one, because it is almost entirely backwards.
the purity of compilers was abandoned almost immediately (when they started creating a file a.out and writing to that instead of writing binaries to stdout, and in the c-preprocessor when #include was added, and in the assembler with the .incbin directive, If compilers were pure, there would be zero need for Makefile style build systems which stat files to see if they have changed.
while Makefiles and their ilk are modeled as a dag is true, The only reason an external file/dag is actually necessary is due to impurity in the compilation process.
There have been very few compilers which have even had a relatively pure core (TeX is the only one that I can actually think of), language servers are if anything moving them to a more pure model, simply due to the fact that its sending sources through some file descriptor rather than having to construct some graph out of filenames.
Long story short, "purity" in the sense of a compiler is a function from source text -> binary text, "foo.c" is not source text, and a bunch of errors is not binary text.
At least language servers take in source text as input.
- rumanator 6 years ago
  
  > the purity of compilers was abandoned almost immediately (when they started creating a file a.out and writing to that instead of writing binaries to stdout
  I don't understand your point. A function doesn't cease to be a function if it sends it's output somewhere else.
  > and in the c-preprocessor when #include was added,
  The C preprocessor is not the compiler. It's a macro processor that expands all macros to generate the translation unit, which is the input that compilers use to generate their output.
  And that is clearly a function.
  > If compilers were pure, there would be zero need for Makefile style build systems which stat files to see if they have changed.
  That assertion makes no sense at all. Compilers take source code as input and output binaries. That's it. The feature you're mentioning is just a convenient trick to cut down build times by avoiding to compile source files that haven't changed. That's not the responsibility of the compiler. That's a function whose input is the source files' attributes and it's output is a DAG of files that is used to run a workflow where in each step a compiler is invoked to take a specific source file as input in order to generate a binary.
  It's functions all the way down, but the compiler is just a layer in the middle.
  > while Makefiles and their ilk are modeled as a dag is true, The only reason an external file/dag is actually necessary is due to impurity in the compilation process.
  You have it entirely backwards: build systems exist because compilers are pure functions with specific and isolated responsibilities. Compilers take source code as input and generate binaries as output. That's it. And they are just a component in the whole build system, which is comprised of multiple tools that are designed as pure functions as well.
  
  3 replies →
- sly010 6 years ago
  
  > The only reason an external file/dag is actually necessary is due to impurity in the compilation process.
  But files also make various pieces of compiler chain interoperable and allows me to define a DAG. That's exactly what make is so powerful and that's exactly what I'd hate to loose.
  Modern compilers do a lot and understandably they are trying to avoid writing out some partially calculated state to disk (e.g. serializing and AST to disk between stages would be doing work twice). But moving everything into the process means your compiler becomes a walled garden.
  You can see this happening in the javascript world. Very few people actually know what WebPack does. It's a giant black box with infinite number of switches and everything is "magic".
  
  1 reply →
- Ace17 6 years ago
  
  > The only reason an external file/dag is actually necessary is due to impurity in the compilation process.
  No compilation process can know about the parts of your project written in another language.
  
  4 replies →
pjmlp 6 years ago

All UNIX compilers with their primitive toolchains, these kind of ideas go all the way back to Xerox development environments.
Lucid and IBM already had query based compiler architectures for their C++ toolchains (Energize C++ and Visual Age for C++ v4).
Orphis 6 years ago

If you cache only at the file level, then you're going to miss a lot of opportunities and still repeat a lot of work.
You could cache at the function level, or each template separately. You could cache includes automatically on files that change frequently to avoid parsing headers again and again without having to manually specify precompiler headers (yes, sure, modules should help there soon).
chriswarbo 6 years ago

"Traditional" compilers (e.g. GCC, GHC, javac, etc.) are essentially single-purpose black boxes: source code goes in, executable comes out.
Usually that source code must be on disk, often it must be arranged in a certain directory structure (e.g. src/module/...), and sometimes it must be in files with particular names (e.g. javac). This forces programmatic use of the compiler to be more complicated, e.g. setting up temporary directories to appease these rules.
That single-purpose is a common use-case, but certainly not the only one. Traditional compilers typically perform pre-processing, lexing, parsing, precedence resolution, name resolution, macro expansion, type inference, type checking, optimisation, code generation, linking, stripping, etc. all within the same binary (there are some exceptions, e.g. the C pre-processor can also be invoked separately).
In my experience, this is the opposite of composable and extendable! Each of these steps is very useful in its own right, yet we typically have no way to invoke them independently, e.g. to parse code into a structured form; or to infer the type of an expression; or to resolve a name; or to optimise an AST; etc.
To make this composable and extendable in the way you suggest, we would need to make these separate processes, piped together with a build tool (e.g. make, or a helper script). In practice this doesn't happen, but some projects have hooks into their code for extensibility; e.g. GCC can be run with different front- and back-ends, and the "middle-end" can be extended with new passes and plugins (finally!); GHC has some limited plugin functionality, and has a (very flaky!) Haskell API for invoking its different stages; etc.
My point is that the "traditional" world was pretty awful for composability and extendability. From the outside, we had big opaque compiler processes invoked by Make. If we're willing to drop down to the compiler's implementation language, there were some limited facilities to make them do something other than their usual source files -> binary task.
If we look at the post, we see that it's talking about "drop down to the compiler's implementation language" rather than standalone processes with stdio and Makefiles. However, the approach it's talking about is precisely one of pure functions (e.g. `fetchType`) and a flexible build system (Rock), providing composability and extendability. It even says this explicitly, e.g.
> The rules of our compiler, i.e. its "Makefile", then becomes the following function, reusing the functions from above:
Note that the post isn't specifically about LSP; it only mentions "providing editor tooling, e.g. through a language server". It doesn't even talk about long-running processes. As a counter-example, it would be pretty trivial to expose these 'tasks' as standalone commands, piping through stdio, if we really wanted to. So we're not "giving up too much"; we would be gaining composability and extendability!
As for "faster recommendations in IDEs", that's a complete straw man. The post gives the example of querying for the type of a qualified function name, and a few others (e.g. resolving names). Sure, those would be useful for IDEs, but they would also be useful for many more systems. Some examples, off the top of my head:
- Code search, e.g. searching by type (like Hayoo, but more general and less flaky); finding usages across a package repo (this relies on name resolution)
- Chasing (resolved) name occurrences, e.g. to finding downstream projects impacted by a breaking change; or to speed up delta-debugging by only checking commits which change code used by the test.
- Documentation generators can benefit from looking up types.
- Static analysers benefit from name resolution, type inference/lookup, etc.
Personally I've spent years on projects which use compilers for things other than their usual source files -> executable task, and their lack of composability and extendability is painful (my comment history is full of rants about this, mostly regarding GHC!). The approach described in this post would be amazing to see in "real" languages (i.e. those with lots of users and code, where more tooling and automation would provide a lot of benefit). I've often thought about a similar approach to this 'query-based' design, and would love to see things go even further in this direction (e.g. to a Prolog-style database of code)

hobo_mark 6 years ago

Reminded me of this lecture from last year:

Responsive compilers - Nicholas Matsakis - PLISS 2019

https://youtube.com/watch?v=N6b44kMS6OM

(Of course it's based on Rust, but the same principles would be applicable elsewhere)

fluffything 6 years ago

The blog post does cite salsa, which is the frame work that was created to create Lark, the language that was created to protoype Rust's implementation of query-based compilation.
https://github.com/lark-exploration/lark
https://github.com/salsa-rs/salsa

foobar_ 6 years ago

Can some one build HN filter for these type links. Sick of all the political bike shed.

_bxg1 6 years ago

Really interesting ideas, though I think the content is hampered a bit by the use of what I think is Haskell in the examples. It hinders accessibility when you're having to learn two unrelated things in parallel, and I don't think the average audience for this piece can be expected to know Haskell well enough to follow along effectively.

vajrabum 6 years ago

This is from a blog that has 2 posts. Both discuss a compiler for an experimental dependently typed language. That's a fairly specialized topic.
Ericson2314 6 years ago

They are talking about the idea, and their implementation, which is in Haskell.
yingw787 6 years ago

I think the author linked a Rust example of a query-based compiler: https://github.com/salsa-rs/salsa

mshockwave 6 years ago

Good...but why does the author think modern compilers / language servers DON'T do caching? Or in another way: why does the author think caching mechanism in modern compiler / language server is insufficient? I think the author is proposing a caching mechanism that has finer granularity and design the whole system around this idea from day one. But first, at least many components in LLVM have been doing caching for a long time (e.g. OrcJIT has a pretty decent caching layer and libTooling also supports incremental parsing with AST caches). Second, what is the memory overhead of this (fine grain caching) design when it's facing some larger input program in real world (e.g. OS kernel)? Does it scale well?

I know it's refurbished from an old post so I probably shouldn't be so harsh. But it will be better to compare some old ideas against state-of art works and found some insights rather than doing pure archaeology

matklad 6 years ago
I am not the author, but work in the same domain. Empirically, existing compilers are impossible to turn into good IDEs, unless the language has header files and forward declarations. Otherwise, you get a decent IDE by doing one of the following:
* writing two compilers (C#, Dart, Java(and, in some sense, every major language, supported in JetBrains tooling)
* starting with IDE compiler from the start (Kotlin & TypeScript)
The two examples where I’ve heard a batch compiler was successfully used to build a language server are C++ and OCaml (haven’t tried these language servers myself though). Curiously, they both use header files, which means that it’s the user who does fine grained caching.
I don‘t see how caching in LLVM is relevant to the task of building LSP servers.
In terms of prior art, I would suggest studying how IntelliJ works, specifically looking at the stubs:
https://www.jetbrains.org/intellij/sdk/docs/basics/indexing_...
- seanmcdirmid 6 years ago
  
  There are ways to turn batch compilers into incremental IDE compilers with some tree and effect caching on the periphery of the compiler. You can even go all the way up to the eval level for a full live programming environment. You don’t need module signatures if you are willing to trace dependencies dynamically.
  See https://www.microsoft.com/en-us/research/publication/program....
weliveindetail 6 years ago

> OrcJIT has a pretty decent caching layer
(Sadly) this is not correct. If you want to cache compiled code between sessions, your only option so far are handwritten object caches along the lines of this example:
https://github.com/llvm/llvm-project/blob/ae47d158a096abad43...