Comment by vidarh
10 years ago
> Essentially, compilation is rewriting of a source AST into the target machine code.
While that is true of most modern compilers, do consider that there's the whole Wirth school of compilers that don't use an AST at all but generate machine code directly during parsing.
> For a human it is much easier to structure such a rewrite as a sequence of very trivial changes.
I'm not convinced, as I've written elsewhere, because while the changes may be trivial, you need to keep track of what a program is expected to look like in between each stage, and that becomes harder the more stages you introduce. E.g. my forever-in-progress Ruby compiler has a stage where it makes closures explicit: It allocates an environment to store variables in and rewrites variable accesses accordingly. For following stages the language I'm working on is different: I can no longer use closures.
The more stages you have, the more language dialects you need to manage to mentally separate when working on the stages. And if you decide you need to reorder stages, you often have painful work to adapt the rewrites to their new dialects.
The end result looks very simple. But try to make changes and it's not necessarily nearly as simple.
> that don't use an AST at all but generate machine code directly during parsing
It is not any different. A "sufficiently smart" compiler must be capable of fusing the lowering passes straight into a parser. Parsing is not that different from the other kinds of passes.
> that becomes harder the more stages you introduce
My usual trick is to have a "unit test" simple AST which, when building in a certain mode, is fed into the compilation pipeline and displayed with all the annotations. It helps a lot to understand what exactly each pass is doing. In the next version of my DSL for building compilers (currently a work in progress) it will even be a part of an IDE.
> the more language dialects you need to manage to mentally separate when working on the stages
Ideally, one pass = one language. Nanopass handles it nicely.
I am using this approach for pretty much all kinds of languages, and never had any issues with too many IRs. In my book, the more distinct IRs, the better.
> A "sufficiently smart" compiler must be capable of fusing the lowering passes straight into a parser.
Given the state of compiler-generation tools (i.e. they're not even viable for generating parsers for production quality compilers), I have little hope of seeing a tool like that in the next couple of decades at least. Probably longer. At least in a form I'll be able to use for anything.
> Ideally, one pass = one language. Nanopass handles it nicely.
That was exactly what I see as a nightmare of a maintenance problem. Even with half a dozen or so variations at most, I find it annoying.
You can solve this by having one "language" that can represent all intermediate stages. This has a nice side effect of allowing you to easily change the ordering of passes if that turns out to produce better results.
In my experience such languages are very dangerous and it is better to have a chain of more restrictive languages instead. Most passes only make sense in a fixed sequence anyway. LLVM is infanously broken in this regard, btw., there are too many implicit pass dependencies.
E.g., there is no point in keeping a one-handed 'if' in a language after a pass which lowers it into a two-handed 'if'. There is no point in keeping a 'lambda' node in your AST after lambda lifting is done.
You will still have different de-facto languages for each intermediate stages.
E.g. even if "x = 1; y = lambda { puts x } " is technically legal in all stages, if one of the stages is responsible for "lowering" the above to an explicit environment allocation and rewriting accesses to x, then anything below that which tries to use the "lambda {}" syntax would fail to compile.