Resources for Amateur Compiler Writers

5 years ago (c9x.me)

93 comments

kilodeca

This is definitely a backend, machine code focused list. If you're writing a compiler to WebAssembly or even the JVM you're probably not gonna need a lot of these resources.

Indeed I'd argue that these lists are more for an intermediate compiler writer. Beginner compiler writers don't need to know register allocation. They need to understand how to think about language translation, how to write an intermediate representation, how to learn a target language.

I've been toying with the idea of writing a post that goes through the thought process of translation. Specifically, how you build states, and invariants. When I write code generation I spend time thinking about the different states the stack could be in, or what invariant I need to keep across this expression (do I need to save a register or global?).

One simple but important example of this is realizing that given a proper, sound type checker, your code generator should almost never give errors. By the time you reach code generation you should have proven enough invariants that code generation cannot fail. This isn't always true but I've found it to be the case.

jstanley 5 years ago

> Indeed I'd argue that these lists are more for an intermediate compiler writer. Beginner compiler writers don't need to know register allocation.
I'm not disagreeing with you, but note that the title does not say it's for beginners, it says it's for amateurs, which is just anyone who does it for fun rather than money.
seanmcdirmid 5 years ago
That reminds me of the dragon book that goes heavily into VAX assembly and optimizing for CISC machines. Probably not the best resource for beginners or amateurs these days, but it won a Turing award at least.
The problem is what it means to write a compiler is pretty broad, some amateur projects focus on backend work, some on front end, many in between.
- coryrc 5 years ago
  
  Also tremendous amounts just on parsing C. I found Lisp In Small Pieces far more interesting.
- astrange 5 years ago
  
  The Dragon Book is just not very good. The newer editions spend far too long on parsing, which is not a big problem, and just make it look scary.
  Grune's Parsing Techniques is a good book (IIRC when I read it ten years ago), and there are some other ones here.
  https://gcc.gnu.org/wiki/ListOfCompilerBooks
orthoxerox 5 years ago
> Beginner compiler writers don't need to know register allocation. They need to understand how to think about language translation, how to write an intermediate representation, how to learn a target language.
I don't think people that want to write a compiler want to end up with helloworld.c instead of helloword.exe. At least I don't.
- hardwaregeek 5 years ago
  
  At least for me, compiling to a simpler target made learning the ideas easier. Compiling to to raw assembly does give you more nerd credit but it also makes it harder to get started.
- ajkjk 5 years ago
  
  The opposite, really. I started learning about compilers because I wanted to implement my own syntactic ideas. Don't care at all how it runs.
  
  1 reply →
- fooker 5 years ago
  
  Ideally, you'd emit LLVM IR or some similar bytecode or assembly. Then you'd use llc/opt or an appropriate assmebler to get an executable.
  Trying to emit machine code isn't too interesting when you're writing your first compiler.
  
  1 reply →
- bootwoot 5 years ago
  
  The creator of C++ would disagree, as C++ was devised as a compile-to-c language.
enos_feedler 5 years ago

You can certain take any path into compiler writing. I dont know anything about AST or front end compilers but ended up writing register allocators and constant folding for nvidia GPU JIT
jeffypoo 5 years ago

I would be interested in reading your yet-to-be-written post!
samatman 5 years ago

That would be great to read, I sincerely hope you write it, post it here, and I get to read it.

estebank 5 years ago

A lot of ink is spilled on the details of how to make compilers go from taking text and turn it into fast executables, but I believe that the User Experience of compilers is as important a subject as those, but there's very little information about it in the literature, outside of a handful of papers and the design documents of existing compilers that do focus on this.

dmitriid 5 years ago
No idea why you are being downvoted, but it's true.
Compiler UX is often horrible, but the new breed of languages tries to rectify parts of it:
- non-crpytic error messages. Show exactly what failed, where it failed, how to solve it.
- compiler-as-a-service. Tool makers will love you.
- compilation speed. If I need to wait 2 minutes for a 10 LOC project to compile, your compiler is doing something wrong.
- the compiler should be the only tool you need. If you rely on 15 different tools just to compile a project, you're doing something wrong
- estebank 5 years ago
  
  As you allude to, treating a compiler as an isolated batch process is an outdated strategy. Any compiler started today should be designed such that it can be used in an IDE from the start, even if you don't implement it in that way until after you've reached a certain level of maturity.
  For error messages in particular is a very expansive topic because a compiler might have the following components:
  Lexer Parser Macro Expansion Name Resolution Lowering Type Checking Privacy Checking Lifetime Analysis Code Generation Linking
  and all of them can produce errors. If you introduce error recovery in your parser, then you better make sure you're carrying that information forward to name resolution and type checking. If you introduce a error type placeholder, better account for it during the rest of your analysis to avoid complaining about non-existing methods on something that wasn't resolved in the first place. Error deduplication. Exploring the grammar space to identify things that a human may attempt but isn't allowed for good reason but that will not blow up until some quasi-random stage. Now that I think about it you could write a whole book about these topic.
  
  3 replies →
- hctaw 5 years ago
  
  I downvoted it because I think it's a meme that ignores where contemporary compiler engineering is happening. We don't even realize they exist anymore - V8, Graal, Roslyn, etc are all optimizing JIT compilers running in the background that can compile multiple languages on the fly. And they do a great job.
  Not everything is a C++ compiler, which is inherently difficult to compile correctly. Let alone quickly.
  I think the future of compilers is that everything will have an incremental parser and JIT compiler with optional AOT compilation for release builds. Hopefully LSP and DAP will take over the world of tooling integration.
  There are other fundamental mistakes to look at than error messages and performance, like pretending that build systems don't exist and that linking multiple compilation units isn't something the compiler can do. Compiling code is getting much easier, but building it is much harder. Compilers have the most information and internal structure to handle all our build problems, but don't do it.
- turminal 5 years ago
  
  > - non-crpytic error messages. Show exactly what failed, where it failed, how to solve it.
  I'm working on a compiler in my free time and I agree with you, but in defense of compilers that suck at this: error reporting is really, really hard to do right.

iTokio 5 years ago

By the author of the excellent https://c9x.me/compile/

A lightweight alternative to llvm https://c9x.me/compile/doc/llvm.html

UncleEntity 5 years ago

The Zephyr Abstract Syntax Description Language, Wang et al.

Has been invaluable for creating the AST (and soon the IR nodes)-- though it has been a source of much yak shaving as I've rewritten my asdl generator at least 3 times.

Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages Book, Terence Parr

Pretty java/antlr focused but wasn't too hard to take the concepts and apply them to a C++ based transpiler I've been slowly poking at.

Combining Analyses, Combining Optimizations, Cliff Click

His thesis and an extension of the linked paper Global Code Motion, Global Value Numbering -- my next step down the rabbit hole...

Tree Automata Techniques and Applications, Comon et al.

Still on the fence on this one vs the iburg/burs algorithm, probably be more useful and match better with the way Parr's book goes about things.

dmitriid 5 years ago

I'm not an expert by far, but I feel like many people who write compilers or develop languages no longer recommend The Dragon Book as it's very outdated.

Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing. It's not uncommon for a book or a course dedicate 60-70% of its time to parsing. And leaves nothing to everything else.

tom_mellior 5 years ago
> I'm not an expert by far, but I feel like many people who write compilers or develop languages no longer recommend The Dragon Book as it's very outdated.
Yes. There are newer editions that I can't judge, but the 1986 edition linked from the page should nowadays only be read by historians. If you want to implement a compiler, look elsewhere, for something newer.
This isn't to say that this was a bad book 35 years ago. It wasn't. But it is simply outdated. It does not contain things that are important today. It has very little treatment of type systems and type inference, for example. Its treatment of data flow analysis is based on intervals, which have limitations (though also some interesting theoretical advantages) and are used by zero compilers written in this millennium (and not even described very well in the Dragon book IIRC). And then of course there is the parsing problem, see my rant elsethread.
As for the article, I think A Catalogue of Optimizing Transformations (https://www.clear.rice.edu/comp512/Lectures/Papers/1971-alle...) would make an interesting addition.
Edit: Forgot to add that the old old Dragon book doesn't cover two topics that the article itself considers very relevant: SSA form (which I believe was discovered just around that time) and linear scan register allocation (which is much newer).
- aarchi 5 years ago
  
  What would you recommend that covers data flow analysis better than the dragon book? I'm currently starting data flow analysis for my own compiler using intervals and/or modular congruences (operations can be reduced to congruences, then combined with the Chinese Remainder Theorem). What are the theoretical advantages of intervals and what do you recommend over that?
  
  1 reply →
MaxBarraclough 5 years ago

Yep, here's a thread from two years ago on exactly that: https://news.ycombinator.com/item?id=20436399
rualca 5 years ago
> Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing.
I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.
Also, you're far more likely to write a parser for a language than implement anything remotely related to compilers and interpreters.
It's one thing to state that some topics might justify more content, but it's absurd to downplay the fundamental importance of parsers in putting together a compiler.
- tom_mellior 5 years ago
  
  The problem with the old old Dragon Book's treatment of parsing is that it explains a particular impractical algorithm (LR(1)? LR(k) maybe? it's been a while) in way too much detail. You would need that detail if you were to implement a parser generator (for that specific algorithm), or a parser using that specific algorithm, which isn't terribly well suited for generating useful error messages for your users.
  Today's computers are ridiculously powerful, and the perceived performance advantages of LR parsing are simply irrelevant. Further, even if you were to implement an LR parser, you would do it with the aid of a parser generator, which hides the boring details and saves you from needing to understand the LR algorithm itself. But more likely when you write a parser you use a parser generator using a different approach, or you write a recursive descent parser by hand. In either case, the dry parts of the old old Dragon Book are completely irrelevant.
  The people criticizing the 1986 Dragon Book's treatment of parsing aren't saying that parsing is irrelevant. They are saying that parsing that specific way is either irrelevant or done for you by a tool that is not the interesting part of a compiler.
- HelloNurse 5 years ago
  
  If parsing "influences how the language can be structured and also how it should be structured for performance reasons" it's probably going to be a mediocre language, in which self-imposed constraints interfere with useful special features resulting in a bland design lacking competitive advantages.
  Pascal has already been cited in other comments as having been designed for ease of parsing, but the Turbo Pascal compiler is an isolated masterpiece, and high-quality compilation at decent speed is normally more useful than cheap compilation at ludicrous speed.
  Maybe parsing seems a large part of constructing a compiler because it's naturally the first task and because it's likely to be entangled with designing the syntax of the language, which can pose difficulties (e.g. ambiguities, context sensitivity, messy operator precedence).
- dmitriid 5 years ago
  
  > Parsing is the single most important feature of building a programming language
  It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).
  > It influences how the language can be structured and also how it should be structured for performance reasons.
  Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi. Even if you do it with regexps. Everything that comes after parsing is very complex and is rarely explained in any detail. And everything that comes after actualy directly influences the structure, the performance etc.
  It really makes zero difference if you decide that your typeing info should be written like this:
  T x = new T();
  or like this:
  let x: T = T()
  or like this:
  var x := T()
  All this is trivial to parse. However, deciding on how you implement typing, what needs to be in the languaage to support your choice etc., now that definitely influences the structure, performance, trade-offs etc.
  How much of that is in the Dragon Book? Next to zero? But surely there are 4 chapters on how to parse this.
  
  3 replies →
- jcranmer 5 years ago
  
  The thing about parsers is this: we have tools that let you input any context-free grammar and will spit out a parser for you. Getting a working parser for your language is trivial, even if getting a good parser is non-trivial (though even there, I hesitate to say that it's hard).
  In terms of time investment, building a parser takes like an hour whereas the semantics for the same language would take a hundred hours. Even when I throw together a parser for a simple language for some project, I spend more time trying to wrangle the output of the parser (that is, the semantics) than I do on the actual parsing portion of it. All of that effort in the Dragon book explaining how to build an LL(1) or LR(1) parser, or even how to build a lexer? Never used it. Almost completely useless. I spend more time thinking if I want to use a peek/next approach or a get/unget approach than handling shift-reduce conflicts or building first/follow sets.
  Of everyone I know who works in compilers, I can't think of anyone who doesn't think that the Dragon book spends too much time on parsing.
  
  2 replies →
- jhgb 5 years ago
  
  > I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.
  Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago. The necessary treatment of all things parsing is then covered on twenty pages of his Compiler Construction text.
  
  4 replies →
Rochus 5 years ago

> no longer recommend The Dragon Book as it's very outdated
Only if they didn't realize there is a second edition from 2006.

jhgb 5 years ago

Perhaps Cooper's and Torczon's "Engineering a Compiler (2nd Edition)" should be among the entries in the "Books" category.

hasitseth 5 years ago

Alan Holub's 'Compiler Design in C' is not mentioned in the books. Holub's book had well-written code. I had learned a lot from his code.

elvis70 5 years ago

And a free copy of this book in PDF has been made available by the author: https://holub.com/compiler/

chandlore 5 years ago

Latest versions of the ABI specifications linked in the Machine Specific section

ARM: https://github.com/ARM-software/abi-aa/releases

x86-64: https://gitlab.com/x86-psABIs/x86-64-ABI (go to most recent CI job and download artifacts for a compiled PDF)

piinbinary 5 years ago

Are there any good resources for building a runtime and codegen for languages with garbage collection?

aardvark179 5 years ago

The Garbage Collection Handbook is excellent, but might go into more depth than you need. For a simple garbage collector there needn’t be any effect on codegen, all you need to worry about is the runtime side of things. Search for how to build a conservative garbage collector and you should find plenty of tutorials and introductory CS courses on the subject.
moonchild 5 years ago
For a basic GC, code generation should be unaffected. The only place you'd need to care about code generation is if you wanted to make a realtime GC, inserting read or write barriers.
- piinbinary 5 years ago
  
  My understanding from doing a bit of reading is that there are at least two things codegen-adjacent that an accurate collector will want (even if you stop the world while collecting):
  1. The locations of pointers inside of allocations.
  2. The locations of pointers in the stack (into which you may need to spill registers).
  I can think of ways to do these (e.g. perhaps each stack frame contains a bitmask at a certain offset), but I'd love to learn what the best practices are.
  
  1 reply →

phonebucket 5 years ago

One thing I find curious: lots of compiler writing seems to be done in Haskell, but I can hardly find any Haskell-based resources for writing compilers.

Can anyone here please recommend any such resources?

atsuzaki 5 years ago

This is a pretty good tutorial that I know of: https://www.stephendiehl.com/llvm/. The author went on to write a book too, though it's still in progress http://dev.stephendiehl.com/fun/
toolslive 5 years ago

https://www.microsoft.com/en-us/research/publication/impleme...
tester756 5 years ago
>lots of compiler writing seems to be done in Haskell
those 4fun of real world?
- mechanicalDigit 5 years ago
  
  Most compilers are written for fun. That’s the point of the thread.
  Also, there is a lot of modern language research is done from the “lens” (heh) of Haskell, or at least the Glasgow school.

nethunters 5 years ago

Any similar resources on writing interpreters?

Also is there are a technical difference between an interpreter and a VM?

mtlynch 5 years ago
I haven't read it, but Bob Nystrom's Crafting Interpreters has been on my list for a while:
https://craftinginterpreters.com/
- orthoxerox 5 years ago
  
  It's a great book. Bob's not a CS teacher, and it's great: the book is grounded in practice and has been structured in a way that lets you get a tangible result at the end of every chapter.
- smueller1234 5 years ago
  
  Don't wait, read it. It's fun and informative, worth every minute! I haven't had as much fun reading a technical book in years!
- LordN00b 5 years ago
  
  I second this. It's a great resource, very well written. Opinionated enough to have character and humour, but not so dry as to make your eyes glaze over. I guess I'm saying that for a technical tome it's very, very readable (as was his Game Design Patterns book incidentally)
CodeArtisan 5 years ago

A virtual machine executes instructions while an interpreter takes source code as input. An interpreter could be built on top of a virtual machine, obviously, but not necessary. For example, SICP/PLAI/OOPLAI[1] explain how to implement an interpreter on top of Scheme where you directly interpret the s-expressions. These may be a worth read if you want to learn about the usual concepts and techniques used in programming language implementations from a more general point of view. Like, for example, how to implement a static type system or an object model based on classes. Interpreters based on a virtual machine are actually compilers on the fly; the source code is not interpreted but translated into instructions beforehand.
[1] http://sarabander.github.io/sicp/
http://cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04...
https://users.dcc.uchile.cl/~etanter/ooplai/
theodpHN 5 years ago

Check out Peter Norvig's neat BASIC Interpreter https://github.com/norvig/pytudes/blob/master/ipynb/BASIC.ip...
MH15 5 years ago
Interpreters usually execute by walking the abstract syntax tree at runtime, bytecode VMs usually execute actual bytecode similar to machine code. The bytecode for such VMs is often way simpler than the instruction set for an actual computer, so it's far more portable between platforms.
- quelltext 5 years ago
  
  Keep in mind that bytecode interpreted languages (Ruby, Python) are typically called interpreted languages. Java is usually called "compiled" because of the explicit step vs. Ruby and Python, but it's essentially the same. And typically you'll find discussions wrt to JVM about interpretation in Java referring to bytecode interpretation vs. compiling being JIT compiling.
  Ultimately limiting interpreters to AST interpreters is not quite correct. The AST is just another IR that source code needs to be converted to just like bytecode. And the AST is essentially also executed by a virtual machine. Interpretation of the IR (the AST, or bytecode, etc.) is one part of a VM. Of course in some instances the VMness of a runtime is more pronounced (Java) than in others (Ruby).
  The difference between interpretation and compilation is that the latter is meant to run on real hardware vs. the former implies something that executes the IR of a program by dynamically choosing which machine instructions to run.
  Of course a compiler is also something that takes in some code and outputs some other, typically more low level representation.
  My point being I don't think there is a strict or consistent definition these days for what an interpreter is.
  Case in point: I've also heard people say interpreters read the code "line by line" (or rather, not that but more accurately, as far as they need to read before they know what to execute next) and execute it piece by piece. Which might be how some interpreters worked in some languages in the past. An AST interpreter OTOH already implies some whole source file pre-processing. Is an AST interpreter then less of an interpreter than one that streams the source code line by line. Is a program that does that more an interpreter than another which goes a step further and, e.g. executes a simplified AST, or one that executes a straightforward trivial bytecode representation of the simplified AST?
- tom_mellior 5 years ago
  
  > Interpreters usually execute by walking the abstract syntax tree at runtime, bytecode VMs usually execute actual bytecode similar to machine code.
  This isn't the right way to separate these concepts. A VM that executes bytecodes by interpretation is also an interpreter (Python, Ruby, PHP are well-known examples). Many VMs don't interpret, or they interpret mostly during startup, but actually try to compile hot parts of the code and execute that machine code rather than interpreting (Java and JavaScript VMs are well-known examples).
  The VM part more properly refers to things like data representation and memory management. Interpreters of an abstract syntax tree, interpreters of bytecodes, and just in time compilers all need to run inside a system that takes care of loading code, allocating memory when requested, and often doing garbage collection. These services are what make a VM. The exact execution strategy is not what determines VM-ness.
- nethunters 5 years ago
  
  Thanks for clearing that up.
  Which languages or implementations of languages directly interpret the AST without the intermediary bytecode compilation?
  I know Python, Java and JavaScript (V8 and SpiderMonkey) all compile to bytecode first probably to speed up subsequent runs and run some optimisations on the bytecode.
  What other benefits are there to compiling to bytecode first vs directly interpreting the AST?
  
  8 replies →

ChicagoDave 5 years ago

I've toyed with compiler stuff for years. My current idea is a fluent/piping language. Is there any reference that would help with that?

Rochus 5 years ago

There is a more recent edition of "Compilers: Principles, Techniques, and Tools" worth considering: https://www.amazon.com/Compilers-Principles-Techniques-Tools....

ahelwer 5 years ago

Does anybody know of good introductory resources on error recovery techniques when writing a parser & interpreter? Bonus points if they use combinators instead of a hand-written parser.

jstimpfle 5 years ago
Not a resource, but depending on what kind of parser you are doing, maybe an idea.
The thing I'm going to try out next for one of my batch parsers (written recursive descent style as many or most parsers in the wild) is to have AST nodes of kind "invalid". The idea is that the parser should always complete, with minimal in-line error checking, even when I have say "require_token_kind()" or similar calls in my parser code.
Maybe not even an "invalid" kind, but an additional "invalid" flag, so that even those invalid nodes can still have all the regular node-kind-depending structure on them.
The other obvious idea is to use a parser generator. But I'm pretty sure I wouldn't be happy on that route - from the little experiments that I've done, I infer it's too bureaucratic, too much boilerplate for me to suffer.
The more general advice for error handling is that you should try to make APIs that require very little in-line error handling. Instead, keep errors behind the APIs, maybe collect them in a list, or only keep the first or last error, and allow to handle them at a later point (out-of-line).
And boom, most of the talk about smart syntax and type systems and other clever systems to make error handling more convenient suddenly becomes obsolete.
Yet more abstractly, don't try to build systems that make cumbersome things more convenient, but think how you can do with less of these cumbersome things. (That works wonders for runtime performance, too).
- tester756 5 years ago
  
  >The more general advice for error handling is that you should try to make APIs that require very little in-line error handling. Instead, keep errors behind the APIs, maybe collect them in a list, or only keep the first or last error, and allow to handle them at a later point (out-of-line).
  I guess you wanted to say that instead of printing just what's wrong at some point, be smarter and try to build some handier error handling like attaching error to ast's node and then maybe do something with it
  
  1 reply →
ruste 5 years ago

I've actually been meaning to write something up about this. I've found you can get fairly good error messages from a parser combinator style parser just by saving the deepest location you successfully parsed to and the parser you failed in.

andromeduck 5 years ago

Wish there were more resources about or tools for bidirectional parsing.

failwhaleshark 5 years ago

ProTips after writing numerous kinds of compilers in university:

Recursive descent with backtracking is much easier to construct and maintain for diagnostics than bison/flex.

Derivative parsing is very interesting.

testrvb 5 years ago

test

alanafarias22 5 years ago

Amei esse posto

alanafarias22 5 years ago

Esses comentários ajudam muito