Comment by Revex
11 years ago
How does something like this happen? How can you write a compiler in the language that it is supposed to be compiling?
11 years ago
How does something like this happen? How can you write a compiler in the language that it is supposed to be compiling?
TLDR: You can't write the first compiler for a language in the language. However, you can write the second compiler in the language and use the first compiler to compile it.
If you already have a compiler in the language you're compiling, you can update the compiler with new features by the following process, called "bootstrapping". This process is used in gcc for example:
1. Use the old binary version to compile the new source version.
2. Use the binary produced in step 1 to compile the new source version.
3. Use the binary provided in step 2 to compile the new source version.
The results of stages 2 and 3 should be the same (assuming the old and new compilers assign the same semantics to the compiler source code and don't have non-determinism, e.g. using wall-clock time to determine when to stop searching for optimizations).
The bootstrap process can't be used on the first compiler for a brand-new language, because there is no "old compiler" that can be used in stage 1. The only way around this is to write the first compiler in a different language that already exists.
Of course, if a self-hosted compiler is your goal, you can afford letting the very first compiler (the one written in another language) be limited, "dirty," or "hacky". You don't have to implement the entire language in the first compiler; the first compiler just has to support enough of the new language to allow you to write the second compiler.
Anyway, once you have the first compiler, you write the first version of the second compiler in whatever subset of the language the first compiler supports. Once the first compiler builds the first version of the second compiler, the first compiler is no longer needed; you can add all new features to the second compiler only.
New versions of the second compiler (perhaps supporting more language features) can now be produced by bootstrapping. Of course, you can also use the second compiler to compile a totally different compiler implementation written in the new language (this would be the third compiler for the new language).
First you solve the chicken and egg problem by writing a bootstrap compiler for the language in some other language (e.g. a C# compiler in C or C++, or a C compiler in assembly language). Then you can compile programs in the new language, and compilers are programs, so you can write a C# compiler in C# and the binary so produced will be able to compile its own source.
Most compilers I know of start with an implementation in another language and eventually write a self-hosted one. Microsoft already had compilers for .NET, though, so you just compile it with those until it's self-hosting.
Bootstrapping a compiler from nothing is harder and involves a lot of incremental steps. http://en.wikipedia.org/wiki/Bootstrapping_(compilers)
Funny you should mention that! There was a really cool article about bootstrapping a compiler nearly from scratch just the other day[1]. But yeah, most people just start with C or C++ until they can compile enough of the language to write a compiler in it.
Some other interesting contemporary examples:
I think as we speak Go is working on its plan to switch from using a compiler written in Go[2].
Rust[3] has kind of a hybrid approach where the front-end of the compiler is written in rust, but generates llvm bytecode which is then compiled the rest of the way by llvm itself. It takes three "stages" for them to do a full compile, first a binary "snapshot" compiler is downloaded and used to compile the compiler (stage 0), then that generated compiler is used to compile the compiler again (stage 1), and then that generated compiler is used to compile the compiler again (stage 2). Stage 2 is actually just a test - its output should be identical to that of stage 1, and if it isn't, something went wrong.
There is a classic lecture from Ken Thompson[4] that I think is really illuminating on how this stuff works, if you're unfamiliar with it.
It's really fascinating stuff!
[1]: http://homepage.ntlworld.com/edmund.grimley-evans/bcompiler.... [2]: https://docs.google.com/document/d/1P3BLR31VA8cvLJLfMibSuTdw... [3]: http://www.rust-lang.org/ [4]: http://cm.bell-labs.com/who/ken/trust.html
They wrote the original C# compiler in a different language (probably C++) first.
Then they wrote the second C# compiler in C#, and used the first compiler to compile the second compiler. Voila!
We regularly dogfood, so we're on like N iterations of your structure, but that's the idea!
Could you ask around the office and find out some of the details about the C# bootstrap process? I would be fascinated to know what language the non-C# compiler was written in and also how long they used/improved the non-C# compiler until they switched to the C# compiler.
This source release is so cool!
2 replies →
http://en.wikipedia.org/wiki/Bootstrapping_(compilers)
This is almost second-nature in Lisp:
http://c2.com/cgi/wiki?StructureAndInterpretationOfComputerP...
And tends to happen to other languages... you just have to bootstrap the first compiler.