Bootstrappable Builds

9 months ago (bootstrappable.org)

The big issue with bootstrappable builds is how to get started and have good examples. This is an ambitious goal, like landing on the moon, and takes a lot to get there. My understanding of this has been you need to (a) Be able to have a compiler that can be compiled from understandable code, which itself may require a set of increasingly complex compilers. I've heard this referred to before as a "compiler pilgrimage" but I can't find where I heard that term. (b) Then you need to be able to build the code with that compiler / dependencies. This is a pretty well solved problem these days assuming you can pin all your dependencies and transitive dependencies. (c) Then this all needs to be reproducible so that you can actually trust the output and that is a pretty hard problem today.

  • Yeah pretty much, the best example I have found for showcasing a solution to this issue is this[1] example, where you can see we start from the most basic of "compilers" (quite literally the equivalent to a `sed` command) and work our way up to Linux 4.9 if I remember correct. Biggest issue is circular dependencies (a lot of lower level build tools depend on themselves nowadays, so we end up needing to build like 4-5 older versions to work our way to a modern toolchain) and different architectures. Since x86 has always been a given, having a completely bootstrappable toolchain on something like arm or even risc-v is a more complex problem, where older versions of programs needed may not necessarily compile or be able to compile further programs on these architectures.

    [1]: https://github.com/fosslinux/live-bootstrap

    • > To avoid using an existing toolchain, we need some way to be able to compile a GCC version without C. We can use a less well-featured compiler, TCC, to do this. And so forth, until we get to a fairly primitive C compiler written in assembly, cc_x86

      I imagined going a slightly different route.

      A minimal Forth can be written in assembly and in itself. It suffices to write a console using a serial port, a primitive FAT filesystem to access SPI Flash, and maybe even an interface to USB mass storage.

      Forth is not very easy to audit, but likely still easier than raw assembly.

      One can write a C compiler right on top of that, sufficient to compile TCC.

      Alternatively, a simple Lisp can be written on top of the Forth, it's much simpler than writing it in assembly. Using the Lisp, a much more understandable and auditable C compiler can be written.

      Much of the Forth, all of the Lisp, and much of the C compiler (except code generation) would be portable and reusable across multiple architectures, without the need to audit them fully every time.

      The fun part here is (potentially) not using QEMU and cross-compilers, and running everything on a sufficiently powerful target hardware, for the extra paranoid.

      6 replies →

  • I feel like the biggest issue of all is that bootstrappable builds are by and large solving the wrong problem. The main concern of bootstrapping is the existence of packages that require on older version of them to build themselves, because if you end up skipping a version for whatever reason, you may end up without a clear way forward (this isn't a theoretical concern--Debian gradle packaging is horribly broken because of this, and they've been trying for years without success to fix it).

    When you have a build dependency on one of multiple stage 0 compilers, the problem of a cycle basically disappears. You need a C++ compiler to build a C++ compiler these days, but you have your choice of two C++ compilers, so the probability you wake up one day without a working C++ compiler that you need is quite low. And the mostly theoretical trusting-trust problem basically disappears on the second stage 0 compiler availability, so the marginal benefit of a third or fourth or nth stage 0 compiler is basically nil.

    And yet, the vast majority of bootstrappable build efforts are basically focusing on "how do I go from, say, a hex editor to a working C compiler," which is one of the least useful efforts imaginable. You can sort of see this with their project list: they highly tout their efforts to get to gcc from "stage0", and when they start talking about Java, it instead becomes "here's how to build a 20-year old version of Java, but, uh, most of this stuff is unmaintained so good luck?" And the JVM languages are in a state of "uhhh... we don't know how to break these cycles, any ideas?"

The story referenced as part of the motivation for the project[1] is pretty chilling. The laws of physics can put a lower limit on things for you if you have an old school analog oscilloscope handy to watch for network packets.

If you have old school TTL, EPROMs, RAM, and time, you could built a CPU you can test all the parts of, and trust. You could even work your way up to floppy disks, and an analog CRT display.

Once you want to ramp up the speed and complexity, things get dicey. I have ideas that would help, but nothing provably secure.

[1] https://www.teamten.com/lawrence/writings/coding-machines/

> Current versions of GCC are written in C++, which means that a C++ compiler is needed to build it from source. GCC 4.7 was the last version of the collection that could be built with a plain C compiler, a much simpler task.

Which C++ compiler was used to build GCC 4.8?

regarding the "security" aspect, I'm interested in what an attack vector would look like against a build system

like, say you are building code, and all the below functions are compilers, and * denotes an evil compiler. Every link in the chain is a compiler building another compiler, until the last node which builds the code.

A() -> B() -> Evil*() -> D() -> E(code) -> binary

how in the world would the evil compiler in this situation inject something malicious into the final binary?

  • Any compiler (or binary) after the evil compiler is compromised. It can inject malicious code into anything it creates (or anything that is produced by what it makes).

    Essentially, the evil compiler can include the evil parts of it in the compiler output. Even worse, the evil compiler could include the self-replicating code within the compiler output.

    You can follow this logic down an infinite chain as you'd like.

Then the trust is in your silicon. Not only the CPU. The network card, hard drive, memory controller, PCI bus...