Comment by er4hn
9 months ago
The big issue with bootstrappable builds is how to get started and have good examples. This is an ambitious goal, like landing on the moon, and takes a lot to get there. My understanding of this has been you need to (a) Be able to have a compiler that can be compiled from understandable code, which itself may require a set of increasingly complex compilers. I've heard this referred to before as a "compiler pilgrimage" but I can't find where I heard that term. (b) Then you need to be able to build the code with that compiler / dependencies. This is a pretty well solved problem these days assuming you can pin all your dependencies and transitive dependencies. (c) Then this all needs to be reproducible so that you can actually trust the output and that is a pretty hard problem today.
Yeah pretty much, the best example I have found for showcasing a solution to this issue is this[1] example, where you can see we start from the most basic of "compilers" (quite literally the equivalent to a `sed` command) and work our way up to Linux 4.9 if I remember correct. Biggest issue is circular dependencies (a lot of lower level build tools depend on themselves nowadays, so we end up needing to build like 4-5 older versions to work our way to a modern toolchain) and different architectures. Since x86 has always been a given, having a completely bootstrappable toolchain on something like arm or even risc-v is a more complex problem, where older versions of programs needed may not necessarily compile or be able to compile further programs on these architectures.
[1]: https://github.com/fosslinux/live-bootstrap
> To avoid using an existing toolchain, we need some way to be able to compile a GCC version without C. We can use a less well-featured compiler, TCC, to do this. And so forth, until we get to a fairly primitive C compiler written in assembly, cc_x86
I imagined going a slightly different route.
A minimal Forth can be written in assembly and in itself. It suffices to write a console using a serial port, a primitive FAT filesystem to access SPI Flash, and maybe even an interface to USB mass storage.
Forth is not very easy to audit, but likely still easier than raw assembly.
One can write a C compiler right on top of that, sufficient to compile TCC.
Alternatively, a simple Lisp can be written on top of the Forth, it's much simpler than writing it in assembly. Using the Lisp, a much more understandable and auditable C compiler can be written.
Much of the Forth, all of the Lisp, and much of the C compiler (except code generation) would be portable and reusable across multiple architectures, without the need to audit them fully every time.
The fun part here is (potentially) not using QEMU and cross-compilers, and running everything on a sufficiently powerful target hardware, for the extra paranoid.
This[0] repo which has the stage0 code used by GPs linked repo includes some explanation in the README about why not Forth or Lisp - in summary it turned out to be harder than that, at least for the authors of that project.
0. https://github.com/oriansj/stage0
2 replies →
i think this is a good approach, but almost everyone who has tried it has gotten bogged down in the usual problems with forth, where you get distracted by writing clever code instead of writing fairly boring, straightforward code that gets the job done in a dumb way. and where things fail in difficult-to-debug ways, so you spend a lot of time debugging. virgil dupras's duskos is the only candidate for an exception to this rule. hopefully you will write a second one!
2 replies →
I feel like the biggest issue of all is that bootstrappable builds are by and large solving the wrong problem. The main concern of bootstrapping is the existence of packages that require on older version of them to build themselves, because if you end up skipping a version for whatever reason, you may end up without a clear way forward (this isn't a theoretical concern--Debian gradle packaging is horribly broken because of this, and they've been trying for years without success to fix it).
When you have a build dependency on one of multiple stage 0 compilers, the problem of a cycle basically disappears. You need a C++ compiler to build a C++ compiler these days, but you have your choice of two C++ compilers, so the probability you wake up one day without a working C++ compiler that you need is quite low. And the mostly theoretical trusting-trust problem basically disappears on the second stage 0 compiler availability, so the marginal benefit of a third or fourth or nth stage 0 compiler is basically nil.
And yet, the vast majority of bootstrappable build efforts are basically focusing on "how do I go from, say, a hex editor to a working C compiler," which is one of the least useful efforts imaginable. You can sort of see this with their project list: they highly tout their efforts to get to gcc from "stage0", and when they start talking about Java, it instead becomes "here's how to build a 20-year old version of Java, but, uh, most of this stuff is unmaintained so good luck?" And the JVM languages are in a state of "uhhh... we don't know how to break these cycles, any ideas?"