Comment by api
3 days ago
An open secret in our field is: the current market leading OSes and (to some extent) system architectures are antiquated and sub-optimal at their foundation due to backward compatibility requirements.
If we started green field today and managed to mitigate second system syndrome, we could design something faster, safer, overall simpler, and easier to program.
Every decent engineer and CS person knows this. But it’s unlikely for two reasons.
One is that doing it while avoiding second system syndrome takes teams with a huge amount of both expertise and discipline. That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.
The second is that there isn’t strong demand. What we have is good enough for what most of the market wants, and right now all the demand for new architecture work is in the GPU/NPU/TPU space for AI. Nobody is interested in messing with the foundation when all the action is there. The CPU in that world is just a job manager for the AI tensor math machine.
Quantum computing will be similar. QC will be controlled by conventional machines, making the latter boring.
We may be past the window where rethinking architectural choices is possible. If you told me we still had Unix in 2000 years I would consider it plausible.
Aerospace, automotive, and medical devices represent a strong demand. They sometimes use and run really interesting stuff, due to the lack of such a strong backwards-compatibility demand, and a very high cost of software malfunction. Your onboard engine control system can run an OS based on seL4 with software written using Ada SPARK, or something. Nobody would bat an eye, nobody needs to run 20-years-old third-party software on it.
I don’t think these devices represent a demand in the same way at all. Secure boot firmware is another “demand” here that’s not really a demand.
All of these things, generally speaking, run unified, trusted applications, so there is no need for dynamic address space protection mechanisms or “OS level” safety. These systems can easily ban dynamic allocation, statically precompute all input sizes, and given enough effort, can mostly be statically proven given the constrained input and output space.
Or, to make this thesis more concise: I believe that OS and architecture level memory safety (object model addressing, CHERI, pointer tagging, etc.) is only necessary when the application space is not constrained. Once the application space is fully constrained you are better off fixing the application (SPARK is actually a great example in this direction).
Mobile phones are the demand and where we see the research and development happening. They’re walled off enough to be able to throw away some backwards compatibility and cross-compatibility, but still demand the ability to run multiple applications which are not statically analyzed and are untrusted by default. And indeed, this is where we see object store style / address space unflattening mitigations like pointer tagging come into play.
In a way, I agree. If you can verify the entire system throughout, you can remove certain runtime checks, such as the separation between the OS and tasks. If you have only one program to run, you can use a unikernel.
I suspect that specifically car / aircraft / spacecraft computers receive regular updates, and these updates change the smallest part they can. So they have separate programs / services running on top of a more general OS. The principles of defense in depth requires that each component should be hardened separately, to minimize the blast radius if a bug slips in.
1 reply →
> we could design something faster, safer, overall simpler, and easier to program
I do remain doubtful on this for general purpose computing principles: Hardware for low latency/high throughput is at odds with full security (absence of observable side-channels). Optimal latency/throughput requires time-constrained=hardware programming with FGPAs or building hardware (high cost) usually programmed on dedicated hardware/software or via things like system-bypass solutions. Simplicity is at odds with generality, see weak/strong formal system vs strong/weak semantics.
If you factor those compromises in, then you'll end up with the current state plus historical mistakes like missing vertical system integration of software stacks above Kernel-space as TCB, bad APIs due to missing formalization, CHERI with its current shortcomings, etc.
I do expect things to change once security with mandatory security processor becomes more required leading to multi-CPU solutions and potential for developers to use on the system complex+simple CPUs, meaning roughly time-accurate virtual and/or real ones.
> The second is that there isn’t strong demand.
This is not true for virtualization and security use cases, but not that obvious yet due to missing wide-spread attacks, see side-channel leaks of cloud solutions. Take a look at hardware security module vendors growth.
> That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.
You need to make a product that out-performs your competitors. If their chip is faster then your work will be ignored regardless of how pure you managed to keep it.
> We may be past the window where rethinking architectural choices is possible.
I think your presumption that our architectures are extremely sub-optimal is wrong. They're exceptionally optimized. Just spend some time thinking about branch prediction and register renaming. It's a steep cliff for any new entrant. You not only have to produce something novel and worthwhile but you have to incorporate decades of deep knowledge into the core of your product, and you have to do all of that without introducing any hardware bugs.
You stand on the shoulders of giants and complain about the style of their footwear.
That’s another reason current designs are probably locked in. It’s called being stuck at a local maximum.
I’m not saying what we have is bad, just that the benefit of hindsight reveals some things.
Computing is tougher than other areas of engineering when it comes to greenfielding due to the extreme interlocking lock-in effects that emerge from things like instruction set and API compatibility. It’s easier to greenfield, say, an engine or an aircraft design, since doing so does not break compatibility with everything. If aviation were like computing, coffee mugs from propeller aircraft would fail to hold coffee (or even be mugs) on a jet aircraft.
Aviation does have a lot of backwards compatibility problems. It's one reason Boeing kept revving the 737 to make the Max version. The constraints come from things like training, certification, runway length, fuel mixes, radio protocols, regulations...
1 reply →
> something faster
How true is this, really? When does the OS kernel take up more than a percent or so of a machine's resources nowadays? I think the problem is that there is so little juice there to squeeze that it's not worth the huge effort.
The problem isn’t direct overhead. The problem is shit APIs like blocking I/O that we constantly have to work around via heroic extensions like io_uring, an inefficient threading model that forces every app to roll its own scheduler (async etc.), lack of OS level support for advanced memory management which would be faster than doing it in user space, etc.
Look behind the curtains, and the space for improvement over the UNIX model are enormous. Our only saving grace is that computers have gotten ridiculously fast.
The thing about AI though is that it has indirect effects down the line. E.g. as prevalence of AI-generated code increases, I would argue that we'll need more guardrails both in development (to ground the model) and at runtime (to ensure that when it still fails, the outcome is not catastrophic).