← Back to context

Comment by tgma

2 days ago

It does not have to be brushed away as "brute force" necessarily. We can, and do, build more reliable systems out of less reliable components. In fact, most industrial engineering accepts some defect rate and builds margins around it.

Software is no different. Even without AI, you already have buggy compilers and buggy OSes and buggy libraries. You just tend to accept the risk because you have some idea of what the failure modes are and can work around it or manage the risk in some other way (buy literal insurance.)

> you already have buggy compilers and buggy OSes and buggy libraries.

Which run, I must add, on effectively infallible hardware. Most of the software straight up assumes that the CPUs and the RAM will function perfectly and don't bother even trying to detect such failures (unless those failures manifest themselves in a catastrophic manner, the show will simply go on).

So in effect, we also can, and do, build less reliable systems out of more reliable components, and that's how software is different.

  • I am not sure if I correctly understood your point. On one dimension, you are basically hinting at another anecdote that proves my point: hardware failure (specifically bit flip in non-ECC memory) is pretty much guaranteed to happen at scale, but people are mostly okay with absorbing that risk. I feel you are overselling the hardware reliability story. For sure, we can build less reliable systems out of reliable components. That goes without saying, and no, that's definitely not software specific. Almost by default most composite systems are less reliable than their primitives (simple example would be nailing two pieces of wood) unless specific care is taken to build in those guardrails or redundancies. The point, however, is it is possible, and there is a vast precedent for it.

  • >Which run, I must add, on effectively infallible hardware.

    Keyword: effective. Hardware is also built on top of components with error tolerances. What do you think ECC memory is for? Or why chips have "yields" and parts that were "turned off" while shipping? Or how thermal throttling happens? Or CPU clocks, which have jitter to be corrected, and tons of other examples, all the way to transistors and capacitors.

    And let's not even get into HDs and SSDs.

    • My point is, when you go from hardware to software, there is a noticeable change in attitude and corresponding drop in reliability. Inside hardware, there is a lot of effort to detect and correct glitches and provide reliably correct behaviour; in software, the most common assumption is that the hardware will just work as intended. How many filesystems store checksums? Even back in the days when HDDs were way less reliable? And ECC exists in no small part because writing software that's reliable in the face of memory-corruption (at the very least can detect such corruption) while is possible, it's hard, costly, and produces less performant software (obviously).

      Anyway, a larger point of my comment is that when we go from traditional software to AI, there a) seems to be an even bigger drop in reliability, b) the attitudes are even more cavalier, c) the software people largely lack most of discipline/knowledge required to build more reliable system from less reliable ones — because they spent most of their careers doing exactly the opposite: building software that they know will work incorrectly even on perfect hardware, and being fine with it. "You just tend to accept the risk", my ass. Yes, when it comes to software, people do this, but when it comes to hardware — remember when Intel messed up the division algorithm inside their chips?

      2 replies →

  • You should talk to an electrical engineer or materials scientist about how reliable transistors emerge from noisy voltages in wires.