Comment by d_silin

20 hours ago

In aviation safety, there is a concept of "Swiss cheese" model, where each successful layer of safety may not be 100% perfect, but has a different set of holes, so overlapping layers create a net gain in safety metrics.

One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline, so the goal of adding them should be an improvement for a measurable metric (code quality, uptime, development cost, successful transactions, etc).

Of course, one has to understand the chosen LLM behaviour for each specific scenario - are they like Swiss cheese (small numbers of large holes) or more like Havarti cheese (large number of small holes), and treat them accordingly.

LLMs are very good at first pass PR checks for example. They catch the silly stuff actual humans just miss sometimes. Typos, copy-paste mistakes etc.

Before any human is pinged about a PR, have a properly tuned LLM look at it first so actual people don't have to waste their time pointing out typos in log messages.

Interesting concept, but as of now we don't apply this technologies as a new compounding layer. We are not using them after the fact we constructed the initial solution. We are not ingesting the code to compare against specs. We are not using them to curate and analyze current hand written tests(prompt: is this test any good? assistant: it is hot garbage, you are inferring that expected result equals your mocked result). We are not really at this phase yet. Not in general, not intelligently. But when the "safe and effective" crowd leave technology we will find good use cases for it, I am certain (unlike uml, VB and Delphi)

LLMs are Kraft Singles. Stuff that only kind of looks like cheese. Once you know it's in there, someone has to inspect, and sign-off on, the entire wheel for any credible semblance of safety.

  • how sure are you that an llm won't be better at reviewing code for safety than most humans, and eventually, most experts?

    • It will only get better at generating random slop and other crap. Maybe helping morons who are unable to eat and breathe without consulting the "helpful assistant".

> One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline

It's another interesting attempt at normalising the bullshit output by LLMs, but NO. Even with the entshittified Boeing, the aviation industry safety and reliability records, are far far far above deterministic software (know for a lot of un-reliability itself), and deterministic, B2C software to LLMs in turn is what Boeing and Airbus software and hardware reliablity are for the B2C software...So you cannot even begin to apply aviation industry paradigms to the shit machines, please.

  • I understand the frustration, but factually it is not true.

    Engines are reliable to about 1 anomaly per million flight hours or so, current flight software is more reliable, on order of 1 fault per billion hours. In-flight engine shutdowns are fairly common, while major software anomalies are much rarer.

    I used LLMs for coding and troubleshooting, and while they can definitely "hit" and "miss", they don't only "miss".

    • I was actually comparing aviation HW+SW vs. consumer software...and making the point that an old C++ invoices processing application, while being way less reliable than aviation HW or SW, is still orders of magnitude more reliable than LLMs. The LLMs don't always miss, true...but they miss far too often for the "hit" part to be relevant at all.

      3 replies →