Comment by d_silin

1 month ago

In aviation safety, there is a concept of "Swiss cheese" model, where each successful layer of safety may not be 100% perfect, but has a different set of holes, so overlapping layers create a net gain in safety metrics.

One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline, so the goal of adding them should be an improvement for a measurable metric (code quality, uptime, development cost, successful transactions, etc).

Of course, one has to understand the chosen LLM behaviour for each specific scenario - are they like Swiss cheese (small numbers of large holes) or more like Havarti cheese (large number of small holes), and treat them accordingly.

12 comments

d_silin

kgwxd 1 month ago

LLMs are Kraft Singles. Stuff that only kind of looks like cheese. Once you know it's in there, someone has to inspect, and sign-off on, the entire wheel for any credible semblance of safety.

tomlue 1 month ago
how sure are you that an llm won't be better at reviewing code for safety than most humans, and eventually, most experts?
- kgwxd 1 month ago
  
  They probably already can for a lot of things, but "Safety" is really about accountability when things go wrong. As a society, I hope we don't end up at "AI isn't perfect, but it's better than people on average, sorry if it failed you, good luck with that."
- hansmayer 1 month ago
  
  It will only get better at generating random slop and other crap. Maybe helping morons who are unable to eat and breathe without consulting the "helpful assistant".

theshrike79 1 month ago

LLMs are very good at first pass PR checks for example. They catch the silly stuff actual humans just miss sometimes. Typos, copy-paste mistakes etc.

Before any human is pinged about a PR, have a properly tuned LLM look at it first so actual people don't have to waste their time pointing out typos in log messages.

heliumtera 1 month ago

Interesting concept, but as of now we don't apply this technologies as a new compounding layer. We are not using them after the fact we constructed the initial solution. We are not ingesting the code to compare against specs. We are not using them to curate and analyze current hand written tests(prompt: is this test any good? assistant: it is hot garbage, you are inferring that expected result equals your mocked result). We are not really at this phase yet. Not in general, not intelligently. But when the "safe and effective" crowd leave technology we will find good use cases for it, I am certain (unlike uml, VB and Delphi)

hansmayer 1 month ago

> One can treat current LLMs as a layer of "cheese" for any software development or deployment pipeline

It's another interesting attempt at normalising the bullshit output by LLMs, but NO. Even with the entshittified Boeing, the aviation industry safety and reliability records, are far far far above deterministic software (know for a lot of un-reliability itself), and deterministic, B2C software to LLMs in turn is what Boeing and Airbus software and hardware reliablity are for the B2C software...So you cannot even begin to apply aviation industry paradigms to the shit machines, please.

d_silin 1 month ago
I understand the frustration, but factually it is not true.
Engines are reliable to about 1 anomaly per million flight hours or so, current flight software is more reliable, on order of 1 fault per billion hours. In-flight engine shutdowns are fairly common, while major software anomalies are much rarer.
I used LLMs for coding and troubleshooting, and while they can definitely "hit" and "miss", they don't only "miss".
- hansmayer 1 month ago
  
  I was actually comparing aviation HW+SW vs. consumer software...and making the point that an old C++ invoices processing application, while being way less reliable than aviation HW or SW, is still orders of magnitude more reliable than LLMs. The LLMs don't always miss, true...but they miss far too often for the "hit" part to be relevant at all.
  
  3 replies →