Comment by jcgrillo
1 year ago
It's amazing that's legal. Like, why do we accept software that does this? It can be done in such a way that these things don't happen.Put another way, why aren't the companies involved being fined and sued out of business? Why aren't their managers facing criminal negligence charges? It's outrageous.
Because there has never been a single commercial jetliner fatality caused by software in its intended operational domain failing to operate according to specification. That makes the commercial jetliner software development and deployment process by far the safest and highest reliability ever conceived by multiple orders of magnitude. We are talking in the 10-12 9s range.
And just to get ahead of: “Well what about the 737 MAX”, that was a system specification error, not due to “buggy” software failing to conform to its specification. The software did what it was supposed to do, but it should not have been designed to do that given the characteristics of the plane and the safety process around its usage.
>“Well what about the 737 MAX”, that was a system specification error, not due to “buggy” software failing to conform to its specification. The software did what it was supposed to do
Exactly: the system was designed to fly the plane into the ground if a single sensor was iced up, and that's exactly what the software did. Boeing really thought this system specification was a good idea.
That is a massive over-simplification and that invites patently false characterizations like it was a "stupid mistake" that would have been fixed if they were not stupid (i.e. adopted average development process). That is absolutely not the case. They were really capable, but aerospace problems are really, really hard, and their safety capability regressed from being really, really capable.
They modified the flight characteristics of the system. They tuned the control scheme to provide the "same" outputs as the old system. However, the tuning relied on a sensor that was not previously safety-critical. As the sensor was not previously safety-critical, it was not subject to safety-critical requirements like having at least two redundant copies as would normally be required. They failed to identify that the sensor became safety critical and should thus be subject to such requirements. They sold configurations with redundant copies, which were purchased by most high-end airlines, but they failed to make it mandatory due to their oversight and purchasers decided to cheap out on sensors since they were characterized as non-safety-critical even if they were useful and valuable. The manual, which pilots actually read, has instructions on how to disable the automatic tuning and enable redundant control systems and such procedures were correctly deployed at least once if not multiple times to avert crashes in premier airlines. Only a combination of all of those failures simultaneously caused fatalities to occur at a rate nearly comparable to driving the same distance, how horrifying!
A error in UX tuning dependent on a sensor that was not made properly redundant was the "cause". That is not a "stupid mistake". That is a really hard mistake and downplaying it like it was a stupid mistake underestimates the challenges involved designing these systems. That does not excuse their mistake as they used to do better, much better, like 1,000x better, and we know how to do better and the better way is empirically economical. But, it does the entire debacle a disservice to claim it was just "being stupid". It was not, it was only qualifying for the Olympics when they needed to get the gold medal.
4 replies →
So what should we make of these issues described in the article? When, not if, this kind of thing kills people will it be a specification error? Will we blame it on maintenance? Surely it can't be the software's fault!
First of all, who got blamed for the 737 MAX? Boeing did. This is one of the few industries where the responsibility does not get easily sloughed off.
Second, 787s have been flying for ~13 years and ~4.5 million flights [1]. Assuming they were unaware of the problem for the majority of that time, their unknowing maintenance and usage processes avoided critical failures due to the stated problems for a tremendous number of flights. Given they now know about it and are issuing a directive to enhance their processes to explicitly handle the problem, we can assume it is even less likely to occur than previously which was already experimentally determined to be ludicrously unlikely. Suing someone into oblivion for a error that has never manifested as a serious failure and that is exceedingly unlikely to manifest is a little excessive.
Third, they should be remediating problems as they arise balanced against the risks introduced by specification changes and against the alternative of other process modifications. Given Boeing’s other recent failings, they should be given strict scrutiny that they are faithfully following the traditional, highly effective remediation processes. It should only be worrisome if they are seeing disproportionately more problems than would be expected in a aircraft design of its age and are not remediating problems robustly and promptly.
[1] https://www.boeing.com/commercial/787#overview
10 replies →
Because it works fine. A maintenance tech gets one extra line item on the weekly or monthly inspection checklist.
It works fine until it doesn't and people die. At which point the blame falls on the maintenance crew? That's wrong. And where there's smoke there's fire. If the software has this horrible bug, likely the broken culture that created it has written worse, more subtle bugs.
Commercial air travel in the US is incredibly safe. The last fatal crash was in 2009.
3 replies →
Because changes to that software go through a enormous amount of testing, validating and documentation for a new baseline to become a flashable item. Meanwhile a always working workaround is needed now.
Have you even found the documentation around things like ACPI? It's kinda coupled with UEFI these days I think, and hell, I'm not even sure of the hardware boards/revisions aircraft makers are using these days... Are they still on BIOS? Or old-as-sin linux/RTOS kernels/microcontrollers?
Point being, when you start talking about high QA systems, where the Quality is non-negotiable (you will have everything documented and tested); barring exec/managerial malfeasance in preventing that work from being done, you reach for the same simple things over and over again since it takes a hell of a lot of work to actually characterize and certify a thing to the requisite level of reliability/operating conditions.
Testing ain't free, ya know.