← Back to context

Comment by tjr

6 years ago

When lives are on the line software should be tested for reliability beyond 51 days.

Avionics software is written a world of verifiable requirements.

For how many days should the software be required to operate?

Is it acceptable to add that many [more] days to the software verification schedule in order to verifiably demonstrate that it works according to requirements?

Why is 51 days not long enough?

Taking a plane from design to commercial delivery takes years. I'm sure they can spare 2-3 months to do some long running tests. Especially if those can run in parallel with other fit-and-finish work unrelated to software.

  • So, your idea is a plane comes with firmware when you buy it (say in 1985) and then that's the only version forever. Every problem between 1985 and now, too bad, this passed QA back in 1985 and we're not changing anything? No.

    Airliners are very long-lived equipment. So in fact they ship new releases. New releases have features that may be really valuable to safety, as well as features that are nice quality of life improvements. They're not shipping once per hour like a web startup, or even once per day like the NT internal team, but they do need to ship more than "once per new model of aircraft".

    I've written before about an accident I spent a bunch of time looking at. No fatalities, just a smashed runway light but still reportable because of the "But for..." rationale. Two of the easiest things that would have prevented that from occurring were firmware tweaks. One was a recommended (but not mandatory) change in a newer build and the other exists only in Airbus planes so far.

    Specifically the newer build does OAT disagree meaning if you tell the plane "It is -20°C outside" thus automatic takeoff thrust is much lower, the plane considers the temperature sensor at the engine inlet and it says to itself, this reads +15°C which is 35K different, that's the difference between flying and crashing into the fence at the end of the runway. I disagree with your guess about the temperature and so I refuse to try to figure out what to do next. You can realise you entered it wrong and type a more realistic value in, or you can set the thrust yourself manually if my sensors are broken.

    The fancier Airbus approach was not to focus on the result of air temperature calculations. If the plane isn't accelerating enough, it can't fly, we don't care why it isn't accelerating, maybe the wheels are square - we need to abort takeoff so we don't crash. So teach the plane how long runways are, it can use GPS to figure out which runway it's using, and then it can tell pilots if they aren't getting enough acceleration and they'll abort because they don't care why it's not enough acceleration either, they don't want to die in a fireball.

    • Long running tests don't mean firmware cannot be updated. Updates will just themselves need time. And with better upfront testing updates should not need to be as frequent.

    • > So in fact they ship new releases

      I had an A380 flight that was slightly delayed due to a “software update” taking longer than expected.

      It was at the SIN layover for QF1 LHR to SIN, so it was kind of worrying/amusing to have your plane need a software update halfway through your journey

  • There are an infinite number of tests that could be performed.

    That this test could have been performed does not mean that all possible tests could have been performed.

    Which is really what we're talking about here.

    Is "able to run 51 days without reboot" a requirement, or not? If not, and it's not a use case, then it shouldn't have been tested for.

    Instead, the limited time and resources available should have been spent on more important things.

    • This just gets back to the issue though - since all this software is locked behind privacy laws we have no visibility into how thoroughly this bug was identified. We don't know if there is a test out there confirming that this data corruption occurs after the expected amount of time after each patch.

      1 reply →

  • 1. You say like aircrafts are being manufactured by laymen. Despite all recent problems with Boeing, it's not the case. 2. Running a battery of formal proof tests is expensive and way more complicated than running a unit test suite for software. 3. Probably more complexity is required to solve this issue, and where is more complexity there, there might be more risk.

    I'm not saying that this is even acceptable or a great trade-off, but the way you worded your comment is presumptuous.

    • We can't see what's in the box (since closed source), but I personally would be okay with this being a clearly laid out limitation, i.e. having a nice blinking red function comment saying "This integer will overflow if the system is up for more than 50 days, but due to hardware limitations we're unable to properly do X Y & Z with a 64 bit width integer on these subsystems."

      If this issue is clearly identified and tested around that's alright, it isn't a huge deal to have to reboot periodically... I'm more concerned this issue is one of those "Oh well, it just... gets a bit off after fifty days - try rebooting it, that seems to fix it."

  • Indeed, they do perform long-running tests. Is 51 days not long enough? How many days would be long enough?