Comment by chrischattin

6 years ago

Commercial pilot here. Can confirm that turn it off and back on again works for troubleshooting avionics issues.

But seriously, this is clickbait and nothing to see here. Many things on the aircraft are checked, cycled, etc before every flight, let alone on a 51 day mx schedule.

> Can confirm that turn it off and back on again works for troubleshooting avionics issues.

Have you ever had to do this mid-flight?

  • I worked in aviation for a while. This is super common. There isn't a pilot on the planet who hasn't turned avionics off in flight (there are always redundant and backup systems). There probably isn't a working pilot in the world that hasn't had to cycle a circuit breaker in flight this month.

    Edit: Well, if this was a normal month.

  • Oh yeah. Many times. But, it's not as scary as it sounds. There are multiple redundancies and backup systems. So, you can cycle something on/off without touching the other systems. It's often a step in abnormal procedures checklists.

    This article is turning a routine checklist/maintenance item into scary sounding clickbait.

Sure, but to me this sounds scary.

If I'd know that it were every "(2^32 - 1) ms to days" -> "49 days 17 hours 2 minutes 47.3 seconds" (millis stored in an unsigned long) then I'd be at ease, but 51 days doesn't say anything to me.

I just hope that they know why it's max 51 days.

  • Instead of 1ms, try running the math if the tick rate instead was 1.024ms, (1024 usec).

    These things are often driven with a 32.768 MHz crystal, in case anyone's wondering why not just a nice even 1.000ms.

    On another forum I talked about way long ago, when I had to debug a huge system that reset after 248.x days of uptime. Yep, run the math...

That and even if it was a real problem it would just be one more thing on a very rigorous checklist anyway, right?

That's fine, and I'm sure it won't be a huge maintenance problem, but it indicates the underlying software is such a mess that they can't even adequately fix a simple issue.

In software, it's what we call an "ugly hack." An "ugly hack" meant that 737s didn't rely on both sensors, and people died. An ugly hack meant that the Ariane 5 rocket exploded in mid-air.

Ugly hacks should not be a part of any project where lives are at stake.

  • No, it indicates that the problem domain is sufficiently dangerous that the risk of fixing must be balanced against the risk of the fix causing a different unknown error. There were ~500,000 787 flights in 2017 with an average of 200 people per flight. The 737-MAX resulted in 385 fatalities, so if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems. Do you have confidence that systems you have worked on have processes in place to guarantee that there is less than a 1 in 250,000 chance that a fix would not cause another error? If not, are you aware of any organization whose development practices you have first hand knowledge of and that you are confident could give such a guarantee? That is the risk analysis that must be done when doing a fix.

    To be fair, this is somewhat of an over-exaggeration of the requirements since not all systems are critical and not all errors cause critical problems. In addition, the risk must be balanced against the alternative, in this case the risk caused by making sure a reboot is done every 51 days, so you would need to do an analysis of the failure probability and possible consequences of the status quo and compare that against the possible error modes of a software fix.

    As an addendum to the risk analysis, the above analysis was only for one year error and on a per-flight basis. If you expect the 787 to fly for ~30 years then the fix must not cause two crashes over 30 years so a 1 in 7,500,000 chance. The average flight is ~5,000 KM which is ~4-5 hours per flight for a total flight time of ~60,000,000 hours. A plane takes ~3 minutes to fall from cruising altitude, so we need fleet downtime of 6 minutes per 60,000,000 hours which is 1 in 600,000,000 downtime. That is 99.9999998% uptime, 8 9s, 6,000x the holy grail of 5 9's availability in the cloud industry, 60,000x the availability guaranteed by the AWS SLA (again, somewhat of an over-exaggeration since you need correlated failures to amount to 3 minutes of continuous-ish failure, but that depends on an analysis of mean-failure time and mean-time-to-resolution which I do not have access to).

    • > .. if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems.

      That's an improper calculation. Slight nuance but orders of magnitude difference. A better approximation is:

      If a fix had a 1 in 250,000 chance of causing a different error that would result in a fatal crash then it would be worse than the 737-MAX problems.

      And converse:

      If a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it could be worse than the 737-MAX problems.

      MAX's MCAS generates way more errors than 1 in 250k. A few of them resulted in a crash.

      1 reply →

    • Well, what I mean is that perhaps such critical software should be carefully rewritten with rigorous architectural and QA standards so they don't have to use an ugly hack in the first place.

      2 replies →

I'm sorry to hear that people have become so accustomed to fixing failures of the mind (which these defects are) with reboots.

It takes a certain type of person to fly a plane and resilience in face of unknowns and following checklists are some of the qualities they have.

In the same vein, these kind of failures are known to programmers for tens of years, just like metallurgists are aware of metal fatigue _and plan for it_.

Failure of software professionals to plan and mitigate for this kind of foreseeable problems is inexcusable and I liken them to some incompetent metallurgists in an alternate Univers who brushed off De Havilland Comet's like there was nothing to learn from.

  • "I'm sorry to hear that people have become so accustomed to fixing failures of the mind (which these defects are) with reboots."

    That's nothing, I fix "failures of the mind" by shutting off my brain for eight hours every day. That's a reboot!

  • I don't know why this is down voted.

    Yes software bugs happen, but they are fixing this with documentation rather than root causing the problem.

    That should worry people, because until its been root caused the actual implications are also unknown. For all we know its a symptom of a bug that will cause a more severe problem somewhere else.

    • Do you have a source for the claim they haven't identified the root cause before determining that a regular reboot is a suitable mitigation?

      1 reply →

    • There is nothing like when truth is rubbing people the wrong way.

      The off-the-charts arrogance of programmers allows creation of unbelievable tools and at the same time enslaves the person to the blind ego.