Boeing 787s must be turned off and on every 51 days to prevent 'misleading data'

6 years ago (theregister.co.uk)

Commercial pilot here. Can confirm that turn it off and back on again works for troubleshooting avionics issues.

But seriously, this is clickbait and nothing to see here. Many things on the aircraft are checked, cycled, etc before every flight, let alone on a 51 day mx schedule.

  • > Can confirm that turn it off and back on again works for troubleshooting avionics issues.

    Have you ever had to do this mid-flight?

    • I worked in aviation for a while. This is super common. There isn't a pilot on the planet who hasn't turned avionics off in flight (there are always redundant and backup systems). There probably isn't a working pilot in the world that hasn't had to cycle a circuit breaker in flight this month.

      Edit: Well, if this was a normal month.

    • Oh yeah. Many times. But, it's not as scary as it sounds. There are multiple redundancies and backup systems. So, you can cycle something on/off without touching the other systems. It's often a step in abnormal procedures checklists.

      This article is turning a routine checklist/maintenance item into scary sounding clickbait.

  • Sure, but to me this sounds scary.

    If I'd know that it were every "(2^32 - 1) ms to days" -> "49 days 17 hours 2 minutes 47.3 seconds" (millis stored in an unsigned long) then I'd be at ease, but 51 days doesn't say anything to me.

    I just hope that they know why it's max 51 days.

    • Instead of 1ms, try running the math if the tick rate instead was 1.024ms, (1024 usec).

      These things are often driven with a 32.768 MHz crystal, in case anyone's wondering why not just a nice even 1.000ms.

      On another forum I talked about way long ago, when I had to debug a huge system that reset after 248.x days of uptime. Yep, run the math...

  • That and even if it was a real problem it would just be one more thing on a very rigorous checklist anyway, right?

  • That's fine, and I'm sure it won't be a huge maintenance problem, but it indicates the underlying software is such a mess that they can't even adequately fix a simple issue.

    In software, it's what we call an "ugly hack." An "ugly hack" meant that 737s didn't rely on both sensors, and people died. An ugly hack meant that the Ariane 5 rocket exploded in mid-air.

    Ugly hacks should not be a part of any project where lives are at stake.

    • No, it indicates that the problem domain is sufficiently dangerous that the risk of fixing must be balanced against the risk of the fix causing a different unknown error. There were ~500,000 787 flights in 2017 with an average of 200 people per flight. The 737-MAX resulted in 385 fatalities, so if a fix had a 1 in 250,000 chance of causing a different error that could result in a fatal crash then it would be worse than the 737-MAX problems. Do you have confidence that systems you have worked on have processes in place to guarantee that there is less than a 1 in 250,000 chance that a fix would not cause another error? If not, are you aware of any organization whose development practices you have first hand knowledge of and that you are confident could give such a guarantee? That is the risk analysis that must be done when doing a fix.

      To be fair, this is somewhat of an over-exaggeration of the requirements since not all systems are critical and not all errors cause critical problems. In addition, the risk must be balanced against the alternative, in this case the risk caused by making sure a reboot is done every 51 days, so you would need to do an analysis of the failure probability and possible consequences of the status quo and compare that against the possible error modes of a software fix.

      As an addendum to the risk analysis, the above analysis was only for one year error and on a per-flight basis. If you expect the 787 to fly for ~30 years then the fix must not cause two crashes over 30 years so a 1 in 7,500,000 chance. The average flight is ~5,000 KM which is ~4-5 hours per flight for a total flight time of ~60,000,000 hours. A plane takes ~3 minutes to fall from cruising altitude, so we need fleet downtime of 6 minutes per 60,000,000 hours which is 1 in 600,000,000 downtime. That is 99.9999998% uptime, 8 9s, 6,000x the holy grail of 5 9's availability in the cloud industry, 60,000x the availability guaranteed by the AWS SLA (again, somewhat of an over-exaggeration since you need correlated failures to amount to 3 minutes of continuous-ish failure, but that depends on an analysis of mean-failure time and mean-time-to-resolution which I do not have access to).

      5 replies →

  • I'm sorry to hear that people have become so accustomed to fixing failures of the mind (which these defects are) with reboots.

    It takes a certain type of person to fly a plane and resilience in face of unknowns and following checklists are some of the qualities they have.

    In the same vein, these kind of failures are known to programmers for tens of years, just like metallurgists are aware of metal fatigue _and plan for it_.

    Failure of software professionals to plan and mitigate for this kind of foreseeable problems is inexcusable and I liken them to some incompetent metallurgists in an alternate Univers who brushed off De Havilland Comet's like there was nothing to learn from.

    • "I'm sorry to hear that people have become so accustomed to fixing failures of the mind (which these defects are) with reboots."

      That's nothing, I fix "failures of the mind" by shutting off my brain for eight hours every day. That's a reboot!

    • I don't know why this is down voted.

      Yes software bugs happen, but they are fixing this with documentation rather than root causing the problem.

      That should worry people, because until its been root caused the actual implications are also unknown. For all we know its a symptom of a bug that will cause a more severe problem somewhere else.

      3 replies →

I am going out on a limb here but I seem to remember reading somewhere that airliners do have maintenance schedules that are very strictly kept, for obvious reasons. If the maintenance schedule is N days, then any news article pointing out how amusing it is that an airliner needs to be rebooted every >N days is at best sensationalism, at worst pure fearmongering.

I don't know for a fact this is the case here for the 787, but I think there are far better things to worry about when it comes to technical security in airliners than how often they need to be rebooted. For example, whether the on-board WiFi is sufficiently separated from the in-flight systems, and (as discussed recently here on HN) whether the advent of touchscreens for critical flight systems is sufficiently durable, tested and redundant.

  • > is at best sensationalism, at worst pure fearmongering

    I don't know about the case here, but any time I've hit an issue in my work where "Thing X needs to be done every Y or bugs start happening" it's a pretty clear sign of some deeper issues and likely a lot of underlying bad dev processes.

    This issue might be as "simple" as a memory leak that will suddenly require reboots every N minutes when a seemingly unrelated patch exacerbates an issue.

    • Devices in systems like this are full of monotonically increasing sequence numbers used for all manner of coordination and diagnostic functions. In this case it appears to be a way to ensure some recency constraint on critical data. This is an extremely common method of attempting to assess/identify staleness of critical data (i.e. "Is the sequence number I'm looking at before or after the last one I saw, and by how much?") in critical real-time systems.

      Probably this is a counter that rolls over if it's not reset, the predictability of needing to reset it before time T is an indicator it's a sequence number that's driven by a hard real-time trigger with extremely predictable cadence.

      17 replies →

    • Bug-free software of any complexity is at the very least exceedingly improbable, so there is always a tradeoff to be made and a lesser evil to be chosen.

      Aircraft firmware requiring mandatory reboots in alignment with maintenance schedule, but working reliably otherwise, inspires more confidence than firmware advertised to run bug-free forever.

      20 replies →

    • Continuous Deployment environments are susceptible to versions of this. You have a slow leak, and nobody notices until Thanksgiving, when the processes that used to run for 3 days now run for 10. And just about the time you think you got that sorted out, Christmas comes along and busts the rest of your teeth out.

      2 replies →

    • Have you heard the anecdote about buffer overflows in missiles? The short story is: they don't matter if they happen after you know the missile has exploded. It doesn't by definition make it an "underlying bad dev process".

    • What about running repairs in something like Cassandra? In some cases it is by design. Here I'm a little surprised an airliner would even go that long without a reboot

  • > but I seem to remember reading somewhere ... at best sensationalism, at worst pure fearmongering

    This type of thinking is proving to be more dangerous than the ideas expressed.

    • Agreed. How is it okay to brush off serious issues like this.

      Power cycling as a maintenance requirement is absurd. You don't sell someone a garage door opener for example and say "oh yeah, by the way, part of the ownership maintenance is that you need to unplug it at least every month, otherwise it might just start opening and closing randomly. No biggie".

      This is an indication of a serious problem with the engineering, and after everything that has been failing with Boeings products lately, makes me want to avoid getting on a Boeing aircraft.

      This has to be a joke. I can't fathom that this sort of issue is in production. It means that there is some sort of unaccounted saturation of buffers, or worse, memory leaks. In a deterministic system, these are the obvious things you need to get right.

      4 replies →

  • >If the maintenance schedule is N days, then any news article pointing out how amusing it is that an airliner needs to be rebooted every <N days is at best sensationalism, at worst pure fearmongering.

    Why not treat it like security? Yes, there are other layers of defense, but any given layer needs to be measured on its own. If I find that some large government website allows for javascript to be inserted for an XSS, but prevents it from running because it only allows javascript executed from a specific javascript origin, it is still a security flaw because some user might use a browser that does not implement content security policy. Yes, the user shouldn't be using such an insecure browser, but the website itself should not allow for scripts to be injected and not properly encoded.

  • Don't you mean >N days? If a maintenance schedule requires maintenance every 60 days but the plane needs to be rebooted every 50 days, that would be cause for concern.

  • Maybe regular maintenance does not always include reboot the computer, or remove the batteries and power on/off etc. If is a safety issue is safer to make it clear and not let if optional.

    • If regular maintenance does not including rebooting the computer, this is news-worthy.

      If the maintenance schedule does document system reboots, this is boring and business as expected... no different than periodically reinflating tires or replacing oil. I'd have no concerns flying on such a plane.

      3 replies →

    • Regular maintenance includes replacing parts which have lifetimes measured in hours, having to reboot a computer on a schedule is just boring.

  • Indeed, working with a Boeing subcontractor, I've seen a few cases where something is "fixed by process control", where rules for people are designed to circumvent the deficiency in the software.

    Basically, adding more code to make software "smarter" for those edge-cases was judged (rightly or wrongly) as having an even higher risk of introducing new bugs and creating new test procedures or invalidating previous testing.

  • I strongly disagree. If the 787 would show an obvious warning before takeoff saying that it must be rebooted, that would be a different story.

    The

  • "A checks", the smallest and most frequent, are every 1000 flight hours. That's a longer interval than 51 days.

  • It does make you wonder what other bugs the software has though? Presumably it's not intentionally designed so that it needs rebooting periodically...

    • > It does make you wonder what other bugs the software has though? Presumably it's not intentionally designed so that it needs rebooting periodically...

      Not necessarily. If the computer will never actually need to run for 51 days continuously, it may be a reasonable trade-off to require the reboot instead of writing (potentially buggy) code to handle a scenario that can be easily prevented from happening.

      It reminds me of this story:

      https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...:

      > I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage leaks. Imagine my surprise when the customers chief software engineer said "Of course it leaks". He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.

      5 replies →

    • Not necessarily - memory management depends on the application. There are situations where it is better to just grow memory usage than garbage collect. Since airplanes require very routine maintenance anyway, this maybe be safer.

    • Lots of things are designed with periodic restarts in mind. From memory, one of the JVM garbage collectors is designed with daily restarts in mind. This is done to avoid having to deal with the expense of memory fragmentation.

  • This reports that this bug was found and mitigation was put in place (maintenance schedules are amended to include a reboot). Where is the "fearmongering"? Because it mentions that the not-implementation of that would be bad?

  • I am going out on a limb here but Google makes this easy enough to verify.

    Waxing anecdotal from a randos unverified, admittedly fuzzy memory is a pretty banal application of my agency. “People aren’t sharing info correctly! Allow me to fix this while admitting to a fuzzy understanding myself!”

    Imagine what could be done if we curved that emotional sense to think and know into agency towards learning it and not bloviating online, bending readers mood to our position, disregarding literal effort at verification.

    Add in behind the back character assassination of the author, and yeah really buying into your expertise. At least I’m being dismissive to your face.

    • A lot of people on HN are at work at these hours, and don't have the time to do the full research.

      You don't know that he didn't do a quick search and nothing came up.

      2 replies →

  • You know there's this trend going on which makes things kind of confusing and that is accusing everything of being fear mongering even when it may not be.

    Things like Corona virus not being that bad and how surgical masks are useless. I genuinely question whether or not what you're saying is actually true or just following the trend.

    I honestly can't tell these days.

  • What I find funny is how developers still "forget" to account for variables that increment and will overload after X days (and the process doesn't catch it).

    It was funny in Windows 95 days (and Unixes knew how to handle those) now it's just sad.

    Of course the problem might be a bit more complex and it might be a combination of issues. Still not good though

I have heard this anecdote before. In case you might not: 51days is awfully close to 2^32ms ...

Safe use of Microsoft Windows also requires rebooting on a slightly shorter schedule, because GetTickCount will overflow. In particular if you're running a real-time simulation which is likely to use delta time as a critical parameter, and you can't audit the code or know for a fact that it uses GetTickCount.

  • I've often wondered how people came to the conclusion that "Windows is unstable, you can't leave it on for more than a couple months!" due to this.

    At one company I worked for, all of our National Instruments test equipment would start to fail with communication problems after about two months on our Windows XP computers. Being familiar with GetTickCount, I rebooted the computers, recorded the date, verified the next failure was 49 days later, then emailed National Instruments with a link to the GetTickCount documentation. They pushed out an update with a fix 3 days later. Oops.

One thing Im curious about is what is classified as a "reboot" of the plane? It it is parked over night are all the systems shut down and restarted the next day it is put back in service? Does it sit "running" in some sleep mode? Last time I checked a plane cannot(practically) stay airborne for 51 days. Is the reboot a pain in the ass 5 day procedure? there are too many unknowns to sound alarm bells.

Do safety-critical systems have memory burned as ROM, instead of having dynamic memory allocation? From my point of view, the plane doesn't really change, so the avionics suite shouldn't require changing either. You build a physics model of the plane, translate it into memory, then bake it in, and dynamic allocation is only needed for when you need inputs. Or is this dangerous because the physics model does change significantly for different loadouts?

I'm not an aerospace or embedded engineer.

I wish some upstart would try to create a plane making startup to crush Boeing at their inefficient bloated game

We need an Elon Musk of flight, Boeing has gotten away with too much mediocrity

  • Are we talking about the same Elon Musk who's car company created an autopilot which has been known to crash into to side of trucks? Have I missed the obvious sarcasm?

  • Is there as much opportunity? It could be that aircraft are much more optimized than rockets were. We already reuse airplanes.

Getting late to the discussion, but people tackle this for a long time in software engineering, it's called Software Rejuvenation, with models of repairing systems, Markovian assumptions, applications in JVM, etc. Interesting topic. It was used to analyse Patriot missiles that needed the same approach to replenish its internal variables each time.

I think airbus 350 needs a reboot as well

I am guessing manufacturer doest have a budget to fix this. They are too busy sorting the 737 pitch controls. I am guessing they need bigger buffer to would clear it out and some good GPS and timestamps database and add a clr button on the console to clear the historic alt and speed data. The historic data can go to the black box and the new one stored to the buffer. A sensor should only look at data from the past week and not calculate stuff from 49 days in the past. What use would pilots have other than for service and maintainence. What was the OS written in objective C?

Overflow is not the only kind of bug triggered by uptime. In February 1991 a Patriot missile failed to intercept and incoming Scud due to an accumulation of time-based errors. The missile system had been online for 100 hours and this resulted in enough error that the intercept calculation was incorrect. People died.

http://www-users.math.umn.edu/~arnold//disasters/patriot.htm...

  • I wonder if there were ever uptime issues caused by heap fragmentation.

    • This happens on set top boxes, especially when the graphics memory heap is allocated separately from the system memory heap. The graphics memory heap can be fragmented and surfaces stop being rendered because there are no contiguous memory blocks large enough. Having two heaps on a low memory device leads to unfortunate compromises.

I'm a bit ashamed of that but i guess im not the only one like this. At work we have a system which started crashing, and we could not figure out why. It runs normally, but restarts after some time and then again continues to function properly. So what did we do? Ran multiple instances behind a proxy and let instances crash. But cluster as a whole functions perfectly even when parts of it are restarting because of unknown error that we have no capacity to identify and fix.

I am guessing they need bigger buffer to would clear it out and some good GPS and timestamps database integrations to clear the alt and speed data. A sensor should only look at data from the past week and not calculate from 49 days in the past. What use would pilot have other than for service and maintainence.

Is it because it runs on Windows 95?

  • It was 49.7 days for Windows 95:

    https://sites.google.com/site/edmarkovich2/whywindows95andwi...

    Still, it's remarkable that two separate Seattle-based companies have produced a similarly short time bomb on very expensive and highly visible product development projects.

    • This wasn't noticed for a few years after Win95 was released.

      The joke was that nobody had ever had a Win95 system stay up for 49 days. Mwah-hah-hah.

    • GetTickCount is a monotonic timestamps still supported in Windows 10, available for any application to use. These days, you should use GetTickCount64, but any application that doesn't handle the rollover of GetTickCount is buggy.

  • Good thing we improved with Windows 10 and have to reboot only every 52 days. Progress!

    Seriously, Windows 10 does tend to get dodgy if you don't reboot for a few weeks. I'm not the only one who's noticed. Granted, it's less likely to outright crash, but it acts increasingly drunk.

    • It gets slower over time. I still can not understand how an OS can make things slower over time.

      I made an excercize of rebooting my work computer every Monday, but it really wasn't enough. Now I restart it every time I can.

I honestly think the safest solution is that an aircraft should refuse to take off after two weeks until you reboot it. In stead, Boeing and Airbus leave it to customers to test if the plane still flies after six months.

Is it common to leave the electronics running for that long anyway? My naive understanding is that it would be rebooted after every flight anyway.

  • Ideally a plane is spending as little time as possible not doing anything. It's on the ground for as short as possible and there's ideally always something happening that needs monitoring or communication. Restarting a bunch of low-level systems just because doesn't fit into that, so apparently a 51 day span without powering it off wouldn't be unheard of.

And to think of the billions given to Boeing to bail it out while the management team who got it into this state got golden parachutes?

If the government deems that Boeing much be saved, it should also deem that the that prior management was negligent and cause for this situation and seize their personal assets and hold them criminally accountable.

The article doesn't mention this, but 51 days is approximately 2^32 milliseconds...

  • 2^32 ms is about 49.71 days ( (2^32)/(1000 * 3600 * 24) ), so less than the reboot cycle of 51 days.

    • I mentioned this in another thread, but 2^32 * 1024us is 50.9 days. So it's probably a systick at 1.024ms overflowing a uint32_t. If you've got a 1us timer it's a lot cleaner for the CPU to make the tick happen at 1024us than at 1000.