Boeing 787s must be reset every 51 days or 'misleading data' is shown (2020)

1 year ago (theregister.com)

I was speaking with a 787 pilot last Sunday, I told him that the week before when I was at an airport there were two pilots sitting next to me talking about how "This is the third bloody 787 rescue we've had this month... I can't believe we had full engine and <I think he said auxiliary?> failure at the same time" - I asked him if this is common and he said "I hear of it, but I haven't had that many major failures, but lots of little things - last time I flew in from <city> a few moments after we touched down we lost auxiliary power from the rear engine, all the cabin lighting went black along with a number of other things, thankfully we'd already significantly reduced speed and were straight and already lost most of the speed we were carrying, so we were fine and taxied to the disembark location, they had it up and flying again within the day - but it certainly was disconcerting to say the least".

I will be slightly paraphrasing from memory there, but certainly was quite surprised how calm he was about the whole thing, there's no way I'd board one of those things.

  • Modern two-engine planes like the 787 have an auxiliary power unit (APU) in the tail. This is a small turbine that runs a generator and a pump for the hydraulics. It’s typically only turned on when the plane is on the ground, or if there’s an emergency in mid-air. It is also needed to start the main engines so if the APU is faulty the plane will probably be stuck where it is. In theory a 787 can take off with just one engine but this is not very safe and wouldn’t be done in all but the most exceptional circumstances.

    There are variations on this depending on the plane model, of course. Some older planes can use an external starter for their engines, but I think that’s very rare now.

    • Aircraft with INOP APUs can generally be "air started" with a ground-based high-pressure air system. It's relatively common and I've been on a plane that had to do the procedure. It was entirely undramatic other than engines being started before the pushback, but I doubt most passengers even noticed.

      Now, interestingly, the 787 is a "bleedless" aircraft, so it doesn't use high-pressure air from the APU to spool up the engines. I believe it can use its hefty bank of lithium-ion batteries to start its engines if the APU (and associated electrical generator) is INOP.

      Not a pilot/engineer - just an enthusiast. Someone more au fait with the 787 might be able to correct me on the above.

      3 replies →

    • >Modern two-engine planes like the 787 have an auxiliary power unit (APU)

      Where "modern" here includes jet airliners made in the 70s yes.

      >It is also needed to start the main engines

      The engines need an air source, and the APU can be an air source, but at one point at least, big airlines preferred using ground hookup provided air sources for starting, in order to save gas. Next time you fly, look at the jetway. There will be a large yellow duct system underneath it that can be hooked up to the plane to provide pneumatic pressure and air conditioning air without starting the APU. There are similar hookups for electrical power so that a plane won't drain its battery during routine turnover operations.

      The bottom price flights I've taken recently don't seem to hook either up though, preferring instead to start the APU during taxi to the gate while shutting down one engine, shut down the other engine once they are at the gate, and reverse the process to taxi back out to the runway. The turnaround time is so short, and the required work to clean and restock the cabin so little, I bet they just don't pay for ground hookups.

  • APU failure maybe? That would be troublesome indeed; with no engines and no APU you'd lose most instrumentation and a lot of the hydraulics.

    • There is also a RAT at the back that can be deployed to generate some power(~5-10 minutes max) in case of severe emergency in Air. It is what you hear sometimes, when the aircraft is making a very shrill noise flying over your head.

      However, if it is not a test flight, a RAT deployment should make you very uncomfortable and worried…

      34 replies →

    • I thought the guy I was speaking mentioned something about instrumentation but I wasn't 100% sure and that sounded more serious so didn't mention it - but if the aux engine failing would do that - I guess that lines up!

Previous: https://ioactive.com/reverse-engineers-perspective-on-the-bo...

The 47 bit timestamp at 32MHz would explain the duration (Though not why it isn't 33MHz?).

  • I have a way simpler explanation. IEEE 754 double can only represent integers up to 2^53 without precision loss, so if you naively average two numbers greater than 2^52, you get an erroneous result.

    It just so happens that 2^52 nanoseconds is a little bit over 52 days.

    I've seen the same thing with AMD CPUs where they hang after ~1042 days which is 2^53 10-nanosecond intervals.

Had a similar problem to this many years ago. Happened every 24 days approximately and lost one user setting. Had a logic analyser connected to it for days trying to reproduce the issue in some way. Went to go for a piss and get a coffee one afternoon and came back and there it was triggered!

What happened? Well it turns out there was a timer that no one used that overflowed and caused an interrupt which wasn’t handled any more, the interrupt handler fell through, caused a halt and the WDT fired fire rebooting it and some idiot hadn’t stored that one setting in the NVRAM.

So then we had more problems. 5000 things with EPROMs in that were rebooting every 24 days which were spread all over the planet. Many questions to ask over how the hell it ended up like that.

I hope people are asking these sorts of questions at Boeing.

Edit: also the source code we had did not match what was on the devices. Turned out the engineer who provided the hex file hadn’t copied that code to the file server and had left a year before hand. We didn’t find that until the WDT fired and piqued our interest and could reproduce it on the dev board because the software was different (should have checked that past the label on the ROM which was wrong!)

I’d note that commercial airplanes generally operate with 6-7 9’s of availability. For anyone that’s ever built a system with 5 9’s, this is impressive. In fact it’s impressive enough you probably don’t think twice about sleeping on a flight.

  • Six 9s would be half a minute of downtime per year.

    I don't see how that is possible given the maintenance required for these planes. Even the simple A checks ground a plane for hours every couple hundred flights while D checks take months to complete every 6-10 years.

    Edit: minute not hour

  • That is presumably historic data though?

    6-7 nines is a lot of nines and we’ve had a couple of issues in quick succession now

  • > it’s impressive enough you probably don’t think twice about sleeping on a flight.

    I don't think twice about sleeping on a flight because I've already made my bed at that point - nothing I can do if something goes wrong.

    (Well, I've woken my wife when a doctor was called for before, but that's about the extent of my usefulness.)

    • I’ll wager if you got into a situation you can’t escape where you had a 30% chance of a horrific death over the next six hours you wouldn’t snuggle into your sound suppressing headphones and doze off between snacks no matter how inevitable things are.

      2 replies →

  • > I’d note that commercial airplanes generally operate with 6-7 9’s of availability.

    Maybe they used to, but Boeing has been doing rather worse and that’s the point here isn’t it?

Airbus A350s had the same issue: https://www.theregister.com/2019/07/25/a350_power_cycle_soft...

We’re just going to see more and more issues like this as more and more software is used in applications like this. I would be willing to bet that a Tesla would also spontaneously crash if left on for hundreds of hours, but they just rarely if ever are left on that long.

  • Ford F150 Lightning had a similar issue on a cross country road race some YT'ers put on. It died at 13% battery, Ford said it was due to not letting the truck rest.

This is remarkably business-as-usual for airplane electronics.

As a more mundane example: the wifi on planes does temporary [edit: DHCP, not NAT] leases. But the system on many has expiration windows on the order of hours, possibly more than a day... Couple that with the number of passengers planes serve and busy routes can easily exhaust the lease pool.

The solution: there's a button the flight attendants can push to reboot the router, dumping the lease table.

  • Even with super long leases, couldn’t they just have a larger subnet? A /8 oughta do it.

    But I guess we’re talking about the same people who made the mistake in the first place…

    • To steelman the choice, the reserved IP /8 subnet is 10.x.x.x and is often used for corporate networks and other larger subnets experience similar usage. People on the plane using WiFi are likely to access their corporate networks via VPN, potentially causing routing issues.

      Users VPNing into the reused address space for their own home VPN are probably knowledgeable enough to figure out what is going on and a small enough user base to not care about.

      10 replies →

Scary as it is, is there any reason for a passenger jet to have uptime if more than, say, 24hrs? Wouldn't you just switch it off and on again between every flight, regardless?

If this issue was in a car, we would never know as no one keeps their car running for 50 days straight.

  • Overnight, planes tend to be plugged in to ground power, to ventilate, keep the batteries charged, for the cleaning crews, etc. Most get rebooted once in a while, but it's always possible one won't be, hence the directive to be certain.

    This particular problem has been known for years (the article is from 2020).

    • Unfortunately, an aircraft has no “reboot”. It is just a violent power cut. A lot of headache is introduced in non-critical aircraft software because there is no “graceful shutdown” or long power duration. Infact, certain hardware has an upper limit(much lower than a week) before which it needs one power cut(sometimes called power cycle) or it suffers from various buffer overflow, counter overflow and starts acting mysterious.

      39 replies →

  • Many car's control units continue to run while the car is off. If you want to reboot your vehicle, you need to unplug the 12v battery for at least a minute.

    • On some cars (recent VWs in particular) when you plug the battery back in you need to twiddle some settings in the computer otherwise the charging circuit will fry the battery prematurely. We've gotten ahead of our skis with this nonsense, time to rein it in.

      15 replies →

  • Some of these planes are constantly flying as long as they're not in maintenance. A plane not in the air is a plane the company bought that's not currently generating profit.

  • I’ll bet you the typical EV stays powered on 24/7 with reboots around OTA updates.

    • unsure what you mean here. most of the systems go to a sleep state in modern vehicles ev or not. the 12v battery keeps only certain ECU's up - think ECUs that control alarm, lock and unlock state and any communication with the mobile app via LTE... but the rest of the systems are OFF, you don't want an EV battery to hit 0% and 12V to also hit 0% - that would basically make it a brick from what I understand- because EV's have contactors which need to shut for the battery to be 'engaged' the 12V battery controls these contactors.

      3 replies →

  • Very strange, because for me, an aircraft(medium) is never alive for more than 24h. A big one like 787 may be alive for up to 72h(assuming longer routes). 50 days for me would be a dream and a lot less headache but it is very expensive to keep an aircraft powered that long with ground power.

    • > it is very expensive to keep an aircraft powered that long with ground power.

      Why do you say this?

  • I know someone on the north slope of Alaska. He does not turn his personal truck off all winter. This is even more typical for semi trucks and whatnot around there.

  • I think it's about the worst case scenario. You wouldn't want this to happen even rarely, especially when it can be solved by putting more time (and god forbid, money) into R&D.

  • Airlines will run the aircraft as long as possible. As another commenter mentioned, if an aircraft isn't in flight, it's in maintenance. All of these times it's on.

51 days? That looks like the old Windows 94/98 bug, where it used a 32-bit variable to store uptime in milliseconds

In the software world I call this an end user discovered issue. But when the issue involves a plane that is carrying actual souls. That can feel very scary.

I am sure this has been resolved by now since its from 2020.

  • I don't think airplane software ships updates the way npm packages do. I would be more surprised if this is fixed.

    • > I don't think airplane software ships updates the way npm packages do.

      I'd ideally like to sleep tonight, thanks.

    • They do get software updates. Watch "Stig Aviation" "Stig Shift" series on youtube. He's shown how to do updates in a few of his videos.

  • Scary would be right.

    Reminds me of the F-22 Raptor crossing the International Dateline error in 2007. They were flying a squadron of them from Hawaii to Japan. They crossed the IDL and all nav/fuel systems went down, as well as some communications gear.

    They only made it back because they were flying with tankers at time, who led them back to base.

  • That depends on how much code was having trouble, and what you mean by "resolved".

    The safe option might be to avoid the situation, and I could imagine that even if there is a code update it might just make the plane balk at getting ready to take off after a certain amount of uptime.

There was a similar problem with a specific generation of 688-class submarines, where a calculated temperature would slowly drift. The metric wasn’t used for any protective actions, so it wasn’t a “shut down immediately and return home surfaced on the diesel” situation, but still disconcerting.

I assume that after this the software was soak-tested for weeks / months to eliminate that class of bug. Naval Reactors is many things, but repeating the same mistake twice isn’t one of them.

It sounds like my random Raspberry Pi sitting somewhere in my server room that has to be restarted every <x> weeks.

There are just too many worrying signs from Boeing in the last years.

I have no idea about these things at all but some of the issues seem almost unforgivable to me.

They should work very hard for the industry, and the ultimate end users to regain confidence in them again. I'm not sure they are doing this.

If problems persist after rebooting, you may need to use a giant paperclip to perform a reset.

I'm honestly impressed that the Register included a prominent blurb explaining to the reader that while this sounds like a catastrophic issue, the most likely outcome if this is experienced in flight is a safe and controlled landing.

> Sidenote > > Pitch and power is a simple concept. If you have the throttles, say, three-quarters open and the nose of the aeroplane is pointing a few degrees above the horizon, chances are you're probably flying straight and level at a safe speed. Training manuals normally contain a number of precise pitch and power settings (they vary between aeroplane types) so if display systems start failing, pilots can fall back to these with confidence.

> This alarming-sounding situation

That's not what's alarming to me. What's alarming is that the plane could possibly be in a position to be continuously powered on for 51 days in the first place.

  • When a minute of downtime costs thousands, why wouldn't you expect planes to be in constant utilization?

    • > why wouldn't you expect planes to be in constant utilization?

      They require weekly maintenance which takes them out of service for at least 12 hours.

      What we may of as 'constant utilization' is quite different in a regulated fleet environment like airlines.

      2 replies →

easy:

  while(true) {
    if(
      (date.today() - date(this.system.uptime) >= 51)
        && !this.sys.isFlying
    ) {
      this.sys.resetNow();
    }
    time.sleep(1000);
  }

  • well now your system doesn't do anything because its stuck in a forever loop checking the time. it's most likely programmed in C so you can remove the OOP as well.

51 days * 86400 seconds * 1000

=> 4406400000

2^32

=> 4294967296

the coincidence seems unlikely, it's basically ~~5 hours and a half~~ 30 hours of difference if one has a 1-ms counter increment

this is what happens when you hire based on checked checkboxes and not qualifications.

This company just can’t stay out of the news. Their planes are trash. Software is straight garbage. Many people have died because of this company and suffered undue stress/anxiety because of the massive dip in quality.

Boeing engineers/builders caught on audio stating they wouldn’t be caught dead in their own planes unless feeling suicidal.

  • The company definitely can't stay out of the news and it's gone downhill over the recent years but you've picked an interesting post to lament about those on. The news they can't stay out of is over 4 years old in this case. The model of plane it's about (787) has never had a single fatality despite >15 years of operations and >1,000 units operating today. In all, deaths are probably the worst possible metric to berate Boeing on - including every death (e.g. hijackings, not just engineering failures) their popular 747 line has had comes to <6,000 fatalities despite carrying billions of passengers over a period of >50 years.

    Despite their ever increasing incompetence on delivery speed, test compliance, and innovation... commercial air travel with Boeing (and other major air manufacturers) has always been one of, if not the, safest mechanisms of travel we've ever executed on. Particularly the last 5 years have been the safest period in terms of air travel deaths or injuries.

    None of that means we shouldn't criticize Boeing by any means, just that doing it over perceived death and accident counts because of what news headlines imply is complete nonsense in terms of actual numbers no matter how you slice it. It's important those kinds of things are reported but it's equally important to not get swept up in paranoia over it.

    • Agreed, my 737 fears were relieved by researching how many of them are in the air at any moment, how many millions of trips they fly each year, how old airframes can get before they get retired, etc. Even the "worse" models are feats of engineering.