Comment by hnarn
6 years ago
I am going out on a limb here but I seem to remember reading somewhere that airliners do have maintenance schedules that are very strictly kept, for obvious reasons. If the maintenance schedule is N days, then any news article pointing out how amusing it is that an airliner needs to be rebooted every >N days is at best sensationalism, at worst pure fearmongering.
I don't know for a fact this is the case here for the 787, but I think there are far better things to worry about when it comes to technical security in airliners than how often they need to be rebooted. For example, whether the on-board WiFi is sufficiently separated from the in-flight systems, and (as discussed recently here on HN) whether the advent of touchscreens for critical flight systems is sufficiently durable, tested and redundant.
> is at best sensationalism, at worst pure fearmongering
I don't know about the case here, but any time I've hit an issue in my work where "Thing X needs to be done every Y or bugs start happening" it's a pretty clear sign of some deeper issues and likely a lot of underlying bad dev processes.
This issue might be as "simple" as a memory leak that will suddenly require reboots every N minutes when a seemingly unrelated patch exacerbates an issue.
Devices in systems like this are full of monotonically increasing sequence numbers used for all manner of coordination and diagnostic functions. In this case it appears to be a way to ensure some recency constraint on critical data. This is an extremely common method of attempting to assess/identify staleness of critical data (i.e. "Is the sequence number I'm looking at before or after the last one I saw, and by how much?") in critical real-time systems.
Probably this is a counter that rolls over if it's not reset, the predictability of needing to reset it before time T is an indicator it's a sequence number that's driven by a hard real-time trigger with extremely predictable cadence.
Exactly. I even worked on a medical instrument that had a similar problem.
Early in development we ran everything (real-time embedded system) on a system interval whose finest level of granularity was a 1ms tick. The system scheduler used a 32-bit accumulator that we knew would roll over after 50 days. However, we were given assurances that the system would have to be powered down for maintenance weekly so it didn't matter. Since proper maintenance is a hard requirement (or the instrument will start reporting failures) that was OK.
Eventually, some time after release we started getting feedback that the system was shutting down for no apparent reason. We investigated and found that those failures were all due to not having been powered down in months.
Apparently, since it could take up to 30 minutes after powering on the instrument before it was ready to run, some labs were performing maintenance with the power still on, so the machines were hitting much higher than expected uptimes. In many cases it wasn't an issue, but if time rolled over in the middle of a test, the instrument would flag the "impossible" time change as a fatal error and immediately shut everything down.
Next release moved to a 64-bit timer. I think we're good :-)
3 replies →
I think you are right. 50 days is 4.32e9 ms, which is just a bit under max value of unsigned 32-bit int.
11 replies →
Bug-free software of any complexity is at the very least exceedingly improbable, so there is always a tradeoff to be made and a lesser evil to be chosen.
Aircraft firmware requiring mandatory reboots in alignment with maintenance schedule, but working reliably otherwise, inspires more confidence than firmware advertised to run bug-free forever.
Aren't Ada and similar languages designed for safety critical cases like this?
When lives are on the line software should be tested for reliability beyond 51 days. Having to restart is a symptom of reckless disregard for safety IMO.
18 replies →
I have audio devices which, when installed, will run flawlessly until power or hardware failure. The firmware isn't bug free, but the operation never encounters bugs.
Continuous Deployment environments are susceptible to versions of this. You have a slow leak, and nobody notices until Thanksgiving, when the processes that used to run for 3 days now run for 10. And just about the time you think you got that sorted out, Christmas comes along and busts the rest of your teeth out.
I worked at Google way way back when. We had an emergency code red situation where dozens of engineers from all over the company had to sit in a room and figure out what was making out network overload. After a bit of debugging it became clear that Gmail services where talking to Calendar services with an exceeding amount of traffic that nobody would have expected.A little debugging later and it became clear that restarting the gmail server fixed the issue. One global rolling restart later and all was well.
But then the debugging started. Turns out the service discovery component would health check backend destinations once a second. This was fine as it made sure we would never try to call against a server that was long gone. The bug was that it never stopped health checking a backend. Even if the service discovery had removed a host from the pool long ago. Gmail had stopped deploying while it got ready for Christmas, and Calendar was doing a ton of small stability improvement deploys. We created the perfect storm for this specific bug.
The most alarming part? This bug existed in the shared code that did RPC calls/health checking for all services across Google and had existed for quite a long time. In the end though, Gmail almost took Google offline by not deploying. =)
1 reply →
Have you heard the anecdote about buffer overflows in missiles? The short story is: they don't matter if they happen after you know the missile has exploded. It doesn't by definition make it an "underlying bad dev process".
What about running repairs in something like Cassandra? In some cases it is by design. Here I'm a little surprised an airliner would even go that long without a reboot
If it goes from 58 days to N minutes, it'll be caught real quick by QA testing for the patch, won't it?
I have no idea, hopefully? Or maybe this bug just manifests as a minor fuzziness on calculations that'd fall within acceptable error for seemingly unrelated tests. I also have no idea what Boeing's QA is like and I feel like assuming the best is clearly incorrect.
> but I seem to remember reading somewhere ... at best sensationalism, at worst pure fearmongering
This type of thinking is proving to be more dangerous than the ideas expressed.
Agreed. How is it okay to brush off serious issues like this.
Power cycling as a maintenance requirement is absurd. You don't sell someone a garage door opener for example and say "oh yeah, by the way, part of the ownership maintenance is that you need to unplug it at least every month, otherwise it might just start opening and closing randomly. No biggie".
This is an indication of a serious problem with the engineering, and after everything that has been failing with Boeings products lately, makes me want to avoid getting on a Boeing aircraft.
This has to be a joke. I can't fathom that this sort of issue is in production. It means that there is some sort of unaccounted saturation of buffers, or worse, memory leaks. In a deterministic system, these are the obvious things you need to get right.
Did you really just compare a passenger aircraft to a garage door opener?
It would also be absurd to say that part of the maintenance for your garage door opener is tearing the whole thing apart after x amount of use, but that's absolutely standard for aircraft engines.
1 reply →
Commercial airliners, as I'm sure you are aware, are not garage doors.
If it's an indication of serious problems with engineering then Airbus aren't immune to them either:
https://ad.easa.europa.eu/blob/EASA_AD_2017_0129_R1.pdf/AD_2...
Why?
>If the maintenance schedule is N days, then any news article pointing out how amusing it is that an airliner needs to be rebooted every <N days is at best sensationalism, at worst pure fearmongering.
Why not treat it like security? Yes, there are other layers of defense, but any given layer needs to be measured on its own. If I find that some large government website allows for javascript to be inserted for an XSS, but prevents it from running because it only allows javascript executed from a specific javascript origin, it is still a security flaw because some user might use a browser that does not implement content security policy. Yes, the user shouldn't be using such an insecure browser, but the website itself should not allow for scripts to be injected and not properly encoded.
Don't you mean >N days? If a maintenance schedule requires maintenance every 60 days but the plane needs to be rebooted every 50 days, that would be cause for concern.
Of course. Edited.
Maybe regular maintenance does not always include reboot the computer, or remove the batteries and power on/off etc. If is a safety issue is safer to make it clear and not let if optional.
If regular maintenance does not including rebooting the computer, this is news-worthy.
If the maintenance schedule does document system reboots, this is boring and business as expected... no different than periodically reinflating tires or replacing oil. I'd have no concerns flying on such a plane.
I understand, but as developers it is interesting that even in such critical software there are such bugs (in case it was not designed with the reboots in mind from the start)
1 reply →
Regular maintenance includes replacing parts which have lifetimes measured in hours, having to reboot a computer on a schedule is just boring.
Indeed, working with a Boeing subcontractor, I've seen a few cases where something is "fixed by process control", where rules for people are designed to circumvent the deficiency in the software.
Basically, adding more code to make software "smarter" for those edge-cases was judged (rightly or wrongly) as having an even higher risk of introducing new bugs and creating new test procedures or invalidating previous testing.
Not all airlines are as strict with maintenance as you’d like:
https://en.wikipedia.org/wiki/Alaska_Airlines_Flight_261#Ext...
I strongly disagree. If the 787 would show an obvious warning before takeoff saying that it must be rebooted, that would be a different story.
The
"A checks", the smallest and most frequent, are every 1000 flight hours. That's a longer interval than 51 days.
Recent discussion on touch screens in avionics: https://news.ycombinator.com/item?id=22739718
It does make you wonder what other bugs the software has though? Presumably it's not intentionally designed so that it needs rebooting periodically...
> It does make you wonder what other bugs the software has though? Presumably it's not intentionally designed so that it needs rebooting periodically...
Not necessarily. If the computer will never actually need to run for 51 days continuously, it may be a reasonable trade-off to require the reboot instead of writing (potentially buggy) code to handle a scenario that can be easily prevented from happening.
It reminds me of this story:
https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...:
> I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage leaks. Imagine my surprise when the customers chief software engineer said "Of course it leaks". He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.
If it were working as designed (and properly documented), it does not seem likely that the FAA would find it necessary to issue an Airworthiness Directive.
It's like Facebook engineering, their PHP infrastructure is leaking like hell. (I worked on it)
But it's not an issue, because servers are constantly redeployed for each code deploy.
2 replies →
To me the implication was not that you should write more potentially buggy code to prevent the need to reboot every 50 days... but rather than you should fix the bug that caused it to need rebooting every 50 days
Not necessarily - memory management depends on the application. There are situations where it is better to just grow memory usage than garbage collect. Since airplanes require very routine maintenance anyway, this maybe be safer.
Lots of things are designed with periodic restarts in mind. From memory, one of the JVM garbage collectors is designed with daily restarts in mind. This is done to avoid having to deal with the expense of memory fragmentation.
This reports that this bug was found and mitigation was put in place (maintenance schedules are amended to include a reboot). Where is the "fearmongering"? Because it mentions that the not-implementation of that would be bad?
It's not like when you skip a day in changing engine oil.
I am going out on a limb here but Google makes this easy enough to verify.
Waxing anecdotal from a randos unverified, admittedly fuzzy memory is a pretty banal application of my agency. “People aren’t sharing info correctly! Allow me to fix this while admitting to a fuzzy understanding myself!”
Imagine what could be done if we curved that emotional sense to think and know into agency towards learning it and not bloviating online, bending readers mood to our position, disregarding literal effort at verification.
Add in behind the back character assassination of the author, and yeah really buying into your expertise. At least I’m being dismissive to your face.
A lot of people on HN are at work at these hours, and don't have the time to do the full research.
You don't know that he didn't do a quick search and nothing came up.
I don’t literally know that but I do literally know how to read.
“I’m going out on a limb here but I seem to remember...“
Offers no sources. Just rambling on a limb.
Throw in behind the back claims of fear mongering by the author, I know the type. Do they know that’s the author’s motivation? Note how rhetorically circular emotional positioning gets.
Thanks for assuming all of us aren’t paying attention and couldn’t possibly possess a legit view if you don’t.
Having achieved grad degrees in math, joined into debates of reality and consciousness at length with experts, and generally just existed in human culture for 40 years, I have a pretty good emotional sense of when an argument is just ego fluffing nonsense.
I am a recursive axiomatic system too and can pick from numerous conscious positions to build on. None of them have to be the one you’d start from. Or vetted in reality, since we’ll take rambling on limbs as enough to get into a debate about.
Which is my point ultimately, social media is just low effort social normalization to inaction. Let’s upvote this anecdote and not bother further.
If the effort isn’t going to be there to take it seriously, why bother with social media discourse and all the mechanical effort to babysit it? What a waste of time and precious resources to prop up millions of digital tribal spaces for low effort thought work.
Tech nerds build the most useless spaces for themselves to wank in.
1 reply →
You know there's this trend going on which makes things kind of confusing and that is accusing everything of being fear mongering even when it may not be.
Things like Corona virus not being that bad and how surgical masks are useless. I genuinely question whether or not what you're saying is actually true or just following the trend.
I honestly can't tell these days.
What I find funny is how developers still "forget" to account for variables that increment and will overload after X days (and the process doesn't catch it).
It was funny in Windows 95 days (and Unixes knew how to handle those) now it's just sad.
Of course the problem might be a bit more complex and it might be a combination of issues. Still not good though
> It was funny in Windows 95 days (and Unixes knew how to handle those) now it's just sad.
Ummm, this says otherwise:)
https://en.wikipedia.org/wiki/Year_2038_problem
Now that the turn of the century is 20 years ago and the next century is 80 years away, I feel it's safe to go back to two digit years for most things.