← Back to context

Comment by munk-a

6 years ago

> is at best sensationalism, at worst pure fearmongering

I don't know about the case here, but any time I've hit an issue in my work where "Thing X needs to be done every Y or bugs start happening" it's a pretty clear sign of some deeper issues and likely a lot of underlying bad dev processes.

This issue might be as "simple" as a memory leak that will suddenly require reboots every N minutes when a seemingly unrelated patch exacerbates an issue.

Devices in systems like this are full of monotonically increasing sequence numbers used for all manner of coordination and diagnostic functions. In this case it appears to be a way to ensure some recency constraint on critical data. This is an extremely common method of attempting to assess/identify staleness of critical data (i.e. "Is the sequence number I'm looking at before or after the last one I saw, and by how much?") in critical real-time systems.

Probably this is a counter that rolls over if it's not reset, the predictability of needing to reset it before time T is an indicator it's a sequence number that's driven by a hard real-time trigger with extremely predictable cadence.

  • Exactly. I even worked on a medical instrument that had a similar problem.

    Early in development we ran everything (real-time embedded system) on a system interval whose finest level of granularity was a 1ms tick. The system scheduler used a 32-bit accumulator that we knew would roll over after 50 days. However, we were given assurances that the system would have to be powered down for maintenance weekly so it didn't matter. Since proper maintenance is a hard requirement (or the instrument will start reporting failures) that was OK.

    Eventually, some time after release we started getting feedback that the system was shutting down for no apparent reason. We investigated and found that those failures were all due to not having been powered down in months.

    Apparently, since it could take up to 30 minutes after powering on the instrument before it was ready to run, some labs were performing maintenance with the power still on, so the machines were hitting much higher than expected uptimes. In many cases it wasn't an issue, but if time rolled over in the middle of a test, the instrument would flag the "impossible" time change as a fatal error and immediately shut everything down.

    Next release moved to a 64-bit timer. I think we're good :-)

    • That's an interesting issue and makes me happy I'm working with a non-critical device. I like to follow practices (again, in non-critical settings) where cases like that can be accepted - but if such a case is detected we bail fatally. With an airplane or even a medical instrument, the cost of suddenly aborting in the middle of an action could be the plane falling out of the sky or some surgical tool becoming unresponsive at a critical time... So I think trying to keep working is the best course of action, but I thank the stars I work with non-critical applications where I can always declare a bad state and refuse to continue.

      2 replies →

  • I think you are right. 50 days is 4.32e9 ms, which is just a bit under max value of unsigned 32-bit int.

    • It's actually a bit over the max value[1] - I agree though that I'd strongly suspect this issue is related to overflowing a millisecond counter stored in a 32-bit int. The numbers are way too close.

      Hey, maybe <51 was just a off-by-one error... or maybe the actual advisory is to be <50 and some PM decided that number was too round or violated an SLA.

      1. 4,294,967,296 or 4.29e9

      7 replies →

Bug-free software of any complexity is at the very least exceedingly improbable, so there is always a tradeoff to be made and a lesser evil to be chosen.

Aircraft firmware requiring mandatory reboots in alignment with maintenance schedule, but working reliably otherwise, inspires more confidence than firmware advertised to run bug-free forever.

  • Aren't Ada and similar languages designed for safety critical cases like this?

    When lives are on the line software should be tested for reliability beyond 51 days. Having to restart is a symptom of reckless disregard for safety IMO.

    • When lives are on the line software should be tested for reliability beyond 51 days.

      Avionics software is written a world of verifiable requirements.

      For how many days should the software be required to operate?

      Is it acceptable to add that many [more] days to the software verification schedule in order to verifiably demonstrate that it works according to requirements?

      Why is 51 days not long enough?

      11 replies →

    • I’d be more alarmed about the fact that FAA had to issue a directive to deal with this situation. Either Boeing did not include the reboot in operation or maintenance procedures, or operators did not follow those procedures.

      The requirement of a reboot on its own, though, would not strike me as a blatant disregard for safety, as long as the period between reboots is long enough to exceed the maximum possible length of flight (taking any emergencies into account) with leeway to spare.

      2 replies →

    • > Having to restart is a symptom of reckless disregard for safety IMO

      No it's a symptom of having bugs in your code.

      And they can be there for a host of reasons ranging from "this is a once off accident" to "systematic failure in the software engineering process".

      2 replies →

  • I have audio devices which, when installed, will run flawlessly until power or hardware failure. The firmware isn't bug free, but the operation never encounters bugs.

Continuous Deployment environments are susceptible to versions of this. You have a slow leak, and nobody notices until Thanksgiving, when the processes that used to run for 3 days now run for 10. And just about the time you think you got that sorted out, Christmas comes along and busts the rest of your teeth out.

  • I worked at Google way way back when. We had an emergency code red situation where dozens of engineers from all over the company had to sit in a room and figure out what was making out network overload. After a bit of debugging it became clear that Gmail services where talking to Calendar services with an exceeding amount of traffic that nobody would have expected.A little debugging later and it became clear that restarting the gmail server fixed the issue. One global rolling restart later and all was well.

    But then the debugging started. Turns out the service discovery component would health check backend destinations once a second. This was fine as it made sure we would never try to call against a server that was long gone. The bug was that it never stopped health checking a backend. Even if the service discovery had removed a host from the pool long ago. Gmail had stopped deploying while it got ready for Christmas, and Calendar was doing a ton of small stability improvement deploys. We created the perfect storm for this specific bug.

    The most alarming part? This bug existed in the shared code that did RPC calls/health checking for all services across Google and had existed for quite a long time. In the end though, Gmail almost took Google offline by not deploying. =)

    • Statistics being what they are, eventually you will have, in the same build, an unfixed bug that requires a restart, and an unfixed bug that only works until you restart. That is never a fun day.

Have you heard the anecdote about buffer overflows in missiles? The short story is: they don't matter if they happen after you know the missile has exploded. It doesn't by definition make it an "underlying bad dev process".

What about running repairs in something like Cassandra? In some cases it is by design. Here I'm a little surprised an airliner would even go that long without a reboot

If it goes from 58 days to N minutes, it'll be caught real quick by QA testing for the patch, won't it?

  • I have no idea, hopefully? Or maybe this bug just manifests as a minor fuzziness on calculations that'd fall within acceptable error for seemingly unrelated tests. I also have no idea what Boeing's QA is like and I feel like assuming the best is clearly incorrect.