Comment by im_down_w_otp

6 years ago

Devices in systems like this are full of monotonically increasing sequence numbers used for all manner of coordination and diagnostic functions. In this case it appears to be a way to ensure some recency constraint on critical data. This is an extremely common method of attempting to assess/identify staleness of critical data (i.e. "Is the sequence number I'm looking at before or after the last one I saw, and by how much?") in critical real-time systems.

Probably this is a counter that rolls over if it's not reset, the predictability of needing to reset it before time T is an indicator it's a sequence number that's driven by a hard real-time trigger with extremely predictable cadence.

17 comments

im_down_w_otp

HeyLaughingBoy 6 years ago

Exactly. I even worked on a medical instrument that had a similar problem.

Early in development we ran everything (real-time embedded system) on a system interval whose finest level of granularity was a 1ms tick. The system scheduler used a 32-bit accumulator that we knew would roll over after 50 days. However, we were given assurances that the system would have to be powered down for maintenance weekly so it didn't matter. Since proper maintenance is a hard requirement (or the instrument will start reporting failures) that was OK.

Eventually, some time after release we started getting feedback that the system was shutting down for no apparent reason. We investigated and found that those failures were all due to not having been powered down in months.

Apparently, since it could take up to 30 minutes after powering on the instrument before it was ready to run, some labs were performing maintenance with the power still on, so the machines were hitting much higher than expected uptimes. In many cases it wasn't an issue, but if time rolled over in the middle of a test, the instrument would flag the "impossible" time change as a fatal error and immediately shut everything down.

Next release moved to a 64-bit timer. I think we're good :-)

munk-a 6 years ago
That's an interesting issue and makes me happy I'm working with a non-critical device. I like to follow practices (again, in non-critical settings) where cases like that can be accepted - but if such a case is detected we bail fatally. With an airplane or even a medical instrument, the cost of suddenly aborting in the middle of an action could be the plane falling out of the sky or some surgical tool becoming unresponsive at a critical time... So I think trying to keep working is the best course of action, but I thank the stars I work with non-critical applications where I can always declare a bad state and refuse to continue.
- HeyLaughingBoy 6 years ago
  
  > the cost of suddenly aborting in the middle of an action
  This is where FMEA (Failure Modes Effects Analysis) is useful: the likelihood of critical failures is assessed and the ones that are both dangerous and unacceptably likely to occur are removed by design. The rest are assigned specific ways of being handled.
  In this particular case, the severity of the failure (not completing a test) is relatively, but not unacceptably high, but the fault is very unlikely to occur since it requires that (a) a lab violate the maintenance protocol we specified and (b) it happens during the time period between a test starting and ending. In all other cases it's a non-issue.
  If we were to continue running in this scenario, the outcome could be far worse than shutting down since we would now have the possibility of providing incorrect diagnostic data to a physician. Again, the FMEA would say that although shutting down is bad, continuing to run is far worse.
  
  1 reply →

panda88888 6 years ago

I think you are right. 50 days is 4.32e9 ms, which is just a bit under max value of unsigned 32-bit int.

munk-a 6 years ago
It's actually a bit over the max value[1] - I agree though that I'd strongly suspect this issue is related to overflowing a millisecond counter stored in a 32-bit int. The numbers are way too close.
Hey, maybe <51 was just a off-by-one error... or maybe the actual advisory is to be <50 and some PM decided that number was too round or violated an SLA.
1. 4,294,967,296 or 4.29e9
- tekstar 6 years ago
  
  They should follow the Linux kernel and set all counters to rollover a few minutes after boot.
  
  2 replies →
- 3pt14159 6 years ago
  
  But with engineering we almost always have safety factors. I'd say it's probably a 64-bit int, but that would be way too much of a safety factor.
  
  3 replies →
vbezhenar 6 years ago
Good old GetTickCount() bug.
- woadwarrior01 6 years ago
  
  That's the first thought that came to my mind. What are the odds that they've got an embedded CPU running Win32 somewhere in the aircraft? :)
  Context[1] for anyone not familiar with the old Win32 APIs.
  [1]: https://docs.microsoft.com/en-us/windows/win32/api/sysinfoap...
  
  1 reply →