← Back to context

Comment by im_down_w_otp

6 years ago

Devices in systems like this are full of monotonically increasing sequence numbers used for all manner of coordination and diagnostic functions. In this case it appears to be a way to ensure some recency constraint on critical data. This is an extremely common method of attempting to assess/identify staleness of critical data (i.e. "Is the sequence number I'm looking at before or after the last one I saw, and by how much?") in critical real-time systems.

Probably this is a counter that rolls over if it's not reset, the predictability of needing to reset it before time T is an indicator it's a sequence number that's driven by a hard real-time trigger with extremely predictable cadence.

Exactly. I even worked on a medical instrument that had a similar problem.

Early in development we ran everything (real-time embedded system) on a system interval whose finest level of granularity was a 1ms tick. The system scheduler used a 32-bit accumulator that we knew would roll over after 50 days. However, we were given assurances that the system would have to be powered down for maintenance weekly so it didn't matter. Since proper maintenance is a hard requirement (or the instrument will start reporting failures) that was OK.

Eventually, some time after release we started getting feedback that the system was shutting down for no apparent reason. We investigated and found that those failures were all due to not having been powered down in months.

Apparently, since it could take up to 30 minutes after powering on the instrument before it was ready to run, some labs were performing maintenance with the power still on, so the machines were hitting much higher than expected uptimes. In many cases it wasn't an issue, but if time rolled over in the middle of a test, the instrument would flag the "impossible" time change as a fatal error and immediately shut everything down.

Next release moved to a 64-bit timer. I think we're good :-)

  • That's an interesting issue and makes me happy I'm working with a non-critical device. I like to follow practices (again, in non-critical settings) where cases like that can be accepted - but if such a case is detected we bail fatally. With an airplane or even a medical instrument, the cost of suddenly aborting in the middle of an action could be the plane falling out of the sky or some surgical tool becoming unresponsive at a critical time... So I think trying to keep working is the best course of action, but I thank the stars I work with non-critical applications where I can always declare a bad state and refuse to continue.

    • > the cost of suddenly aborting in the middle of an action

      This is where FMEA (Failure Modes Effects Analysis) is useful: the likelihood of critical failures is assessed and the ones that are both dangerous and unacceptably likely to occur are removed by design. The rest are assigned specific ways of being handled.

      In this particular case, the severity of the failure (not completing a test) is relatively, but not unacceptably high, but the fault is very unlikely to occur since it requires that (a) a lab violate the maintenance protocol we specified and (b) it happens during the time period between a test starting and ending. In all other cases it's a non-issue.

      If we were to continue running in this scenario, the outcome could be far worse than shutting down since we would now have the possibility of providing incorrect diagnostic data to a physician. Again, the FMEA would say that although shutting down is bad, continuing to run is far worse.

      1 reply →

I think you are right. 50 days is 4.32e9 ms, which is just a bit under max value of unsigned 32-bit int.