Comment by HeyLaughingBoy

6 years ago

Exactly. I even worked on a medical instrument that had a similar problem.

Early in development we ran everything (real-time embedded system) on a system interval whose finest level of granularity was a 1ms tick. The system scheduler used a 32-bit accumulator that we knew would roll over after 50 days. However, we were given assurances that the system would have to be powered down for maintenance weekly so it didn't matter. Since proper maintenance is a hard requirement (or the instrument will start reporting failures) that was OK.

Eventually, some time after release we started getting feedback that the system was shutting down for no apparent reason. We investigated and found that those failures were all due to not having been powered down in months.

Apparently, since it could take up to 30 minutes after powering on the instrument before it was ready to run, some labs were performing maintenance with the power still on, so the machines were hitting much higher than expected uptimes. In many cases it wasn't an issue, but if time rolled over in the middle of a test, the instrument would flag the "impossible" time change as a fatal error and immediately shut everything down.

Next release moved to a 64-bit timer. I think we're good :-)

3 comments

HeyLaughingBoy

munk-a 6 years ago

That's an interesting issue and makes me happy I'm working with a non-critical device. I like to follow practices (again, in non-critical settings) where cases like that can be accepted - but if such a case is detected we bail fatally. With an airplane or even a medical instrument, the cost of suddenly aborting in the middle of an action could be the plane falling out of the sky or some surgical tool becoming unresponsive at a critical time... So I think trying to keep working is the best course of action, but I thank the stars I work with non-critical applications where I can always declare a bad state and refuse to continue.

HeyLaughingBoy 6 years ago
> the cost of suddenly aborting in the middle of an action
This is where FMEA (Failure Modes Effects Analysis) is useful: the likelihood of critical failures is assessed and the ones that are both dangerous and unacceptably likely to occur are removed by design. The rest are assigned specific ways of being handled.
In this particular case, the severity of the failure (not completing a test) is relatively, but not unacceptably high, but the fault is very unlikely to occur since it requires that (a) a lab violate the maintenance protocol we specified and (b) it happens during the time period between a test starting and ending. In all other cases it's a non-issue.
If we were to continue running in this scenario, the outcome could be far worse than shutting down since we would now have the possibility of providing incorrect diagnostic data to a physician. Again, the FMEA would say that although shutting down is bad, continuing to run is far worse.
- im_down_w_otp 6 years ago
  
  I think FMEAs are a very good procedural tool. Because they put you into a mindset of considering your system and its functionality that's more "failure first" oriented.
  However, they're also very difficult and time consuming to perform and keep updated throughout the development lifecycle. They're also necessarily sparse in terms of real coverage of a system's operational/behavioral domain for complex systems.
  That said, I think way more software engineering organizations should be doing them as a matter of course even outside safety-critical systems. They're a very useful procedural tool to highlight blindspots at the very least.