Comment by munk-a

6 years ago

That's an interesting issue and makes me happy I'm working with a non-critical device. I like to follow practices (again, in non-critical settings) where cases like that can be accepted - but if such a case is detected we bail fatally. With an airplane or even a medical instrument, the cost of suddenly aborting in the middle of an action could be the plane falling out of the sky or some surgical tool becoming unresponsive at a critical time... So I think trying to keep working is the best course of action, but I thank the stars I work with non-critical applications where I can always declare a bad state and refuse to continue.

> the cost of suddenly aborting in the middle of an action

This is where FMEA (Failure Modes Effects Analysis) is useful: the likelihood of critical failures is assessed and the ones that are both dangerous and unacceptably likely to occur are removed by design. The rest are assigned specific ways of being handled.

In this particular case, the severity of the failure (not completing a test) is relatively, but not unacceptably high, but the fault is very unlikely to occur since it requires that (a) a lab violate the maintenance protocol we specified and (b) it happens during the time period between a test starting and ending. In all other cases it's a non-issue.

If we were to continue running in this scenario, the outcome could be far worse than shutting down since we would now have the possibility of providing incorrect diagnostic data to a physician. Again, the FMEA would say that although shutting down is bad, continuing to run is far worse.

  • I think FMEAs are a very good procedural tool. Because they put you into a mindset of considering your system and its functionality that's more "failure first" oriented.

    However, they're also very difficult and time consuming to perform and keep updated throughout the development lifecycle. They're also necessarily sparse in terms of real coverage of a system's operational/behavioral domain for complex systems.

    That said, I think way more software engineering organizations should be doing them as a matter of course even outside safety-critical systems. They're a very useful procedural tool to highlight blindspots at the very least.