Comment by HeyLaughingBoy

6 years ago

> the cost of suddenly aborting in the middle of an action

This is where FMEA (Failure Modes Effects Analysis) is useful: the likelihood of critical failures is assessed and the ones that are both dangerous and unacceptably likely to occur are removed by design. The rest are assigned specific ways of being handled.

In this particular case, the severity of the failure (not completing a test) is relatively, but not unacceptably high, but the fault is very unlikely to occur since it requires that (a) a lab violate the maintenance protocol we specified and (b) it happens during the time period between a test starting and ending. In all other cases it's a non-issue.

If we were to continue running in this scenario, the outcome could be far worse than shutting down since we would now have the possibility of providing incorrect diagnostic data to a physician. Again, the FMEA would say that although shutting down is bad, continuing to run is far worse.

I think FMEAs are a very good procedural tool. Because they put you into a mindset of considering your system and its functionality that's more "failure first" oriented.

However, they're also very difficult and time consuming to perform and keep updated throughout the development lifecycle. They're also necessarily sparse in terms of real coverage of a system's operational/behavioral domain for complex systems.

That said, I think way more software engineering organizations should be doing them as a matter of course even outside safety-critical systems. They're a very useful procedural tool to highlight blindspots at the very least.