Comment by kayodelycaon
10 months ago
Crash-only is really hard to implement if another system is involved that isn't crash-only. If you crash in the middle of a network request, you may not know what state the other system is in.
I've had to deal with buggy mainframe software whose error messages had no relation to how much an operation succeeded. (And no way to ask it after the fact...) Welcome to the special hell.
Idempotent APIs + sane timeouts + retries.
Regular software can crash in the middle of a network request too (e.g.: someone accidentally unplugged the wrong network cable, power outage, etc).
Crash-only software is likely to test recovery of such situation.
Your comment suggests that you believe crash-only software to be inherently less reliable than the alternative. But that is opposite of the stated goal and supposed benefits.
Tbf isn't that equivalent to a network partition and then rebooting or replacing one node? The network will always go down in every middle point of an operation