← Back to context

Comment by thesz

10 hours ago

Erlang has "die and be restarted" philosophy towards process failures, so these "bugs that happen to erlang systems in prod" may not be fixed at all, if they are rare enough.

As of now, the post you're replying to says "bugs that regularly happen ... in prod"

Now, if it crashes every 10 years, that is regular, but I think the meaning is that it happens often. Back when I operated a large dist cluster, yes, some rare crashes happened that never got noticed or the triage was 'wait and see if it happens again' and it didn't happen. But let it crash and restart from a known good state is a philosophy about structuring error checking more than an operational philosophy: always check for success and if you don't know how to handle an error fail loudly and return to a good state to continue.

Operationally, you are expected to monitor for crashes and figure out how to prevent them in the future. And, IMHO, be prepared to hot load fixes in response... although a lot of organizations don't hot load.