Comment by fl0ki

10 months ago

Intentional crashing can be fine. Unintentional crashing with telemetry can be fine because you're going to fix it.

Unintentional crashing without telemetry is terrible. I've seen too many systems built to "just panic because it'll restart and retry" that never converge because the retry hits the same conditions and no thought was put into how to monitor what is going wrong.

As you all know, such systems tend to also neglect jitter and backoff so the retrying clients also hot-loop slamming every dependency, even ones that weren't erroring prior to the crash.

I've seen people shell into k8s pods and poke around at files manually for an all-nighter because they didn't invest even one hour in telemetry beforehand. Even that was a second penance for the first crime: finding out about an outage because of a user escalation rather than an automated alert.

Ironically, at times, some attempt at monitoring was made but undermined by the crash, e.g. Prometheus metrics were exported but lost before they could be scraped.

We have a long way to go educating most developers about production maturity before it's safe to endorse crashing without accounting for the downsides.

This was written in 2006 when monitoring was barely on anyone's radar. It's understandable in that context. People reading it in a modern context have to BYO production maturity.