Comment by 0xbadcafebee

5 years ago

> If I'm awaken at 2am from being on-call for more than once per quarter, then something is seriously wrong and I will either fix it or quit.

Yes, something is wrong. But it could be many things. Here is how you can find out:

1. Is the thing that's broken a bug in your code? Then it's your fault, so fix it. This means you need better testing too, and maybe a redesign to resist failures. Try to get alerts at 9am in dev so that they don't come in at 2am from production.

2. Is the thing that's broken a server thing that Ops is supposed to deal with? Probably you should quit. You can also work with Ops to redesign the server stuff to be something less prone to failure. Often Ops can't do this themselves because they don't know enough about how your apps work. Go talk to them, help them out. Or quit.

3. Is the thing that's broken a false alarm, or not important? Quit. Or work with Ops to create better alarms and tests. Ops doesn't know your app, so you need to help them craft the SLIs and SLOs.

4. Did Ops create all these alerts themselves without your involvement? Quit. Or take ownership of the tests and alerts for your apps.

5. Is it a huge slog to try to figure out how the alerting works, to work with Ops to make changes, to add tests, or to figure out what's broken or not and troubleshoot it? Definitely quit.