Comment by TheDong
5 hours ago
Do you know of a single service at a single company that actually does that?
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I've been oncall for a different G service that nearly paged on every error. It used the standard error budget tooling, but on hundreds of user buckets because the engineering around locality-specific configuration was... suspect. Many of these buckets had single-digits user. If a user was on their phone and lost signal, I was paged. Very poor oncall experience.
The sub-service at IBM cloud I worked on had an insanely small error budget such that pages were nearly constant. On call was hell week until a few of us insisted on fixing the issues. The "few" of us were contractors. The employees seemed more than willing to just let the pages continue.
Some companies pay more if people are paged. It can create a perverse incentive not to fix problems or, in extreme cases, to watch things going wrong, waiting for the page, and then being ready to fix it straight away.
I worked at a large fintech moving billions of dollars in volume a day.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.
> We paged for every single 500.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
A reliability engineer from Jane Street gave a great talk about this, five nine’s of correctness in reporting, etc isn’t enough for the SEC.
https://youtu.be/zR9PpXWsKFQ
Client network timeout shouldn't result in 500. With 408 and retry you should, dependent on the business criteria, get either an upsert (transaction is retried) or 422 (validation that given entry already exists).
Even if it's "DB in datacenter I tried to save to was hit by meteor" event, you can cater for this not to result in 500 (ie - DB unreachable, retry in a couple of minutes); the question is if you want to.