Comment by sunrunner

4 hours ago

> We paged for every single 500.

Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?

2 comments

sunrunner

LPisGood 3 hours ago

A reliability engineer from Jane Street gave a great talk about this, five nine’s of correctness in reporting, etc isn’t enough for the SEC.

https://youtu.be/zR9PpXWsKFQ

eithed 3 hours ago

Client network timeout shouldn't result in 500. With 408 and retry you should, dependent on the business criteria, get either an upsert (transaction is retried) or 422 (validation that given entry already exists).

Even if it's "DB in datacenter I tried to save to was hit by meteor" event, you can cater for this not to result in 500 (ie - DB unreachable, retry in a couple of minutes); the question is if you want to.