Comment by rco8786

3 months ago

There are no truly automated status pages. It's an impossible problem. I mean that seriously. At scale you're collecting 100s of thousands (or mms) of metrics/spans/logs across 10s or 100s of loosely coupled systems. Building a system that can accurately analyze these and assess what the status page should say, in real time, without human intervention, is just not possible with current technology.

Even just the basic question of "are we down or is our monitoring system just having issues" requires a human. And it's never "are we down", because these are distributed systems we're talking about.

If service X goes down entirely, does that warrant a status page update? Yes? Turns out system X is just running ML jobs in the background and has no customer impact.

If service Z's p95 response latency jumps from 10ms to 1500ms for 5 minutes, 500s spike at the same time, but overall 200s rate is around 98%, are we down? is that a status page update? Is that 1 bad actor trying to cause issues? Is that indicative of 2,000 customers experiencing an outage and the other 98,000 operating normally? Is that a bad rack switch that's causing a few random 500s across the whole customer base and the service will reject that node and auto-recover in a moment?

0 comments

rco8786

No comments yet

Contribute on Hacker News ↗