← Back to context

Comment by arkx

5 hours ago

Me too. What good is a status page that's not automated?

There are no truly automated status pages. It's an impossible problem. I mean that seriously. At scale you're collecting 100s of thousands (or mms) of metrics/spans/logs across 10s or 100s of loosely coupled systems. Building a system that can accurately analyze these and assess what the status page should say, in real time, without human intervention, is just not possible with current technology.

Even just the basic question of "are we down or is our monitoring system just having issues" requires a human. And it's never "are we down", because these are distributed systems we're talking about.

If service X goes down entirely, does that warrant a status page update? Yes? Turns out system X is just running ML jobs in the background and has no customer impact.

If service Z's p95 response latency jumps from 10ms to 1500ms for 5 minutes, 500s spike at the same time, but overall 200s rate is around 98%, are we down? is that a status page update? Is that 1 bad actor trying to cause issues? Is that indicative of 2,000 customers experiencing an outage and the other 98,000 operating normally? Is that a bad rack switch that's causing a few random 500s across the whole customer base and the service will reject that node and auto-recover in a moment?

I can answer that - once the lawyers take interest in your SLAs, you need to check with them if this is really an incident. Otherwise, you might lose some contract money and nobody wants that.