Comment by skywhopper
7 hours ago
At some level, the status updates have to be manual. Any automation you try to build on top is inevitably going to break in a crisis situation.
7 hours ago
At some level, the status updates have to be manual. Any automation you try to build on top is inevitably going to break in a crisis situation.
I found GitHub's old "how many visits to this status page have there been recently" graph on their status page to be an absurdly neat solution to this.
Requires zero insight into other infrastructure, absolutely minimal automation, but immediately gives you an idea whether it's down for just you or everybody. Sadly now deceased.
I like that https://discordstatus.com/ shows the API response times as well. There's times where Discord will seem to have issues, and those correlate very well with increased API response times usually.
Reddit Status used to show API response times way back in the day as well when I used to use the site, but they've really watered it down since then. Everything that goes there needs to be manually put in now AFAIK. Not to mention that one of the few sections is for "ads.reddit.com", classic.
https://steamstat.us still has this - while not official it's pretty nice.
They are manual AND political (depending on how big the company is). Because having a dashboard go to red usually has a bunch of project work behind it.
Yeah, this is something people think is super easy to automate, and it is for the most basic implementation of something like a single test runner. The most basic implementation is prone to false positives, and as you say, breaking when the rest of your stuff breaks.
You can put your test runner on different infrastructure, and now you have a whole new class of false positives to deal with. And it costs you a bit more because you're probably paying someone for the different infra.
You can put several test runners on different infrastructure in different parts of the world. This increases your costs further. The only truly clear signals you get from this are when all are passing or all are failing. Any mixture of passes and fails has an opportunity for misinterpretation. Why is Sydney timing out while all the others are passing? Is that an issue with the test runner or its local infra, or is there an internet event happening (cable cut, BGP hijack, etc) beyond the local infra?
And thus nearly everyone has a human in the loop to interpret the test results and make a decision about whether to post, regardless of how far they've gone with automation.