← Back to context

Comment by oxymoron

2 days ago

Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.

"there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.

Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.

  • With highly distributed services there's always something failing, some small percentage.

    • Sure but you can still put a message up when it's some <numeric value> over some <threshold value> like errors are 50% higher than normal (maybe the SLO is 99.999% of requests are processed successfully)

      1 reply →

> Because a lot of the time, not everyone is impacted

then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.

  • There's always a partial outage in large systems, some very small percentage. All clouds should report all red then.

It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"

They still could show that so.e.issues exist. Their monitoring must know.

The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)