← Back to context

Comment by ransom1538

2 days ago

Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.

We are truly in gods hands.

$ prod

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists

Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.

I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.

  • Ma Bell hit that decently often.

    • Is that even knowable? Like, I know they called it “The Astonishing, Unfailing, Bell System” but if they had an outage somewhere did they actually have an infrastructure of “canary phones” and such to tell in real time? (As in, they’d know even if service was restored in an hour)

      Not trying to snark, I legit got nerdsniped by this comment.

      1 reply →

    • Running a much simpler system with much more independent nodes.

      It's a lot easier to keep packets flowing than to keep non-self-contained servers serving.

Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.

  • "there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.

    Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.

  • > Because a lot of the time, not everyone is impacted

    then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.

    • There's always a partial outage in large systems, some very small percentage. All clouds should report all red then.

  • It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"

  • They still could show that so.e.issues exist. Their monitoring must know.

    The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)

Because there are contracts related to uptime :)

  • Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.

    • The real point of SLAs is to give you a reason to break contracts. If a vendor doesn't meet their contractual promises, that gives you a lot of room to get out contracts

  • Does any service even say they're "down" anymore? All I see is "elevated error rates".

    • 4 to 6 hours after the flames are visible from orbit and management has finally given up on the 37th quick fix you do get that red X

      But really not until after it's been on CNN a while.

if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.

> Why can't companies be honest with being down

SLA agreements.

  • Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.

The program that updates the status page is hosted on Google Cloud.

  • It's not. You might be joking, but that comment still isn't helpful.

    My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.

    • How could it not be helpful given that it gave you reason to provide more details that you wouldn't have otherwise shared? You may not have thought this through. There is nothing more helpful. Unless you think your own comment isn't helpful, but then...

      4 replies →

  • So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.

    • A program that updates the status page failing does not imply that the status page is manually edited. It is not like you would generate a status page on every request.

      2 replies →

    • the services ARE healthy, status page is correct. The backbone which links YOU to the service isn't healthy. Take a look at cloudflare, they are already working on it

      1 reply →