Comment by ransom1538

2 days ago

Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.

We are truly in gods hands.

$ prod

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists

42 comments

ransom1538

kingstnap 2 days ago

Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.

I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.

wbl 2 days ago
Ma Bell hit that decently often.
- Uehreka 2 days ago
  
  Is that even knowable? Like, I know they called it “The Astonishing, Unfailing, Bell System” but if they had an outage somewhere did they actually have an infrastructure of “canary phones” and such to tell in real time? (As in, they’d know even if service was restored in an hour)
  Not trying to snark, I legit got nerdsniped by this comment.
  
  1 reply →
- Dylan16807 2 days ago
  
  Running a much simpler system with much more independent nodes.
  It's a lot easier to keep packets flowing than to keep non-self-contained servers serving.

oxymoron 2 days ago

Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.

hnuser123456 2 days ago
"there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.
Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.
- deepsun 2 days ago
  
  With highly distributed services there's always something failing, some small percentage.
  
  2 replies →
- spwa4 2 days ago
  
  Just say it: they want to lie to 95% of customers.
Eduard 2 days ago
> Because a lot of the time, not everyone is impacted
then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.
- deepsun 2 days ago
  
  There's always a partial outage in large systems, some very small percentage. All clouds should report all red then.
nijave 2 days ago

It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"
johannes1234321 2 days ago

They still could show that so.e.issues exist. Their monitoring must know.
The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)
jobs_throwaway 2 days ago

That is still 100% an outage and should be displayed as such

jeanlucas 2 days ago

Because there are contracts related to uptime :)

rixthefox 2 days ago
Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.
- datadrivenangel 2 days ago
  
  The real point of SLAs is to give you a reason to break contracts. If a vendor doesn't meet their contractual promises, that gives you a lot of room to get out contracts
rustc 2 days ago
Does any service even say they're "down" anymore? All I see is "elevated error rates".
- colechristensen 2 days ago
  
  4 to 6 hours after the flames are visible from orbit and management has finally given up on the 37th quick fix you do get that red X
  But really not until after it's been on CNN a while.

rapus95 2 days ago

if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.

voytec 2 days ago

> Why can't companies be honest with being down

SLA agreements.

organsnyder 2 days ago

Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.
remram 2 days ago

Service level agreements agreements?

9rx 2 days ago

The program that updates the status page is hosted on Google Cloud.

tfsh 2 days ago
It's not. You might be joking, but that comment still isn't helpful.
My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.
- 9rx 2 days ago
  
  How could it not be helpful given that it gave you reason to provide more details that you wouldn't have otherwise shared? You may not have thought this through. There is nothing more helpful. Unless you think your own comment isn't helpful, but then...
  
  4 replies →
ashu1461 2 days ago
So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.
- 9rx 2 days ago
  
  A program that updates the status page failing does not imply that the status page is manually edited. It is not like you would generate a status page on every request.
  
  2 replies →
- rapus95 2 days ago
  
  the services ARE healthy, status page is correct. The backbone which links YOU to the service isn't healthy. Take a look at cloudflare, they are already working on it
  
  1 reply →

supportengineer 2 days ago

Nobody gets a promotion, that's why.

rozap 2 days ago

Please, won't somebody think of the KPIs.