Comment by thanhhaimai
2 days ago
The status page is green, but there are outages reported: https://downdetector.com/status/google-cloud/
2 days ago
The status page is green, but there are outages reported: https://downdetector.com/status/google-cloud/
Why even have a status page? Someone reported that their org of >100,000 users can't use Google Meet. If corps aren't going to update their status page, might as well just not have one.
https://www.google.com/appsstatus/dashboard/
https://status.cloud.google.com/index.html
Edit: The GCP status page got updated <1 minute after I posted this, showing affected services are Cloud Data Fusion, Cloud Memorystore, Cloud Shell, Cloud Workstations, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataproc, Google Cloud Storage, Identity and Access Management, Identity Platform, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Vertex AI Search
There's no situation where the corporation controls the status page where you can trust the status page to have accurate information. None. The incentives will never be aligned in this regard. It's just too tempting and easy for the corp to control the narrative when they maintain their own status page.
The only accurate status pages are provided by third party service checkers.
> The incentives will never be aligned in this regard.
Well, yes, incentives, do big customers with wads of cash have an incentive to demand accurate reporting from their suppliers so they can react better rather than trying to identify issues? If there's systematic underreporting, then apparently not. Though in this case they did update their page.
2 replies →
Who gets a promotion from a working status board?
I have zero faith in status pages. It's easier and more reliable to just check twitter.
Heroku was down for _hours_ the other day before there was any mention of an incident - meanwhile there were hundreds of comments across twitter, hn, reddit etc.
anecdotally, the status pages have been taken away from engineering and are run by customer support and marketing
> might as well just not have one
This is my position.
Here's the incident: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...
It was nearly an hour into our company's internal incident channel on this for GCP to finally declare that yes, in fact, things on fire.
… I get that PR-types probably want to massage the message, but going radio dark is not good PR.
It's updated now, shows the impact to console, dataproc, GCS, IAM and Identity Platform: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...
Yeah, my company of hundreds of people working remotely are having 90%+ failures connecting to Google Meetings - joining a meeting just results in a 504.
Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.
We are truly in gods hands.
$ prod
Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists
Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.
I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.
Ma Bell hit that decently often.
3 replies →
Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.
"there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.
Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.
4 replies →
> Because a lot of the time, not everyone is impacted
then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.
1 reply →
It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"
They still could show that so.e.issues exist. Their monitoring must know.
The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)
That is still 100% an outage and should be displayed as such
Because there are contracts related to uptime :)
Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.
1 reply →
Does any service even say they're "down" anymore? All I see is "elevated error rates".
2 replies →
if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.
> Why can't companies be honest with being down
SLA agreements.
Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.
Service level agreements agreements?
The program that updates the status page is hosted on Google Cloud.
It's not. You might be joking, but that comment still isn't helpful.
My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.
5 replies →
So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.
5 replies →
Nobody gets a promotion, that's why.
Please, won't somebody think of the KPIs.
Whichever product person is in charge of the status page should be ashamed
How could you possibly trust them with your critical workloads? They don't even tell you whether or not their services work (despite obviously knowing)
[dead]
AWS is fine: https://health.aws.amazon.com/health/status
My guess is whatever system downdetector uses to "detect downtime" relies on either GCP or Cloudflare (also having issues at the moment: https://www.cloudflarestatus.com/)
So’s Azure? https://downdetector.com/status/windows-azure/
This is where we get to learn about the one common system all of our “distributed cloud” systems rely on, isn’t it?
My gut says all clouds spike when one goes down from people misreporting issues.
But I suppose there's always "something something BGP" but that feels less likely.
Aren't some of these sites partially based on hits (because of the assumption that if enough people are suddenly googling "Is youtube down", then youtube must be having some sort of issue.
I could see a big outage like this causing people to google "Is AWS down?"
Almost everything on the downdetector home page is listed as having downtime...
At this point I don’t know if I must assume people are trolling or the entire internet is down.
wtf is going on
It's the entire internet. Check oracle cloud, etc etc. The ENTIRE INTERNET.
Quick! Pirate as much music as possible before it goes for good! ;)
Hacker News is fine.
oracle and azure report no issues on their statuspages, likely just down detector getting hammered.
1 reply →
are there nuclear war or something???