Comment by thanhhaimai

8 months ago

The status page is green, but there are outages reported: https://downdetector.com/status/google-cloud/

74 comments

thanhhaimai

Why even have a status page? Someone reported that their org of >100,000 users can't use Google Meet. If corps aren't going to update their status page, might as well just not have one.

https://www.google.com/appsstatus/dashboard/

https://status.cloud.google.com/index.html

Edit: The GCP status page got updated <1 minute after I posted this, showing affected services are Cloud Data Fusion, Cloud Memorystore, Cloud Shell, Cloud Workstations, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataproc, Google Cloud Storage, Identity and Access Management, Identity Platform, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Vertex AI Search

SOLAR_FIELDS 8 months ago
There's no situation where the corporation controls the status page where you can trust the status page to have accurate information. None. The incentives will never be aligned in this regard. It's just too tempting and easy for the corp to control the narrative when they maintain their own status page.
The only accurate status pages are provided by third party service checkers.
- the8472 8 months ago
  
  > The incentives will never be aligned in this regard.
  Well, yes, incentives, do big customers with wads of cash have an incentive to demand accurate reporting from their suppliers so they can react better rather than trying to identify issues? If there's systematic underreporting, then apparently not. Though in this case they did update their page.
  
  2 replies →
supportengineer 8 months ago

Who gets a promotion from a working status board?
nikcub 8 months ago
I have zero faith in status pages. It's easier and more reliable to just check twitter.
Heroku was down for _hours_ the other day before there was any mention of an incident - meanwhile there were hundreds of comments across twitter, hn, reddit etc.
- fooey 8 months ago
  
  anecdotally, the status pages have been taken away from engineering and are run by customer support and marketing
paulddraper 8 months ago

> might as well just not have one
This is my position.

jorts 8 months ago

Here's the incident: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

deathanatos 8 months ago

It was nearly an hour into our company's internal incident channel on this for GCP to finally declare that yes, in fact, things on fire.
… I get that PR-types probably want to massage the message, but going radio dark is not good PR.

ransom1538 8 months ago

Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.

We are truly in gods hands.

$ prod

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists

kingstnap 8 months ago
Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.
I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.
- wbl 8 months ago
  
  Ma Bell hit that decently often.
  
  3 replies →
oxymoron 8 months ago
Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.
- hnuser123456 8 months ago
  
  "there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.
  Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.
  
  4 replies →
- Eduard 8 months ago
  
  > Because a lot of the time, not everyone is impacted
  then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.
  
  1 reply →
- nijave 8 months ago
  
  It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"
- johannes1234321 8 months ago
  
  They still could show that so.e.issues exist. Their monitoring must know.
  The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)
- jobs_throwaway 8 months ago
  
  That is still 100% an outage and should be displayed as such
jeanlucas 8 months ago
Because there are contracts related to uptime :)
- rixthefox 8 months ago
  
  Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.
  
  1 reply →
- rustc 8 months ago
  
  Does any service even say they're "down" anymore? All I see is "elevated error rates".
  
  2 replies →
rapus95 8 months ago

if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.
voytec 8 months ago
> Why can't companies be honest with being down
SLA agreements.
- organsnyder 8 months ago
  
  Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.
- remram 8 months ago
  
  Service level agreements agreements?
9rx 8 months ago
The program that updates the status page is hosted on Google Cloud.
- tfsh 8 months ago
  
  It's not. You might be joking, but that comment still isn't helpful.
  My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.
  
  6 replies →
- ashu1461 8 months ago
  
  So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.
  
  5 replies →
supportengineer 8 months ago

Nobody gets a promotion, that's why.
rozap 8 months ago

Please, won't somebody think of the KPIs.

FireBeyond 8 months ago

Yeah, my company of hundreds of people working remotely are having 90%+ failures connecting to Google Meetings - joining a meeting just results in a 504.

milesward 8 months ago

It's updated now, shows the impact to console, dataproc, GCS, IAM and Identity Platform: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

DrBenCarson 8 months ago

Whichever product person is in charge of the status page should be ashamed

How could you possibly trust them with your critical workloads? They don't even tell you whether or not their services work (despite obviously knowing)

carter-0 8 months ago

[dead]

aylmao 8 months ago

AWS is fine: https://health.aws.amazon.com/health/status
My guess is whatever system downdetector uses to "detect downtime" relies on either GCP or Cloudflare (also having issues at the moment: https://www.cloudflarestatus.com/)
roughly 8 months ago
So’s Azure? https://downdetector.com/status/windows-azure/
This is where we get to learn about the one common system all of our “distributed cloud” systems rely on, isn’t it?
- deathanatos 8 months ago
  
  My gut says all clouds spike when one goes down from people misreporting issues.
  But I suppose there's always "something something BGP" but that feels less likely.
Macha 8 months ago

Aren't some of these sites partially based on hits (because of the assumption that if enough people are suddenly googling "Is youtube down", then youtube must be having some sort of issue.
I could see a big outage like this causing people to google "Is AWS down?"
bicx 8 months ago
Almost everything on the downdetector home page is listed as having downtime...
- 0xCAP 8 months ago
  
  At this point I don’t know if I must assume people are trolling or the entire internet is down.
tonyhart7 8 months ago

wtf is going on
ransom1538 8 months ago
It's the entire internet. Check oracle cloud, etc etc. The ENTIRE INTERNET.
- cyberpunk 8 months ago
  
  Quick! Pirate as much music as possible before it goes for good! ;)
- deepsun 8 months ago
  
  Hacker News is fine.
- cyberpunk 8 months ago
  
  oracle and azure report no issues on their statuspages, likely just down detector getting hammered.
  
  1 reply →
- tonyhart7 8 months ago
  
  are there nuclear war or something???