← Back to context

Comment by thanhhaimai

2 days ago

The status page is green, but there are outages reported: https://downdetector.com/status/google-cloud/

Why even have a status page? Someone reported that their org of >100,000 users can't use Google Meet. If corps aren't going to update their status page, might as well just not have one.

https://www.google.com/appsstatus/dashboard/

https://status.cloud.google.com/index.html

Edit: The GCP status page got updated <1 minute after I posted this, showing affected services are Cloud Data Fusion, Cloud Memorystore, Cloud Shell, Cloud Workstations, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataproc, Google Cloud Storage, Identity and Access Management, Identity Platform, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Vertex AI Search

  • There's no situation where the corporation controls the status page where you can trust the status page to have accurate information. None. The incentives will never be aligned in this regard. It's just too tempting and easy for the corp to control the narrative when they maintain their own status page.

    The only accurate status pages are provided by third party service checkers.

    • > The incentives will never be aligned in this regard.

      Well, yes, incentives, do big customers with wads of cash have an incentive to demand accurate reporting from their suppliers so they can react better rather than trying to identify issues? If there's systematic underreporting, then apparently not. Though in this case they did update their page.

      2 replies →

  • I have zero faith in status pages. It's easier and more reliable to just check twitter.

    Heroku was down for _hours_ the other day before there was any mention of an incident - meanwhile there were hundreds of comments across twitter, hn, reddit etc.

    • anecdotally, the status pages have been taken away from engineering and are run by customer support and marketing

Yeah, my company of hundreds of people working remotely are having 90%+ failures connecting to Google Meetings - joining a meeting just results in a 504.

Why can't companies be honest with being down. It helps us all out so we don't spend an hour internalizing.

We are truly in gods hands.

$ prod

Fetching cluster endpoint and auth data. ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=503, message=Visibility check was unavailable. Please retry the request and contact support if the problem persists

  • Because they have unrealistic targets so they make up fake uptime numbers. 99.999% would mean not even having an hour of downtime in 10 years.

    I remember reddit being down for like a whole day or so and they claimed 99.5% in that month.

  • Because a lot of the time, not everyone is impacted, as the systems are designed to contain the "blast radius" of failures using techniques such as cellular architecture and [shuffle sharding](https://aws.amazon.com/builders-library/workload-isolation-u...). So sometimes a service is completely down for some customers and fully unaffected for other customers.

    • "there is a 5% chance your instance is down" is still a partial outage. A green check should only mean everything (about that service) is working for everyone (in that region) as intended.

      Downdetector reports started spiking over an hour ago but there still isn't a single status that isn't a green checkmark on the status page.

      4 replies →

    • > Because a lot of the time, not everyone is impacted

      then such pages should report a partial failure. Indeed the GCP outage page lists an orange "One or more regions affected" marker, but all services show the green "Available" marker, which apparently is not true.

      1 reply →

    • It's not rocket science. Put a message up "The service is currently degraded and some users may see errors"

    • They still could show that so.e.issues exist. Their monitoring must know.

      The issue is that they don't want to. (For claiming good uptime, which may even be true for average user, if most outages affect only small groups)

  • Because there are contracts related to uptime :)

    • Those contracts will be monitoring their service availability on their own. If Google can't be honest you can bet your bottom dollar the companies paying for that SLA are going to hold them accountable if they report the outage properly or not.

      1 reply →

  • if half the internet is down, which it apparently is, it's usually not the service in question, but some backbone service like cloudflare. And as internal health monitoring doesn't route to the outside through the backbone to get back in, it won't pick it up. Which is good in some sense, as it means that we can see if it's on the path TO the service or the service itself.

  • > Why can't companies be honest with being down

    SLA agreements.

    • Any customer with enough leverage to negotiate meaningful SLA agreements will also have the leverage to insist that uptime is not derived from the absence of incidents on public-facing status pages.

  • The program that updates the status page is hosted on Google Cloud.

    • It's not. You might be joking, but that comment still isn't helpful.

      My understanding is this is part of Google's internal PSD offering (Public Status Board) which uses SCS (Static Content Service) behind GFE (Google Frontend) which is hosted on Borg, and deploys other large scale apps such as Search, Drive, YouTube, etc.

      5 replies →

    • So even then, it should have been able to correctly report the status, it somehow shows that the status page is not automated and any change there needs to go through someone manual.

      5 replies →

Whichever product person is in charge of the status page should be ashamed

How could you possibly trust them with your critical workloads? They don't even tell you whether or not their services work (despite obviously knowing)

[dead]