← Back to context

Comment by jlhawn

2 days ago

The outage was global. For my team specifically, a global Identity and Access Management outage meant that our internal service accounts could not refresh their short-lived access tokens and so different parts of our infrastructure began to fail over the course of an hour or so, regardless of what region or zone they were in. Services were up, but they could not access critical GCP services because of auth-related issues which resulted in internal service errors for us.

To give an example, our web servers connect to our GCP CloudSQL database via a Cloud SQL Auth Proxy (there's also a connection pooler in between but that also stayed up). The connection to the proxy was always available, but the proxy wasn't able to renew auth tokens it uses to tunnel to the database, regardless of where the webserver or database happened to be located. To mitigate this in the future we're planning to stop using the auth proxy and connect directly via mutual TLS but now it means we have to manage TLS certificates.