Comment by SageThrowaway
2 days ago
(Throwaway since I was part of a related team a while back)
Service Control (Chemist) is a somewhat old service, been around for about a decade, and is critical for a lot of GCP APIs for authn, authz, auditing, quota etc. Almost mandated in Cloud.
There's a proxy in the path of most GCP APIs, that calls Chemist before forwarding requests to the backend. (Hence I don't think fail open mitigation mentioned in post-mortem will work)
Both Chemist and the proxy are written in C++, and have picked up a ton of legacy cruft over the years.
The teams have extensive static analysis & testing, gradual rollouts, feature flags, red buttons and strong monitoring/alerting systems in place. The SREs in particular are pretty amazing.
Since Chemist handles a lot of policy checks like IAM, quotas, etc., other teams involved in those areas have contributed to the codebase. Over time, shortcuts have been taken so those teams don’t have to go through Chemist's approval for every change.
However, in the past few years, the organization’s seen a lot of churn and a lot of offshoring too. Which has led to a bigger focus on flashy, new projects led by L8/L9s to justify headcount instead of prioritizing quality, maintenance, and reliability. This shift has contributed to a drop in quality standards and increased pressure to ship things out faster (and one of the reasons I ended up leaving Cloud).
Also many of the servers/services best practices common at Google are not so common here.
That said, in this specific case, it seems like the issue is more about lackluster code and code review. (iirc code was merged despite some failures). And pushing config changes instantly through Spanner made it worse.
No comments yet
Contribute on Hacker News ↗