Comment by singron
2 days ago
Nearly every global outage at Google has looked vaguely like this. I.e. a bespoke system that rapidly deploys configs globally gets a bad config.
All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.
In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.
[I was an SRE at Google several years ago]
From the OP:
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
One of the problems has been that most users have requested that quotas get updated as fast as possible and that they should be consistent across regions, even for global quotas. As such people have been prioritising user experience rather than availability.
I hope the pendulum swings the other way around now in the discussion.
[disclaimer that I worked as a GCP SRE for a long time, but not left recently]
Why did you leave? If you don’t mind asking
2 replies →