Comment by btown
2 days ago
From the OP:
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
One of the problems has been that most users have requested that quotas get updated as fast as possible and that they should be consistent across regions, even for global quotas. As such people have been prioritising user experience rather than availability.
I hope the pendulum swings the other way around now in the discussion.
[disclaimer that I worked as a GCP SRE for a long time, but not left recently]
Why did you leave? If you don’t mind asking
Not sure if you'll get an answer (I'd be interesting in a response as well), but from the blog in their profile it looks like they moved to be a 'member of technical staff working in the AI Reliability Engineering (AIRE) team at Anthropic'. So it might have just been an upward move to something different/more-exciting.
1 reply →