Comment by ofrzeta

16 hours ago

The incident report is interesting. Fast reaction time by the SRE team (2 minutes), then the "red button" rollout. But then "Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load."

In my experience this happens more often than not: In an exceptional situation like a recovery of many nodes quotas that make sense in regular operations get exceeded quickly and you run into another failure scenario. As long as the underlying infrastructure can cope with it, it's good if you can disable quotas temporarily and quickly. Or throttle the recovery operations that naturally take longer in that case.

1 comment

ofrzeta

throwaway250612 11 hours ago

Exponential backoff is misinformation here. During the server startup, it reads a critical data, intentionally no retry, hence no backoff.

A better fix is to quickly spread the load to backup databases that already exist. There are other options too.