Comment by foota

2 days ago

My reading is that this is on start up. E.g., some config needs to be read when tasks come up. It's easy to have backoff in your normal API path, but miss it in all the other places your code talks to services.

Does borg not backoff your task if it crash loops? That is how k8s does it.

  • I'm not sure, but it clearly wasn't sufficient if it does.

    I guess the issue here is if you're crash looping, once the task comes up it will generate load retrying to get the config, so even if you're no longer crash looping (and hence no longer backing off at borg) you're still causing overload.

    As long as the initial rate of tasks coming up is enough to cause overload, this will result in persisting the outage even once all tasks are up (assuming that the overload is sufficient to bring goodput of tasks becoming healthy to near zero).

    Interestingly you can read that one of the mitigations they applied was to fan out config reads to the multiregional mirrors of the database instead of just the regional us-central1 mirror, presumably the multi regional mirrors brought in significantly more capacity than just the regional mirrors, spreading the load.

    I'd be curious to know how much configuration they're loading that it caused such load.