← Back to context

Comment by jstanley

14 days ago

My view is that you basically never want exponential backoff.

The only time exponential backoff is useful is if the failure is due to a rate limit and you specifically need a mechanism to reduce the rate at which you are attempting to use it.

In the common case that the thing you're trying to talk is just down, exponential backoff with base N (e.g. wait 2x longer each time) increases your expected downtime by a factor of N (e.g. 2), because by the time your dependency is working again, you may be waiting up to the same amount of time again before you even retry it! Meanwhile, your service is down and your customers can't use it and your program is doing nothing but sleeping for another 30 minutes before it even checks to see if it can work.

And for what? What is the downside to you if your program retries much more frequently?

I much prefer setting a fixed time period to wait between retries (would you call that linear backoff? no backoff?), so for example if the thing fails you just sleep 1 second and try again, forever. And then your service is working again within 1 second of your dependency coming back up.

If you really must use exponential backoff then pick a quite-low upper bound on how long you'll wait between retries. It is extremely frustrating to find out that something wasn't working just because it was sleeping for a long time because the previous handful of attempts failed.

> The only time exponential backoff is useful is if the failure is due to a rate limit and you specifically need a mechanism to reduce the rate at which you are attempting to use it.

Exponential backoff is applicable to any failure where the time it has so far gone unresolved is the primary piece of available data on how lilong it is likely to take before being resolved, which is a very common situation, which is why it is a good default for most situations where you don’t have a better knowable-in-advance information at hand and the probability distribution ofn time to resolve, and where delays aren't super costly (though knowledge of when delays become costly can be used to set a cap on exponential backoff, too.)

  • > and where delays aren't super costly

    This is key.

    If delays aren't costly, sure, any algorithm is fine.

    But if you want your service up as soon as possible, why spend pointless minutes calling sleep() when you could be getting things working sooner?

    • > If delays aren't costly, sure, any algorithm is fine.

      Not true, delays aren't the only costs. Compute, network, and developer time digging through logs all cost money, and hammering a service fruitlessly when it is down adds to all of those.

      (It also doesn't help that “system is overloaded” may sometimes be communicated clearly, but also in many systems can manifest in... just about any other kind of error, too.)

> The only time exponential backoff is useful is if the failure is due to a rate limit and you specifically need a mechanism to reduce the rate at which you are attempting to use it.

That's what you should be using exponential backoff for. In actuality, the new latency introduced by the backoff should be maintained for some time even after a successful request has been received, and gradually over time the interval reduced.

> I much prefer setting a fixed time period to wait between retries (would you call that linear backoff? no backoff?)

I've heard it referred to as truncated exponential backoff.

  • That's only truncated exponential backoff if you do exponential backoff to some point.

    If its just a fixed retry interval, then its... a fixed retry interval.

Instead of using a fixed backoff, just use a token-bucket algorithm. Try a few requests every now and then and have each successful request enable another retry.

> basically never want exponential backoff.

> [unless] due to a rate limit

Pretty common use-case for automatic retries...

  • That's fine!

    If it's for a rate limit, sure, use exponential backoff, and put a known upper bound on how long you'll back off for.

    But don't make your application wait for 50 minutes to come back up just because the database was down for 25 minutes. That's the kind of thing I am protesting here.