Comment by NovemberWhiskey

2 months ago

Outside of a very narrow range of safety- or otherwise ultra-critical systems, no-one is designing for actual guarantees of performance attributes like throughput or latency. The compromises involved in guarantees are just too high in terms of over-provisioning, cost to build and so on.

In large, distributed systems the best we're looking for is statistically acceptable. You can always tailor a workload that will break a guarantee in the real world.

So you engineer with techniques that reduce the likelihood that workloads you have characterized as realistic can be handled with headroom, and you worry about graceful degradation under oversubscription (i.e. maintaining "good-put"). In my experience, that usually comes down to good load-balancing, auto-scaling and load-shedding.

Virtually all of the truly bad incidents I've seen in large-scale distributed systems are caused by an inability to recover back to steady-state after some kind of unexpected perturbation.

If I had to characterize problem number one, it's bad subscriber-service request patterns that don't provide back pressure appropriately. e.g. subscribers that don't know how to back-off properly and services that don't provide back-pressure. Classical example is a subscriber that retries requests on a static schedule and gives up on requests that have been in-flight "too long", coupled with services that continue to accept requests when oversubscribed.

9 comments

NovemberWhiskey

amw-zero 2 months ago

I think this is less about guarantees and more about understanding behavioral characteristics in response to different loads.

I personally could care less about proving that an endpoint always responds in less than 100ms say, but I care very much about understanding where various saturation points are in my systems, or what values I should set for limits like database connections, or how what the effect of sporadic timeouts are, etc. I think that's more the point of this post (which you see him talk about in other posts on his blog).

NovemberWhiskey 2 months ago
I am not sure that static analysis is ever going to give answers to those questions. I think the best you can hope to do is surface knowledge about the tacit assumptions about dependencies in order to explore their behaviors through simulation or testing.
I think it often boils down to "know when you're going to start queuing, and how you will design the system to bound those queues". If you're not using that principle at design stage then I think you're already cooked.
- amw-zero 2 months ago
  
  Who brought up static analysis?
  I think simulation is definitely a promising direction.
  
  2 replies →

AlotOfReading 2 months ago

It's just realtime programming. I wouldn't say that realtime techniques are limited to a very narrow range of ultra critical systems, given that they encompass everything from the code on your SIM card to games in your steam library.

    In large, distributed systems the best we're looking for is statistically acceptable. You can always tailor a workload that will break a guarantee in the real world.

This is called "soft" realtime.

NovemberWhiskey 2 months ago
"Soft" realtime just means that you have a time-utility function that doesn't step-change to zero at an a priori deadline. Virtually everything in the real world is at least a soft realtime system.
I don't disagree with you that it's a realtime problem, I do however think that "just" is doing a lot of work there.
- AlotOfReading 2 months ago
  
  There are multiple ways to deal with deadline misses for soft systems. Only some of them actually deliver the correct data, just late. A lot of systems will abort the execution and move on with zeros/last computed data instead, or drop the data entirely. A modern network AQM system like CAKE uses both delayed scheduling and intelligent dropping.
  Agreed though, "just" is hiding quite a deep rabbit hole.

bluGill 2 months ago

While you don't need performance guarantees for most things, you still need performance. You can safely let "a small number" of requests "take too long", but if you let "too many" your users will start to complain and go elsewhere. Of course everything in quotes is fuzzy (though sometimes we have very accurate measures for specific things), but you need to meet those requirements even if they are not formal.