Comment by mjb

3 months ago

When does AP help?

It helps in the case where clients are (a) able to contact a minority partition, and (b) can tolerate eventual consistency, and (c) can’t contact the majority partition. These cases are quite rare in modern internet-connected applications.

Consider a 3AZ cloud deployment with remote clients on the internet, and one AZ partitioned off. Most often, clients from the outside will either be able to contact the remaining majority (the two healthy AZs), or will be able to contact nobody. Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.

What about internal clients in the partitioned off DC? Yes, the trade-off is that they won’t be able to make isolated progress. If they’re web servers or whatever, that’s moot because they’re partitioned off and there’s no work to do. Same if they’re a training cluster, or other highly connected workloads. There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.

Weak consistency is much more interesting as a mechanism for reducing latency (as DynamoDB does, for example) or increasing scalability (as the typical RDBMS ‘read replicas’ pattern does).

2 comments

mjb

lmm 3 months ago

> Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.

It happens any time you have a real partition, where e.g. one country or one office is cut off from the rest of the network. You're assuming that all of your system's use is external users from the internet, and you don't care about losing a small region when it's isolated from the internet, but most software systems are internal and if you're a company with 3 locations then being able to continue to work when one is cut off from the other 2 is pretty valuable.

> There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.

I'd say it's pretty normal if you've got a system that actually does anything rather than just gluing together external stuff. Although sadly that may be the minority these days.

mrkeen 3 months ago

Latency = Partition.

There is code running on a server. When poll that server, that code will choose (in the best case) between given you whatever answer it has Available, or it will check with its peers (on a different continent) to make sure it serves you up-to-date (Consistent) information.

CAP problems being a "rare occurrence" isn't a thing. The running code is either executing AP code or CP code.