← Back to context

Comment by mrkeen

2 days ago

It's wishful thinking. It's like choosing Newtonian physics over relativity because it's simpler or the equations are neater.

If you have strong consistency, then you have at best availability xor partition tolerance.

"Eventual" consistency is the best tradeoff we have for an AP system.

Computation happens at a time and a place. Your frontend is not the same computer as your backend service, or your database, or your cloud providers, or your partners.

So you can insist on full-ACID on your DB (which it probably isn't running btw - search "READ COMMITTED".) but your DB will only be consistent with itself.

We always talk about multiple bank accounts in these consistency modelling exercises. Do yourself a favour and start thinking about multiple banks.

There is no reason a database can’t be both strongly consistent (linearizable, or equivalent) and available to clients on the majority side of a partition. This is, by far, the common case of real-world partitions in deployments with 3 data centers. One is disconnected or fails. The other two can continue, offering both strong consistency and availability to clients on their side of the partition.

The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!

If you don’t believe me, check out Phil Bernstein’s paper (Bernstein and Das) about this. Or read the Gilbert and Lynch proof carefully.

  • > The Gilbert and Lynch definition of CAP calls this state ‘unavailable’, in that it’s not available to all clients. Practically, though, it’s still available for two thirds of clients (or more, if we can reroute clients from the outside), which seems meaningfully ‘available’ to me!

    That's great for those two thirds but not for the other one third. (Indeed you will notice that it's "available" precisely to the clients that are not "partitioned").

    • When does AP help?

      It helps in the case where clients are (a) able to contact a minority partition, and (b) can tolerate eventual consistency, and (c) can’t contact the majority partition. These cases are quite rare in modern internet-connected applications.

      Consider a 3AZ cloud deployment with remote clients on the internet, and one AZ partitioned off. Most often, clients from the outside will either be able to contact the remaining majority (the two healthy AZs), or will be able to contact nobody. Rarely, clients from the outside will have a path into the minority partition but not the majority partition, but I don’t think I’ve seen that happen in nearly two decades of watching systems like this.

      What about internal clients in the partitioned off DC? Yes, the trade-off is that they won’t be able to make isolated progress. If they’re web servers or whatever, that’s moot because they’re partitioned off and there’s no work to do. Same if they’re a training cluster, or other highly connected workloads. There are workloads that can tolerate a ton of asynchrony where being able to continue while disconnected is interesting, but they’re the exception rather than the rule.

      Weak consistency is much more interesting as a mechanism for reducing latency (as DynamoDB does, for example) or increasing scalability (as the typical RDBMS ‘read replicas’ pattern does).

      2 replies →

The author addresses that in a linked post: https://brooker.co.za/blog/2024/07/25/cap-again.html

  • They don't address it so much as assume it away. Of course if all your load balancers and end clients can still talk to both sides of your partition then you don't have a problem - that's because you don't actually have a partition in that case.

    • The author’s position seems to be that “actual partition” in this sense is a very unusual case for most cloud applications.

Having worked as the lead architect for bank core payment systems, multiple bank scenario is a special case that is way too complex for the purpose of these discussions.

It is a multi-layered processes that ultimately makes it very probable that the state of a payment transaction is consistent between banks, involving reconciliation processes, manual handling of failed transactions over extended time period if the reconciliation fails, settlement accounts for each of the involved banks and sometimes even central banks for instant payments.

But I can imagine scenarios when even those can fail to make the transaction state globally consistent. For example a catastrophic event that destroys the bank's systems and a small bank has failed to take off-site backups, and one payment has some hic-up so that the receiving bank cannot know what happened with the transaction. So they would probably assume something.