Comment by LAC-Tech
2 years ago
"FDB can tolerate f failures with only f+1 replicas."
Wait a minute, I know that formula... Viewstamped Replication?? I need to read the foundation DB paper. (I mainly read CRDT stuff so hopefully it's understandable).
---
In general I'm really impressed foundation DB folks.
The talk "Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson blew my mind. TL;DR they spent the majority of their initial dev effort into making a simulation of the database, then when they were happy with that plugged in real storage, time and networks at the end. Well worth a watch for anyone interested in distributed systems & reliability.
https://www.youtube.com/watch?v=4fFDFbi3toc&pp=ygUgZGV0ZXJta...
"f failures with f+1 replicas" is the standard for all non-byzantine fault tolerant systems out there. You will find it in Paxos, Raft, Viewstamped Replication, etc.
It makes sense if you think about it: these systems follow a leader/replica model, and naturally you only need one leader to make progress
Amending my own parent comment, since it won't let me edit: I was wrong about this being standard in Paxos/Raft etc. They actually require "f failures with 2f+1 replicas" (meaning that at a minimum a strict majority of replicas need to be available). I blame my morning brain.
Replica is ambiguous here: is it 1 leader and n replicas? Or is it just n replicas, one of which is assigned "leader"?
I thought "these systems follow a leader/replica model" would be the former, but "f failures with f+1 replicas" the latter.
It's a cluster size of n replicas, with one of the n being the (current) leader.
f failures with f+1 replicas is a cluster size of n replicas can sustain n-1 failures. n=f+1 or f=n-1. You wanna be able to sustain f failures, you need a cluster size (n) of f+1.
When there is a failure, a non-failing node becomes the leader (or there's no leader change if the current leader isn't the one that failed). A cluster size of 1 has 1 leader, and can sustain 0 failures.
1 reply →
It is same for all CP systems in terms of CAP. During partition, clients that have access to the leader, could read/write. Clients that have access to non-leader servers could only read consistent data to the point when non-leader lost connection to the leader (i.e. old data, but still consistent).
Raft (at least) goes offline if more than half the replicas are gone, doesn't it? It won't accept writes, and it won't serve reads unless you've explicitly chosen to serve stale reads.
That's a beheviour of quorum systems - majority voting. It guarantees no inconsistent writes in the event of a network partition, where each half of the replicas are workig fine and can talk to each other, but are getting no response from the other half.
But if you can reliably confirm that all but one nodes have "failed", for a suitably robust definition of failed, that's a different scenario. This means even though you can't communicate with a failed node in the normal way, you are able to get confirmation that the node cannot respond to normal messages to any other nodes or clients, and something (maybe controlling the node, or software on the node itself) guarantees to prevent those responses, until the node goes through a recovery and reintegration process.
Some ways this is done are using remote-controlled power, remote-controlled reboot, or reconfiguring the network switches to cut off the node. Just to ensure it can't come back and carry on responding as if nothing happened except a temporary delay. There's some subtlety to doing this robustly: Consider a response packet that got onto the network before the cut off event, but is delayed a long time inside the network due to a queue or fault.
After reliable "failure" confirmation, you can shrink the quorum size dynamically in response, even down to a single node, and then resume forward progress.
You're right, I misspoke in my original comment. See https://news.ycombinator.com/item?id=37560124
What usually happens is that a leader won't confirm an operation as successful until such operation has been applied in a quorum of replicas (see: synchronous replication).
In theory, nothing prevents a leader from accepting new writes even if it can't reach a quorum, provided it never allows reading operations that haven't been replicated to a number of replicas.
Thanks for putting it in context! VSR is the only one I've read into by virtue of it having a really readable paper.
Re: Will’s talk, which I agree is awesome.
They recently turned that knowledge into a product, still early/rough but holy crap it feels like dark wizardry to use it.
Plus these folks are really top shelf humans to work with.
https://antithesis.com/
"FDB can tolerate f failures with only f+1 replicas." is too vague. What kind of failures and in which situations?
If "failure" is a netsplit, only single partition would allow writes, because they choose CP from CAP theorem.
Oh man I've just had a (friendly!) debate on this with some distsys folks on twitter.
General consensus (no pun intended!) is the term availability is not really well defined, and the CAP thoerem is not a useful way to think about things (see Martin Kleppmanns "the unhelpful CAP theorem" in DDIA).
The problem is that the term available is overloaded. In CAP “(A)vailable” specifically means you can keep making db updates as long as you can talk to any db node (e.g. you and a db node have split off the internet together). In every other distributed systems context “available” means the system doesn’t stop working overall when failures happen. These are very different usages and it confuses a lot of people.
Available = you can see one of replicas, you are good to go. CAP is good to understand what are the limitations when you have partitioned network.
FoundationDB does not give you Availabity though, only CP.