In this post, we explore one of the most celebrated and widely used techniques for reaching consensus: the Lock-Commit paradigm. This approach is a key technique of DLS88, Lamport’s Paxos, and many subsequent protocols. Protocols like Raft, PBFT, Tendermint, SBFT, Casper, HotStuff, etc are all based on this paradigm.
We exemplify the Lock-Commit paradigm with a simple single-shot synchronous protocol (with message delay at most $\Delta$) for uniform consensus that is tolerant to $f$ omission failures, given $2f<n$.
In synchrony, this post shows how to tolerate $k<n$ crash failures. A different post shows a non-uniform consensus protocols that tolerates $t<n/2$ omission failures (recall that for omision failures, non-uniform consensus means that faulty replicas may commit on incorrect values).
This related post shows that it is impossible to tolerate $2f\geq n$ omission failures. As an exercise, you can extend this lower bound to show it is impossible to tolerate $k+ft \geq n$ for $k$ crash failures and $f$ omission failures.
Our goal is solving consensus in a system with clients and $n$ replicas. In the Primary-Backup approach, the protocol progresses in views. In each view $v$, one designated replica is the primary of this view and the others are the backups of this view. There is a view change trigger protocol that decides when to globally execute a view change protocol that increments the view (in a single shot synchronous model, it is enough to have $n$ views and simply assign replica $v$ to be the primary of view $v$, but more generally you could have a round-robin or randomized leader election matching of Primaries to views).
The safety risk in consensus is that a replica commits to a value but its decision is not known to other replicas. In particular, a new primary in a new view may emerge that does not know about the committed value. The Lock-Commit paradigm introduces a Lock step before the Commit decision and entails the following two important parts:
Lock a quorum before you Commit. To commit a value $cmd$, the primary of view $v$ first makes sure there is a quorum of replicas that store a lock that consists of a lock-value $cmd$ and a lock-view $v$. This is typically done by having the primary propose the value $cmd$ and receive $n-f$ acknowledgments that the value $cmd$ is locked in view $v$.
Read a quorum before you Propose. When a new primary starts a new view, it runs a view change protocol. In this protocol, the new primary of view $v$ needs to decide which value to propose (and eventually commit) in view $v$. The new Primary must read the locks of previous views from a quorum of replicas and must adopt the lock-value with the highest lock-view it sees.
Intuitively, this approach is safe is because any two quorum sets must have a non-empty intersection, and choosing the lock-value with the highest lock-view will guarantee seeing the committed value (see the proof below). In this post, we assume $n=2f+1$ and use majority quorums that consist of $n-f=f+1$ replicas.
This Lock-Commit paradigm is very powerful and can be extended to many settings:
Since Lock-Commit is based on quorum intersection, safety does not rely on any synchrony. Indeed unlike the Commit-Notify approach, the Lock-Commit approach can be extended to an asynchronous (or partially synchronous) models.
The Lock-Commit approach guarantees uniform consensus (so even omission-faulty replicas commit on the same value). In essence, it guarantees that a replica does not decide before the system is committed.
The Lock-Commit approach can be extended to tolerate malicious adversaries. For $n>2f$, by using signatures and synchrony; and for $n>3f$, by using a quorum systems where every two sets intersect by at least $f+1$ replicas.
Here is a Lock-Commit based (uniform) consensus protocol tolerating $f<n/2$ omission faults for a single slot:
Lock-Commit for omission failures
// pseudocode for Replica j log =  // a log (of size 1) of committed values view = 1 // view number that indicates the current Primary lock-view = 0 // the highest view a propose was heard lock-value = null // the value associated with the highest lock start timer(1) // start timer for view 1 (for duration 8 Delta) while true: // as primary 1: replica 1 and view is 1 on receiving first cmd from client and j == 1 and view == 1: send ("propose", cmd, view) to all replicas // as a backup replica on receiving ("propose", cmd, v) and v == view: lock-view = v lock-value = cmd send ("lock", cmd, v) to the primary j // as a primary: replica j and view is j on receiving ("lock", cmd, view) from n-f distinct replicas and view == j: // commit (after you know enough have locked) log = cmd send ("commit", cmd) to all replicas terminate // as a replica: commit and terminate on receiving ("commit", cmd): log = cmd send ("commit", cmd) to all replicas terminate
When does a replica move from one view to another? When it sees that the current primary is not making progress. This is the view change trigger protocol:
on timer(i) expiring and log == null and view == i; or on receiving ("blame", i) from f+1 distinct replicas send ("blame", i) to all replicas on receiving ("blame", i) from n-f distinct and view <= i: // this will trigger a timer and a "highest lock message" send ("view change", i+1) to all replicas (including self)
Note that the view change trigger protocol can be simplified and also altered to have a linear communication optimistic path. Assuming synchrony, we could for example, simply trigger a view change after each 8 message delays. The more elaborate option described above will allow us to generalize in later posts.
What do the next primaries propose? To maintain safety, they must read from a quorum and use the lock-value with the highest lock-view. This is the essence of the view change protocol:
// send your highest lock on receiving ("view change", v) and view < v: view = v // start timer for 8 Delta start timer(v) send ("highest lock", lock-value, lock-view, v) to replica v // as the primary (you are replica j and view is j) on receiving ("highest lock", l-val, l-view, j) from n-f distinct replicas and view == j: if all l-val are null (all l-view are 0): my-val = any value heard from the clients otherwise: my-val = l-val with the highest l-view send ("propose", my-val, view) to all replicas
Argument for Safety
The key intuition for safety is a quorum intersection argument between two quorums: $W$: A set of replicas who sent a lock message on the committed value (and are hence locked on it). $R$: A set of replicas who send their highest locks to a primary in any higher view.
If $|W \cap R| \geq 1$ replica, then the committed value is always passed on to the next leader as a lock-value whose lock-view is maximial. This argument thus holds for all subsequent views and is formalized using induction in the following claim.
Claim: Let $v$ be the first view where some replica commits to some value $cmd$. Then, no primary will propose $cmd’ \neq cmd$ at any view $v’\geq v$.
By induction on $v’ \geq v$. For view $v’=v$, this follows since the primary sends just one “propose” value per view. Assume the hypothesis holds for all views $\leq v’$ and consider the view change of primary $v’+1$ to view $v’+1$.
Let $W$ be the set of $n-f$ distinct replicas that set $lock-view = v$ and sent $(“lock”, cmd, v)$ to the primary $v$ in view $v$.
Let $R$ be the $n-f$ replicas that the primary $v’+1$ received their $(“highest$ $lock”, lock$-$val, lock$-$view, v’+1)$ for view $v’+1$.
Since $W \cap R \neq \emptyset$, then the primary of $v’+1$ must hear from a member of $R$. Fix some $a \in W \cap R$. By the definition of $R$ and from the induction hypothesis we know that for the view change to view $v’+1$, replica $a$’s lock-view is at least view $v$ and its lock-value must remain $cmd$. In addition, from the induction hypothesis, we know that no other member of $W$ can have a lock that has a lock-view that is at least $v$ with a lock-value $cmd’ \neq cmd$.
Hence, during the view change of view $v’+1$, the value with the maximum view in $W$ must have the value $cmd$ and be with a view $\geq v$.
Observe that this argument did not rely on synchrony.
Argument for Liveness
Claim: Let $v$ be the first view with a non-faulty primary. Then, all non-faulty replicas will commit by the end of view $v$.
Observe that in any view $<v$, either some non-faulty commits and hence all non-faulty commit and terminate one message delay later; or otherwise, all non-faulty do not commit, and hence will send a “blame” and hence all non-faulty will send a “view change” and join the next view within one round trip.
Observe that this argument requires synchrony: it uses the fact that the timers will expire and all start the next view within one message delay.
If some non-faulty replicas have not decided before entering view $v$, then all non-faulty will enter view $v$ within one message delay. In view $v$, the non-faulty primary will gather $n-f$ distinct “lock” messages and will send a commit message that will arrive to all non-faulty replicas before their $timer(v)$ expires (assuming the timer is larger than 8 message delays and using the fact that they all started their timer with a gap of at most one message delay). Hence, even if all faulty replicas send a “blame” message, there will not be enough “blame” messages to form a “view change” message.
Again observe the use of synchrony.
We did not fully specify how the clients send commands to the replicas. For simplicity, we can assume that clients broadcast their requests to all replicas. In practice, one can add a mechanism to track the primary and resend commands to the new primary when there is a view change.
In this post, we do not talk about executing the commands (as required in a state machine replication protocol) nor about how clients can learn about consensus. The protocol can be easily modified to handle these aspects.
Observe that the only requirement is that $|W \cap R| \geq 1$ (honest) replica. Thus, while we consider using quorums of size $n-f = f+1$, the sizes are flexible and do not have to be symmetric. This is the idea behind Flexible Paxos.
Many thanks to Alin Tomescu for valuable comments and insights!
Please answer/discuss/comment/ask on Twitter.