The Lock-Commit Paradigm

In this post, we explore one of the most celebrated and widely used techniques for reaching consensus: the Lock-Commit paradigm. This approach is a key technique of DLS88, Lamport’s Paxos, and many subsequent protocols. Protocols like Raft, PBFT, Tendermint, SBFT, Casper FFG, HotStuff, etc., are all based on this paradigm.

We exemplify the Lock-Commit paradigm with a simple single-shot synchronous protocol (with message delay at most $\Delta$) for uniform consensus that is tolerant to $f$ omission failures, given $2f<n$.

It is impossible to solve consensus with $2f \geq n$ omission failures. As an exercise: extend this lower bound to show it is impossible to tolerate $k+2f \geq n$ for $k$ crash failures and $f$ omission failures.
Using the Commit-Notify paradigm in synchrony, non-uniform consensus is possible for $t<n/2$ omission failures (recall that for omision failures, non-uniform consensus means that faulty replicas may commit on incorrect values).
In synchrony, consensus is possible for $k$ crash failures, for any $k<n$.
In a follow-up post, we extend the lock-commit paradigm to a multi-shot protocol that can tolerate both $f$ omission failures and $k$ crash failures given $k+2f<n$.
Protocols in partial synchrony that use the lock-commit paradigm: Streamlet, Benign Hotstuff, Raft with Chaining, Information Theoretic HotStuff.

Lock-Commit Paradigm

Consider the task of solving consensus in a system with clients and $n$ replicas. Clients send commands to the replicas and the goal of the replicacs (in the single shot case) is to decide on one command. In the Primary-Backup approach, the protocol progresses in views. In each view $v$, one designated replica is the primary replica and the other replicas are the backup replicas of this view. The primary of a view needs to propose a command and replica need to eventually commit (or decide) on a command. There is a view change trigger protocol that decides when to globally execute a view change protocol that increments the view.

The main safety risk in Primary-Backup based consensus protocols is that a replica commits to a value and later a different replica commits to a different value. The only way this can happen is if a new primary in a new view proposes a different value. To guarantee this will not happen the Lock-Commit paradigm does two crucial things:

Lock a quorum before you Commit. To commit a value $cmd$, the primary of view $v$ first makes sure there is a quorum of replicas that store a lock that consists of a lock-value $cmd$ and a lock-view $v$. This is typically done by having the primary propose the value $cmd$ and receive $n-f$ acknowledgments that the value $cmd$ is locked in view $v$.
Read a quorum before you Propose. When a new primary starts a new view, it runs a view change protocol. In this protocol, the new primary of view $v$ needs to decide which value to propose (and eventually commit) in view $v$. The new Primary must read the locks of previous views from a quorum of replicas and must adopt the lock-value with the highest lock-view it sees.

Intuitively, this approach is safe is because any two quorum sets must have a non-empty intersection, hence if the primary proposes the lock-value with the highest lock-view then this value must be the the committed value (see the proof below). In this post, we assume $n=2f+1$ and use majority quorums that consist of $n-f=f+1$ replicas.

This Lock-Commit paradigm is the core of Paxos. It is extended to many settings:

Since Lock-Commit is based on quorum intersection, safety does not rely on any synchrony. Indeed unlike the Commit-Notify paradigm, the Lock-Commit paradigm is often used for safety in asynchrony (and partially synchrony).
The Lock-Commit paradigm guarantees uniform consensus (so even omission-faulty replicas commit on the same value). In essence, it guarantees that a replica does not decide before the system is in a committed state.
The Lock-Commit paradigm can be extended to tolerate malicious adversaries. For $n>2f$, by using signatures and synchrony; and for $n>3f$, by using a quorum system where every two sets intersect by at least $f+1$ replicas.

Here is a Lock-Commit based (uniform) consensus protocol tolerating $f<n/2$ omission faults for a single slot:

Lock-Commit for omission failures

// pseudocode for Replica j

log = []     // a log (of size 1) of committed values
view = 1     // view number that indicates the current Primary
lock-view = 0     // the highest view a propose was heard
lock-value = null // the value associated with the highest lock
start timer(1) // start timer for view 1 (for duration 8 Delta)


while true:
   // as primary 1: replica 1 and view is 1
   on receiving first cmd from client and j == 1 and view == 1:
         send ("propose", cmd, view) to all replicas
   
   // as a backup replica 
   on receiving ("propose", cmd, v) and v == view:
         lock-view = v
         lock-value = cmd
         send ("lock", cmd, v) to the primary j
   
   // as a primary: replica j and view is j
   on receiving ("lock", cmd, view) from n-f distinct replicas and view == j:
         // commit (after you know enough have locked)
         log[0] = cmd
         send ("commit", cmd) to all replicas
         terminate
   
   // as a replica: commit and terminate
   on receiving ("commit", cmd):
         log[0] = cmd
         send ("commit", cmd) to all replicas
         terminate

When does a replica move from one view to another? When it sees that the current primary is not making progress. This is the view change trigger protocol:

  on timer(i) expiring and log[0] == null and view == i; or
  on receiving ("blame", i) from f+1 distinct replicas
         send ("blame", i) to all replicas
  on receiving ("blame", i) from n-f distinct and view <= i:
         // this will trigger a timer and a "highest lock message"
         send ("view change", i+1) to all replicas (including self)

Note that the view change trigger protocol can be simplified and also altered to have a linear communication optimistic path. Assuming synchrony, we could for example, simply trigger a view change after each 8 message delays. The more elaborate option described above will allow us to generalize in later posts.

What do the next primaries propose? To maintain safety, they must read from a quorum and use the lock-value with the highest lock-view. This is the essence of the view change protocol:

   // send your highest lock
   on receiving ("view change", v) and view < v:
         view = v
         // start timer for 8 Delta
         start timer(v)
         send ("highest lock", lock-value, lock-view, v) to replica v
         
   // as the primary (you are replica j and view is j)
   on receiving ("highest lock", l-val, l-view, j) from n-f distinct replicas and view == j:
         if all l-val are null (all l-view are 0):
             my-val = any value heard from the clients
         otherwise:
             my-val = l-val with the highest l-view
         send ("propose", my-val, view) to all replicas

Argument for Safety

The key intuition for safety is a quorum intersection argument between two quorums: $W$: A set of replicas who sent a lock message on the committed value (and are hence locked on it). $R$: A set of replicas who send their highest locks to a primary in any higher view.

If $|W \cap R| \geq 1$ replica, then the committed value is always passed on to the next leader as a lock-value whose lock-view is maximal. This argument thus holds for all subsequent views and is formalized using induction in the following claim.

Claim: Let $v$ be the first view where some replica commits to some value $cmd$. Then, no primary will propose $cmd’ \neq cmd$ at any view $v’\geq v$.

Proof:

By induction on $v’ \geq v$. For view $v’=v$, this follows since the primary sends just one “propose” value per view. Assume the hypothesis holds for all views $\leq v’$ and consider the view change of primary $v’+1$ to view $v’+1$.

Let $W$ be the set of $n-f$ distinct replicas that set $lock-view = v$ and sent $(“lock”, cmd, v)$ to the primary $v$ in view $v$.

Let $R$ be the $n-f$ replicas that the primary $v’+1$ received their $(“highest$ $lock”, lock$-$val, lock$-$view, v’+1)$ for view $v’+1$.

Since $W \cap R \neq \emptyset$, then the primary of $v’+1$ must hear from a member of $R$. Fix some $a \in W \cap R$. By the definition of $R$ and from the induction hypothesis we know that for the view change to view $v’+1$, replica $a$’s lock-view is at least view $v$ and its lock-value must remain $cmd$. In addition, from the induction hypothesis, we know that no other member of $W$ can have a lock that has a lock-view that is at least $v$ with a lock-value $cmd’ \neq cmd$.

Hence, during the view change of view $v’+1$, the value with the maximum view in $W$ must have the value $cmd$ and be with a view $\geq v$.

Observe that this argument did not rely on synchrony.

Argument for Liveness

Claim: Let $v$ be the first view with a non-faulty primary. Then, all non-faulty replicas will commit by the end of view $v$.

Proof:

Observe that in any view $<v$, either some non-faulty commits and hence all non-faulty commit and terminate one message delay later; or otherwise, all non-faulty do not commit, and hence will send a “blame” and hence all non-faulty will send a “view change” and join the next view within one round trip.

Observe that this argument requires synchrony: it uses the fact that the timers will expire and all start the next view within one message delay.

If some non-faulty replicas have not decided before entering view $v$, then all non-faulty will enter view $v$ within one message delay. In view $v$, the non-faulty primary will gather $n-f$ distinct “lock” messages and will send a commit message that will arrive to all non-faulty replicas before their $timer(v)$ expires (assuming the timer is larger than 8 message delays and using the fact that they all started their timer with a gap of at most one message delay). Hence, even if all faulty replicas send a “blame” message, there will not be enough “blame” messages to form a “view change” message.

Again observe the use of synchrony.

Remarks

We did not fully specify how the clients send commands to the replicas. For simplicity, we can assume that clients broadcast their requests to all replicas. In practice, one can add a mechanism to track the primary and resend commands only to the new primary when there is a view change.
In this post, we do not talk about executing the commands (as discussed in a state machine replication protocol) or running multiple consensus instantces.
The only requirement for Safety is that $|W \cap R| \geq 1$ (honest) replica. Thus, while we consider using quorums of size $n-f = f+1$, the sizes are flexible and do not have to be symmetric. This is the idea behind Flexible Paxos.

Acknowlegment

Many thanks to Alin Tomescu for valuable comments and insights!

Please answer/discuss/comment/ask on Twitter.