Reasoning about Distributed Protocols with Smart Casual Verification

Here at decentralized thoughts, we spend a lot of time reasoning about distributed protocols. Often, we focus on solving distributed consensus, personally it’s my favorite CS problem, but it’s also famously one of the most difficult and subtle problems in distributed computing. Reasoning about distributed algorithms is hard at the best of times, with state split across remote nodes, asynchrony, concurrency, and non-determinism in the order that events, such as message delivery, occur as well as failures, like nodes crashing, message loss, or even Byzantine attacks.

Confidential Consortium Framework

This is exactly the problem we faced in the Confidential Consortium Framework (CCF). CCF is an open-source platform for developing trustworthy and reliable decentralized applications (you can learn more about CCF in our VLDB paper). CCF powers critical services and as such it is vital to build confidence in the correctness of CCF’s design and implementation. In particular, CCF has a custom distributed consensus protocol, including its own dynamic reconfiguration logic. CCF’s consensus logic, though based on the popular Raft protocol has been sufficiently modified such that it is now based on an unproven algorithm. This is a common problem: the Chubby authors similarly note that the additional requirements imposed by a real-world system over a vanilla Paxos implementation require significant changes. Moreover, distributed consensus protocols are notoriously hard to get right, and even well-known protocols, including PBFT, Egalitarian Paxos, and Zyzzyva, as well as the previously mentioned Raft and Chubby, have been found to contain subtle bugs.

Smart Casual Verification

CCF already had an extensive suite of tests. Software testing (“casual verification”) is approachable, widely utilized, and flexible, however, it is challenging to achieve coverage of the edge cases, where many things go wrong simultaneously. Full formal verification, synthesizing an executable implementation from a formally proven specification, has been successfully applied to distributed systems, including consensus protocols, however such approaches require significant upfront investment in time and expertise. At the time of writing, CCF is 63 kLoC in C++ and it was not an option to rewrite CCF for the purpose of verification.

We (the CCF team) opted for a different approach which we refer to as “smart casual verification”. Generally, smart casual refers to a style of dress that is neat and stylish, without the expense and discomfort associated with formal attire such as a business suit. Smart casual verification is therefore a pragmatic yet systematic approach to verification that combines the rigor of formal methods with the ease of use and flexibility of more casual methods. Smart casual verification combines systematic formal specification and techniques such as model checking with trace validation to tie the formal specification back to the real-world implementation.

The full details of our adventures applying smart casual verification to CCF are outlined in our recent paper at NSDI 2025. At a high-level, we used TLA+ to develop a formal model of our distributed consensus (and consistency) protocols. We were able to use TLC, one of the model checkers for TLA+, to both exhaustively search finite subsets of the state space and randomly explore the depths of the space. We were able to harness the existing “casual verification” to generate concrete traces from our implementation and check that behaviors exhibited by the implementation match what is described by our formal model. As a result, we found six bugs in our consensus protocols, five concerning safety and one regarding liveness. We were thus able to fix these before they caused issues in production.

Specification & model checking in TLA+

TLA+ was a natural choice of formal specification language for this effort. In fact, we were able to use Raft’s original TLA+ specification as a starting point. A TLA+ specification consists of a system’s safety and liveness properties and a description of the system as a state machine. The state machine’s main components are:

a collection of variables, representing the current state of a system,
a set of initials states, describing the starting values of the variables, and
a set of actions, defining the possible state transitions from the current values of the variables to the next values.

TLA+ specifications can then be verified against their desired properties, for instance, in the case of Raft, the term of each node is monotonically increasing or there is at most one committed log entry per index. To describe a distributed system, like CCF, the variables are the local states of each node in the system and the collection of messages currently in flight. Each action models a node taking a single step within the protocol, updating its local state, and optionally adding to the collection of in-flight messages. Actions can either be initiated from a message receipt or by a node itself, for instance, a candidate stepping up as a leader. Like most TLA+ specifications, our specification has an infinite number of reachable states, there can be any number of nodes, a node can always increase its term, or a client can always send a new transaction. TLC checks properties by enumerating states therefore we needed to make the number of states finite. We tackled this in two ways. Firstly, we constrained the specification by limiting the number of client requests, number of configuration and number of terms. We can then check properties by (breath-first) exhaustive exploration of the set of reachable states. Whilst satisfying, this approach is time-consuming (we are talking days for larger numbers of client requests, configs and terms!). In particular, CCF continues to evolve so we wanted a lightweight way to check the spec inside the existing CI pipeline. For this purpose, we instead opted for simulation, weighted random exploration of the reachable states, which proved surprisingly effective at quickly finding property violations.

Trace Validation

Once we had a model describing our consensus algorithm, we were concerned that perhaps the implementation and specification did not line up. This concern was not unfounded, sometimes we found bugs in the specification which did not exist in the actual implementation and vice versa. Moreover, once we were certain that the implementation and specification were consistent, how could we be sure that this remained true as CCF evolves? In other words, how can we be sure that the specification and implementation are in sync?

This is where trace validation entered the picture. We instrumented our implementation with some simple optional print statements to log the internal state of the nodes during key points in the consensus. Running the existing test harness thus produced a set of traces from the implementation. We used TLC to check that each trace matched a sequence of states described in the specification. If there is a 1:1 correspondence between the variables and actions in the traces and the specification, then this process is straightforward. Unfortunately, this was not the case for us, the trace and specification had different “grains of atomicity”. As we describe in our paper, we were still able to make trace validation work for us. Like simulation, we were able to integrate trace validation to our CI pipeline to help us keep the implementation and specification in sync.

Conclusion

Smart casual verification is not without its limitations; it does not come with the same iron strong guarantees as rigorous formally verified implementations and requires more time investments than casual verification alone. However, it might still be worth considering adding smart causal verification to your verification wardrobe in your latest distributed systems project.

Learn more about our journey with smart casual verification in “Smart Casual Verification of the Confidential Consortium Framework” by Heidi Howard, Markus Kuppe, Edward Ashton, Amaury Chamayou and Natacha Crooks, presented last month at NSDI 2025.

Acknowledgements: Thank you to Markus Kuppe for his helpful feedback on this post.