Data, Consensus, Execution: Three Scalability Bottlenecks for State Machine Replication

If anyone asks you: how can I scale my State Machine Replication (Blockchain) system?

You should answer back with a question: what is your bottleneck? Is it Data, Consensus or Execution?

Data: Shipping the commands to all the replicas. For example, if a block contains 1MB of commands, then you need all these bits to arrive to all the validating replicas. The obvious bottleneck is your system’s channel capacity.
Consensus: Once the commands arrive, replicas engage in a consensus protocol (like the ones discussed here for partial synchrony or synchrony). For example, if a consensus protocol needs 2 round-trips and the validating replicas are spread across the globe, then the obvious latency bottleneck is due to the speed of light and the size of Earth.
Execution: After the commands have arrived and consensus is reached on the total ordering of the commands, the replicas need to execute the commands. The execution engine is a function that takes the old state and applies the ordered commands to compute the new state (and compute the output). For example, if an execution requires doing many cryptographic operations, then the obvious bottleneck is that replicas need to re-execute these cryptographic operations.

These three bottlenecks are not a tradeoff or a dilemma or even a trilemma. They are independent bottlenecks. The ability of a State Machine Replication system to scale is bottlenecked by the minimum of all three. Below we provide a partial list of directions for addressing these challenges.

1. Scaling Data

Better network solutions

In Bitcoin and other cryptocurrencies, the ability to scale depends crucially on the ability to reduce the latency it takes a winning block of commands to propagate and reach all other miners. Systems like FIBRE, Falcon, and bloXroute aim to reduce this latency by using pipelining and forward error correction codes to propagate blocks with lower latency.

Another way to improve data scalability centers around being able to discover peers and access content via a content addressable network. See Kademlia, which inspired Ethereum’s RLPx and was generalized in libp2p.

Pushing data to layer two

One solution to the bottleneck generated by replicating all the commands is not to replicate them! Systems like Lightning, Plasma, and other layer two solutions aim to reduce data replication by pushing some of the intermediate commands to a smaller private group and only report periodic summaries to the main system (see our post on payment channels). This approach has a natural drawback: not replicating all the data creates a data availability problem. The safety depends on having at least one honest party in each private group that can react in a timely manner.

2. Scaling Consensus

Throughput vs latency trade-off

Some people talk about Transactions-Per-Second (TPS) as the measure of how scalable a consensus protocol is. TPS is a measure of throughput and optimizing it alone is a misunderstanding of the consensus scalability challenge. A solution for scaling consensus must address both throughput and latency. Improving consensus throughput (and hurting latency) via batching is easy: reach consensus on the hash of the batch of all the data just once-a-day instead of every few seconds. Clearly the cost of doing consensus just once-a-day will be amortized and consensus will not be the bottleneck in terms of throughput. Batching is an important technique that increases latency and increases throughput for consensus protocols, but batching is not a magic solution to scale consensus performance.

The PBFT journal version has a good discussion of latency and throughput for BFT State Machine Replication.

For Nakamoto Consensus based protocols there has been a series of works: Bitcoin-NG, Fruitchains, Prism that aim to improve throughput and latency.

Performance vs security trade-off

Some people suggest improving the performance of consensus by running it on a smaller group of validating replicas. Decreasing the set of validating replicas indeed increases the performance but it also reduces security. Before we discuss this path, let’s consider ways to improve consensus performance without reducing the number of validating replicas.

One way to improve consensus performance without reducing security is to improve the consensus protocol complexity, for example, by reducing the number of rounds or changing the message complexity from quadratic to linear. This post discusses some protocol improvements in partial synchrony and this post discusses some protocol improvements in synchrony.

Scale vs adaptivity trade-off

Consensus protocols based on the PBFT view-based paradigm are susceptible to the adversary adaptively attacking the Primary. Security of a consensus protocol is not just about the size of the adversary (which is controlled by the total number of validating replicas) but also about the adaptive power of the adversary.

Protocols that handle an adaptive adversary often incur a higher cost and are more challenging to scale. Algorand suggests using round based cryptographic sampling to scale Byzantine consensus and protect it from adaptive attackers. This approach has promising simulation results. An adaptive adversary can use Denial-of-Service attacks to block the system from making progress. HoneyBadger suggest the first practical asynchronous BFT protocol, which guarantees liveness without making any timing assumptions.

Avoiding a total ordering of all commands

When commands all depend on each other, there is no choice for consensus but to create a complete total order of all commands. In many workloads, commands do not depend on each other and do not interfere with each other. For example, in some cases a command where $A$ pays $B$ and a command where $C$ pays $D$ do not interfere with each other and there is no reason to bottleneck the system and spend precious consensus efforts to internally order these two commands. This approach has been taken in the non-Byzantine model in epaxos. Protocols like Avalanche and other DAG based protocols improve consensus throughput by allowing non-interfering commands to be committed concurrently.

Sharding

We are now ready to discuss the idea of improving the performance of consensus by running it on a smaller group of validating replicas and by doing so allowing many consensus systems to run in parallel.

At a high level, Sharding is the idea of partitioning the state and the set of validating replicas. Each shard controls some part of the state and consensus is run by some part of the total validating replica population. Some cross shard mechanism must also be in place. The comprehensive “Sharding FAQ” of Ethereum is a great resource to learn more about sharding.

Sharding is a way to parallelize the data, consensus, and execution bottlenecks. Parallelizing the data and execution depends on having a workload with low contention. From a consensus perspective, sharding is essentially a performance vs security trade-off: instead of using all validating replicas to secure one state machine, sharding creates multiple state machines, and each validating replica is used to secure just one of them.

Having many shards (when contention is low) can obviously improve performance, but since each shard is secured by less validating replicas, it may also reduce security. See Omniledger and Ethereum 2.0 for examples of systems that use sharding techniques.

Ethereum 2.0 plans to combine the lower security of each shard with a higher security global chain. Much like layer 2 solutions, the idea is that each lower security shard can periodically finalize itself on the higher security global chain. This creates a security-latency tradeoff: waiting for high security requires waiting for the periodic global chain finalization.

3. Scaling Execution

The separation of consensus and execution is one of the fundamental architecture designs of State Machine Replication (see Base 20013). The advantages of this separation are highlighted in Yin et al 2003. In the traditional SMR design, after a command is replicated and committed, it needs to be executed on all validating replicas.

In many systems, the cost of executing the commands is the biggest scalability bottleneck. A major type of denial-of-service attack on State Machine Replication systems is to issue legal commands that will cause the system to waste time during execution (for example, see here and here). Many systems design Domain Specific Languages to prevent these attacks. Bitcoin uses bitcoin script which carefully limits the computational complexity of each transaction. Ethereum uses a gas mechanism to limit the execution complexity and incentivize its usage in an efficient manner.

Parallelizing execution

One proposed way to speedup the execution computation bottleneck is to leverage machine parallelization. This approach works when commands in the block are mostly contention free (commutative or independent). The main idea is to find ways to simulate a sequential execution via a protocol that exploits parallelism in the optimistic contention-free case but maintains safety even if there is contention. See Eve 2012, Dickerson, Gazzillo, Herlihy, Koskinen 2017 and Saraph, Herlihy 2019.

Don’t execute, verify using economic incentives and fraud proofs (optimistic rollups)

In this solution, the commands are committed as data but the execution is not done by the validating replicas. The validating replicas just act as a data availability layer.

Instead of having replicas execute the commands, there is an economically-incentivized game where players can become executers by posting bonds. A bonded executer can then commit the execution outcome. Any bonded reporting agent can provide a fraud proof claiming that the executer committed an incorrect execution outcome. If the fraud proof is correct then the executer is slashed and the reporting agent is partially rewarded. If the reporting agent lies about the fraud proof, then its bond can be slashed. Protocols for efficiently challenging the executer have origins in the work of Feige Kilian 2000 followed by Canetti, Riva, Rothblum 2011 and were later adopted to using on-chain incentives in TrueBit Teutsch, Reitwießner 2017 and in Buterin’s Off-Chain Oracles. This approach is now being developed under the name optimistic rollups (also see merged consensus, Adler, Mikerah, Quintyne-Collins, Al-Bassam, Sonnino, Buterin and LazyLedger).

Don’t execute, verify using succinct proofs (zk rollups)

In this solution, again the commands are committed as data but the execution is not done by the validating replicas. The validating replicas just act as a data availability layer for the commands.

Instead of using games and fraud proofs to verify execution it is possible to leverage succinct non-interactive proofs (PCP, Groth 2010, Groth 2016, Ben-Sasson, Chiesa, Tromer, Virza 2013-2019, survey). These cryptographic techniques allow a prover to generate very short proofs that can be efficiently verified with high cryptographic soundness and completeness. Execution (and proof generation) need to happen only at one entity. Once a short proof exists, then the validating replicas of the execution engine just need to validate the short proof instead of re-executing the long transactions. In Bitcoin, this approach was proposed by Maxwell (see CoinWitness, 2013) as a way to improve “Bitcoin’s scalability and securely integrate new and innovative alternative transaction methods and expand Bitcoin’s zero-trust nature to more types of transactions.” Zexe uses this approach to build a nano-kernel based proof system that allows private transactions in the UTXO setting.

This approach for scaling transactions is highlighted in Buterin’s zk-roll-up post. See podcasts by Ben-Sasson and Gluchowski and videos by Gluchowski and Buterin for a way to add privacy (zero knowledge) to the succinct proofs (zk zk rollups). Gluchowski wrote a survey/comparison on optimistic and zk rollups and recently an announcement on ZK Sync.

Using a succinct proof has a huge advantage: once the proof is created, the cost of verification is very low. The disadvantage is that creating the proof of execution of the commands is often significantly more resource consuming than just executing the commands (both in terms of CPU and in terms of memory). Another disadvantage is that these protocols add non-trivial complexity. Moreover, succinct proofs come in several flavors, some of these flavors require non-trivial trusted setup ceremonies.

Rollups aim to remove the execution bottleneck, but they do not change the data bottleneck.

Executing (validating) requires state or can it be stateless?

Yet another challenge for execution is that it requires not only the command but also the existing state. For example, in Ethereum a full node requires hundreds of gigabytes and an archive requires a few terabytes. In Bitcoin, a full node also requires hundreds of gigabytes. Having a large state increases the barrier to entry and significantly slows down the rate at which a new node can start validating commands. This is particularly challenging in a sharding solution where there is a need to shuffle validators around to reduce the power of an adaptive adversary. If the state of each shard is large then the time it takes to sync with the state becomes a barrier to the speed at which validators can be shuffled between shards.

So reducing the amount of data needed for validation allows for a more secure sharing which in turn allows for more scalability. One approach is to have the command to contain cryptographic proofs of all the data that it needs for validation. This is the idea of stateless clients and can be implemented with Merkle Tree Proofs. New cryptographic techniques allow for ever more efficient stateless validation. For example, see Coda. Alin’s podcast is a great introduction and contains many links.

Acknowledgment. We would like to thank Ling Ren, Kartik Nayak, Alin Tomescu, Pratyush Mishra, Louis Guthmann, and John Adler for helpful feedback on this post.

Please leave comments on Twitter