Shoal++: High Throughput DAG-BFT Can Be Fast!

TL;DR: Shoal++ is a novel DAG-BFT system that supercharges Shoal to achieve near-optimal theoretical latency while preserving the high throughput and robustness of state-of-the-art certified DAG BFT protocols.

We evaluated Shoal++ against state-of-the-art DAG BFT protocols, such as Bullshark and Shoal — as well as a concurrent DAG effort, Mysticeti (similar to Cordial Miners). Our results show that Shoal++ outperforms all of them, even in failure-free scenarios, reducing state-of-the-art latency by up to 60% and achieving sub-second latency even at 100k tps throughput.

DAG-BFT Background

The core idea of DAG-BFT is to decouple consensus logic from data dissemination, which is done in parallel in an effort to fully utilize available system bandwidth. We explain the motivation and design of DAG-BFT protocols in two previous blog posts: DAG Meets BFT and Shoal.

DAG construction

Shoal++ follows the certified DAG structure proposed in Narwhal, which uses the round-based DAG first intreduced in Aleph. Validators collaboratively proceed through a series of structured rounds, forming a DAG in the process. Each round consists of one node per validator (a proposal), each of which contains a set of transactions, and at least $n-f$ unique reference links (edges) to nodes of the preceding round. Each node is certified via a consisted broadcast before being added to the DAG. The figure below illustrates a Narwhal-based DAG construction.

The consisted broadcast process to certify and add node to the DAG requires three steps, for a total of three message delays:

A validator broadcasts its node proposal
Other validators vote for the proposal, and the original validator aggregates a certificate consisting of 2f+1 signatures on the node hash.
The validator broadcasts its certified node, and recipients add the node to their local view of the DAG.

The certification process, hence a certified DAG, guarantees three desirable properties:

It guarantees that Byzantine validators cannot equivocate, which simplifies the consensus logic
The DAG grows at network speed since a node with a certificate can be immediately added to the DAG even if its causal history is missing.
Fetching missing nodes is done off the critical path and is better load-balanced according to the certificate signers.

Embedded consensus

Modern DAG-based protocols leverage the structured DAG rounds to implicitly embed consensus: validators simply interpret their local view of the DAG, allowing them to reach consensus without exchanging additional messages. A more detailed description of state-of-the-art protocols, Bullshark and Shoal, can be found in the previous posts DAG Meets BFT and Shoal. We provide here a short overview of the Bullshark protocol.

Every odd round in Bullshark has a pre-defined anchor node, which simulates a leader. The protocol first agrees on what anchors to commit and then orders their respective causal histories according to the committed order. There are two ways to commit an anchor:

An anchor in round r is directly committed if it gets $f+1$ votes, i.e., $f+1$ nodes in round $r+1$ link to it.
An anchor in round r is indirectly committed if the first (directly or indirectly) committed anchor in a round $r’ > r$ has a path to it.

A direct commit requires 2 DAG rounds to commit an anchor; indirect commitment require several more rounds. The figure below demonstrates the Bullshark commit rule. Validator 4 directly commits anchor A1 since it has $2=f+1$ votes in its local DAG. Validator 1 indirectly commits anchor A1 since the next (directly or indirectly) committed anchor has a path to A1.

Latency breakdown

We measure transaction end-to-end (e2e) latency in a consensus system as the time between a validator first receiving the transaction and the time the transaction is committed (ordered+all data is available and log prefix is committed). In a DAG BFT system, this latency can be broken into three sequential phases:

Queuing latency: the time it takes for a transaction to be proposed, i.e., included in a broadcasted node.
Anchoring latency: the time it takes for a non-anchor node to be linked by an eventually committing anchor.
Anchor Commit latency: the time it takes to commit an anchor node.

To aid in illustration of these phases we will use Bullshark as an example.

Queuing latency: It takes 3 message delays to certify and add a node to the DAG in Bullshark. Therefore, a validator proposes a new node every 3 message delays. As a result, under uniform distribution, a transaction submitted to a validator must wait, on average, 1.5 message delays before being proposed.

Anchoring latency: The Bullshark protocol has a pre-defined anchor in every odd round. The rest of the nodes are ordered as part of a committed anchor’s causal history, which leads to higher latency. Specifically, nodes in even rounds require at least an additional DAG round (resulting in an extra 3 message delays), and non-anchor nodes in odd rounds require at least two additional DAG rounds (resulting in an extra 6 message delays).

Anchor Commit latency: The Bullshark protocol requires 2 DAG rounds (6 message delays) to commit an anchor in the good case — one DAG round of certifying and broadcasting the anchor and another round for the voting nodes to become certified.

Introducing Shoal++

In Shoal++ we independently tackle each of the DAG BFT latency phases reducing the best case e2e latency to only 4.5 message delays, which is 7.5 message delays fewer than Bullshark. More formally,

During failure free synchronous network periods, the e2e average transaction latency is $4.5\delta$, where $\delta$ is the network messgae delay.

We describe the key ideas below, while more practical consideration can be found in the paper.

Fast anchors

Our first observation is that the direct Bullshark commit rule can be optimized in common-case scenarios. This perhaps comes as a surprise as Bullshark has been subject to multiple iterations and is running in a production system for over a year now. The idea is simple: rather than only commit upon observing $f+1$ certified nodes in round $r+1$ that link to an anchor (votes) in round $r$, Shoal++ also allows validators to commit anchors upon observing $2f+1$ (uncertified) node proposals.

The safety argument is straightforward. Since there are at most 𝑓 Byzantine validators, receiving $2f+1$ uncertified nodes implies that $f+1$ of them will eventually be certified, and thus the direct commit rule is guaranteed to be fulfilled. This optimization reduces the direct commit latency in the good case from 6 to 4 message delays: 3 message delays to certify and add the anchor to the DAG and another delay to broadcast the uncertified nodes (the first step of the certification process). A similar idea was discussed in a previous sailfish post.

More anchors

The Shoal paper introduced anchor pipelining and discussed the theoretical concept of having multiple anchors per round. In Shoal++, we transform this idea into a practical solution to reduce anchoring latency. By treating all nodes as anchors, we can potentially save 3 message delays by eliminating anchoring latency entirely.

In essence, the idea is to run a new Bullshark instance for each node, where each node serves as the instance’s first predefined anchor. Each Bullshark instance acts as a single-shot consensus mechanism to decide whether to commit or skip the associated node. Skipped nodes are committed as part of a committed anchor’s causal history, as before.

The practical challenge is that increasing the number of anchors per round also raises the likelihood that one of them cannot be directly committed. Since all validators must follow the same deterministic order, an anchor that cannot be directly committed introduces higher latency for the subsequent anchors in the round. To address this issue, Shoal++ incorporates the anchor reputation mechanism from Shoal and introduces a round timeout. Specifically, it uses the reputation system to filter out nodes associated with crashed or extremely slow validators and employs a small round timeout to advance the DAG in a synchronized manner. This approach results in slightly slower DAG construction but significantly lowers end-to-end latency by eliminating anchoring latency for most nodes.

More DAGs

Finally, Shoal++ minimizes queuing latency by operating multiple DAGs in parallel. These DAGs are staggered using a small offset, allowing waiting transactions that miss a round to be quickly included in the next DAG. Each DAG uses the same internal ordering logic, and the global ordering logic interleaves their outputs in a round-robin fashion. This approach reduces the average queuing latency from 1.5 to 0.5 message delays.

Evaluation

Experimental Setup We used Google Cloud to mimic the deployment of a globally decentralized network. Our testbed consists of 100 validators spread evenly across 10 regions in 6 continents around the world with machine specs similar to those used by production blockchains and qualifies as commodity grade. More details can be found in the paper.

Baselines We compare Shoal++ against three popular, high throughput DAG-BFT consensus protocols, Bullshark, Shoal, and Mysticeti, as well as one traditional low-latency traditional-style BFT protocol, Jolteon. Bullshark and Shoal represent state-of-the-art certified DAG-based protocols, while Mysticeti (a practical implementation of the Cordial Miner protocol) is a concurrent work to Shoal++ proposing an uncertified design to reduce latency. Jolteon is a state-of-the-art leader-based BFT protocol, which improves Hotstuff latency by 33%. For Mysticeti, we run the publicly available source code referenced from the paper. We implemented the other protocols in our codebase. All protocols are written in Rust, utilizing the Tokio asynchronous runtime. All prototypes except Mysticeti persist consensus data to storage.

Results The graph below compares the protocol performance under the no-failure case. More experiments can be found in the paper. Each data point represents the 50 percentile with error bars representing the 25, and 75th percentiles. Shoal++ outperforms all protocols and achieves sub-second latency even at 100k tps throughput.

Certified VS Uncertified DAGs

A recent group of protocols propose the design of uncertified DAGs. Such constructions can, in theory, reduce latency but in practice they are prone to data fetching on the critical path. One notable example is Mysticeti (which implements a similar protocol to the one proposed in Cordial Miners; and with multiple anchors per round as proposed in Shoal). Specifically, they replace the certification process in each round with a best-effort broadcast, reducing the latency to add a node to the DAG to 1 message delay. The embedded consensus on top resembles PBFT.

The main issue is that edges to uncertified, yet locally unavailable nodes cause fetching on the critical path. In other words, a node cannot be added to the DAG until all its casual history is locally available. This is because the DAG no longer has the proof of availability guarantee provided by the node certificates. In theory, such fetching can result in an additional 2 message delay in each round, while in practice it might be even worse since too much fetching creates bottlenecks in the system.

The figure below illustrates the fetching issue in uncertified DAGs. Solid squares represent locally available nodes to validator 1, out of which the black squares represent its local DAG. In an uncertified DAG, validator 1 cannot advance to round 2 and broadcast node n’’ since node n’ cannot be added to the DAG yet. This is because node n has not yet been received.

Note that validator 3 might be completely honest and yet suffer due to validator 4’s inability to disseminate its node, which in turn might be just a slow honest validator. A single byzantine validator, however, might completely destroy the protocol latency by sending its nodes to an arbitrary subset of validators. A similar effect can be caused by insatiable networks with a non-negligible probability of packet loss.

Conclusion

Shoal++ aims to demonstrate that certification is not the root cause of high latencies and that latency can be significantly reduced without sacrificing DAG BFT robustness. By incorporating three orthogonal latency optimizations on top of Shoal, Shoal++ manages to achieve near-optimal latency while maintaining high throughput and robustness, even under less favorable networking conditions.