Since 2018, blockchain research has seen a surge in DAG-based BFT protocols aimed at higher throughput. These protocols can be traced back to Hashgraph, 2018 and Aleph, 2019, theoretical work of All You Need is DAG, 2021 and systems papers Narwhal and Bullshark.

In this post, we highlight 6 aspects of this body of research:

  1. The separation of dispersal and consensus
  2. Dispersal is live in asynchrony and can be combined
  3. Single-disperser BFT vs multi-disperser BFT
  4. Certified vs Uncertified DAGs
  5. Virtual voting: piggybacking the consensus on the DAG for free
  6. Ordering transactions

1. The separation of dispersal and consensus

The separation of consensus and execution improves modularity. Here we focus on the separation of dispersal and consensus:

  1. Disperse the data and obtain a short data availability certificate (DAC) that proves the data can be retrieved. Binding the DAC to the data.
  2. Reach consensus on the DAC.
  3. Retrieve the data using the DAC. Obtaining totality.

This pattern of dispersal and retrieval using a DAC is called provable verifiable data dissemination (provable VID). Provable VIDs can be implemented as provable broadcast or using error correcting codes for better message complexity and worse latency. This latter approach seems essential for getting $O(1)$ scaling of Blockchains.

Yet another option is to disperse via provable reliable broadcast which obtains totality and removes the need for a retrieval protocol. This option, as is, requires a higher message complexity at dispersal but lower overall latency. The message complexity can be improved by using message efficient reliable broadcast (which uses error correction codes).

2. Dispersal is live in asynchrony and DACs can be combined

  • Dispersal is live in asynchrony: dispersal does not solve consensus (it is a single writer object), so it can make progress in asynchrony (and is not limited by the FLP lower bound).

  • DACs can be combined: A set of data availability certificates can be aggregated as part of the content of a new dispersal. This new combined dispersal again has a small DAC, but now its retrieval can recursively retrieve all the set of previously dispersed data.

Some systems make no progress in asynchrony because they lock-step the dispersal progress with the consensus progress. These systems then suffer a performance degradation when the network becomes synchronous, as they need to work on the backlog. In contrast, systems that separate dispersal from consensus can make progress by continuing to disperse in asynchrony and then commit the whole batch once synchrony returns. Indeed, many DAG-based protocols and protocols such as Autobahn do this.

3. Single-disperser BFT vs Multi-disperser BFT

BFT consensus protocols have a single proposer (aka primary or leader) in each view. Separating the consensus from the dispersal open two options:

  • In a single-disperser BFT this primary is both the single proposer and the single disperser of the view.

  • In a multi-disperser BFT multiple parties act as dispersers concurrently. The single proposer then combines multiple DACs.

Examples of multi-disperser BFT include All you Need is DAG, Narwhal, Bullshark, Shoal, Sailfish, Shoal++, Cordial miners, Star, Autobahn, and Mysticeti. They show how using multiple dispersers can give a throughput boost and obtain $O(n)$ amortized communication complexity in a setting where each disperser has a $O(n)$ size non-overlapping set of transactions it wants to commit.

We note that a similar throughput boost can be obtained in a single-disperser BFT when it disperses using error-correcting codes (cf. Figure~1 in the Pipes paper).

When do multiple dispersers help?

Multiple dispersers help only if their transaction sets are mostly disjoint. If they overlap heavily, bandwidth is wasted since each transaction is dispersed repeatedly.

From the client perspective (assuming clients have disjoint transactions), the ideal case is that each client uses a single non-faulty disperser, giving a real bandwidth boost.

However, if $f$ dispersers are faulty, in the worst case each client must send its transactions to $f+1$ dispersers. Without coordination, this repeats every transaction $f+1$ times and multiple dispersers may still disperse the same transactions redundantly.

4. Multi-disperser BFT: Certified and Uncertified Directed Acyclic Graph (DAG)

Certified DAG: Aleph initiates this pattern where each party waits to receive $n-f$ round $i$ DACs from $n-f$ different dispersers and combines them (along with any new data) to a new round $i+1$ dispersal protocol. Recent works like Sailfish and Shoal++ focus on improving latency for certified DAGs.

Uncertified DAG: Hashgraph initiated the pattern where each party waits for $n-f$ signed round $i$ messages and combines those (along with any new data) to create a signed round $i+1$ message. This is often called an uncertified DAG and is used in Cordial Miners, Mysticeti, and Mahi-Mahi.

Data Synchronization for DAGs

In DAGs the dispersed data includes not just the new data from round $i$ but also data from all prior rounds ($i-1$, $i-2$, etc.). If reliable broadcast is not used for dispersal then some parties may not have all prior data from earlier rounds, and so a synchronization process is needed to fetch missing data.

For uncertified DAGs, Byzantine parties can attack the system by selectively sending data to some parties while withholding that data from other parties, forcing these parties to synchronize on the critical path. For certified DAGs that use DACs, parties do not have to synchronize on the critical path, but they have to recursively fetch missing data from the DAG before execution, which may require multiple network round trips.

Autobahn is an example of a system that achieves non-blocking synchronization and avoids recursive synchronization by carefully designing the dispersal protocol so that retrieval for an arbitrarily large amount of data requires just a single round-trip.

5. Virtual voting: piggybacking the consensus on the DAG for free

The idea of virtual voting is to piggyback the consensus messages onto the DAG messages. It’s a throughput latency trade-off:

The additive bandwidth overhead of virtual voting for consensus is zero. However, virtual voting has latency disadvantages: a consensus protocol may need different timeouts or different triggers (for example, not just arbitrary $n-f$ messages, but a message from some designated leader or messages with some specific content). Moreover, some virtual voting protocols on certified DAGs may add additional unneeded queuing latency (if it “misses the bus”).

6. Ordering transactions

Most DAG protocols maintain a typical DAG structure where every block in a view (round) references a super-majority of blocks in the previous view. When committing, despite the multiple dispersers, there is a single proposer (leader) in each view. Thus, the parties only agree on whether the proposer of a view proposed a block in view $v$. Given a sequence of ordered proposer blocks across views, non-proposer blocks can be ordered among them using any deterministic algorithm. The super-majority block references from view $v$ to blocks from the previous view $v-1$ ensures that sufficiently many non-proposer blocks from $v-1$ can be ordered before the block from view $v$.

Acknowledgements

This post follows insightful discussions at the Dagstuhl seminar on Next-Generation Secure Distributed Computing. We would like to thank the organizers and participants for their valuable feedback.

We would like to thank Adi Seredinschi and Nibesh Shrestha for useful feedback on the post.

Your thoughts and comments on X.