In this blog post, we will explain the core ideas behind Sailfish, a latency-efficient DAG-based protocol. In essence, Sailfish is a reliable-broadcast (RBC) based DAG protocol that supports leaders in every RBC round. It commits leader vertices within 1RBC + $\delta$ time and non-leader vertices within 2RBC + $\delta$ time, outperforming the state-of-the-art in terms of these latencies (where $\delta$ represents the actual network delay).
What are the key characteristics of DAG-based protocols?
Before we dive into the details, let’s pause and ask a more basic question: Why do we need DAG-based protocols when there are leader-based protocols? or, What are the key characteristics of DAG-based protocols compared to leader-based protocols?
Leader-based protocols. The literature on BFT protocols has progressed significantly over the past four decades. Many of these protocols, especially those designed for partially synchronous networks, use the Phase-King paradigm. These protocols are often referred to as leader-based protocols. In this paradigm, the protocols operate in iterations, and in each iteration, there is a designated party known as the leader. The protocol ensures that a Byzantine leader cannot cause a consistency violation. On the other hand, an honest leader sends a proposal to all other parties, and when the network is synchronous, this proposal enables all honest parties to reach a decision within a few rounds.
Several protocols, including PBFT, Tendermint, HotStuff, Streamlet, Simplex, and HotStuff-2 use this approach. These protocols aim to optimize two key properties: latency and communication complexity. Notably, they have achieved optimality in these metrics. For instance, PBFT, Tendermint and Simplex can achieve the optimal latency of $3\delta$ when the leader is honest. HotStuff and HotStuff-2 achieve optimal communication complexity of $O(n^2)$ words for consensus on a single value and $O(n)$ words when the leader is honest.
Given that these protocols already obtain optimality, haven’t we achieved all that we need?
In practice, throughput, i.e., the number of transactions decided in a given span of time, is a crucial metric. While we intuitively link communication efficiency to throughput–assuming that low communication complexity implies high throughput–there are additional factors to consider. Specifically, in a vanilla leader-based protocol, the leader can become a bottleneck in two ways:
(1) Unequal scheduling of work. While the leader is sending a proposal, the other parties’ processors and their networks are underutilized, leading to uneven resource usage across parties. For instance, if the leader is sending a 1MB block to a thousand parties, it is transmitting about 1GB of data while each other party is only receiving 1MB. This issue is also discussed in this post.
(2) Lack of progress when the leader fails. In leader-based protocols, progress halts if the leader fails until it is replaced. Consequently, there is a reduction in throughput during the leader replacement period.
These concerns are well-known, and several approaches have been discussed to address them. There is a long line of work on extension protocols (GP’16, NRSVX’20, BLLN’22, S’23) which extend 1-bit consensus protocols to an arbitrary $\ell$-bit consensus protocols efficiently using erasure coding techniques. Similarly, data availability oracles such as Danksharding, Tiramisu, EigenDA, and others also rely on erasure coding techniques. Finally, Narwhal is a systems-level optimization that separates transaction dissemination (using a separate worker node) from consensus. These approaches somewhat mitigate the first concern but not the second.
DAG-based protocols. DAG-based protocols address both of these concerns in a more natural manner. These protocols require every party to disseminate transactions independently without relying on a leader. The resulting data structure maintained by each party forms a DAG, giving rise to the term “DAG-based protocols”. With respect to scheduling, each party performs a similar amount of work at all times, effectively solving the problem of uneven resource usage. Additionally, because each party is responsible for disseminating its own transactions, the protocol continues to progress in constructing the DAG even if a party fails during a round. Although these protocols typically have higher communication complexity compared to leader-based counterparts, they often achieve better throughput in practice for a moderate number of parties (e.g., Bullshark).
A note on the terminology. We should note that while these sets of protocols are generally termed “leader-based” or “DAG-based”, these terminologies can be somewhat misleading. Many leader-based protocols construct chains, which are technically also DAGs (and can potentially be misclassified as DAG-based). At the same time, most DAG-based protocols rely on a leader to drive consensus (and thus, perhaps could be called leader-based).
Finally, while DAG-based protocols would naturally provide the two properties described earlier, that does not imply that non-DAG-based protocols may not be able to achieve these properties and correspondingly achieve better throughput. For instance, asynchronous protocols (e.g., HoneyBadger BFT) also exhibit both of these properties.
Sailfish: Towards Improving the Latency of DAG-based Protocols
We will now introduce Sailfish, a partially synchronous DAG-based protocol designed to tolerate $t < n/3$ faults while prioritizing low latency.
Sailfish data structure. The protocol progresses through a series of rounds. In each round $r$, each party makes a proposal, represented as a DAG vertex. The vertex includes references to at least $2t+1$ vertices proposed in round $r-1$ (where $t$ is the maximum number of Byzantine faults). These references form the edges of the DAG. The edges and paths formed from these edges are used for committing vertices in the DAG. Sailfish relies on a reliable broadcast protocol (RBC) to disseminate the vertices; RBC ensures non-equivocation (two honest parties do not disagree on the output) and totality (if some honest party delivers, all honest parties will deliver).
Each party $P_i$ maintains a local copy of the DAG, indexed by round number. When a round $r$ vertex is delivered (via RBC) it is added to the $DAG_i[r]$. Although different honest parties may observe different views of the DAG initially, the reliable broadcast of the vertices ensures that each party will eventually converge on the same view of the DAG.
DAG construction. In each round $r$, each party $P_i$ proposes one vertex. In order to propose a vertex in a round $r$, party $P_i$ waits to receive at least $2t+1$ round $r-1$ vertices along with the round $r-1$ leader vertex until a timeout occurs in round $r-1$. In the event that the party receives $2t+1$ round $r-1$ vertices along with the round $r-1$ leader vertex, the party can immediately enter round $r$ and propose a round $r$ vertex. Including a reference to the round $r-1$ leader vertex serves as a vote towards the round $r-1$ leader vertex and is used to commit the leader vertex.
On the other hand, if the party does not receive the round $r-1$ leader vertex before the timeout, it sends a no-vote message in round $r-1$ to round-$r$ leader indicating that the party did not vote for the round $r-1$ leader vertex.
In Sailfish, there is a specific constraint on the leader vertex. A round $r$ leader vertex needs to either have a path to the round $r-1$ leader vertex or a round $r-1$ no-vote certificate, which is a quorum of round $r-1$ no-vote messages. The no-vote certificate serves as a proof that a quorum of parties did not vote for the round $r-1$ leader vertex. Hence, the round $r-1$ leader vertex cannot be committed and it is safe to lack a path to the round $r-1$ leader vertex. This requirement for the round $r$ leader vertex to either have a path to the round $r-1$ leader vertex or contain the no-vote certificate is the key idea enabling Sailfish to support a leader vertex in each round.
Committing and ordering the DAG. In Sailfish, only leader vertices are committed, while non-leader vertices are ordered as part of the causal history of the leader vertices in some deterministic order.
To commit the leader vertex with a latency of 1 RBC+ $\delta$, Sailfish leverages the observation that when the sender (of the RBC) is honest, the first value received from the sender is the value that is eventually delivered. However, if the sender is Byzantine, the final delivered value can differ from the first value received. To account for this potential Byzantine behavior, an honest party $P_i$ directly commits the round $r$ leader vertex only when it observes $2t+1$ “first messages” (of the RBC) for round $r+1$ vertices that have paths to the round $r$ leader vertex. Additionally, at least $t+1$ of those messages are from honest parties who do not send no-vote messages. This is sufficient to ensure a no-vote certificate does not exist in round $r$ implying that the leader vertex of round $r’>r$ must have a path to the round $r$ leader vertex.
Upon directly committing a leader vertex $v_k$ in round $r$, $P_i$ first indirectly commits leader vertices $v_{m}$ in smaller rounds such that there exists a path from $v_k$ to $v_m$ (based on its local copy of the DAG) until it reaches a round $r’ < r$ in which it previously directly committed a leader vertex. This protocol guarantees that when a round $r$ leader vertex $v_k$ is directly committed by an honest party, leader vertices for any subsequent round $r’ > r$ possess a path to $v_k$, ensuring $v_k$ will be committed by all honest parties, either directly or indirectly.
Key Argument for Consistency
Since honest parties will eventually have the same DAG structure, the crux of the consistency analysis is to ensure that no two honest parties disagree on the set of leader vertices that have been committed for each of the rounds. Observe that leader vertices can be directly committed using the commit rule described earlier, or indirectly as a consequence of directly committing a leader vertex in a higher round. Once the set of leader vertices are agreed upon, ordering of non-leader vertices will follow.
Claim: If an honest party commits round $r$ leader vertex $v$, then there is a path from round $r’$ leader vertex to $v$ for $r’ > r$.
Suppose $P_i$ directly committed $v$ in round $r$, then there exists a set $Q$ of $2t+1$ round-$(r+1)$ vertices with path to $v$. Let $H \subset Q$ be the set of vertices proposed by honest parties in $Q$. Let $v_{\ell}$ be the round $r’$ leader vertex. The above claim can be shown in two parts: First, showing that this holds for $r’ = r+1$. Then showing that this holds for any $r’ > r+1$.
Part (i): $r’ = r+1$ If $v_{\ell} \in H$, we are trivially done. Otherwise, the vertices in $H$ are from round $r+1$ honest non-leader parties. When a round $r+1$ honest non-leader party includes a reference to leader vertex $v$, it does not send a round $r$ no-vote message. Since $|H| \ge t+1$, by standard quorum intersection argument, a round $r$ no-vote certificate does not exist. Moreover, parties in $H$ have delivered $v$. By totality property, the round $r+1$ leader will eventually deliver $v$. Thus, if the round $r+1$ leader vertex exists, it must include a reference to $v$ and there exists a path from $v_\ell$ to $v$.
Part (ii): $r’ > r+1$ Note that all vertices in $H$ will eventually be delivered by all honest parties and included in $DAG[r+1]$. Additionally, a round $r+2$ vertex has a path to $2t+1$ round $r+1$ vertices. By standard quorum intersection, this includes at least $1$ vertex in $H$ which has a path to $v$. Thus, all round $r+2$ vertices (including round $r+2$ leader vertex) have a path to $v$. Moreover, each round $r’’> r+2$ vertex has paths to at least $2t+1$ vertices in round $r’‘-1$. By transitivity, each vertex at round $r’’$ has path to at least $2t+1$ vertices in round $r+2$. This implies $v_{\ell}$ must have a path to $v$.
Commit latency. A round-$r$ leader vertex gets committed when $2t+1$ round-$(r+1)$ vertices have paths to it. Our protocol enables parties to commit a leader vertex as soon as they receive the first messages of the RBC from $2t+1$ parties, without waiting for the delivery through an RBC. As a result, the latency to commit a leader vertex is 1 RBC latency + $1\delta$. A non-leader vertex needs to wait for another RBC before it is committed.
For completeness, here is how Sailfish’s latency compares with the state-of-the-art:
Almost signature-free and post-quantum safety. When a signature-free RBC is utilized, Sailfish requires signatures solely for forming no-vote certificates, which is necessary for liveness. In typical executions without failures, it does not require signatures and provides post-quantum safety.
Evaluation. We compared the performance of Sailfish with state-of-the-art DAG-based protocols across 5 GCP regions spanning the US, Finland, Australia, and Japan. As shown in the figure below, Sailfish achieves a significant improvement in latency over prior art, consistent with our theoretical analysis. In the figure below, LV represents the average latency to commit the leader vertices, NLV represents the average latency to commit the non-leader vertices a round before the leader vertex and Avg represents the average latency for all vertices (including those from prior rounds that were ordered when the leader vertex was committed).
Extension to Multi-leader Sailfish
In Sailfish, the latency to commit the leader vertex is shorter compared to the non-leader vertices. To improve the latency for multiple vertices, we extend Sailfish to support multiple leaders within a single round. In the best-case scenario, when all of these leaders are honest, the respective leader vertices can be committed with a latency of 1 RBC+ $1\delta$ and improves the average commit latency as more vertices are committed with reduced latency (i.e., 1 RBC+ $1\delta$).
For further insights into Sailfish and Multi-leader Sailfish, please refer to the paper. Please leave comments on Twitter.