In the standard distributed computing model, communication uncertainty is modeled by an adversary that controls message delays. The communication model defines the limits to the power of the adversary to delay messages.

There are three basic models: synchronous, asynchronous, and partial synchrony.

In the Synchronous model, there exists some known finite time bound $\Delta$. For any message sent, the adversary can delay its delivery by at most $\Delta$.

In the Asynchronous model, for any message sent, the adversary can delay its delivery by any finite amount of time. There is no delivery bound, but every message must eventually be delivered.

The Partial synchrony model (see DLS88) aims to find a middle ground between these two models. The assumption is that there exists some known finite time bound $\Delta$ and a special event called GST (Global Stabilization Time) such that:

  • The adversary must cause the GST event to eventually happen after some unknown finite time.
  • Any message sent at time $x$ must be delivered by time $\Delta + \max (x,GST)$.

Informally, the system behaves asynchronously until GST and synchronously afterward. Note that the adversary can delay the GST event by any finite amount of time and that no protocol can explicitly detect that GST has occurred. There is no external signal that tells you that GST happened.

At first thought, the Synchronous model may seem to be good enough. Why not just assume, for example, that any message sent over the internet will be delivered by say 2 minutes? First, there is a trade-off:

  • Setting a large and conservative $\Delta$ of say 10 minutes may indeed always faithfully model the real world. However, protocol designers who depend on $\Delta$ may incur very long timeouts and hence degrade performance.
  • Setting a small, aggressive $\Delta$ (e.g., 0.1s) may not always match the real world. This means that protocols whose safety depends on this bound may suffer safety violations in the real world.

Second, even if you think you found a magical sweet spot for $\Delta$, imagine a sender that broadcasts a message to two receivers, one message arrives after $\Delta - \epsilon$ time and the other after $\Delta + \epsilon$. Here the real world would behave differently than the model and this could again potentially cause safety problems.

To overcome the problems with synchrony, the asynchronous model forces protocol designers to assume nothing about network delays. The result is often robust protocols:

  • Since they do not depend on any time bound, message delays cannot cause unexpected safety violations.
  • Since they cannot use any fixed values for timeouts, they must inherently adapt to the actual latency of the system.

The main problem with the asynchronous model is that protocols in this model tend to be more complex and harder to reason about. Moreover, there are many known complexity gaps between synchrony and asynchrony. Two examples:

  • The celebrated Fischer, Lynch and Paterson 1985 lower bound says that in the asynchronous model, any protocol that solves consensus withstanding an adversary (who can fail-stop just one party) must have an infinite execution (see this blog post).
  • Authenticated Byzantine Agreement is possible for $n>2f$ in the synchronous model (see blog post) but not possible for $n \leq 3f$ in the asynchronous model (where $f$ is the number of parties the adversary can corrupt). See this blog post for the lower bound.

The Partial synchrony model was suggested in 1988 by Dwork, Lynch, and Stockmeyer (with a preliminary version in PODC 1984). Their paper “Consensus in the presence of partial synchrony” was awarded the Edsger W. Dijkstra Prize in Distributed Computing in 2007. This Partial synchrony model is also (implicitly) used in two other publications from around the same time: Lamport’s Paxos and Oki and Liskov’s viewstamped replication.

Partial synchrony captures the idea of designing protocols for systems that are usually synchronous but tolerate temporary violations (e.g., denial-of-service attacks). In particular, a recurring theme in the Partial synchrony model is to design protocols that are always safe (even when the system is asynchronous) but provide liveness and termination guarantees only after GST (only when the system is synchronous).

Many large-scale systems use protocols that can be successfully reasoned about in the Partial Synchrony model. Some examples of such systems: Zookeeper, etcd, Spanner, Cosmos Hub.

An alternative definition for Partial synchrony is to assume that there is some finite unknown upper bound $\Delta$ on message delivery. This bound is not known in advance and can be chosen by the adversary. See this follow-up post or DLS88 for this and for the equivalence of these definitions.

Acknowledgments

Special thanks to Alin Tomescu and Kai Mast for reviewing this post and sending insightful comments. Please leave comments on Twitter