In the standard distributed computing model, the communication uncertainty is captured by an adversary that can control the message delays. The communication model defines the limits to the power of the adversary to delay messages.
There are three basic communication models: the Synchronous model, the Asynchronous model, and the Partial synchrony model.
In the Synchronous model, there exists some known finite time bound $\Delta$. For any message sent, the adversary can delay its delivery by at most $\Delta$.
In the Asynchronous model, for any message sent, the adversary can delay its delivery by any finite amount of time. So, on the one hand, there is no bound on the time to deliver a message but, on the other hand, each message must eventually be delivered.
The Partial synchrony model (see DLS88) aims to find a middle ground between these two models. The assumption is that there exists some known finite time bound $\Delta$ and a special event called GST (Global Stabilization Time) such that:
- The adversary must cause the GST event to eventually happen after some unknown finite time.
- Any message sent at time $x$ must be delivered by time $\Delta + \max (x,GST)$.
Informally, in the Partial synchrony model, the system behaves asynchronously till GST and synchronously after GST. Note that the adversary can delay the GST event by any finite amount of time and that no protocol can explicitly detect that GST has occurred. There is no external signal that tells you that GST happened.
At first thought, the Synchronous model may seem to be good enough. Why not just assume, for example, that any message sent over the internet will be delivered by say 2 minutes? First, there is a trade-off:
- Setting a large and conservative $\Delta$ of say 10 minutes may indeed always faithfully model the real world. However, protocol designers who depend on $\Delta$ may incur very long timeouts and hence degrade performance.
- Setting a small and aggressive $\Delta$ of say 0.1 seconds may actually not always faithfully model the real world. This means that protocols whose safety depends on this bound may suffer safety violations in the real world.
Second, even if you think you found a magical sweet spot for $\Delta$, imagine a sender that broadcasts a message to two receivers, one message arrives after $\Delta - \epsilon$ time and the other after $\Delta + \epsilon$. Here the real world would behave differently than the model and this could again potentially cause safety problems.
To overcome the problems with synchrony, the asynchronous model forces protocol designers to assume nothing about network delays. The outcome is often very robust protocols:
- Since they do not depend on any time bound, message delays cannot cause unexpected safety violations.
- Since they cannot use any fixed values for timeouts, they must inherently adapt to the actual latency of the system.
The main problem with the asynchronous model is that protocols in this model tend to be more complex and harder to reason about. Moreover, there are many known complexity gaps between synchrony and asynchrony. Two examples:
- The celebrated Fischer, Lynch and Paterson 1985 lower bound says that in the asynchronous model, any protocol that solves consensus withstanding an adversary (who can fail-stop just one party) must have an infinite execution (see this blog post).
- Authenticated Byzantine Agreement is possible for $n>2f$ in the synchronous model (see blog post) but impossible for $n \leq 3f$ in the asynchronous model (where $f$ is the number of parties the adversary can corrupt). See this blog post for the lower bound.
The Partial synchrony model was suggested in 1988 by Dwork, Lynch, and Stockmeyer (with a preliminary version in PODC 1984). Their paper “Consensus in the presence of partial synchrony” was awarded the Edsger W. Dijkstra Prize in Distributed Computing in 2007. This Partial synchrony model is also (implicitly) used in two other publications from around the same time: Lamport’s Paxos and Oki and Liskov’s viewstamped replication.
Partial synchrony captures the intuition that we would like to design protocols for systems that are usually synchronous but have reasonable guarantees, even if the synchrony assumptions become temporarily violated by some extreme event (like a denial of service attack). In particular, a recurring theme in the Partial synchrony model is to design protocols that are always safe (even when the system is asynchronous) but provide liveness and termination guarantees only after GST (only when the system is synchronous).
Many large-scale systems use protocols that can be successfully reasoned about in the Partial Synchrony model. Some examples of such systems: Zookeeper, etcd, Spanner, Cosmos Hub and many more.
An alternative definition for Partial synchrony is to assume that there is some finite unknown upper bound $\Delta$ on message delivery. This bound is not known in advance and can be chosen by the adversary. See this follow-up post or DLS88 for this and for the equivalence of these definitions.
Special thanks to Alin Tomescu and Kai Mast for reviewing this post and sending insightful comments. Please leave comments on Twitter