Since the introduction of PBFT, the first practical BFT replication solution in the partial synchrony model, numerous BFT solutions were built around its core two-phase paradigm. The first phase guarantees proposal uniqueness through a QC. The second phase guarantees that a new leader can convince replicas to vote for a safe proposal. This requires the leader to relay information from (n−f) replicas, each reporting its own highest QC or vote. Generations of two-phase works thus suffer from a quadratic communication bottleneck on leader replacement.
HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. This alleviates the above complexity and at the same time considerably simplifies the leader replacement protocol. Having (almost) canonized the phases, it is very easy to pipeline HotStuff, and to frequently rotate leaders.
We have implemented HotStuff as a library in roughly 4K lines of C++ code. Most noticeably, the core consensus logic specified in the pseudocode consumes only around 200 lines. In this section, we will first examine baseline throughput and latency by comparing to a state-of-art system, BFT-SMaRt [13]. We then focus on the message cost for view changes to see our advantages in this scenario.
We conducted our experiments on Amazon EC2 using c5.4xlarge instances. Each instance had 16 vCPUs supported by Intel Xeon Platinum 8000 processors. All cores sustained a Turbo CPU clock speed up to 3.4GHz. We ran each replica on a single VM instance, and so BFT-SMaRt, which makes heavy use of threads, was allowed to utilize 16 cores per replica, as in their original evaluation [13]. The maximum TCP bandwidth measured by iperf was around 1.2 Gigabytes per second. We did not throttle the bandwidth in any run. The network latency between two machines was less than 1 ms.
Our prototype implementation of HotStuff uses secp256k1 for all digital signatures in both votes and quorum certificates. BFT-SMaRt uses hmac-sha1 for MACs (Message Authentication Codes) in the messages during normal operation and uses digital signatures in addition to MACs during a view change.
All results for HotStuff reflect end-to-end measurement from the clients. For BFT-SMaRt, we used the microbenchmark programs ThroughputLatencyServer and ThroughputLatencyClient from the BFT-SMaRt website (https://github.com/bft-smart/library). The client program measures end-to-end latency but not throughput, while the server-side program measures both throughput and latency. We used the throughput results from servers and the latency results from clients.
Figure 4 depicts three batch sizes for both systems, 100, 400, and 800, though because these systems have different batching schemes, these numbers mean slightly different things for each system. BFT-SMaRt drives a separate consensus decision for each operation, and batches the messages from multiple consensus protocols. Therefore, it has a typical L-shaped latency/throughput performance curve. HotStuff batches multiple operations in each node, and in this way, mitigates the cost of digital signatures per decision. However, above 400 operations per batch, the latency incurred by batching becomes higher than the cost of the crypto. Despite these differences, both three-phase (“HS3-”) and two-phase (“HS2-”) HotStuff achieves comparable latency performance to BFT-SMaRt (“BS-”) for all three batch sizes, while their maximum throughput noticeably outperformed BFT-SMaRt.
For batch sizes of 100 and 400, the lowest-latency HotStuff point provides latency and throughput that are better than the latency and throughput simultaneously achievable by BFT-SMaRT at its highest throughput, while incurring a small increase in latency. This increase is partly due to the batching strategy employed by HotStuff: It needs three additional full batches (two in the two-phase variant) to arrive at a decision on a batch. Our experiments kept the number of outstanding requests high, but the higher the batch size, the longer it takes to fill the batching pipeline. Practical deployments could be further optimized to adapt the batch size to the number of outstanding operations.
Figure 5 depicts three client request/reply payload sizes (in bytes) of 0/0, 128/128, and 1024/1024, denoted “p0”, “p128”, and “p1024” respectively. At all payload sizes, both three-phase and two-phase HotStuff outperformed BFTSMaRt in throughput, with similar or comparable latency.
Notice BFT-SMaRt uses MACs based on symmetric crypto that is orders of magnitude faster than the asymmetric crypto in digital signatures used by HotStuff, and also three-phase HotStuff has more round trips compared to two-phase PBFT variant used by BFT-SMaRt. Yet HotStuff is still able to achieve comparable latency and much higher throughput. Below we evaluate both systems in more challenging situations, where the performance advantages of HotStuff will become more pronounced.
To evaluate the scalability of HotStuff in various dimensions, we performed three experiments. For the baseline, we used zero-size request/response payloads while varying the number of replicas. The second evaluation repeated the baseline experiment with 128-byte and 1024-byte request/response payloads. The third test repeated the baseline (with empty payloads) while introducing network delays between replicas that were uniformly distributed in 5ms ± 0.5ms or in 10ms ± 1.0ms, implemented using NetEm (see https://www.linux.org/docs/man8/tc-netem.html). For each data point, we repeated five runs with the same setting and show error bars to indicate the standard deviation for all runs.
The first setting is depicted in Figure 6a (throughput) and Figure 6b (latency). Both three-phase and two-phase HotStuff show consistently better throughput than BFT-SMaRt, while their latencies are still comparable to BFTSMaRt with graceful degradation. The performance scales better than BFT-SMaRt when n < 32. This is because we currently still use a list of secp256k1 signatures for a QC. In the future, we plan to reduce the cryptographic computation overhead in HotStuff by using a fast threshold signature scheme.
The second setting with payload size 128 or 1024 bytes is denoted by “p128” or “p1024” in Figure 7a (throughput) and Figure 7b (latency). Due to its quadratic bandwidth cost, the throughput of BFT-SMaRt scales worse than HotStuff for reasonably large (1024-byte) payload size.
The third setting is shown in Figure 8a (throughput) and Figure 8b (latency) as “5ms” or “10ms”. Again, due to the larger use of communication in BFT-SMaRt, HotStuff consistently outperformed BFT-SMaRt in both cases.
To evaluate the communication complexity of leader replacement, we counted the number of MAC or signature verifications performed within BFT-SMaRt’s view-change protocol. Our evaluation strategy was as follows. We injected a view change into BFT-SMaRt every one thousand decisions. We instrumented the BFT-SMaRt source code to count the number of verifications upon receiving and processing messages within the view-change protocol. Beyond communication complexity, this measurement underscores the cryptographic computation load associated with transferring these authenticated values.
Figure 9a and Figure 9b show the number of extra authenticators (MACs and signatures, respectively) processed for each view change, where “extra” is defined to be those authenticators that would not be sent if the leader remained stable. Note that HotStuff has no “extra” authenticators by this definition, since the number of authenticators remains the same regardless of whether the leader stays the same or not. The two figures show that BFT-SMaRt uses cubic numbers of MACs and quadratic numbers of signatures. HotStuff does not require extra authenticators for view changes and so is omitted from the graph.
Evaluating the real-time performance of leader replacement is tricky. First, BFT-SMaRt got stuck when triggering frequent view changes; our authenticator-counting benchmark had to average over as many successful view changes as possible before the system got stuck, repeating the experiment many times. Second, the actual elapsed time for leader replacement depends highly on timeout parameters and the leader-election mechanism. It is therefore impossible to provide a meaningful comparison.
In this section, we examine four BFT replication protocols spanning four decades of research in Byzantine fault tolernace, casting them into a chained framework similar to Chained HotStuff.
Figure 3 provides a birds-eye view of the commit rules of five protocols we consider, including HotStuff.
In a nutshell, the commit rule in DLS [25] is One-Chain, allowing a node to be committed only by its own leader. The commit rules in PBFT [20], Tendermint [15, 16] and Casper [17] are almost identical, and consist of Two-Chains. They differ in the mechanisms they introduce for liveness, PBFT has leader “proofs” of quadratic size (no Linearity), Tendermint and Casper introduce a mandatory ∆ delay before each leader proposal (no Optimistic Responsiveness). HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.
The simplest commit rule is a One-Chain. Modeled after Dwork, Lynch, and Stockmeyer (DLS), the first known asynchronous Byzantine Consensus solution, this rule is depicted in Figure 3(a). A replica becomes locked in DLS on the highest node it voted for.
Unfortunately, this rule would easily lead to a deadlock if at some height, a leader equivocates, and two correct replicas became locked on the conflicting proposals at that height. Relinquishing either lock is unsafe unless there are that indicate they did not vote for the locked value.
Indeed, in DLS only the leader of each height can itself reach a commit decision by the One-Chain commit rule. Thus, only the leader itself is harmed if it has equivocated. Replicas can relinquish a lock either if replicas did not vote for it, or if there are conflicting proposals (signed by the leader). The unlocking protocol occurring at the end of each height in DLS turns out to be fairly complex and expensive. Together with the fact that only the leader for a height can decide, in the best scenario where no fault occurs and the network is timely, DLS requires n leader rotations, and message transmissions, per single decision. While it broke new ground in demonstrating a safe asynchronous protocol, DLS was not designed as a practical solution.
Modeled after PBFT, a more practical appraoch uses a Two-Chain commit rule, see Figure 3(b). When a replica votes for a node that forms a One-Chain, it becomes locked on it. Conflicting One-Chains at the same height are simply not possible, as each has a QC, hence the deadlock situation of DLS is avoided.
However, if one replica holds a higher lock than others, a leader may not know about it even if it collects information from n − f replicas. This could prevent leaders from reaching decisions ad infinitum, purely due to scheduling. To get “unstuck”, the PBFT unlocks all replicas by carrying a proof consisting of the highest One-Chain’s by replicas. This proof is quite involved, as explained below.
The original PBFT, which has been open-sourced [20] and adopted in several follow up works [13, 34], a leader proof contains a set of messages collected from n − f replicas reporting the highest One-Chain each member voted for. Each One-Chain contains a QC, hence the total communication cost is . Harnessing signature combining methods from [45, 18], SBFT [30] reduces this cost to by turning each QC to a single value.
In the PBFT variant in [21], a leader proof contains the highest One-Chain the leader collected from the quorum only once. It also includes one signed value from each member of the quorum, proving that it did not vote for a higher One-Chain. Broadcasting this proof incurs communication complexity . Note that whereas the signatures on a QC may be combined into a single value, the proof as a whole cannot be reduced to constant size because messages from different members of the quorum may have different values.
In both variants, a correct replica unlocks even it has a higher One-Chain than the leader’s proof. Thus, a correct leader can force its proposal to be accepted during period of synchrony, and liveness is guaranteed. The cost is quadratic communication per leader replacement.
Tendermint has a Two-Chain commit rule identical to PBFT, and Casper has a Two-Chain rule in which the leaf does not need to have a QC to direct parent. That is, in Casper, Figure 3(c,d) depicts the commit rules for Tendermint and Casper, respectively.
In both methods, a leader simply sends the highest One-Chain it knows along with its proposal. A replica unlocks a One-Chain if it receives from the leader a higher one.
However, because correct replicas may not vote for a leader’s node, to guarantee progress a new leader must obtain the highest One-Chain by waiting the maximal network delay. Otherwise, if leaders only wait for the first n − f messages to start a new height, there is no progress guarantee. Leader delays are inherent both in Tendermint and in Casper, in order to provide liveness.
This simple leader protocol embodies a linear leap in the communication complexity of the leader protocol, which HotStuff borrows from. As already mentioned above, a QC could be captured in a single value using threshold signatures, hence a leader can collect and disseminate the highest One-Chain with linear communication complexity. However, crucially, due to the extra QC step, HotStuff does not require the leader to wait the maximal network delay.
HotStuff is a practical protocol for building efficient SMR systems. Because of its simplicity, we can easily turn Algorithm 3 into an event-driven-style specification that is almost like the code skeleton for a prototype implementation.
As shown in Algorithm 4, the code is further simplified and generalized by extracting the liveness mechanism from the body into a module named Pacemaker. Instead of the next leader always waiting for a genericQC at the end of the generic phase before starting its reign, this logic is delegated to the Pacemaker. A stable leader can skip this step and streamline proposals across multiple heights. Additionally, we relax the direct parent constraint for maintaining the highest genericQC and lockedQC, while still preserving the requirement that the QC in a valid node always refers to its ancestor. The proof of correctness is similar to Chained HotStuff and we also defer it to the appendix of [50].
Data structures. Each replica u keeps track of the following main state variables:
mapping from a node to its votes.
height of last voted node.
locked node (similar to lockedQC).
last executed node.
highest known QC (similar to genericQC) kept by a Pacemaker.
leaf node kept by a Pacemaker.
It also keeps a constant , the same genesis node known by all correct replicas. To bootstrap, contains a hardcoded QC for itself, , , are all initialized to , and contains the QC for .
Pacemaker. A Pacemaker is a mechanism that guarantees progress after GST. It achieves this through two ingredients.
The first one is “synchronization”, bringing all correct replicas, and a unique leader, into a common height for a sufficiently long period. The usual synchronization mechanism in the literature [25, 20, 15] is for replicas to increase the count of ∆’s they spend at larger heights, until progress is being made. A common way to deterministically elect a leader is to use a rotating leader scheme in which all correct replicas keep a predefined leader schedule and rotate to the next one when the leader is demoted.
Second, a Pacemaker needs to provide the leader with a way to choose a proposal that will be supported by correct replicas. As shown in Algorithm 5, after a view change, in onReceiveNewView, the new leader collects new-view messages sent by replicas through onNextSyncView to discover the highest QC to satisfy the second part of the condition in onReceiveProposal for liveness (Line 18 of Algorithm 4). During the same view, however, the incumbent leader will chain the new node to the end of the leaf last proposed by itself, where no new-view message is needed. Based on some application-specific heuristics (to wait until the previously proposed node gets a QC, for example), the current leader invokes onBeat to propose a new node carrying the command to be executed.
It is worth noting that even if a bad Pacemaker invokes onPropose arbitrarily, or selects a parent and a QC capriciously, and against any scheduling delays, safety is always guaranteed. Therefore, safety guaranteed by Algorithm 4 alone is entirely decoupled from liveness by any potential instantiation of Algorithm 5.
Two-phase HotStuff variant. To further demonstrate the flexibility of the HotStuff framework, Algorithm 6 shows the two-phase variant of HotStuff. Only the update procedure is affected, a Two-Chain is required for reaching a commit decision, and a One-Chain determines the lock. As discussed above (Section 4.4), this two-phase variant loses Optimistic Responsiveness, and is similar to Tendermint/Casper. The benefit is fewer phases, while liveness may be addressed by incorporating in Pacemaker a wait based on maximum network delay. See Section 7.3 for further discussion.
It takes three phases for a Basic HotStuff leader to commit a proposal. These phases are not doing “useful” work except collecting votes from replicas, and they are all very similar. In Chained HotStuff, we improve the Basic HotStuff protocol utility while at the same time considerably simplifying it. The idea is to change the view on every prepare phase, so each proposal has its own view. This reduces the number of message types and allows for pipelining of decisions. A similar approach for message type reduction was suggested in Casper [1].
More specifically, in Chained HotStuff the votes over a prepare phase are collected in a view by the leader into a genericQC. Then the genericQC is relayed to the leader of the next view, essentially delegating responsibility for the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare phase for view simultaneously serves as the pre-commit phase for view . The prepare phase for view simultaneously serves as the pre-commit phase for view and as the commit phase for view . This is possible because all the phases have identical structure.
The pipeline of Basic HotStuff protocol phases embedded in a chain of Chained HotStuff proposals is depicted in Figure 1. Views , , of Chained HotStuff serve as the prepare, pre-commit, and commit Basic HotStuff phases for cmd 1 proposed in . This command becomes committed by the end of . Views , , serve as the three Basic HotStuff phases for cmd 2 proposed in , and it becomes committed by the end of . Additional proposals generated in these phases continue the pipeline similarly, and are denoted by dashed boxes. In Figure 1, a single arrow denotes the field for a node , and a double arrow denotes .
Hence, there are only two types of messages in Chained HotStuff, a new-view message and generic-phase generic message. The generic QC functions in all logically pipelined phases. We next explain the mechanisms in the pipeline to take care of locking and committing, which occur only in the commit and decide phases of Basic HotStuff.
Dummy nodes. The genericQC used by a leader in some view viewNumber may not directly reference the proposal of the preceding view (viewNumber −1). The reason is that the leader of a preceding view fails to obtain a QC, either because there are conflicting proposals, or due to a benign crash. To simplify the tree structure, createLeaf extends genericQC.node with blank nodes up to the height (the number of parent links on a node’s branch) of the proposing view, so view-numbers are equated with node heights. As a result, the QC embedded in a node b may not refer to its parent, i.e., b.justify.node may not equal b.parent (the last node in Figure 2).
One-Chain, Two-Chain, and Three-Chain. When a node carries a QC that refers to a direct parent, i.e., , we say that it forms a One-Chain. Denote by . Node forms a Two-Chain, if in addition to forming a One-Chain, . It forms a Three-Chain, if forms a Two-Chain.
Looking at chain , , , ancestry gaps might occur at any one of the nodes. These situations are similar to a leader of Basic HotStuff failing to complete any one of three phases, and getting interrupted to the next view by nextView.
If forms a One-Chain, the prepare phase of has succeeded. Hence, when a replica votes for , it should remember . We remark that it is safe to update even when a One-Chain is not direct, so long as it is higher than the current . In the implementation code described in Section 6, we indeed update in this case.
If forms a Two-Chain, then the pre-commit phase of b 0 has succeeded. The replica should therefore update . Again, we remark that the lock can be updated even when a Two-Chain is not direct—safety will not break—and indeed, this is given in the implementation code in Section 6.
最后,如果形成Three-Chain,则的commit阶段成功,并且成为已提交的决策。
Finally, if b ∗ forms a Three-Chain, the commit phase of b has succeeded, and b becomes a committed decision.
Algorithm 3 shows the pseudocode for Chained HotStuff. The proof of safety given by Theorem 5 in Appendix A is similar to the one for Basic HotStuff. We require the QC in a valid node refers to its ancestor. For brevity, we assume the constraint always holds and omit checking in the code.
HotStuff solves the State Machine Replication (SMR) problem. At the core of SMR is a protocol for deciding on a growing log of command requests by clients. A group of state-machine replicas apply commands in sequence order consistently. A client sends a command request to all replicas, and waits for responses from (f + 1) of them. For the most part, we omit the client from the discussion, and defer to the standard literature for issues regarding numbering and de-duplication of client requests.
The Basic HotStuff solution is presented in Algorithm 2. The protocol works in a succession of views numbered with monotonically increasing view numbers. Each viewNumber has a unique dedicated leader known to all. Each replica stores a tree of pending commands as its local data structure. Each tree node contains a proposed command (or a batch of them), metadata associated with the protocol, and a parent link. The branch led by a given node is the path from the node all the way to the tree root by visiting parent links. During the protocol, a monotonically growing branch becomes committed. To become committed, the leader of a particular view proposing the branch must collect votes from a quorum of replicas in three phases, prepare, pre-commit, and commit.
A key ingredient in the protocol is a collection of votes over a leader proposal, referred to as a quorum certificate (or “QC” in short). The QC is associated with a particular node and a view number. The tcombine utility employs a threshold signature scheme to generate a representation of signed votes as a single authenticator.
Below we give an operational description of the protocol logic by phases, followed by a precise specification in Algorithm 2, and conclude the section with safety, liveness, and complexity arguments.
prepare phase. The protocol for a new leader starts by collecting new-view messages from replicas. The new-view message is sent by a replica as it transitions into viewNumber (including the first view) and carries the highest prepareQC that the replica received (⊥ if none), as described below.
The leader processes these messages in order to select a branch that has the highest preceding view in which a was formed. The leader selects the with the highest view, denoted , among the new-view messages. Because is the highest among replicas, no higher view could have reached a commit decision. The branch led by is therefore safe.
Collecting new-view messages to select a safe branch may be omitted by an incumbent leader, who may simply select its own highest as . We defer this optimization to Section 6 and only describe a single, unified leader protocol in this section. Note that, different from PBFT-like protocols, including this step in the leader protocol is straightforward, and it incurs the same, linear overhead as all the other phases of the protocol, regardless of the situation.
The leader uses the createLeaf method to extend the tail of with a new proposal. The method creates a new leaf node as a child and embeds a digest of the parent in the child node. The leader then sends the new node in a prepare message to all other replicas. The proposal carries for safety justification.
Upon receiving the prepare message for the current view from the leader, replica r uses the safeNode predicate to determine whether to accept it. If it is accepted, the replica sends a prepare vote with a partial signature (produced by ) for the proposal to the leader.
safeNode predicate. The safeNode predicate is a core ingredient of the protocol. It examines a proposal message m carrying a QC justification m.justify, and determines whether m.node is safe to accept. The safety rule to accept a proposal is the branch of m.node extends from the currently locked node lockedQC.node. On the other hand, the liveness rule is the replica will accept m if m.justify has a higher view than the current lockedQC. The predicate is true as long as either one of two rules holds.
pre-commit phase. When the leader receives prepare votes for the current proposal curProposal, it combines them into a prepareQC. The leader broadcasts prepareQC in pre-commit messages. A replica responds to the leader with pre-commit vote having a signed digest of the proposal.
commit phase. The commit phase is similar to pre-commit phase. When the leader receives pre-commit votes, it combines them into a precommitQC and broadcasts it in commit messages; replicas respond to it with a commit vote. Importantly, a replica becomes locked on the precommitQC at this point by setting its lockedQC to precommitQC (Line 25 of Algorithm 2). This is crucial to guard the safety of the proposal in case it becomes a consensus decision.
decide phase. When the leader receives commit votes, it combines them into a commitQC. Once the leader has assembled a commitQC, it sends it in a decide message to all other replicas. Upon receiving a decide message, a replica considers the proposal embodied in the commitQC a committed decision, and executes the commands in the committed branch. The replica increments viewNumber and starts the next view.
nextView interrupt. In all phases, a replica waits for a message at view viewNumber for a timeout period, determined by an auxiliary nextView(viewNumber) utility. If nextView(viewNumber) interrupts waiting, the replica also increments viewNumber and starts the next view.
Messages. A message m in the protocol has a fixed set of fields that are populated using the Msg() utility shown in Algorithm 1. m is automatically stamped with curView, the sender’s current view number. Each message has a type m.type ∈ {new-view, prepare, pre-commit, commit, decide}. m.node contains a proposed node (the leaf node of a proposed branch). There is an optional field m.justify. The leader always uses this field to carry the QC for the different phases. Replicas use it in new-view messages to carry the highest prepareQC. Each message sent in a replica role contains a partial signature m.partialSig by the sender over the tuple hm.type, m.viewNumber, m.nodei, which is added in the voteMsg() utility.
Quorum certificates. A Quorum Certificate (QC) over a tuple is a data type that combines a collection of signatures for the same tuple signed by replicas. Given a QC qc, we use , ,
Tree and branches. Each command is wrapped in a node that additionally contains a parent link which could be a hash digest of the parent node. We omit the implementation details from the pseudocode. During the protocol, a replica delivers a message only after the branch led by the node is already in its local tree. In practice, a recipient who falls behind can catch up by fetching missing nodes from other replicas. For brevity, these details are also omitted from the pseudocode. Two branches are conflicting if neither one is an extension of the other. Two nodes are conflicting if the branches led by them are conflicting.
Bookkeeping variables. A replica uses additional local variables for bookkeeping the protocol state: (i) a viewNumber, initially 1 and incremented either by finishing a decision or by a nextView interrupt; (ii) a locked quorum certificate lockedQC, initially ⊥, storing the highest QC for which a replica voted commit; and (iii) a prepareQC, initially ⊥, storing the highest QC for which a replica voted pre-commit. Additionally, in order to incrementally execute a committed log of commands, the replica maintains the highest node whose branch has been executed. This is omitted below for brevity.
The protocol given in Algorithm 2 is described as an iterated view-by-view loop. In each view, a replica performs phases in succession based on its role, described as a succession of “as” blocks. A replica can have more than one role. For example, a leader is also a (normal) replica. Execution of as blocks across roles can be proceeded concurrently. The execution of each as block is atomic. A nextView interrupt aborts all operations in any as block, and jumps to the “Finally” block.
4.4 安全性、活跃性以及复杂度
安全性(Safety). 我们定义仲裁证书 是合法的当且仅当为真。
Safety. We first define a quorum certificate to be valid if is true.
引理1. 对于任意合法的, ,如果有 且 与 冲突,那么我们有。
Lemma 1. For any valid , in which and conflicts with , we have .
Proof. To show a contradiction, suppose . Because a valid QC can be formed only with votes (i.e., partial signatures) for it, there must be a correct replica who voted twice in the same phase of v. This is impossible because the pseudocode allows voting only once for each phase in each view.
定理2 如果和是冲突的节点,那么他们不能够同时被一个正确的副本所提交(commit)。
Theorem 2. If and are conflicting nodes, then they cannot be both committed, each by a correct replica.
Proof. We prove this important theorem by contradiction. Let denote a valid commitQC (i.e., ) such that , and denote a valid commitQC such that . Denote and . By Lemma 1, . W.l.o.g. assume .
We will now denote by the lowest view higher than for which there is a valid prepareQC, (i.e., ) where , and conflicts with . Formally, we define the following predicate for any prepareQC:
We can now set the first switching point :
Note that, by assumption such a must exist; for example, could be the prepareQC formed in view .
Of the correct replicas that sent a partial result , let be the first that contributed ; such an must exist since otherwise, one of and could not have been created. During view , replica updates its lock to a on at Line 25 of Algorithm 2. Due to the minimality of , the lock that replica has on the branch led by is not changed before is formed. Otherwise must have seen some other prepareQC with lower view because Line 17 comes before Line 25, contradicting to the minimality. Now consider the invocation of safeNode in the prepare phase of view by replica , with a message carrying . By assumption, m.node conflicts with , and so the disjunct at Line 26 of Algorithm 1 is false. Moreover, would violate the minimality of , and so the disjunct in Line 27 of Algorithm 1 is also false. Thus, safeNode must return false and cannot cast a prepare vote on the conflicting branch in view , a contradiction.
Liveness. There are two functions left undefined in the previous section: leader and nextView. Their definition will not affect safety of the protocol, though they do matter to liveness. Before giving candidate definitions for them, we first show that after GST, there is a bounded duration such that if all correct replicas remain in view v during and the leader for view v is correct, then a decision is reached. Below, we say that and match if and are valid, , and .
引理3 如果一个正常运行的副本已经锁定了,且锁定在了,那么至少个副本投票了与匹配的。
Lemma 3. If a correct replica is locked such that , then at least correct replicas voted for some prepareQC matching lockedQC.
Proof. Suppose replica is locked on precommitQC. Then, votes were cast for the matching prepareQC in the prepare phase (Line 10 of Algorithm 2), out of which at least were from correct replicas.
Theorem 4. After GST, there exists a bounded time period such that if all correct replicas remain in view during and the leader for viewv is correct, then a decision is reached.
Proof. Starting in a new view, the leader collects new-view messages and calculates its highQC before broadcasting a prepare messsage. Suppose among all replicas (including the leader itself), the highest kept lock is . By Lemma 3, we know there are at least correct replicas that voted for a matching , and have already sent them to the leader in their new-view messages. Thus, the leader must learn a matching in at least one of these new-view messages and use it as in its prepare message. By the assumption, all correct replicas are synchronized in their view and the leader is nonfaulty. Therefore, all correct replicas will vote in the prepare phase, since in safeNode, the condition on Line 27 of Algorithm 1 is satisfied (even if the node in the message conflicts with a replica’s stale , and so Line 26 is not). Then, after the leader assembles a valid for this view, all replicas will vote in all the following phases, leading to a new decision. After GST, the duration for these phases to complete is of bounded length.
The protocol is Optimistically Responsive because there is no explicit “wait-for-∆” step, and the logical disjunction in safeNode is used to override a stale lock with the help of the three-phase paradigm.
We now provide simple constructions for leader and nextView that suffice to ensure that after GST, eventually a view will be reached in which the leader is correct and all correct replicas remain in this view for time. It suffices for leader to return some deterministic mapping from view number to a replica, eventually rotating through all replicas. A possible solution for nextView is to utilize an exponential back-off mechanism that maintains a timeout interval. Then a timer is set upon entering each view. When the timer goes off without making any decision, the replica doubles the interval and calls nextView to advance the view. Since the interval is doubled at each time, the waiting intervals of all correct replicas will eventually have at least overlap in common, during which the leader could drive a decision.
Livelessness with two-phases. We now briefly demonstrate an infinite non-deciding scenario for a “two-phase” HotStuff. This explains the necessity for introducing a synchronous delay in Casper and Tendermint, and hence for abandoning (Optimistic) Responsiveness.
In the two-phase HotStuff variant, we omit the pre-commit phase and proceed directly to commit. A replica becomes locked when it votes on a prepareQC. Suppose, in view , a leader proposes . It completes the prepare phase, and some replica votes for the prepareQC, say , such that . Hence, becomes locked on . An asynchronous network scheduling causes the rest of the replicas to move to view without receiving .
We now repeat ad infinitum the following single-view transcript. We start view with only holding the highest prepareQC (i.e. ) in the system. The new leader collects new-view messages from replicas excluding . The highest prepareQC among these, , has view and conflicts with . then proposes which extends , to which honest replicas respond with a vote, but rejects it because it is locked on , conflicts with and is lower than . Eventaully, replicas give up and move to the next view. Just then, a faulty replica responds to ’s proposal, then puts together a prepareQC and one replica, say votes for it and becomes locked on it.
Complexity. In each phase of HotStuff, only the leader broadcasts to all replicas while the replicas respond to the sender once with a partial signature to certify the vote. In the leader’s message, the QC consists of a proof of votes collected previously, which can be encoded by a single threshold signature. In a replica’s response, the partial signature from that replica is the only authenticator. Therefore, in each phase, there are authenticators received in total. As there is a constant number of phases, the overall complexity per view is .