【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Conclusion

9. 结论

自从引入PBFT(部分同步模型中的第一个实用的BFT副本解决方案)以来,围绕其核心两阶段架构,构建了许多BFT解决方案。 第一阶段通过仲裁证书保证提案的唯一性。 第二阶段保证新领导者可以说服复制者投票赞成安全的提案。 这要求领导者中继来自n-f个副本的信息,每个副本都报告自己最高的QC或投票。 因此,一代代的两阶段工作都面临着更换领导者的二次沟通瓶颈。

Since the introduction of PBFT, the first practical BFT replication solution in the partial synchrony model, numerous BFT solutions were built around its core two-phase paradigm. The first phase guarantees proposal uniqueness through a QC. The second phase guarantees that a new leader can convince replicas to vote for a safe proposal. This requires the leader to relay information from (n−f) replicas, each reporting its own highest QC or vote. Generations of two-phase works thus suffer from a quadratic communication bottleneck on leader replacement.

HotStuff围绕三相核心展开,使新领导者可以轻松选择已知的最高仲裁证书。 这减轻了上述复杂性,同时大大简化了领导者更换协议。 (几乎)完成了阶段的标准化之后,很容易通过管道传递HotStuff,并经常轮换领导者。

HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. This alleviates the above complexity and at the same time considerably simplifies the leader replacement protocol. Having (almost) canonized the phases, it is very easy to pipeline HotStuff, and to frequently rotate leaders.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Evaluation

8. 模型评估

我们已经在大约4K行C ++代码中将HotStuff实现为一个库。 最值得注意的是,伪代码中指定的核心共识逻辑仅占用了大约200行。 在本节中,我们将自己的模型与State-Of-Art的模型BFT-SMaRt 进行比较,首先检查baseline吞吐量和延迟。 然后,我们将重点放在观察视图更改的消息成本上,以了解HotStuff在这种情况下的优势。

We have implemented HotStuff as a library in roughly 4K lines of C++ code. Most noticeably, the core consensus logic specified in the pseudocode consumes only around 200 lines. In this section, we will first examine baseline throughput and latency by comparing to a state-of-art system, BFT-SMaRt [13]. We then focus on the message cost for view changes to see our advantages in this scenario.

8.1 设置

我们使用c5.4xlarge实例在Amazon EC2上进行了实验。每个实例具有16个由Intel Xeon Platinum 8000处理器组成的vCPU。所有内核均具有高达3.4GHz的Turbo CPU时钟。我们将每个副本运行在单个VM实例上,因此,大量使用线程的BFT-SMaRt被允许每个副本使用16个核心,就像它们的原始评估一样[13]。 iperf测得的最大TCP带宽约为每秒1.2 GB。我们没有在任何测试中限制带宽。两台计算机之间的网络延迟小于1毫秒。

在我们的HotStuff原型实现中,secp256k1算法被用于投票和仲裁证书中的所有数字签名。 BFT-SMaRt在正常消息上,使用hmac-sha1生成消息的身份验证代码(MAC),并在视图更改期间外额外还使用数字签名。

HotStuff的所有结果均都是使用客户端的端到端测量的。对于BFT-SMaRt,我们使用了BFT-SMaRt网站(https://github.com/bft-smart/library)的微型基准程序ThroughputLatencyServer和ThroughputLatencyClient。使用客户端程序测量端到端延迟,但不测量吞吐量,服务器端程序测量吞吐量和延迟。最终我们使用了服务器的吞吐量结果和客户端的延迟结果。

We conducted our experiments on Amazon EC2 using c5.4xlarge instances. Each instance had 16 vCPUs supported by Intel Xeon Platinum 8000 processors. All cores sustained a Turbo CPU clock speed up to 3.4GHz. We ran each replica on a single VM instance, and so BFT-SMaRt, which makes heavy use of threads, was allowed to utilize 16 cores per replica, as in their original evaluation [13]. The maximum TCP bandwidth measured by iperf was around 1.2 Gigabytes per second. We did not throttle the bandwidth in any run. The network latency between two machines was less than 1 ms.

Our prototype implementation of HotStuff uses secp256k1 for all digital signatures in both votes and quorum certificates. BFT-SMaRt uses hmac-sha1 for MACs (Message Authentication Codes) in the messages during normal operation and uses digital signatures in addition to MACs during a view change.

All results for HotStuff reflect end-to-end measurement from the clients. For BFT-SMaRt, we used the microbenchmark programs ThroughputLatencyServer and ThroughputLatencyClient from the BFT-SMaRt website (https://github.com/bft-smart/library). The client program measures end-to-end latency but not throughput, while the server-side program measures both throughput and latency. We used the throughput results from servers and the latency results from clients.

8.2 基准

图4描绘了两个系统,分别在三个批处理大小(batch-size)的表现,batch-size分别取100、400和800。由于这些系统具有不同的batch-size定义,所以这些数字对于每个系统的意义略有不同。 BFT-SMaRt为每个操作驱动一个单独的共识决策,并批量处理来自多个共识协议的消息。 因此,它具有典型的L形等待时间/吞吐量性能曲线。 HotStuff在一个节点中分使用批处理打包了多个操作,从而降低了每个决策的数字签名成本。 但是,如果每批处理超过400次操作,批处理引起的等待时间变得比加密耗时的成本高。 尽管存在这些差异,但三步骤(HS3-*)和两步骤(HS2-*)HotStuff均可在所有三个批次大小上实现与BFT-SMaRt(BS-*)相当的延迟性能,同时最大吞吐量显著胜过BFT-SMaRt。

Figure 4 depicts three batch sizes for both systems, 100, 400, and 800, though because these systems have different batching schemes, these numbers mean slightly different things for each system. BFT-SMaRt drives a separate consensus decision for each operation, and batches the messages from multiple consensus protocols. Therefore, it has a typical L-shaped latency/throughput performance curve. HotStuff batches multiple operations in each node, and in this way, mitigates the cost of digital signatures per decision. However, above 400 operations per batch, the latency incurred by batching becomes higher than the cost of the crypto. Despite these differences, both three-phase (“HS3-”) and two-phase (“HS2-”) HotStuff achieves comparable latency performance to BFT-SMaRt (“BS-”) for all three batch sizes, while their maximum throughput noticeably outperformed BFT-SMaRt.

对于batch-size为100和400的场景,HotStuff的最低延时情景所对应的延迟和吞吐量要比BFT-SMaRT在其最高吞吐量下同时可实现的延迟和吞吐量要好,同时会导致延迟稍有增加。这种增加部分是由于HotStuff采用的批次策略:它需要三个额外的完整批次(两阶段变种算法则需要两个批次)才能使得一个批次的决策完成。我们的实验使未完成的请求数量保持较高,但是批处理量越大,填充批处理流水线所需的时间就越长。可以进一步在实际部署优化模型,以使批次大小适合未完成的操作数量。

For batch sizes of 100 and 400, the lowest-latency HotStuff point provides latency and throughput that are better than the latency and throughput simultaneously achievable by BFT-SMaRT at its highest throughput, while incurring a small increase in latency. This increase is partly due to the batching strategy employed by HotStuff: It needs three additional full batches (two in the two-phase variant) to arrive at a decision on a batch. Our experiments kept the number of outstanding requests high, but the higher the batch size, the longer it takes to fill the batching pipeline. Practical deployments could be further optimized to adapt the batch size to the number of outstanding operations.

图5描绘了0/0、128/128和1024/1024的三种客户端请求/答复有效载荷大小(以字节为单位),分别表示为”p0″,”p128″和”p1024″。在所有有效负载大小下,三阶段和两阶段HotStuff的吞吐量在性能上均优于BFT-SMaRt,且延迟时间相似或相当。

[?graph]

Figure 5 depicts three client request/reply payload sizes (in bytes) of 0/0, 128/128, and 1024/1024, denoted “p0”, “p128”, and “p1024” respectively. At all payload sizes, both three-phase and two-phase HotStuff outperformed BFTSMaRt in throughput, with similar or comparable latency.

注意,BFT-SMaRt使用基于对称加密的MAC,其速度比HotStuff使用的数字签名中的非对称加密快几个数量级,并且与BFT-SMaRt使用的两阶段PBFT变体相比,三阶段HotStuff具有更多的往返行程。但是,HotStuff仍然能够实现可比的延迟和更高的吞吐量。下面我们将在更具挑战性的情况下评估这两种系统,在这些情况下,HotStuff的性能优势将更加明显。

Notice BFT-SMaRt uses MACs based on symmetric crypto that is orders of magnitude faster than the asymmetric crypto in digital signatures used by HotStuff, and also three-phase HotStuff has more round trips compared to two-phase PBFT variant used by BFT-SMaRt. Yet HotStuff is still able to achieve comparable latency and much higher throughput. Below we evaluate both systems in more challenging situations, where the performance advantages of HotStuff will become more pronounced.

8.3 扩展性

为了评估HotStuff在各个维度上的可伸缩性,我们进行了三个实验。对于基准模型,我们使用零大小的请求/响应有效负载,同时更改了副本的数量。第二次评估使用128字节和1024字节的请求/响应有效负载重复了基准实验。第三次测试重复了基线(有效载荷为空),同时引入了副本之间的网络延迟,这些延迟以5ms±0.5ms或10ms±1.0ms的均匀分布(使用NetEm实现)(请参见https://www.linux.org/docs/ man8 / tc-netem.html)。对于每个数据点,我们以相同的设置重复进行五次运行,并显示误差区间以指示所有运行的标准偏差。

To evaluate the scalability of HotStuff in various dimensions, we performed three experiments. For the baseline, we used zero-size request/response payloads while varying the number of replicas. The second evaluation repeated the baseline experiment with 128-byte and 1024-byte request/response payloads. The third test repeated the baseline (with empty payloads) while introducing network delays between replicas that were uniformly distributed in 5ms ± 0.5ms or in 10ms ± 1.0ms, implemented using NetEm (see https://www.linux.org/docs/man8/tc-netem.html). For each data point, we repeated five runs with the same setting and show error bars to indicate the standard deviation for all runs.

第一个设置如图6a(吞吐量)和图6b(延迟)所示。三阶段和两阶段HotStuff始终显示出比BFT-SMaRt更好的吞吐量,而它们的延迟仍然可以与BFTSMaRt相提并论。当n <32时,性能扩展比BFT-SMaRt好。这是因为我们目前仍将secp256k1签名列表用于QC。将来,我们计划通过使用快速阈值签名方案来减少HotStuff中的密码计算开销。

图6 0/0载荷,400batch-size对应的扩展性

The first setting is depicted in Figure 6a (throughput) and Figure 6b (latency). Both three-phase and two-phase HotStuff show consistently better throughput than BFT-SMaRt, while their latencies are still comparable to BFTSMaRt with graceful degradation. The performance scales better than BFT-SMaRt when n < 32. This is because we currently still use a list of secp256k1 signatures for a QC. In the future, we plan to reduce the cryptographic computation overhead in HotStuff by using a fast threshold signature scheme.

有效负载大小为128或1024字节的第二个设置在图7a(吞吐量)和图7b(等待时间)中用“ p128”或“ p1024”表示。由于其二次带宽成本,对于相当大的有效载荷大小(1024字节),BFT-SMaRt的吞吐能力比HotStuff差。

图7 128/128或1024/1024载荷,400batch-size的扩展性

The second setting with payload size 128 or 1024 bytes is denoted by “p128” or “p1024” in Figure 7a (throughput) and Figure 7b (latency). Due to its quadratic bandwidth cost, the throughput of BFT-SMaRt scales worse than HotStuff for reasonably large (1024-byte) payload size.

第三种设置在图8a(吞吐量)和图8b(等待时间)中显示传输延迟为“ 5ms”或“ 10ms”的对比。同样,由于在BFT-SMaRt中使用了更多的通信,因此HotStuff在这两种情况下均始终优于BFT-SMaRt。

图8 0/0载荷、中途传输延时5ms或10ms,batch-size400ms的扩展性

The third setting is shown in Figure 8a (throughput) and Figure 8b (latency) as “5ms” or “10ms”. Again, due to the larger use of communication in BFT-SMaRt, HotStuff consistently outperformed BFT-SMaRt in both cases.

8.4 View Change

为了评估leader更换的通信复杂性,我们计算了在BFT-SMaRt的视图更改协议中执行的MAC或签名验证的数量。 我们的评估策略如下。 每隔一千个决策,我们就会向BFT-SMaRt中注入视图更改。 我们检测了BFT-SMaRt源代码,以计算在视图更改协议内接收和处理消息后的验证次数。 除了通信的复杂性之外,此测量还强调了与传输这些经过身份验证的值相关的密码计算负担。

To evaluate the communication complexity of leader replacement, we counted the number of MAC or signature verifications performed within BFT-SMaRt’s view-change protocol. Our evaluation strategy was as follows. We injected a view change into BFT-SMaRt every one thousand decisions. We instrumented the BFT-SMaRt source code to count the number of verifications upon receiving and processing messages within the view-change protocol. Beyond communication complexity, this measurement underscores the cryptographic computation load associated with transferring these authenticated values.

图9a和图9b显示了为每个视图更改处理的额外的身份验证器(分别为MAC和签名)的数量,其中“额外”定义为如果领导者保持稳定就不会发送的那些身份验证器。 请注意,根据此定义,HotStu ff没有“额外”的身份验证者,因为身份验证者的数量保持不变,而不管领导者是否保持不变。 这两个数据表明,BFT-SMaRt使用的是MAC的立方数和签名的二次数。 HotStu ff不需要额外的身份验证器即可更改视图,因此在图中将其省略。

图9 视图转换的时候,额外使用验证器个数

Figure 9a and Figure 9b show the number of extra authenticators (MACs and signatures, respectively) processed for each view change, where “extra” is defined to be those authenticators that would not be sent if the leader remained stable. Note that HotStuff has no “extra” authenticators by this definition, since the number of authenticators remains the same regardless of whether the leader stays the same or not. The two figures show that BFT-SMaRt uses cubic numbers of MACs and quadratic numbers of signatures. HotStuff does not require extra authenticators for view changes and so is omitted from the graph.

评估领导者更换的实时性能非常棘手。 首先,BFT-SMaRt在触发频繁的视图更改时被卡住; 我们的验证者计数基准必须在系统卡死之前对尽可能多的成功视图更改进行平均,并重复多次实验。 其次,更换领导者的实际花费时间在很大程度上取决于超时参数和领导者选举机制。 因此,不可能提供有意义的比较。

Evaluating the real-time performance of leader replacement is tricky. First, BFT-SMaRt got stuck when triggering frequent view changes; our authenticator-counting benchmark had to average over as many successful view changes as possible before the system got stuck, repeating the experiment many times. Second, the actual elapsed time for leader replacement depends highly on timeout parameters and the leader-election mechanism. It is therefore impossible to provide a meaningful comparison.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – One-Chain and Two-Chain BFT Protocols

7. One-Chain/Two-Chain的BFT协议

在本节中,我们研究了四个BFT副本协议,这些协议跨越了拜占庭式容错研究的四十余年,并将它们植入类似于Chained HotStuff的基于链的框架中。

图3提供了我们考虑的包括HotStuff在内的五个协议的提交规则的鸟瞰图。

简而言之,DLS的提交规则是单链(One-Chain)的,允许节点仅由其自己的领导者提交。 PBFT、Tendermint和Casper中的提交规则几乎相同,由双链(Two-Chain)组成。 他们在介绍活动的机制上有所不同,PBFT具有二次方规模的leader“证明复杂度”(非线性),Tendermint和Casper在每个leader提议之前引入了强制性性的Δ延迟(非乐观响应性)。 HotStuff使用三链(Three-Chain)规则,并具有无延迟的线性leader协议。

图3 不同BFT协议的提交规则

In this section, we examine four BFT replication protocols spanning four decades of research in Byzantine fault tolernace, casting them into a chained framework similar to Chained HotStuff.

Figure 3 provides a birds-eye view of the commit rules of five protocols we consider, including HotStuff.

In a nutshell, the commit rule in DLS [25] is One-Chain, allowing a node to be committed only by its own leader. The commit rules in PBFT [20], Tendermint [15, 16] and Casper [17] are almost identical, and consist of Two-Chains. They differ in the mechanisms they introduce for liveness, PBFT has leader “proofs” of quadratic size (no Linearity), Tendermint and Casper introduce a mandatory ∆ delay before each leader proposal (no Optimistic Responsiveness). HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.

7.1 DLS

最简单的提交规则是单链(One-Chain)。服从该规则有第一个已知的异步拜占庭共识解决方案Dwork,Lynch和Stockmeyer(DLS)模型,如图3(a)所示。副本将锁定在其投票的最高节点上的DLS中。

不幸的是,如果在某个高度处,领导者模棱两可,此规则将很容易导致僵局,并且在该高度处的两个有冲突的提议被锁定在了两个正确的副本中。除非有2f +1表示他们没有投票赞成锁定的值,否则放弃任何一个锁定都是不安全的。

实际上,在DLS中,只有每个高度的leader本身都可以通过单链提交规则来达成提交决定。因此,如果领导者模棱两可,只会受到伤害。如果2f +1个副本没有投票赞成该副本,或者存在冲突的提议(由领导者签名),则副本可以放弃该锁。出现在DLS中每个高度末端的解锁协议非常复杂且昂贵。在没有故障且网络及时的最佳情况下,只有高空领导者可以决定这一事实,DLS每次决策都需要n次领导者轮换和O(n^4)条消息传输。虽然DLS在演示安全的异步协议方面开辟了新天地,但DLS并非设计为实用的解决方案。

The simplest commit rule is a One-Chain. Modeled after Dwork, Lynch, and Stockmeyer (DLS), the first known asynchronous Byzantine Consensus solution, this rule is depicted in Figure 3(a). A replica becomes locked in DLS on the highest node it voted for.

Unfortunately, this rule would easily lead to a deadlock if at some height, a leader equivocates, and two correct replicas became locked on the conflicting proposals at that height. Relinquishing either lock is unsafe unless there are 2f + 1 that indicate they did not vote for the locked value.

Indeed, in DLS only the leader of each height can itself reach a commit decision by the One-Chain commit rule. Thus, only the leader itself is harmed if it has equivocated. Replicas can relinquish a lock either if 2f +1 replicas did not vote for it, or if there are conflicting proposals (signed by the leader). The unlocking protocol occurring at the end of each height in DLS turns out to be fairly complex and expensive. Together with the fact that only the leader for a height can decide, in the best scenario where no fault occurs and the network is timely, DLS requires n leader rotations, and O(n^4) message transmissions, per single decision. While it broke new ground in demonstrating a safe asynchronous protocol, DLS was not designed as a practical solution.

7.2 PBFT

以PBFT为模型,更实际的方法是使用两链提交规则,请参见图3(b)。当副本为构成单链的节点投票时,它就锁定在该节点上。完全不可能在同一高度上冲突单链,因为每个链都有一个QC,因此避免了DLS的死锁情况。

但是,如果一个副本具有比其他副本更高的锁定,则即使领导者从n-f个副本中收集信息,领导者也可能对此一无所知。这可能会导致领导者纯粹由于调度而无法做出最终决定。为了获得“解粘”效果,PBFT会通过携带由2f +1个副本的最高单链组成的证明来解锁所有副本。如下所述,该证明相当复杂。

原始的PBFT已开源[20]并在多项后续工作中采用[13,34],领导者证明包含从n-f个副本收集的一组消息,报告每个成员投票支持的最高单链。每个单链包含一个QC,因此总通信成本为O(n^3)。利用[45,18],SBFT [30]中的签名组合方法,可以通过将每个QC设置为单个值,将成本降低到O(n^2)

在[21]中的PBFT变体中,领导者证明包含从仲裁群体中收集的最高领导者一次链。它还包括每个仲裁成员的一个已签名的值,证明它没有投票赞成更高的单链。广播该证明会导致通信复杂度O(n^2)。请注意,尽管QC上的签名可以组合为一个值,但是由于来自仲裁的不同成员的消息可能具有不同的值,因此作为一个整体的证明不能减少到恒定的大小。

在这两种变体中,正确的副本都可以解锁,即使它的单链比领导者的证明更高。因此,正确的领导者可以在同步期间强制其提议被接受,从而确保活动的活力。费用是每次更换领导者的二次通信费用。

Modeled after PBFT, a more practical appraoch uses a Two-Chain commit rule, see Figure 3(b). When a replica votes for a node that forms a One-Chain, it becomes locked on it. Conflicting One-Chains at the same height are simply not possible, as each has a QC, hence the deadlock situation of DLS is avoided.

However, if one replica holds a higher lock than others, a leader may not know about it even if it collects information from n − f replicas. This could prevent leaders from reaching decisions ad infinitum, purely due to scheduling. To get “unstuck”, the PBFT unlocks all replicas by carrying a proof consisting of the highest One-Chain’s by 2f + 1 replicas. This proof is quite involved, as explained below.

The original PBFT, which has been open-sourced [20] and adopted in several follow up works [13, 34], a leader proof contains a set of messages collected from n − f replicas reporting the highest One-Chain each member voted for. Each One-Chain contains a QC, hence the total communication cost is O(n^3). Harnessing signature combining methods from [45, 18], SBFT [30] reduces this cost to O(n^2) by turning each QC to a single value.

In the PBFT variant in [21], a leader proof contains the highest One-Chain the leader collected from the quorum only once. It also includes one signed value from each member of the quorum, proving that it did not vote for a higher One-Chain. Broadcasting this proof incurs communication complexity O(n^2). Note that whereas the signatures on a QC may be combined into a single value, the proof as a whole cannot be reduced to constant size because messages from different members of the quorum may have different values.

In both variants, a correct replica unlocks even it has a higher One-Chain than the leader’s proof. Thus, a correct leader can force its proposal to be accepted during period of synchrony, and liveness is guaranteed. The cost is quadratic communication per leader replacement.

7.3 Tendermint 和 Casper

Tendermint具有与PBFT相同的“双链”提交规则,而Casper具有“双链”规则,其中叶子不需要具有控制父节点的QC。也就是说,在Casper中,图3(c,d)分别描述了Tendermint和Casper的提交规则。

在这两种方法中,领导者都只是简单地发送其所知道的最高单链及其提议。如果副本从领导者接收到更高的副本,它将解锁单链。

但是,由于正确的副本可能无法投票给领导者的节点,因此要保证进度,新领导者必须通过等待最大的网络延迟来获得最高的单链。否则,如果领导者仅等待第一个n-f条消息开始新的高度,则无法保证进度。领导者的延迟在Tendermint和Casper中都是固有的,以便提供生气。

这个简单的领导者协议体现了HotStu ff借鉴的领导者协议的通信复杂度的线性飞跃。如上所述,可以使用阈值签名将QC捕获为单个值,因此领导者可以收集和传播具有线性通信复杂性的最高单链。但是,至关重要的是,由于额外的QC步骤,HotStu ff不需要领导者等待最大的网络延迟。

Tendermint has a Two-Chain commit rule identical to PBFT, and Casper has a Two-Chain rule in which the leaf does not need to have a QC to direct parent. That is, in Casper, Figure 3(c,d) depicts the commit rules for Tendermint and Casper, respectively.

In both methods, a leader simply sends the highest One-Chain it knows along with its proposal. A replica unlocks a One-Chain if it receives from the leader a higher one.

However, because correct replicas may not vote for a leader’s node, to guarantee progress a new leader must obtain the highest One-Chain by waiting the maximal network delay. Otherwise, if leaders only wait for the first n − f messages to start a new height, there is no progress guarantee. Leader delays are inherent both in Tendermint and in Casper, in order to provide liveness.

This simple leader protocol embodies a linear leap in the communication complexity of the leader protocol, which HotStuff borrows from. As already mentioned above, a QC could be captured in a single value using threshold signatures, hence a leader can collect and disseminate the highest One-Chain with linear communication complexity. However, crucially, due to the extra QC step, HotStuff does not require the leader to wait the maximal network delay.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Implementation

6. 实现

HotStuff是用于构建高效SMR系统的实用协议。 由于其简易性,我们可以轻松地将算法3转换为事件驱动样式的规范,该规范几乎类似于原型实现的代码框架。

HotStuff is a practical protocol for building efficient SMR systems. Because of its simplicity, we can easily turn Algorithm 3 into an event-driven-style specification that is almost like the code skeleton for a prototype implementation.

如算法4所示,可以进一步简化代码,方法是将活跃性相关机制简化并泛化成为一个独立的模块,该模块我们称之为pacemaker。一般来说,下一个leader需要在generic阶段末尾等待genericQC,再开始它的“统治”。 稳定的leader可以略过此步骤,并简化多个高度的提案。 此外,在维护最高genericQC和lockedQC的过程中,我们放宽了直接父亲的约束,同时仍然保留了有效节点中的QC始终引用其祖先的要求。 正确性的证明与Chained HotStuff类似,我们也将其推迟到附录中。

As shown in Algorithm 4, the code is further simplified and generalized by extracting the liveness mechanism from the body into a module named Pacemaker. Instead of the next leader always waiting for a genericQC at the end of the generic phase before starting its reign, this logic is delegated to the Pacemaker. A stable leader can skip this step and streamline proposals across multiple heights. Additionally, we relax the direct parent constraint for maintaining the highest genericQC and lockedQC, while still preserving the requirement that the QC in a valid node always refers to its ancestor. The proof of correctness is similar to Chained HotStuff and we also defer it to the appendix of [50].

数据结构 任意的副本u持续追踪下列的核心状态变量:

  • V[\cdot] 将节点映射到投票
  • vheight 最近投票节点的高度
  • b_{lock} 被锁定的节点(和lockedQC相似)
  • b_{exec} 最后执行的节点
  • qc_{high} 最高的已知QC(与genericQC相似),为Pacemaker维护的
  • b_{leaf} 为Pacemaker维护的叶节点

同时维护一个常量 b_0 , 一个系统共享的的创世纪节点,被所有副本所公认,用于系统的启动。b_0 包含了一个通过硬编码指向自身的QC, b_{lock} , b_{exec} , b_{leaf} 都初始化为 b_0 , 且 qc_{high} 包含 b_0的QC .

Data structures. Each replica u keeps track of the following main state variables:

  • V[\cdot] mapping from a node to its votes.
  • vheight height of last voted node.
  • b_{lock} locked node (similar to lockedQC).
  • b_{exec} last executed node.
  • qc_{high} highest known QC (similar to genericQC) kept by a Pacemaker.
  • b_{leaf} leaf node kept by a Pacemaker.

It also keeps a constant b_0 , the same genesis node known by all correct replicas. To bootstrap, b_0 contains a hardcoded QC for itself, b_{lock} , b_{exec} , b_{leaf} are all initialized to b_0 , and qc_{high} contains the QC for b_0 .

起搏器(Pacemaker) Pacemaker是这样一个机制,保证GST之后一定会取得进展,通过下面两种机制保证的。

第一个是“同步”,同步会将所有正确的副本和唯一的领导者带入一个共同高度的视图,持续足够长的时间。 文献[25、20、15]中通常使用的同步机制是使副本在更大的高度的视图上增加Δ的数量,直到取得进展。 常用的选取leader的方法是使用轮换leader方案,其中所有正确的副本都保留一个预先确定的leader时间表,并在转移leader时轮换到时间表中下一个leader。

Pacemaker. A Pacemaker is a mechanism that guarantees progress after GST. It achieves this through two ingredients.

The first one is “synchronization”, bringing all correct replicas, and a unique leader, into a common height for a sufficiently long period. The usual synchronization mechanism in the literature [25, 20, 15] is for replicas to increase the count of ∆’s they spend at larger heights, until progress is being made. A common way to deterministically elect a leader is to use a rotating leader scheme in which all correct replicas keep a predefined leader schedule and rotate to the next one when the leader is demoted.

其次,Pacemaker需要为leader提供一种选择提案的方式,该提案将得到正确副本的支持。 如算法5所示,在视图更改后,新领导者在onReceiveNewView中收集副本通过onNextSyncView发送的新视图消息,以发现最高的仲裁证书,以满足onReceiveProposal中第二个条件的要求(算法4的第18行) )。 但是,在同一视图中,现任leader会将新节点链接到自己最后提出的叶的末尾,在此不需要新视图消息。 基于某些特定于应用程序的启发式方法(例如,等到先前提出的节点获得QC之后),当前领导者调用onBeat提出一个新节点,该节点携带要执行的命令。

Second, a Pacemaker needs to provide the leader with a way to choose a proposal that will be supported by correct replicas. As shown in Algorithm 5, after a view change, in onReceiveNewView, the new leader collects new-view messages sent by replicas through onNextSyncView to discover the highest QC to satisfy the second part of the condition in onReceiveProposal for liveness (Line 18 of Algorithm 4). During the same view, however, the incumbent leader will chain the new node to the end of the leaf last proposed by itself, where no new-view message is needed. Based on some application-specific heuristics (to wait until the previously proposed node gets a QC, for example), the current leader invokes onBeat to propose a new node carrying the command to be executed.

值得注意的是,即使不良的Pacemaker任意调用onPropose或任意选择父级和QC,并且在任何计划延迟的情况下,也始终可以确保安全性。 因此,仅通过算法4保证的安全性就可以通过算法5的任何潜在实例与生命力完全脱钩。

It is worth noting that even if a bad Pacemaker invokes onPropose arbitrarily, or selects a parent and a QC capriciously, and against any scheduling delays, safety is always guaranteed. Therefore, safety guaranteed by Algorithm 4 alone is entirely decoupled from liveness by any potential instantiation of Algorithm 5.

两阶段HotStuff变型 为了进一步说明HotStuff框架的灵活性,算法6显示了HotStuff的两阶段变体。 仅执行更新过程,需要Two-Chain才能做出提交决定,而One-Chain确定锁定。 如上所述(第4.4节),此两阶段变型失去了乐观响应能力,类似于Tendermint / Casper。 好处是更少的阶段,而活跃性可以通过在Pacemaker中加入基于最大网络延迟的等待来解决。 参见第7.3节进一步讨论。

Two-phase HotStuff variant. To further demonstrate the flexibility of the HotStuff framework, Algorithm 6 shows the two-phase variant of HotStuff. Only the update procedure is affected, a Two-Chain is required for reaching a commit decision, and a One-Chain determines the lock. As discussed above (Section 4.4), this two-phase variant loses Optimistic Responsiveness, and is similar to Tendermint/Casper. The benefit is fewer phases, while liveness may be addressed by incorporating in Pacemaker a wait based on maximum network delay. See Section 7.3 for further discussion.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Chained HotStuff

5. Chained HotStuff

上一节讲到的Basic HotStuff里面,leader想要提交一个提案,需要三个步骤。这些步骤除了从副本处收集投票之外,并没有做其他什么有意义的工作,因此从形式上来看非常相似。在Chained HotStuff里面,我们一方面优化了Basic HotStuff的功能,同时对其进行了简化。核心思想是在prepare阶段就改变视图,因此每一个提案都会有自己对应的视图。这样做会简化消息的类别,并且允许决策的并行化。一个相似的消息类别简化思路在Casper算法的文章里也有所提及。

It takes three phases for a Basic HotStuff leader to commit a proposal. These phases are not doing “useful” work except collecting votes from replicas, and they are all very similar. In Chained HotStuff, we improve the Basic HotStuff protocol utility while at the same time considerably simplifying it. The idea is to change the view on every prepare phase, so each proposal has its own view. This reduces the number of message types and allows for pipelining of decisions. A similar approach for message type reduction was suggested in Casper [1].

更具体地说,在Chained HotStuff协议中,prepare阶段的投票由leader收集到一个视图中,储存在状态变量genericQC里。接下来,genericQC被转发给下一个视图的leader,实质上是将下一阶段的职责(这曾经由pre-commit负责的)委托给下一个leader节点。然而,下一位leader实际上并不单独进行pre-commit阶段,而是启动一个新的prepare阶段,添加自己的提议。视图v+1的prepare阶段同时充当视图v的pre-commit阶段。视图v+2的prepare阶段同时充当视图v+1的pre-commit阶段和视图v的commit阶段。这中流水线化的思路是完全可行的,因为Basic HotStuff的所有的阶段都有相同的结构。

More specifically, in Chained HotStuff the votes over a prepare phase are collected in a view by the leader into a genericQC. Then the genericQC is relayed to the leader of the next view, essentially delegating responsibility for the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare phase for view v + 1 simultaneously serves as the pre-commit phase for view v. The prepare phase for view v + 2 simultaneously serves as the pre-commit phase for view v + 1 and as the commit phase for view v. This is possible because all the phases have identical structure.

图1描绘了将Basic HotStuff协议的各个阶段流水线化,最终得到Chained HotStuff协议的示例。在Chained HotStuff协议里,视图v_1v_2v_3对于cmd 1来说充当了Basic HotStuff中的prepare, pre-commit, commit阶段。该命令将在v_4结束时提交。视图v_2v_3v_4作为v_2中提议的cmd 2的三个Basic Hotstuff阶段,并在v_5结束时提交。在这些阶段生成的其他提议以类似的方式继续流水线化,用虚线框表示。在图1中,单箭头表示节点bb.parent字段,双箭头表示b.justify.node

The pipeline of Basic HotStuff protocol phases embedded in a chain of Chained HotStuff proposals is depicted in Figure 1. Views v_1 , v_2 , v_3 of Chained HotStuff serve as the prepare, pre-commit, and commit Basic HotStuff phases for cmd 1 proposed in v_1 . This command becomes committed by the end of v_4 . Views v_2 , v_3 , v_4 serve as the three Basic HotStuff phases for cmd 2 proposed in v_2 , and it becomes committed by the end of v_5 . Additional proposals generated in these phases continue the pipeline similarly, and are denoted by dashed boxes. In Figure 1, a single arrow denotes the b.parent field for a node b, and a double arrow denotes b.justify.node.

图1 Chained HotStuff是Basic HotStuff的流水线版本,QC可以同时处于多个阶段。

基于上述的结构,事实上Chained HotStuff只有两种类别的消息—— NewView和GenericPhase消息。一个通用的QC函数在所有的流水线阶段运行。接下来,我们将解释流水线的机制,重点关注锁定与提交部分,他们对应了Basic HotStuff的commit和deside阶段。

Hence, there are only two types of messages in Chained HotStuff, a new-view message and generic-phase generic message. The generic QC functions in all logically pipelined phases. We next explain the mechanisms in the pipeline to take care of locking and committing, which occur only in the commit and decide phases of Basic HotStuff.

虚拟节点 在某些视图viewNumber中,leader使用的genericQC不能直接引用前一个视图的提议(viewNumber-1)。原因是之前视图的leader未能获得QC,要么是因为有冲突的提议,要么是因为节点的崩溃。为了简化树结构,createLeaf将genericQC.node扩展为空白节点,直到提议视图的高度(节点分支上父链接的数量),因此视图编号与节点高度相等。这样就会导致,嵌入在节点b中的仲裁证书QC不能正确引用其父节点,即b.justify.node不能等于b.parent(图2中的最后一个节点)。

Dummy nodes. The genericQC used by a leader in some view viewNumber may not directly reference the proposal of the preceding view (viewNumber −1). The reason is that the leader of a preceding view fails to obtain a QC, either because there are conflicting proposals, or due to a benign crash. To simplify the tree structure, createLeaf extends genericQC.node with blank nodes up to the height (the number of parent links on a node’s branch) of the proposing view, so view-numbers are equated with node heights. As a result, the QC embedded in a node b may not refer to its parent, i.e., b.justify.node may not equal b.parent (the last node in Figure 2).

图2 节点v4,v5,v6组成了Three-Chain,节点v8没有成为一个合法的One-Chain。

一链(One-Chain),二链(Two-Chain)和三链(Three-Chain) 当节点b^*带有和父节点相同的QC时,即b^∗.justify.node = b^∗.parent,我们说它形成了一个One-Chain。 令b''=b^∗.justify.node,如果除了形成单链外,还满足b''.justify.node = b''.parent,那么节点b^∗形成Two-Chain。 如果b''已经是Two-Chain了,那么b^*将形成Three-Chain。

One-Chain, Two-Chain, and Three-Chain. When a node b^∗ carries a QC that refers to a direct parent, i.e., b^∗ .justify.node = b^∗ .parent, we say that it forms a One-Chain. Denote by b'' = b^∗ .justify.node. Node b^∗ forms a Two-Chain, if in addition to forming a One-Chain, b''.justify.node = b''.parent. It forms a Three-Chain, if b'' forms a Two-Chain.

查看链b = b'.justify.node, b'= b''.justify.node, b'' = b^∗ .justify.node,祖先空缺可能发生在任何一个 节点。 这些情况类似于Basic HotStuff的leader未能完成三个阶段中的任何一个阶段,并被nextView中断到下一个视图。

Looking at chain b = b'.justify.node, b'= b''.justify.node, b'' = b^∗ .justify.node, ancestry gaps might occur at any one of the nodes. These situations are similar to a leader of Basic HotStuff failing to complete any one of three phases, and getting interrupted to the next view by nextView.

如果b^*形成One-Chain,则b''的prepare阶段是成功的。 因此,当副本对b^∗投票时,它应该记录了genericQC \leftarrow b^∗.justify。 我们需要注意,即使One-Chain不是直接的,只要它高于当前的genericQC,也可以安全地更新genericQC。 在第6节中描述的实现代码中,在这种情况下,我们确实更新了genericQC

If b^∗ forms a One-Chain, the prepare phase of b'' has succeeded. Hence, when a replica votes for b^∗ , it should remember genericQC \leftarrow b^∗.justify. We remark that it is safe to update genericQC even when a One-Chain is not direct, so long as it is higher than the current genericQC. In the implementation code described in Section 6, we indeed update genericQC in this case.

如果b^*形成两链,则b'的pre-commit阶段成功。 因此,副本应更新lockedQC \leftarrow b''.justify。 再次注意,即使Two-Chain不是直接的,也可以更新该锁(安全不会中断),而实际上,这在第6节的实现代码中已给出。

If b^* forms a Two-Chain, then the pre-commit phase of b 0 has succeeded. The replica should therefore update lockedQC \leftarrow b''.justify. Again, we remark that the lock can be updated even when a Two-Chain is not direct—safety will not break—and indeed, this is given in the implementation code in Section 6.

最后,如果b^*形成Three-Chain,则b的commit阶段成功,并且b成为已提交的决策。

Finally, if b ∗ forms a Three-Chain, the commit phase of b has succeeded, and b becomes a committed decision.

算法3显示了Chain HotStuff的伪代码。 附录A中定理5给出的安全性证明与Basic HotStuff的证明相似。 我们要求在有效节点中的仲裁证书(QC)引用其祖先。 为简便起见,我们假设约束始终成立,并且省略了代码中的检查。

Algorithm 3 shows the pseudocode for Chained HotStuff. The proof of safety given by Theorem 5 in Appendix A is similar to the one for Basic HotStuff. We require the QC in a valid node refers to its ancestor. For brevity, we assume the constraint always holds and omit checking in the code.

算法3 Chained HotStuff协议

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Basic HotStuff

这是《HotStuff:BFT Consensus in the Lens of Blockchain》论文翻译的第四篇,包含了论文的Basic HotStuff章节。


4. 简化HotStuff协议

HotSuff解决了状态复制机(SMR)问题,在SMR的核心问题,是如何设计一个协议,引导客户端发来的指令自然执行。状态复制机的副本们按照统一的顺序执行请求。一个客户端向所有副本发送一个请求,并等待其中的f+1个副本的回执。在大多数情况下,我们在讨论中省略了客户端这个角色,在有关请求的编号和重复数据消除的问题上,遵循行业标准做法。

HotStuff solves the State Machine Replication (SMR) problem. At the core of SMR is a protocol for deciding on a growing log of command requests by clients. A group of state-machine replicas apply commands in sequence order consistently. A client sends a command request to all replicas, and waits for responses from (f + 1) of them. For the most part, we omit the client from the discussion, and defer to the standard literature for issues regarding numbering and de-duplication of client requests.

算法2给出了简易版HotStuff的解决方案。这个协议在一系列编号单调递增的视图(view)中工作,每一个编号的视图都有唯一的leader节点被其他副本所公认。所有节点都在其内存中储存一个待执行指令树。所有的树节点包括一个客户端提议的指令(command)、协议的关联元数据(metadata)和一个父节点链接。节点的分支路径被定义为从该节点出发,沿着父节点链接走到根节点的整条路径。在协议过程中,单调增长的分支被提交(commited),为了提交分支,leader节点需要提议一个能够获得n-f个副本投票支持的分支,而投票分为三个步骤,准备阶段prepare, 预提交阶段pre-commit,提交阶段commit。

【由于这里介绍的是一个简化版模型,Basic HotStuff里面不存在leader节点的选取细节,leader节点可以看做在运行之初被公认了,后面的文章会提到,leader的选取由一个独立的pacemaker算法实现。同时,视图可以理解为账本变动的最小单位,相同视图中账本内容完全相同,而记账则体现在视图的变化。】

The Basic HotStuff solution is presented in Algorithm 2. The protocol works in a succession of views numbered with monotonically increasing view numbers. Each viewNumber has a unique dedicated leader known to all. Each replica stores a tree of pending commands as its local data structure. Each tree node contains a proposed command (or a batch of them), metadata associated with the protocol, and a parent link. The branch led by a given node is the path from the node all the way to the tree root by visiting parent links. During the protocol, a monotonically growing branch becomes committed. To become committed, the leader of a particular view proposing the branch must collect votes from a quorum of (n - f) replicas in three phases, prepare, pre-commit, and commit.

协议的核心部分是leader节点发起的n-f个副本的投票,这个投票的结果被称为仲裁证书(quorum certificate, QC)。仲裁证书被关联到一个特定的树节点以及视图编号。tcombine函数可以将n-f个被签名的投票合并成一个可进行合法性验证的仲裁证书。

A key ingredient in the protocol is a collection of (n - f) votes over a leader proposal, referred to as a quorum certificate (or “QC” in short). The QC is associated with a particular node and a view number. The tcombine utility employs a threshold signature scheme to generate a representation of (n - f) signed votes as a single authenticator.

下面将首先描述Basic HotStuff协议的各阶段执行逻辑,紧接着是基于伪代码(算法2)更精细的描述。最后用对Basic HotStuff的安全性、活跃性、复杂度分析结束本节。

Below we give an operational description of the protocol logic by phases, followed by a precise specification in Algorithm 2, and conclude the section with safety, liveness, and complexity arguments.

4.1 阶段

准备(prepare)阶段 一个新leader节点的协议过程起始于从n-f个副本中收集NewView消息。副本在进入新的viewNumber时(包括最开始的那个视图),会发送一个NewView消息给leader节点,并且在消息中捎带该副本当前的prepareQC值(如果不存在则发送\perp)。

leader节点选择一个prepareQC最高的分支,这个值被定义为highQC。因为highQCn-f个副本中的最高值,所以没有更高的视图可以做提交决定,因此highQC对应的副本作为leader节点对于该分支来说是最安全的。

收集NewView消息之后,如何选择一个安全的分支这个部分会被当前leader节点忽略,在这里我们简单地选择最高prepareQC作为highQC,为了简便起见,我们将此优化推迟到第6节。在本节中我们仅描述一个单一leader节点的协议。请注意,与类似PBFT的协议不同,我们现在这个单一leader节点协议中,上述步骤是直接跳过的,并强制与协议的其他阶段一样产生相同的线性开销。

leader使用createLeaf方法用一个新的提案扩展highQC.node的尾部。该方法创建一个新的叶节点作为子节点,并在子节点中嵌入父节点的摘要。然后,leader节点使用一个prepare类型消息将新节点发送到所有其他副本。该消息同时包含了highQC以支持后续的安全性验证。

从leader接收到当前视图的prepare类型消息后,副本r使用safeNode谓词函数判断是否接受它。如果被接受,复制品会将带有部分签名(由tsign(r)生成)的prepare类型投票发送给leader节点。

prepare phase. The protocol for a new leader starts by collecting new-view messages from (n - f) replicas. The new-view message is sent by a replica as it transitions into viewNumber (including the first view) and carries the highest prepareQC that the replica received (⊥ if none), as described below.

The leader processes these messages in order to select a branch that has the highest preceding view in which a prepareQC was formed. The leader selects the prepareQCwith the highest view, denoted highQC, among the new-view messages. Because highQC is the highest among (n - f) replicas, no higher view could have reached a commit decision. The branch led by highQC.node is therefore safe.

Collecting new-view messages to select a safe branch may be omitted by an incumbent leader, who may simply select its own highest prepareQC as highQC. We defer this optimization to Section 6 and only describe a single, unified leader protocol in this section. Note that, different from PBFT-like protocols, including this step in the leader protocol is straightforward, and it incurs the same, linear overhead as all the other phases of the protocol, regardless of the situation.

The leader uses the createLeaf method to extend the tail of highQC.node with a new proposal. The method creates a new leaf node as a child and embeds a digest of the parent in the child node. The leader then sends the new node in a prepare message to all other replicas. The proposal carries highQC for safety justification.

Upon receiving the prepare message for the current view from the leader, replica r uses the safeNode predicate to determine whether to accept it. If it is accepted, the replica sends a prepare vote with a partial signature (produced by tsign_r ) for the proposal to the leader.

安全节点(safeNode)谓词函数 safeNode谓词是协议的核心组成部分。它检查带有仲裁证明域(QC)m.justify的消息m,并确定m.node是否可以安全接受。一方面,如果m.node分支是从当前lockedQC.node分支扩展而来,则可以安全的接受提案。另一方面,由活跃性规则保证,如果m.justify的视图高于当前lockedQC,则副本将接受m。只要两个规则中的任何一个成立,谓词函数就返回真。

【我的理解是,safety rule指的是leader在lockedQC.node后面加了一个command节点,对于副本来说,接受这个节点毫无坏处,甚至可以和leader保持同步。liveness rule指的是,当前副本发现自己锁定的副本以及过时了,leader都已经跑到下一个视图了,所以说也会无条件支持leader。safeNode谓词函数为false意味着副本和leader在lockedQC上有冲突,且位于同一个视图,这个时候,副本不给leader投票】

safeNode predicate. The safeNode predicate is a core ingredient of the protocol. It examines a proposal message m carrying a QC justification m.justify, and determines whether m.node is safe to accept. The safety rule to accept a proposal is the branch of m.node extends from the currently locked node lockedQC.node. On the other hand, the liveness rule is the replica will accept m if m.justify has a higher view than the current lockedQC. The predicate is true as long as either one of two rules holds.

预提交(pre-commit)阶段 当leader节点收到n-f个为当前提案curProposal投票时,将他们合并为prepareQC。leader使用pre-commit类型的消息广播prepareQC。一个副本通过PRE COMMIT投票对leader节点做出响应,并对提案进行签名摘要。

pre-commit phase. When the leader receives (n - f) prepare votes for the current proposal curProposal, it combines them into a prepareQC. The leader broadcasts prepareQC in pre-commit messages. A replica responds to the leader with pre-commit vote having a signed digest of the proposal.

提交(commit)阶段 commit阶段类似于pre-commit阶段。当leader接收到n-f个PRE COMMIT投票时,它将它们组合成一个precommitQC,并在commit类型消息中广播它;副本用COMMIT投票来响应它。重要的是,此时通过将副本的lockedQC设置为precommitQC(算法2的第25行),副本将被锁定在precommitQC上。这对于保护提案的安全至关重要,以防提案成为一项协商一致的决定。

commit phase. The commit phase is similar to pre-commit phase. When the leader receives (n-f) pre-commit votes, it combines them into a precommitQC and broadcasts it in commit messages; replicas respond to it with a commit vote. Importantly, a replica becomes locked on the precommitQC at this point by setting its lockedQC to precommitQC (Line 25 of Algorithm 2). This is crucial to guard the safety of the proposal in case it becomes a consensus decision.

决定(decide)阶段 当leader节点收到n-f个COMMIT的选票时,它将它们合并成一个commitQC。一旦leader节点获得了commitQC,它就会以DECIDE消息的形式将其发送给所有其他副本。在接收到DECIDE消息时,副本将commitQC中包含的提案视图视为已提交的决策,并在已提交的分支中执行命令。副本将增加viewNumber并启动下一个视图。

decide phase. When the leader receives (n - f) commit votes, it combines them into a commitQC. Once the leader has assembled a commitQC, it sends it in a decide message to all other replicas. Upon receiving a decide message, a replica considers the proposal embodied in the commitQC a committed decision, and executes the commands in the committed branch. The replica increments viewNumber and starts the next view.

下一个视图(nextView)中断 在所有阶段中,副本都会在viewNumber视图处等待由nextView(viewNumber)函数确定的超时时间段。如果nextView(viewNumber)中断等待,副本也会增加viewNumber并开始下一个视图。

nextView interrupt. In all phases, a replica waits for a message at view viewNumber for a timeout period, determined by an auxiliary nextView(viewNumber) utility. If nextView(viewNumber) interrupts waiting, the replica also increments viewNumber and starts the next view.

4.2 数据结构

消息(message) 协议中的消息m由一组固定的字段组成,这些字段使用算法1中所示的Msg()程序填充。m会被自动打上curView标记,即发送者当前的视图编号。每个消息都有一个类型m.type \in \{new view, prepare, pre commit, commit, decide\}。m.node包含一个建议的节点(建议分支的叶节点)。有一个可选的字段m.justify。leader节点总是在不同的阶段使用这个域来储存仲裁证书(QC)。副本节点则在新视图消息中使用它来携带最高的prepareQC。以副本角色发送的每个消息都包含部分签名m.partialSig,该值是通过将三元组\langle m.type, m.viewNumber, m.node \rangle通过voteMsg()函数签名获得的。

Messages. A message m in the protocol has a fixed set of fields that are populated using the Msg() utility shown in Algorithm 1. m is automatically stamped with curView, the sender’s current view number. Each message has a type m.type ∈ {new-view, prepare, pre-commit, commit, decide}. m.node contains a proposed node (the leaf node of a proposed branch). There is an optional field m.justify. The leader always uses this field to carry the QC for the different phases. Replicas use it in new-view messages to carry the highest prepareQC. Each message sent in a replica role contains a partial signature m.partialSig by the sender over the tuple hm.type, m.viewNumber, m.nodei, which is added in the voteMsg() utility.

仲裁证书(Quorum certificates) 一个仲裁证书是建立在三元组\langle type, viewNumber, node \rangle之上的,并且组合了该三元组在n-f个副本上的签名。给定一个仲裁证书 qc,我们使用qc.type, qc.viewNumber, qc.node来指示原三元组中的项。

Quorum certificates. A Quorum Certificate (QC) over a tuple \langle type, viewNumber, node \rangle is a data type that combines a collection of signatures for the same tuple signed by (n-f) replicas. Given a QC qc, we use qc.type, qc.viewNumber, qc.node

树(Tree)和分枝(Branches) 每个命令都包装在一个节点中,该节点还包含一个父链接,该父链接可以是父节点的哈希摘要。我们在伪代码中省略了实现细节。在协议期间,副本仅在节点所对应的分支已在其本地树中之后才传递消息。实际上,落后的副本可以通过从其他副本中获取丢失的节点数据来赶上。为简洁起见,伪代码中也省略了这些细节。如果两个分支都不是另一个分支的扩展,那么这两个分支就是冲突的。如果两个节点所对应的分支冲突,则两个节点冲突。

Tree and branches. Each command is wrapped in a node that additionally contains a parent link which could be a hash digest of the parent node. We omit the implementation details from the pseudocode. During the protocol, a replica delivers a message only after the branch led by the node is already in its local tree. In practice, a recipient who falls behind can catch up by fetching missing nodes from other replicas. For brevity, these details are also omitted from the pseudocode. Two branches are conflicting if neither one is an extension of the other. Two nodes are conflicting if the branches led by them are conflicting.

记账变量 副本使用额外的局部变量来记录协议状态:
i) viewNumber,最初为1,通过完成决策或nextView中断实现递增;
ii) lockedQC,最初为⊥,存储副本投票提交的最高QC;
iii) prepareQC,最初为⊥,存储副本投票预提交的QC。
此外,为了增量地执行一个提交的命令日志,副本维护其分支已被执行的最高节点。为了简洁起见,下面省略这一点。

Bookkeeping variables. A replica uses additional local variables for bookkeeping the protocol state: (i) a viewNumber, initially 1 and incremented either by finishing a decision or by a nextView interrupt; (ii) a locked quorum certificate lockedQC, initially ⊥, storing the highest QC for which a replica voted commit; and (iii) a prepareQC, initially ⊥, storing the highest QC for which a replica voted pre-commit. Additionally, in order to incrementally execute a committed log of commands, the replica maintains the highest node whose branch has been executed. This is omitted below for brevity.

4.3 协议规范

算法2中给出的协议被描述为逐视图循环迭代。在每轮视图中,副本根据扮演的其角色(描述为一系列as块)连续执行阶段(注意,副本可以具有多个角色,例如,leader也是一个normal副本)。不同角色的as块的执行可以并行化。每个as块的执行本身则是原子的。nextView中断中止任何as块中的所有操作,并跳到Finally块。

The protocol given in Algorithm 2 is described as an iterated view-by-view loop. In each view, a replica performs phases in succession based on its role, described as a succession of “as” blocks. A replica can have more than one role. For example, a leader is also a (normal) replica. Execution of as blocks across roles can be proceeded concurrently. The execution of each as block is atomic. A nextView interrupt aborts all operations in any as block, and jumps to the “Finally” block.

4.4 安全性、活跃性以及复杂度

安全性(Safety). 我们定义仲裁证书 qc是合法的当且仅当tverify(\langle qc.type, qc.viewNumber, qc.node \rangle, qc.sig)为真。

Safety. We first define a quorum certificate qc to be valid if tverify(\langle qc.type, qc.viewNumber, qc.node \rangle, qc.sig) is true.

引理1. 对于任意合法的qc_1, qc_2,如果有qc_1.type = qc_2.typeqc_1.nodeqc_2.node 冲突,那么我们有qc_1.viewNumber \neq qc_2.viewNumber

Lemma 1. For any valid qc_1, qc_2 in which qc_1.type = qc_2.type and qc_1.node conflicts with qc_2.node, we have qc_1.viewNumber \neq qc_2.viewNumber.

证明. 反证法,假设qc_1.viewNumber = qc_2.viewNumber = v,由于一个合法的仲裁证书只能够通过n - f = 2f + 1个副本投票(部分签名)生成。因此,一定有一个正常工作的副本对v投票了两次,伪代码里面并不允许这个情况出现。

Proof. To show a contradiction, suppose qc_1.viewNumber = qc_2.viewNumber = v. Because a valid QC can be formed only with n - f = 2f + 1 votes (i.e., partial signatures) for it, there must be a correct replica who voted twice in the same phase of v. This is impossible because the pseudocode allows voting only once for each phase in each view.

定理2 如果wb是冲突的节点,那么他们不能够同时被一个正确的副本所提交(commit)。

Theorem 2. If w and b are conflicting nodes, then they cannot be both committed, each by a correct replica.

证明 我们也会用反证法证明这个重要的结论。 令qc_1表示一个合法的commitQC(即q_1.type = commit),此时q_1.node = w,同时qc_2是另一个合法的commitQC,满足q_2.node = b。令v_1 = qc_1.viewNumberv2=qc_2.viewNumber。由引理1我们知道,v_1 \neq v_2,不失一般性,可以假设v_1 < v_2

现在,我们假设v_s是比v_1高的prepareQC里面的最小视图。qc_s.type=prepare, qc_s.viewNumber = v_sqc_s.nodew冲突,那么我们可以定义下面的谓词逻辑:

E(prepareQC):=(v_1 < prepareQC.viewNumber \leq v_2 ) \wedge (prepareQC.node\ conflicts \quad with\quad w).

我们现在可以按照下列形式定义出qc_sqc_s可以看做是“冲突”的起始位置:

qc_s := argmin_{prepareQC} \{prepareQC.viewNumber | prepareQC\quad is\quad valid \wedge E(prepareQC)\} .

注意,基于之前的假设,这样的qc_s必须存在;例如,在最坏情况,qc_s可以是在视图v_2中形成的prepareQC。

一个正确的副本,会发送部分签名的结果tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node \rangle)给leader节点,令r成为对于tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node\rangle)有贡献的第一个副本;这样的r必须存在,否则,qc_1.sigqc_s.sig其中一个就不能创建。在视图v_1期间,副本r在算法2的第25行使用precommitQC更新它的lockedQC,对应树节点w。由于v_s的最小化定义,副本r上曾经在w处生成的lockedQC在qc_s形成之前不会改变。否则r一定看到了其他一些低视图的prepareQC,因为第17行在第25行之前,与最小值相矛盾。现在考虑副本rz在视图v_s的准备阶段调用safeNode,其中消息m包含m.node=qc_s.node。假设m.nodelockedQC.node冲突,因此算法1第26行的判断为false。此外,m.justify.viewNumber>v_1将违反v_s的最小值的假设,因此算法1第27行中的判断也是false的。因此,safeNode必须返回false,而r不能针对v_s的观点进行prepare投票,发现矛盾,证毕。

Proof. We prove this important theorem by contradiction. Let qc_1 denote a valid commitQC (i.e., qc_1.type = commit) such that qc_1.node = w, and qc_2 denote a valid commitQC such that qc_2.node = b. Denote v_1 = qc_1.viewNumber and v_2 = qc_2.viewNumber. By Lemma 1, v_1 \neq v_2 . W.l.o.g. assume v_1 < v_2 .

We will now denote by v_s the lowest view higher than v_1 for which there is a valid prepareQC, qc_s (i.e., qc_s.type = prepare) where qc_s.viewNumber = v_s , and qc_s.node conflicts with w. Formally, we define the following predicate for any prepareQC:

E(prepareQC):=(v_1 < prepareQC.viewNumber \leq v_2 ) \wedge (prepareQC.node\ conflicts\ with\ w).

We can now set the first switching point qc_s :

qc_s := argmin_{prepareQC} \{prepareQC.viewNumber | prepareQC\ is\ valid \wedge E(prepareQC)\} .

Note that, by assumption such a qc_s must exist; for example, qc_s could be the prepareQC formed in view v_2 .

Of the correct replicas that sent a partial result tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node \rangle), let r be the first that contributed tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node \rangle); such an r must exist since otherwise, one of qc_1.sig and qc_s.sig could not have been created. During view v_1 , replica r updates its lock lockedQC to a precommitQC on w at Line 25 of Algorithm 2. Due to the minimality of v_s , the lock that replica r has on the branch led by w is not changed before qc_s is formed. Otherwise r must have seen some other prepareQC with lower view because Line 17 comes before Line 25, contradicting to the minimality. Now consider the invocation of safeNode in the prepare phase of view v_s by replica r, with a message m carrying m.node = qc_s.node. By assumption, m.node conflicts with lockedQC.node, and so the disjunct at Line 26 of Algorithm 1 is false. Moreover, m.justify.viewNumber > v_1 would violate the minimality of v_s , and so the disjunct in Line 27 of Algorithm 1 is also false. Thus, safeNode must return false and r cannot cast a prepare vote on the conflicting branch in view v_s , a contradiction.

活跃性(Liveness). 前面的叙述中,缺少了两个函数,leader和nextView。他们的定义与协议的安全性无关,但和协议的活跃性有关。在给出候选定义之前,我们首先定义GST之后,存在一个有界的持续时间T_f,满足如果所有副本在T_f期间仍然在视图v中,且视图v的leader节点正常运行,那么决定(decision)可以到达。之后的叙述中,我们定义qc_1qc_2匹配(match)当且仅当qc_1qc_2是合法的,且qc_1.node = qc_2.node,且qc_1.viewNumber = qc_2.viewNumber

Liveness. There are two functions left undefined in the previous section: leader and nextView. Their definition will not affect safety of the protocol, though they do matter to liveness. Before giving candidate definitions for them, we first show that after GST, there is a bounded duration T_f such that if all correct replicas remain in view v during T_f and the leader for view v is correct, then a decision is reached. Below, we say that qc_1 and qc_2 match if qc_1 and qc_2 are valid, qc_1.node = qc_2.node, and qc_1.viewNumber = qc_2.viewNumber.

引理3 如果所有正常运行的副本已经锁定了,即lockedQC = precommitQC,那么至少f+1个副本投票了prepareQC,且prepareQC匹配lockedQC

Lemma 3. If a correct replica is locked such that lockedQC = precommitQC, then at least f+1 correct replicas voted for some prepareQC matching lockedQC.

证明 如果一些副本r锁定了precommitQC,那么prepareQC在prepare阶段一定获得了(n-f)个投票(算法2的第10行),其中至少f+1个是正常运行的副本。

Proof. Suppose replica r is locked on precommitQC. Then, (n - f) votes were cast for the matching prepareQC in the prepare phase (Line 10 of Algorithm 2), out of which at least f + 1 were from correct replicas.

定理4 在GST之后,存在一个时间界限T_f,满足如果所有副本在T_f期间仍然在视图v中,且视图v的leader节点是正确的,那么决定(decision)可以在T_f结束时到达。

Theorem 4. After GST, there exists a bounded time period T_f such that if all correct replicas remain in view v during T_f and the leader for viewv is correct, then a decision is reached.

证明 从一个新的视图开始,leader节点收集(n-f)个newView消息,并计算出他们的highQC,最后广播prepare消息。假设在所有副本中(包括leader本身),最高的锁定QC是lockedQC = precommitQC^*,通过引理3, 我们知道了,至少有f+1个正确的副本曾经向一个匹配precommitQCprepareQC投票,并且该值已经被附在NewView消息中,传送给leader节点。基于上述假设,所有正确的副本在该视图中处于同步状态(synchronized),且leader节点是无缺陷的。因此,所有正确的副本都会在准备阶段投票,由于在函数safeNode里面,算法1第27行的条件被满足了(即使节点的消息和副本的lockedQC.node冲突,即26行条件不满足)。然后,在leader为这个视图聚合出有效的prepareQC之后,所有副本将在接下来的所有阶段进行投票,从而产生一个新的决策。在GST之后,这些阶段完成的持续时间T_f是有界的。

Proof. Starting in a new view, the leader collects (n - f) new-view messages and calculates its highQC before broadcasting a prepare messsage. Suppose among all replicas (including the leader itself), the highest kept lock is lockedQC = precommitQC^∗ . By Lemma 3, we know there are at least f+1 correct replicas that voted for a prepareQC^∗ matching precommitQC^∗ , and have already sent them to the leader in their new-view messages. Thus, the leader must learn a matching prepareQC^∗ in at least one of these new-view messages and use it as highQC in its prepare message. By the assumption, all correct replicas are synchronized in their view and the leader is nonfaulty. Therefore, all correct replicas will vote in the prepare phase, since in safeNode, the condition on Line 27 of Algorithm 1 is satisfied (even if the node in the message conflicts with a replica’s stale lockedQC.node, and so Line 26 is not). Then, after the leader assembles a valid prepareQC for this view, all replicas will vote in all the following phases, leading to a new decision. After GST, the duration T_f for these phases to complete is of bounded length.

协议是乐观响应性的,因为没有严格的“等待\Delta时间”的步骤,并且使用三步骤的逻辑,让我们可以通过safeNode的判断覆盖过时的锁。

The protocol is Optimistically Responsive because there is no explicit “wait-for-∆” step, and the logical disjunction in safeNode is used to override a stale lock with the help of the three-phase paradigm.

我们现在为leader和nextView提供简单的构造,以确保GST之后,最终将到达一个视图,在该视图中leader节点是正确的,并且所有正确的副本将在此视图中保留T_f的时间。对于leader来说,一个特定的从视图号到副本的映射就已经足够了,最终leader的选择在所有副本之间进行旋转。nextView的一个可能的解决方案是利用一种exponential back-off机制来保持超时时间间隔。然后在进入每个视图时设置一个计时器。当计时器在没有做出任何决定的情况下结束,副本将间隔加倍,并调用nextView来推进视图。由于每次间隔都是原来的两倍,所有正确副本的等待间隔最终都会有至少T_f的共同重叠,在此期间,leader可以驱动决策。

We now provide simple constructions for leader and nextView that suffice to ensure that after GST, eventually a view will be reached in which the leader is correct and all correct replicas remain in this view for T_f time. It suffices for leader to return some deterministic mapping from view number to a replica, eventually rotating through all replicas. A possible solution for nextView is to utilize an exponential back-off mechanism that maintains a timeout interval. Then a timer is set upon entering each view. When the timer goes off without making any decision, the replica doubles the interval and calls nextView to advance the view. Since the interval is doubled at each time, the waiting intervals of all correct replicas will eventually have at least T_f overlap in common, during which the leader could drive a decision.

HotStuff两阶段变种的活跃性 现在,我们将展示一个“两阶段”HotStuff的无限非决定性场景。这解释了在Casper和Tendermint中引入同步延迟的必要性,并因此放弃(乐观的)响应。

在HotStuff的两阶段变种中,我们删除pre-commit阶段,直接到commit阶段。一个副本在投票prepareQC的时候即锁定。假设在视图v,一个leader发起提议b。当完成prepare阶段时,一些副本r_v向prepareQC投票,得到qc,满足qc.node=b。因此r_v都会锁定qc,一个异步的网络调度使得其余的副本在没有接收qc的情况下直接进入下一个视图v+1

现在,我们重复无穷多次这个单视图脚本。最开始,我们在视图v+1中,副本r_v持有系统中最高的prepareQC(如qc)。新的leaderl从除了r_v之外的2f+1个副本中收集了NewView消息。在这些节点中最高的prepareQCqc',对应了视图v-1b'=qc'.node,与b矛盾。l进一步提议基于b'的提案b'',对于这个来说,2f个诚实的副本提供了投票。但是副本r_v拒绝了这个提案,因为他已经被锁定在了qc上面。b''b是冲突的,qc'qc要低,最终,2f个副本放弃向下一个视图的转换。就在这个时候,一个崩溃的副本回应了l的提议,l合并了一个prepareQC(v+1, b''),并且一个副本r_{v+1}表示会投票给这个提议,并且锁定了他。

Livelessness with two-phases. We now briefly demonstrate an infinite non-deciding scenario for a “two-phase” HotStuff. This explains the necessity for introducing a synchronous delay in Casper and Tendermint, and hence for abandoning (Optimistic) Responsiveness.

In the two-phase HotStuff variant, we omit the pre-commit phase and proceed directly to commit. A replica becomes locked when it votes on a prepareQC. Suppose, in view v, a leader proposes b. It completes the prepare phase, and some replica r_v votes for the prepareQC, say qc, such that qc.node = b. Hence, r_v becomes locked on qc. An asynchronous network scheduling causes the rest of the replicas to move to view v + 1 without receiving qc.

We now repeat ad infinitum the following single-view transcript. We start view v + 1 with only r_v holding the highest prepareQC (i.e. qc) in the system. The new leader l collects new-view messages from 2f + 1 replicas excluding r_v . The highest prepareQC among these, qc' , has view v-1 and b' = qc'.node conflicts with b. l then proposes b'' which extends b' , to which 2f honest replicas respond with a vote, but r_v rejects it because it is locked on qc, b'' conflicts with b and qc' is lower than qc. Eventaully, 2f replicas give up and move to the next view. Just then, a faulty replica responds to l’s proposal, l then puts together a prepareQC(v+1, b'') and one replica, say r_{v+1} votes for it and becomes locked on it.

复杂度 在HotStuff的每个阶段,只有leader向所有副本广播,而副本则用部分签名向发送者回复一次以完成投票。在leader的消息中,仲裁证书由先前收集到的(n-f)张选票组成,这些选票可以通过一个阈值签名进行编码。在副本的响应中,来自该副本的部分签名是唯一的身份验证程序。因此,在每个阶段中,总共接收到O(n)个验证器(authenticators)。由于有固定数量的阶段,每个视图的总体复杂性为O(n)

Complexity. In each phase of HotStuff, only the leader broadcasts to all replicas while the replicas respond to the sender once with a partial signature to certify the vote. In the leader’s message, the QC consists of a proof of (n-f) votes collected previously, which can be encoded by a single threshold signature. In a replica’s response, the partial signature from that replica is the only authenticator. Therefore, in each phase, there are O(n) authenticators received in total. As there is a constant number of phases, the overall complexity per view is O(n).