【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Conclusion

9. 结论

自从引入PBFT(部分同步模型中的第一个实用的BFT副本解决方案)以来,研究者们围绕其核心两阶段架构,构建了许多BFT解决方案。 第一阶段通过仲裁证书保证提案的唯一性。 第二阶段保证新领导者可以说服副本节点投票赞成安全的提案。 这要求领导者拥有来自n-f个副本的投票,每个副本都报告自己最高的QC或投票。 因此,各种的两阶段工作都面临着更换领导者的二次沟通瓶颈。

Since the introduction of PBFT, the first practical BFT replication solution in the partial synchrony model, numerous BFT solutions were built around its core two-phase paradigm. The first phase guarantees proposal uniqueness through a QC. The second phase guarantees that a new leader can convince replicas to vote for a safe proposal. This requires the leader to relay information from (n−f) replicas, each reporting its own highest QC or vote. Generations of two-phase works thus suffer from a quadratic communication bottleneck on leader replacement.

HotStuff的核心是三阶段的共识,使新领导者可以轻松选择已知的最高仲裁证书。 这减轻了上述通信复杂性,同时大大简化了领导者更换协议。 由于各个阶段基本同质化,所以很容易通过流水线的方式优化HotStuff,并支持频繁的领导者更迭。

HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. This alleviates the above complexity and at the same time considerably simplifies the leader replacement protocol. Having (almost) canonized the phases, it is very easy to pipeline HotStuff, and to frequently rotate leaders.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Evaluation

8. 模型评估

我们已经在大约4K行C ++代码中将HotStuff实现为一个库。 最值得注意的是,伪代码中指定的核心共识逻辑仅占用了大约200行。 在本节中,我们将自己的模型与State-Of-Art的模型BFT-SMaRt 进行比较,首先检查baseline吞吐量和延迟。 然后,我们将重点放在观察视图更改的消息成本上,以了解HotStuff在这种情况下的优势。

We have implemented HotStuff as a library in roughly 4K lines of C++ code. Most noticeably, the core consensus logic specified in the pseudocode consumes only around 200 lines. In this section, we will first examine baseline throughput and latency by comparing to a state-of-art system, BFT-SMaRt [13]. We then focus on the message cost for view changes to see our advantages in this scenario.

8.1 设置

我们使用c5.4xlarge实例在Amazon EC2上进行了实验。每个实例具有16个由Intel Xeon Platinum 8000处理器组成的vCPU。所有内核均具有高达3.4GHz的Turbo CPU时钟。我们将每个副本运行在单个VM实例上,因此,大量使用线程的BFT-SMaRt被允许每个副本使用16个核心,就像它们的原始评估一样[13]。 iperf测得的最大TCP带宽约为每秒1.2 GB。我们没有在任何测试中限制带宽。两台计算机之间的网络延迟小于1毫秒。

在我们的HotStuff原型实现中,secp256k1算法被用于投票和仲裁证书中的所有数字签名。 BFT-SMaRt在正常消息上,使用hmac-sha1生成消息的身份验证代码(MAC),并在视图更改期间外额外还使用数字签名。

HotStuff的所有结果均都是使用客户端的端到端测量的。对于BFT-SMaRt,我们使用了BFT-SMaRt网站(https://github.com/bft-smart/library)的微型基准程序ThroughputLatencyServer和ThroughputLatencyClient。使用客户端程序测量端到端延迟,但不测量吞吐量,服务器端程序测量吞吐量和延迟。最终我们使用了服务器的吞吐量结果和客户端的延迟结果。

We conducted our experiments on Amazon EC2 using c5.4xlarge instances. Each instance had 16 vCPUs supported by Intel Xeon Platinum 8000 processors. All cores sustained a Turbo CPU clock speed up to 3.4GHz. We ran each replica on a single VM instance, and so BFT-SMaRt, which makes heavy use of threads, was allowed to utilize 16 cores per replica, as in their original evaluation [13]. The maximum TCP bandwidth measured by iperf was around 1.2 Gigabytes per second. We did not throttle the bandwidth in any run. The network latency between two machines was less than 1 ms.

Our prototype implementation of HotStuff uses secp256k1 for all digital signatures in both votes and quorum certificates. BFT-SMaRt uses hmac-sha1 for MACs (Message Authentication Codes) in the messages during normal operation and uses digital signatures in addition to MACs during a view change.

All results for HotStuff reflect end-to-end measurement from the clients. For BFT-SMaRt, we used the microbenchmark programs ThroughputLatencyServer and ThroughputLatencyClient from the BFT-SMaRt website (https://github.com/bft-smart/library). The client program measures end-to-end latency but not throughput, while the server-side program measures both throughput and latency. We used the throughput results from servers and the latency results from clients.

8.2 基准

我们首先在评估其他BFT复制系统时常见的设置下测量吞吐量和延迟。 我们以允许单次故障(即f = 1)的配置运行了4个副本,同时更改了操作请求速率,直到系统饱和。 该基准测试使用了空的(零大小)操作请求和响应,并且未触发任何视图更改。 我们将扩展到下面的其他设置。 尽管我们的响应式HotStuff是三阶段的,但由于BFT-SMaRt基线只有两个阶段,因此我们也将其两阶段变型作为附加基线运行。

图4描绘了两个系统,分别在三个批处理大小(batch-size)的表现,batch-size分别取100、400和800。由于这些系统具有不同的batch-size定义,所以这些数字对于每个系统的意义略有不同。 BFT-SMaRt为每个操作驱动一个单独的共识决策,并批量处理来自多个共识协议的消息。 因此,它具有典型的L形等待时间/吞吐量性能曲线。 HotStuff在一个节点中分使用批处理打包了多个操作,从而降低了每个决策的数字签名成本。 但是,如果每批处理超过400次操作,批处理引起的等待时间变得比加密耗时的成本高。 尽管存在这些差异,但三步骤(HS3-*)和两步骤(HS2-*)HotStuff均可在所有三个批次大小上实现与BFT-SMaRt(BS-*)相当的延迟性能,同时最大吞吐量显著胜过BFT-SMaRt。

图4 带宽和延时在选择不同batchsize的区别,4副本,0/0载荷

Figure 4 depicts three batch sizes for both systems, 100, 400, and 800, though because these systems have different batching schemes, these numbers mean slightly different things for each system. BFT-SMaRt drives a separate consensus decision for each operation, and batches the messages from multiple consensus protocols. Therefore, it has a typical L-shaped latency/throughput performance curve. HotStuff batches multiple operations in each node, and in this way, mitigates the cost of digital signatures per decision. However, above 400 operations per batch, the latency incurred by batching becomes higher than the cost of the crypto. Despite these differences, both three-phase (“HS3-”) and two-phase (“HS2-”) HotStuff achieves comparable latency performance to BFT-SMaRt (“BS-”) for all three batch sizes, while their maximum throughput noticeably outperformed BFT-SMaRt.

对于batch-size为100和400的场景,HotStuff的最低延时情景所对应的延迟和吞吐量要比BFT-SMaRT在其最高吞吐量下同时可实现的延迟和吞吐量要好,同时会导致延迟稍有增加。这种增加部分是由于HotStuff采用的批次策略:它需要三个额外的完整批次(两阶段变种算法则需要两个批次)才能使得一个批次的决策完成。我们的实验使未完成的请求数量保持较高,但是批处理量越大,填充批处理流水线所需的时间就越长。可以进一步在实际部署优化模型,以使批次大小适合未完成的操作数量。

For batch sizes of 100 and 400, the lowest-latency HotStuff point provides latency and throughput that are better than the latency and throughput simultaneously achievable by BFT-SMaRT at its highest throughput, while incurring a small increase in latency. This increase is partly due to the batching strategy employed by HotStuff: It needs three additional full batches (two in the two-phase variant) to arrive at a decision on a batch. Our experiments kept the number of outstanding requests high, but the higher the batch size, the longer it takes to fill the batching pipeline. Practical deployments could be further optimized to adapt the batch size to the number of outstanding operations.

图5描绘了0/0、128/128和1024/1024的三种客户端请求/答复有效载荷大小(以字节为单位),分别表示为”p0″,”p128″和”p1024″。在所有有效负载大小下,三阶段和两阶段HotStuff的吞吐量在性能上均优于BFT-SMaRt,且延迟时间相似或相当。

图5 带宽延迟对比,在选取不同载荷大小的情况下对比,4副本,400batchsize

Figure 5 depicts three client request/reply payload sizes (in bytes) of 0/0, 128/128, and 1024/1024, denoted “p0”, “p128”, and “p1024” respectively. At all payload sizes, both three-phase and two-phase HotStuff outperformed BFTSMaRt in throughput, with similar or comparable latency.

注意,BFT-SMaRt使用基于对称加密的MAC,其速度比HotStuff使用的数字签名中的非对称加密快几个数量级,并且与BFT-SMaRt使用的两阶段PBFT变体相比,三阶段HotStuff具有更多的往返行程。但是,HotStuff仍然能够实现可比的延迟和更高的吞吐量。下面我们将在更具挑战性的情况下评估这两种系统,在这些情况下,HotStuff的性能优势将更加明显。

Notice BFT-SMaRt uses MACs based on symmetric crypto that is orders of magnitude faster than the asymmetric crypto in digital signatures used by HotStuff, and also three-phase HotStuff has more round trips compared to two-phase PBFT variant used by BFT-SMaRt. Yet HotStuff is still able to achieve comparable latency and much higher throughput. Below we evaluate both systems in more challenging situations, where the performance advantages of HotStuff will become more pronounced.

8.3 扩展性

为了评估HotStuff在各个维度上的可伸缩性,我们进行了三个实验。对于基准模型,我们使用零大小的请求/响应有效负载,同时更改了副本的数量。第二次评估使用128字节和1024字节的请求/响应有效负载重复了基准实验。第三次测试重复了基线(有效载荷为空),同时引入了副本之间的网络延迟,这些延迟以5ms±0.5ms或10ms±1.0ms的均匀分布(使用NetEm实现)(请参见https://www.linux.org/docs/ man8 / tc-netem.html)。对于每个数据点,我们以相同的设置重复进行五次运行,并显示误差区间以指示所有运行的标准偏差。

To evaluate the scalability of HotStuff in various dimensions, we performed three experiments. For the baseline, we used zero-size request/response payloads while varying the number of replicas. The second evaluation repeated the baseline experiment with 128-byte and 1024-byte request/response payloads. The third test repeated the baseline (with empty payloads) while introducing network delays between replicas that were uniformly distributed in 5ms ± 0.5ms or in 10ms ± 1.0ms, implemented using NetEm (see https://www.linux.org/docs/man8/tc-netem.html). For each data point, we repeated five runs with the same setting and show error bars to indicate the standard deviation for all runs.

第一个设置如图6a(吞吐量)和图6b(延迟)所示。三阶段和两阶段HotStuff始终显示出比BFT-SMaRt更好的吞吐量,而它们的延迟仍然可以与BFTSMaRt相提并论。当n <32时,性能扩展比BFT-SMaRt好。这是因为我们目前仍将secp256k1签名列表用于QC。将来,我们计划通过使用快速阈值签名方案来减少HotStuff中的密码计算开销。

图6 0/0载荷,400batch-size对应的扩展性

The first setting is depicted in Figure 6a (throughput) and Figure 6b (latency). Both three-phase and two-phase HotStuff show consistently better throughput than BFT-SMaRt, while their latencies are still comparable to BFTSMaRt with graceful degradation. The performance scales better than BFT-SMaRt when n < 32. This is because we currently still use a list of secp256k1 signatures for a QC. In the future, we plan to reduce the cryptographic computation overhead in HotStuff by using a fast threshold signature scheme.

有效负载大小为128或1024字节的第二个设置在图7a(吞吐量)和图7b(等待时间)中用“ p128”或“ p1024”表示。由于其二次带宽成本,对于相当大的有效载荷大小(1024字节),BFT-SMaRt的吞吐能力比HotStuff差。

图7 128/128或1024/1024载荷,400batch-size的扩展性

The second setting with payload size 128 or 1024 bytes is denoted by “p128” or “p1024” in Figure 7a (throughput) and Figure 7b (latency). Due to its quadratic bandwidth cost, the throughput of BFT-SMaRt scales worse than HotStuff for reasonably large (1024-byte) payload size.

第三种设置在图8a(吞吐量)和图8b(等待时间)中显示传输延迟为“ 5ms”或“ 10ms”的对比。同样,由于在BFT-SMaRt中使用了更多的通信,因此HotStuff在这两种情况下均始终优于BFT-SMaRt。

图8 0/0载荷、中途传输延时5ms或10ms,batch-size400ms的扩展性

The third setting is shown in Figure 8a (throughput) and Figure 8b (latency) as “5ms” or “10ms”. Again, due to the larger use of communication in BFT-SMaRt, HotStuff consistently outperformed BFT-SMaRt in both cases.

8.4 View Change

为了评估leader更换的通信复杂性,我们计算了在BFT-SMaRt的视图更改协议中执行的MAC或签名验证的数量。 我们的评估策略如下。 每隔一千个决策,我们就会向BFT-SMaRt中注入视图更改。 我们检测了BFT-SMaRt源代码,以计算在视图更改协议内接收和处理消息后的验证次数。 除了通信的复杂性之外,此测量还强调了与传输这些经过身份验证的值相关的密码计算负担。

To evaluate the communication complexity of leader replacement, we counted the number of MAC or signature verifications performed within BFT-SMaRt’s view-change protocol. Our evaluation strategy was as follows. We injected a view change into BFT-SMaRt every one thousand decisions. We instrumented the BFT-SMaRt source code to count the number of verifications upon receiving and processing messages within the view-change protocol. Beyond communication complexity, this measurement underscores the cryptographic computation load associated with transferring these authenticated values.

图9a和图9b显示了为每个视图更改处理的额外的身份验证器(分别为MAC和签名)的数量,其中“额外”定义为如果领导者保持稳定就不会发送的那些身份验证器。 请注意,根据此定义,HotStuff没有“额外”的身份验证者,因为身份验证者的数量保持不变,而不管领导者是否保持不变。 这两个数据表明,BFT-SMaRt使用的是MAC的立方数和签名的二次数。 HotStuff不需要额外的身份验证器即可更改视图,因此在图中将其省略。

图9 视图转换的时候,额外使用验证器个数

Figure 9a and Figure 9b show the number of extra authenticators (MACs and signatures, respectively) processed for each view change, where “extra” is defined to be those authenticators that would not be sent if the leader remained stable. Note that HotStuff has no “extra” authenticators by this definition, since the number of authenticators remains the same regardless of whether the leader stays the same or not. The two figures show that BFT-SMaRt uses cubic numbers of MACs and quadratic numbers of signatures. HotStuff does not require extra authenticators for view changes and so is omitted from the graph.

Evaluating the real-time performance of leader replacement is tricky. First, BFT-SMaRt got stuck when triggering frequent view changes; our authenticator-counting benchmark had to average over as many successful view changes as possible before the system got stuck, repeating the experiment many times. Second, the actual elapsed time for leader replacement depends highly on timeout parameters and the leader-election mechanism. It is therefore impossible to provide a meaningful comparison.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – One-Chain and Two-Chain BFT Protocols

7. One-Chain/Two-Chain的BFT协议

在本节中,我们研究了四个BFT副本协议,这些协议跨越了拜占庭式容错研究的四十余年,并将它们植入类似于Chained HotStuff的基于链的框架中。

图3提供了我们考虑的包括HotStuff在内的五个协议的提交规则的鸟瞰图。

简而言之,DLS的提交规则是单链(One-Chain)的,允许节点仅由其自己的领导者提交。 PBFT、Tendermint和Casper中的提交规则几乎相同,由双链(Two-Chain)组成。 他们在介绍活动的机制上有所不同,PBFT具有二次方规模的leader“证明复杂度”(非线性),Tendermint和Casper在每个leader提议之前引入了强制性性的Δ延迟(非乐观响应性)。 HotStuff使用三链(Three-Chain)规则,并具有无延迟的线性leader协议。

图3 不同BFT协议的提交规则

In this section, we examine four BFT replication protocols spanning four decades of research in Byzantine fault tolernace, casting them into a chained framework similar to Chained HotStuff.

Figure 3 provides a birds-eye view of the commit rules of five protocols we consider, including HotStuff.

In a nutshell, the commit rule in DLS [25] is One-Chain, allowing a node to be committed only by its own leader. The commit rules in PBFT [20], Tendermint [15, 16] and Casper [17] are almost identical, and consist of Two-Chains. They differ in the mechanisms they introduce for liveness, PBFT has leader “proofs” of quadratic size (no Linearity), Tendermint and Casper introduce a mandatory ∆ delay before each leader proposal (no Optimistic Responsiveness). HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.

7.1 DLS

最简单的提交规则是单链(One-Chain)。服从该规则有第一个已知的异步拜占庭共识解决方案Dwork,Lynch和Stockmeyer(DLS)模型,如图3(a)所示。副本将锁定在其投票的最高节点上的DLS中。

不幸的是,如果在某个高度处,领导者模棱两可,此规则将很容易导致僵局,并且在该高度处的两个有冲突的提议被锁定在了两个正确的副本中。除非有2f +1表示他们没有投票赞成锁定的值,否则放弃任何一个锁定都是不安全的。

实际上,在DLS中,只有每个高度的leader本身都可以通过单链提交规则来达成提交决定。因此,如果领导者模棱两可,只会受到伤害。如果2f +1个副本没有投票赞成该副本,或者存在冲突的提议(由领导者签名),则副本可以放弃该锁。出现在DLS中每个高度末端的解锁协议非常复杂且昂贵。在没有故障且网络及时的最佳情况下,只有高空领导者可以决定这一事实,DLS每次决策都需要n次领导者轮换和O(n^4)条消息传输。虽然DLS在演示安全的异步协议方面开辟了新天地,但DLS并非设计为实用的解决方案。

The simplest commit rule is a One-Chain. Modeled after Dwork, Lynch, and Stockmeyer (DLS), the first known asynchronous Byzantine Consensus solution, this rule is depicted in Figure 3(a). A replica becomes locked in DLS on the highest node it voted for.

Unfortunately, this rule would easily lead to a deadlock if at some height, a leader equivocates, and two correct replicas became locked on the conflicting proposals at that height. Relinquishing either lock is unsafe unless there are 2f + 1 that indicate they did not vote for the locked value.

Indeed, in DLS only the leader of each height can itself reach a commit decision by the One-Chain commit rule. Thus, only the leader itself is harmed if it has equivocated. Replicas can relinquish a lock either if 2f +1 replicas did not vote for it, or if there are conflicting proposals (signed by the leader). The unlocking protocol occurring at the end of each height in DLS turns out to be fairly complex and expensive. Together with the fact that only the leader for a height can decide, in the best scenario where no fault occurs and the network is timely, DLS requires n leader rotations, and O(n^4) message transmissions, per single decision. While it broke new ground in demonstrating a safe asynchronous protocol, DLS was not designed as a practical solution.

7.2 PBFT

以PBFT为模型,更实际的方法是使用两链提交规则,请参见图3(b)。当副本为构成单链的节点投票时,它就锁定在该节点上。完全不可能在同一高度上冲突单链,因为每个链都有一个QC,因此避免了DLS的死锁情况。

但是,如果一个副本具有比其他副本更高的锁定,则即使领导者从n-f个副本中收集信息,领导者也可能对此一无所知。这可能会导致领导者纯粹由于调度而无法做出最终决定。为了获得“解除阻塞”效果,PBFT会通过携带由2f +1个副本的最高单链组成的证明来解锁所有副本。如下所述,该证明相当复杂。

原始的PBFT已开源并在多项后续工作中采用,领导者证明包含从n-f个副本收集的一组消息,报告每个成员投票支持的最高单链。每个单链包含一个QC,因此总通信成本为O(n^3)。利用多种签名组合方法,可以通过将每个QC设置为单个值,将成本降低到O(n^2)

在[21]中的PBFT变体中,领导者证明包含从仲裁群体中收集的最高领导者一次链。它还包括每个仲裁成员的一个已签名的值,证明它没有投票赞成更高的单链。广播该证明会导致通信复杂度O(n^2)。请注意,尽管QC上的签名可以组合为一个值,但是由于来自仲裁的不同成员的消息可能具有不同的值,因此作为一个整体的证明不能减少到恒定的大小。

在这两种变体中,正确的副本都可以解锁,即使它的单链比领导者的证明更高。因此,正确的领导者可以在同步期间强制其提议被接受,从而确保活动的活力。费用是每次更换领导者的二次通信费用。

Modeled after PBFT, a more practical appraoch uses a Two-Chain commit rule, see Figure 3(b). When a replica votes for a node that forms a One-Chain, it becomes locked on it. Conflicting One-Chains at the same height are simply not possible, as each has a QC, hence the deadlock situation of DLS is avoided.

However, if one replica holds a higher lock than others, a leader may not know about it even if it collects information from n − f replicas. This could prevent leaders from reaching decisions ad infinitum, purely due to scheduling. To get “unstuck”, the PBFT unlocks all replicas by carrying a proof consisting of the highest One-Chain’s by 2f + 1 replicas. This proof is quite involved, as explained below.

The original PBFT, which has been open-sourced [20] and adopted in several follow up works [13, 34], a leader proof contains a set of messages collected from n − f replicas reporting the highest One-Chain each member voted for. Each One-Chain contains a QC, hence the total communication cost is O(n^3). Harnessing signature combining methods from [45, 18], SBFT [30] reduces this cost to O(n^2) by turning each QC to a single value.

In the PBFT variant in [21], a leader proof contains the highest One-Chain the leader collected from the quorum only once. It also includes one signed value from each member of the quorum, proving that it did not vote for a higher One-Chain. Broadcasting this proof incurs communication complexity O(n^2). Note that whereas the signatures on a QC may be combined into a single value, the proof as a whole cannot be reduced to constant size because messages from different members of the quorum may have different values.

In both variants, a correct replica unlocks even it has a higher One-Chain than the leader’s proof. Thus, a correct leader can force its proposal to be accepted during period of synchrony, and liveness is guaranteed. The cost is quadratic communication per leader replacement.

7.3 Tendermint 和 Casper

Tendermint具有与PBFT相同的“双链”提交规则,而Casper具有“双链”规则,其中叶子不需要具有控制父节点的QC。也就是说,在Casper中,图3(c,d)分别描述了Tendermint和Casper的提交规则。

在这两种方法中,领导者都只是简单地发送其所知道的最高单链及其提议。如果副本从领导者接收到更高的副本,它将解锁单链。

但是,由于正确的副本可能无法投票给领导者的节点,因此要保证进度,新领导者必须通过等待最大的网络延迟来获得最高的单链。否则,如果领导者仅等待第一个n-f条消息开始新的高度,则无法保证进度。领导者的延迟在Tendermint和Casper中都是固有的,以便提供活跃性。

这个简单的领导者协议体现了HotStuff借鉴的领导者协议的通信复杂度的线性飞跃。如上所述,可以使用阈值签名将QC的捕获简化为单个值,因此领导者可以收集和传播具有线性通信复杂性的最高单链。但是,至关重要的是,由于额外的QC步骤,HotStuff不需要领导者等待最大的网络延迟。

Tendermint has a Two-Chain commit rule identical to PBFT, and Casper has a Two-Chain rule in which the leaf does not need to have a QC to direct parent. That is, in Casper, Figure 3(c,d) depicts the commit rules for Tendermint and Casper, respectively.

In both methods, a leader simply sends the highest One-Chain it knows along with its proposal. A replica unlocks a One-Chain if it receives from the leader a higher one.

However, because correct replicas may not vote for a leader’s node, to guarantee progress a new leader must obtain the highest One-Chain by waiting the maximal network delay. Otherwise, if leaders only wait for the first n − f messages to start a new height, there is no progress guarantee. Leader delays are inherent both in Tendermint and in Casper, in order to provide liveness.

This simple leader protocol embodies a linear leap in the communication complexity of the leader protocol, which HotStuff borrows from. As already mentioned above, a QC could be captured in a single value using threshold signatures, hence a leader can collect and disseminate the highest One-Chain with linear communication complexity. However, crucially, due to the extra QC step, HotStuff does not require the leader to wait the maximal network delay.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Implementation

6. 实现

HotStuff是用于构建高效SMR系统的实用协议。 由于其简易性,我们可以轻松地将算法3转换为事件驱动的形式,该形式几乎等同于我们的实现原型。

HotStuff is a practical protocol for building efficient SMR systems. Because of its simplicity, we can easily turn Algorithm 3 into an event-driven-style specification that is almost like the code skeleton for a prototype implementation.

如算法4所示,可以进一步简化代码,方法是将活跃性相关机制简化并泛化成为一个独立的模块,该模块我们称之为pacemaker。一般来说,下一个leader需要在generic阶段末尾等待genericQC,再开始它的“统治”,这些工作都被丢给Pacemaker模块。 稳定的leader可以略过此步骤,并流水线化处理多个不同高度的提案。 此外,在维护最高genericQC和lockedQC的过程中,我们放宽了直接父亲的约束,同时仍然保留了有效节点中的QC始终引用其祖先的要求。 正确性的证明与Chained HotStuff类似,我们也将其推迟到附录中。

As shown in Algorithm 4, the code is further simplified and generalized by extracting the liveness mechanism from the body into a module named Pacemaker. Instead of the next leader always waiting for a genericQC at the end of the generic phase before starting its reign, this logic is delegated to the Pacemaker. A stable leader can skip this step and streamline proposals across multiple heights. Additionally, we relax the direct parent constraint for maintaining the highest genericQC and lockedQC, while still preserving the requirement that the QC in a valid node always refers to its ancestor. The proof of correctness is similar to Chained HotStuff and we also defer it to the appendix of [50].

数据结构 任意的副本u持续追踪下列的核心状态变量:

  • V[\cdot] 将节点映射到投票
  • vheight 最近投票节点的高度
  • b_{lock} 被锁定的节点(和lockedQC相似)
  • b_{exec} 最后执行的节点
  • qc_{high} 由Pacemaker维护的最高的已知QC(与genericQC相似)
  • b_{leaf} 由Pacemaker维护的叶节点

同时维护一个常量 b_0 , 一个系统共享的的创世纪节点,被所有副本所公认,用于系统的启动。b_0 包含了一个通过硬编码指向自身的QC, b_{lock} , b_{exec} , b_{leaf} 都初始化为 b_0 , 且 qc_{high} 包含 b_0的QC .

Data structures. Each replica u keeps track of the following main state variables:

  • V[\cdot] mapping from a node to its votes.
  • vheight height of last voted node.
  • b_{lock} locked node (similar to lockedQC).
  • b_{exec} last executed node.
  • qc_{high} highest known QC (similar to genericQC) kept by a Pacemaker.
  • b_{leaf} leaf node kept by a Pacemaker.

It also keeps a constant b_0 , the same genesis node known by all correct replicas. To bootstrap, b_0 contains a hardcoded QC for itself, b_{lock} , b_{exec} , b_{leaf} are all initialized to b_0 , and qc_{high} contains the QC for b_0 .

起搏器(Pacemaker) Pacemaker是这样一个机制,保证GST之后一定会取得进展,通过下面两种机制保证的。

第一个是“同步”,同步会将所有正确的副本和唯一的leader节点带入一个共同高度的视图,持续足够长的时间。 文献[25、20、15]中通常使用的同步机制是使副本在更大的高度的视图上增加等待Δ的次数,直到取得进展。 常用的选取leader的方法是使用轮换leader方案,其中所有正确的副本都保留一个预先确定的leader时间表,并在转移leader时轮换到时间表中下一个leader。

Pacemaker. A Pacemaker is a mechanism that guarantees progress after GST. It achieves this through two ingredients.

The first one is “synchronization”, bringing all correct replicas, and a unique leader, into a common height for a sufficiently long period. The usual synchronization mechanism in the literature [25, 20, 15] is for replicas to increase the count of ∆’s they spend at larger heights, until progress is being made. A common way to deterministically elect a leader is to use a rotating leader scheme in which all correct replicas keep a predefined leader schedule and rotate to the next one when the leader is demoted.

其次,Pacemaker需要为leader提供一种选择提案的方式,以保证选择的提案将得到正确副本的支持。 如算法5所示,在视图更改后,新领导者在onReceiveNewView中收集副本通过onNextSyncView发送的新视图消息,以发现最高的仲裁证书,以满足onReceiveProposal中第二个条件的要求(算法4的第18行) )。 但是,在同一视图中,现任leader会将新节点链接到自己最后提出的叶的末尾,在此不需要新视图消息。 基于某些特定于应用程序的启发式方法(例如,等到先前提出的节点获得QC之后),当前领导者调用onBeat提出一个新节点,该节点携带要执行的命令。

Second, a Pacemaker needs to provide the leader with a way to choose a proposal that will be supported by correct replicas. As shown in Algorithm 5, after a view change, in onReceiveNewView, the new leader collects new-view messages sent by replicas through onNextSyncView to discover the highest QC to satisfy the second part of the condition in onReceiveProposal for liveness (Line 18 of Algorithm 4). During the same view, however, the incumbent leader will chain the new node to the end of the leaf last proposed by itself, where no new-view message is needed. Based on some application-specific heuristics (to wait until the previously proposed node gets a QC, for example), the current leader invokes onBeat to propose a new node carrying the command to be executed.

值得注意的是,即使不良的Pacemaker任意调用onPropose或任意选择父级和QC,并且在任何计划延迟的情况下,也始终可以确保安全性。 因此,仅通过算法4保证的安全性就可以与算法5的任何可能活跃性实现完全脱钩。

It is worth noting that even if a bad Pacemaker invokes onPropose arbitrarily, or selects a parent and a QC capriciously, and against any scheduling delays, safety is always guaranteed. Therefore, safety guaranteed by Algorithm 4 alone is entirely decoupled from liveness by any potential instantiation of Algorithm 5.

两阶段HotStuff变型 为了进一步说明HotStuff框架的灵活性,算法6显示了HotStuff的两阶段变体。 仅执行更新过程,需要Two-Chain才能做出提交决定,而One-Chain确定锁定。 如上所述(第4.4节),此两阶段变型失去了乐观响应能力,类似于Tendermint / Casper。 好处是更少的阶段,而活跃性可以通过在Pacemaker中加入基于最大网络延迟的等待来解决。 参见第7.3节进一步讨论。

Two-phase HotStuff variant. To further demonstrate the flexibility of the HotStuff framework, Algorithm 6 shows the two-phase variant of HotStuff. Only the update procedure is affected, a Two-Chain is required for reaching a commit decision, and a One-Chain determines the lock. As discussed above (Section 4.4), this two-phase variant loses Optimistic Responsiveness, and is similar to Tendermint/Casper. The benefit is fewer phases, while liveness may be addressed by incorporating in Pacemaker a wait based on maximum network delay. See Section 7.3 for further discussion.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Chained HotStuff

5. Chained HotStuff

上一节讲到的Basic HotStuff里面,leader想要提交一个提案,需要三个步骤。这些步骤除了从副本处收集投票之外,并没有做其他什么有意义的工作,因此从形式上来看非常相似。在Chained HotStuff里面,我们一方面优化了Basic HotStuff的功能,同时对其进行了简化。核心思想是将改变视图的步骤放在prepare阶段完成,每一个提案都会有自己对应的视图。这样做会简化消息的种类个数,并且允许决策的并行化。相似的消息种类简化的思路在Casper算法的文章里也有所提及。

It takes three phases for a Basic HotStuff leader to commit a proposal. These phases are not doing “useful” work except collecting votes from replicas, and they are all very similar. In Chained HotStuff, we improve the Basic HotStuff protocol utility while at the same time considerably simplifying it. The idea is to change the view on every prepare phase, so each proposal has its own view. This reduces the number of message types and allows for pipelining of decisions. A similar approach for message type reduction was suggested in Casper [1].

更具体地说,在Chained HotStuff协议中,prepare阶段的投票由leader收集到一个视图中,储存在状态变量genericQC里。接下来,genericQC被转发给下一个视图的leader,实质上是将下一阶段的职责(这曾经由pre-commit阶段负责的)委托给下一个leader节点。然而,下一位leader实际上并不单独进行pre-commit阶段,而是启动一个新的prepare阶段,添加自己的提议。视图v+1的prepare阶段同时充当视图v的pre-commit阶段。视图v+2的prepare阶段同时充当视图v+1的pre-commit阶段和视图v的commit阶段。这中流水线化的思路是完全可行的,因为Basic HotStuff的所有的阶段都有相同的结构。

More specifically, in Chained HotStuff the votes over a prepare phase are collected in a view by the leader into a genericQC. Then the genericQC is relayed to the leader of the next view, essentially delegating responsibility for the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare phase for view v + 1 simultaneously serves as the pre-commit phase for view v. The prepare phase for view v + 2 simultaneously serves as the pre-commit phase for view v + 1 and as the commit phase for view v. This is possible because all the phases have identical structure.

图1描绘了将Basic HotStuff协议的各个阶段流水线化,最终得到Chained HotStuff协议的示例。在Chained HotStuff协议里,视图v_1v_2v_3对于cmd 1来说充当了Basic HotStuff中的prepare, pre-commit, commit阶段。该命令将在v_4结束时提交。视图v_2v_3v_4作为v_2中提议的cmd 2的三个Basic Hotstuff阶段,并在v_5结束时提交。在这些阶段生成的其他提议以类似的方式继续流水线化,用虚线框表示。在图1中,单箭头表示节点bb.parent字段,双箭头表示b.justify.node

The pipeline of Basic HotStuff protocol phases embedded in a chain of Chained HotStuff proposals is depicted in Figure 1. Views v_1 , v_2 , v_3 of Chained HotStuff serve as the prepare, pre-commit, and commit Basic HotStuff phases for cmd 1 proposed in v_1 . This command becomes committed by the end of v_4 . Views v_2 , v_3 , v_4 serve as the three Basic HotStuff phases for cmd 2 proposed in v_2 , and it becomes committed by the end of v_5 . Additional proposals generated in these phases continue the pipeline similarly, and are denoted by dashed boxes. In Figure 1, a single arrow denotes the b.parent field for a node b, and a double arrow denotes b.justify.node.

图1 Chained HotStuff是Basic HotStuff的流水线版本,QC可以同时处于多个阶段。

基于上述的结构,事实上Chained HotStuff只有两种类别的消息—— NewView和GenericPhase消息。一个通用的QC函数在所有的流水线阶段运行。接下来,我们将解释流水线的机制,重点关注锁定lock与提交commit操作,他们对应了Basic HotStuff的commit和decide阶段。

Hence, there are only two types of messages in Chained HotStuff, a new-view message and generic-phase generic message. The generic QC functions in all logically pipelined phases. We next explain the mechanisms in the pipeline to take care of locking and committing, which occur only in the commit and decide phases of Basic HotStuff.

虚拟节点 在某些视图viewNumber中,leader使用的genericQC不能直接引用前一个视图的提议(viewNumber-1)。原因是之前视图的leader未能获得QC,要么是因为有冲突的提议,要么是因为节点的崩溃。为了简化树结构,createLeaf将genericQC.node扩展为空白节点,直到提议视图的高度(节点分支上父链接的数量),因此视图编号与节点高度相等。这样就会导致,嵌入在节点b中的仲裁证书QC不能正确引用其父节点,即b.justify.node不能等于b.parent(图2中的最后一个节点)。

Dummy nodes. The genericQC used by a leader in some view viewNumber may not directly reference the proposal of the preceding view (viewNumber −1). The reason is that the leader of a preceding view fails to obtain a QC, either because there are conflicting proposals, or due to a benign crash. To simplify the tree structure, createLeaf extends genericQC.node with blank nodes up to the height (the number of parent links on a node’s branch) of the proposing view, so view-numbers are equated with node heights. As a result, the QC embedded in a node b may not refer to its parent, i.e., b.justify.node may not equal b.parent (the last node in Figure 2).

图2 节点v4,v5,v6组成了Three-Chain,节点v8没有成为一个合法的One-Chain。

一链(One-Chain),二链(Two-Chain)和三链(Three-Chain) 当节点b^*带有和父节点相同的QC时,即b^∗.justify.node = b^∗.parent,我们说它形成了一个One-Chain。 令b''=b^∗.justify.node,如果除了形成单链外,还满足b''.justify.node = b''.parent,那么节点b^∗形成Two-Chain。 如果b''已经是Two-Chain了,那么b^*将形成Three-Chain。

One-Chain, Two-Chain, and Three-Chain. When a node b^∗ carries a QC that refers to a direct parent, i.e., b^∗ .justify.node = b^∗ .parent, we say that it forms a One-Chain. Denote by b'' = b^∗ .justify.node. Node b^∗ forms a Two-Chain, if in addition to forming a One-Chain, b''.justify.node = b''.parent. It forms a Three-Chain, if b'' forms a Two-Chain.

查看链b = b'.justify.node, b'= b''.justify.node, b'' = b^∗ .justify.node,祖先空缺可能发生在任何一个 节点。 这些情况类似于Basic HotStuff的leader未能完成三个阶段中的任何一个阶段,并被nextView中断到下一个视图。

Looking at chain b = b'.justify.node, b'= b''.justify.node, b'' = b^∗ .justify.node, ancestry gaps might occur at any one of the nodes. These situations are similar to a leader of Basic HotStuff failing to complete any one of three phases, and getting interrupted to the next view by nextView.

如果b^*形成One-Chain,则b''的prepare阶段是成功的。 因此,当副本对b^∗投票时,它应该记录了genericQC \leftarrow b^∗.justify。 我们需要注意,即使One-Chain不是直接的,只要它高于当前的genericQC,也可以安全地更新genericQC。 在第6节中描述的实现代码中,在这种情况下,我们确实更新了genericQC

If b^∗ forms a One-Chain, the prepare phase of b'' has succeeded. Hence, when a replica votes for b^∗ , it should remember genericQC \leftarrow b^∗.justify. We remark that it is safe to update genericQC even when a One-Chain is not direct, so long as it is higher than the current genericQC. In the implementation code described in Section 6, we indeed update genericQC in this case.

如果b^*形成两链,则b'的pre-commit阶段成功。 因此,副本应更新lockedQC \leftarrow b''.justify。 再次注意,即使Two-Chain不是直接的,也可以更新该锁(安全不会中断),而实际上,这在第6节的实现代码中已给出。

If b^* forms a Two-Chain, then the pre-commit phase of b 0 has succeeded. The replica should therefore update lockedQC \leftarrow b''.justify. Again, we remark that the lock can be updated even when a Two-Chain is not direct—safety will not break—and indeed, this is given in the implementation code in Section 6.

最后,如果b^*形成Three-Chain,则b的commit阶段成功,并且b成为已提交的决策。

Finally, if b ∗ forms a Three-Chain, the commit phase of b has succeeded, and b becomes a committed decision.

算法3显示了Chain HotStuff的伪代码。 附录A中定理5给出的安全性证明与Basic HotStuff的证明相似。 我们要求在有效节点中的仲裁证书(QC)引用其祖先。 为简便起见,我们假设约束始终成立,并且省略了代码中的检查。

Algorithm 3 shows the pseudocode for Chained HotStuff. The proof of safety given by Theorem 5 in Appendix A is similar to the one for Basic HotStuff. We require the QC in a valid node refers to its ancestor. For brevity, we assume the constraint always holds and omit checking in the code.

算法3 Chained HotStuff协议

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Basic HotStuff

这是《HotStuff:BFT Consensus in the Lens of Blockchain》论文翻译的第四篇,包含了论文的Basic HotStuff章节。


4. 简化HotStuff协议

HotSuff解决了状态复制机(SMR)问题,在SMR的核心问题,是如何设计一个协议,引导客户端发来的指令自然执行。状态复制机的副本们按照统一的顺序执行请求。一个客户端向所有副本发送一个请求,并等待其中的f+1个副本的回执。在大多数情况下,我们在讨论中省略了客户端这个角色,在有关请求的编号和重复数据消除的问题上,遵循行业标准做法。

HotStuff solves the State Machine Replication (SMR) problem. At the core of SMR is a protocol for deciding on a growing log of command requests by clients. A group of state-machine replicas apply commands in sequence order consistently. A client sends a command request to all replicas, and waits for responses from (f + 1) of them. For the most part, we omit the client from the discussion, and defer to the standard literature for issues regarding numbering and de-duplication of client requests.

算法2给出了简易版HotStuff的解决方案。这个协议在一系列编号单调递增的视图(view)中工作,每一个编号的视图都有唯一的leader节点被其他副本所公认。所有节点都在其内存中储存一个待执行指令树。所有的树节点包括一个客户端提议的指令(command)、协议的关联元数据(metadata)和一个父节点链接。节点的分支路径被定义为从该节点出发,沿着父节点链接走到根节点的整条路径。在协议过程中,单调增长的分支被提交(commited),为了提交分支,leader节点需要提议一个能够获得n-f个副本投票支持的分支,而投票分为三个步骤,准备阶段prepare, 预提交阶段pre-commit,提交阶段commit。

【由于这里介绍的是一个简化版模型,Basic HotStuff里面不存在leader节点的选取细节,leader节点可以看做在运行之初被公认了,后面的文章会提到,leader的选取由一个独立的pacemaker算法实现。同时,视图可以理解为账本变动的最小单位,相同视图中账本内容完全相同,而记账则体现在视图的变化。】

The Basic HotStuff solution is presented in Algorithm 2. The protocol works in a succession of views numbered with monotonically increasing view numbers. Each viewNumber has a unique dedicated leader known to all. Each replica stores a tree of pending commands as its local data structure. Each tree node contains a proposed command (or a batch of them), metadata associated with the protocol, and a parent link. The branch led by a given node is the path from the node all the way to the tree root by visiting parent links. During the protocol, a monotonically growing branch becomes committed. To become committed, the leader of a particular view proposing the branch must collect votes from a quorum of (n - f) replicas in three phases, prepare, pre-commit, and commit.

协议的核心部分是leader节点发起的n-f个副本的投票,这个投票的结果被称为仲裁证书(quorum certificate, QC)。仲裁证书被关联到一个特定的树节点以及视图编号。tcombine函数可以将n-f个被签名的投票合并成一个可进行合法性验证的仲裁证书。

A key ingredient in the protocol is a collection of (n - f) votes over a leader proposal, referred to as a quorum certificate (or “QC” in short). The QC is associated with a particular node and a view number. The tcombine utility employs a threshold signature scheme to generate a representation of (n - f) signed votes as a single authenticator.

下面将首先描述Basic HotStuff协议的各阶段执行逻辑,紧接着是基于伪代码(算法2)更精细的描述。最后用对Basic HotStuff的安全性、活跃性、复杂度分析结束本节。

Below we give an operational description of the protocol logic by phases, followed by a precise specification in Algorithm 2, and conclude the section with safety, liveness, and complexity arguments.

4.1 阶段

准备(prepare)阶段 一个新leader节点的协议过程起始于从n-f个副本中收集NewView消息。副本在进入新的viewNumber时(包括最开始的那个视图),会发送一个NewView消息给leader节点,并且在消息中捎带该副本当前的prepareQC值(如果不存在则发送\perp)。

leader节点选择一个prepareQC最高的分支,这个值被定义为highQC。因为highQCn-f个副本中的最高值,所以没有更高的视图可以做提交决定,因此highQC对应的副本作为leader节点对于该分支来说是最安全的。

收集NewView消息之后,如何选择一个安全的分支这个部分会被当前leader节点忽略,在这里我们简单地选择最高prepareQC作为highQC,为了简便起见,我们将此优化推迟到第6节。在本节中我们仅描述一个单一leader节点的协议。请注意,与类似PBFT的协议不同,我们现在这个单一leader节点协议中,上述步骤是直接跳过的,并强制与协议的其他阶段一样产生相同的线性开销。

leader使用createLeaf方法用一个新的提案扩展highQC.node的尾部。该方法创建一个新的叶节点作为子节点,并在子节点中嵌入父节点的摘要。然后,leader节点使用一个prepare类型消息将新节点发送到所有其他副本。该消息同时包含了highQC以支持后续的安全性验证。

从leader接收到当前视图的prepare类型消息后,副本r使用safeNode谓词函数判断是否接受它。如果被接受,复制品会将带有部分签名(由tsign(r)生成)的prepare类型投票发送给leader节点。

prepare phase. The protocol for a new leader starts by collecting new-view messages from (n - f) replicas. The new-view message is sent by a replica as it transitions into viewNumber (including the first view) and carries the highest prepareQC that the replica received (⊥ if none), as described below.

The leader processes these messages in order to select a branch that has the highest preceding view in which a prepareQC was formed. The leader selects the prepareQCwith the highest view, denoted highQC, among the new-view messages. Because highQC is the highest among (n - f) replicas, no higher view could have reached a commit decision. The branch led by highQC.node is therefore safe.

Collecting new-view messages to select a safe branch may be omitted by an incumbent leader, who may simply select its own highest prepareQC as highQC. We defer this optimization to Section 6 and only describe a single, unified leader protocol in this section. Note that, different from PBFT-like protocols, including this step in the leader protocol is straightforward, and it incurs the same, linear overhead as all the other phases of the protocol, regardless of the situation.

The leader uses the createLeaf method to extend the tail of highQC.node with a new proposal. The method creates a new leaf node as a child and embeds a digest of the parent in the child node. The leader then sends the new node in a prepare message to all other replicas. The proposal carries highQC for safety justification.

Upon receiving the prepare message for the current view from the leader, replica r uses the safeNode predicate to determine whether to accept it. If it is accepted, the replica sends a prepare vote with a partial signature (produced by tsign_r ) for the proposal to the leader.

安全节点(safeNode)谓词函数 safeNode谓词是协议的核心组成部分。它检查带有仲裁证明域(QC)m.justify的消息m,并确定m.node是否可以安全接受。一方面,如果m.node分支是从当前lockedQC.node分支扩展而来,则可以安全的接受提案。另一方面,由活跃性规则保证,如果m.justify的视图高于当前lockedQC,则副本将接受m。只要两个规则中的任何一个成立,谓词函数就返回真。

【我的理解是,safety rule指的是leader在lockedQC.node后面加了一个command节点,对于副本来说,接受这个节点毫无坏处,甚至可以和leader保持同步。liveness rule指的是,当前副本发现自己锁定的副本以及过时了,leader的提案节点的父节点都已经是获得n-f个投票的QC了,自己永远卡在一个奇怪的古老状态毫无意义,所以说也会无条件支持leader。safeNode谓词函数为false意味着副本和leader在lockedQC上有冲突,且位于同一个视图,这个时候,副本不给leader投票】

safeNode predicate. The safeNode predicate is a core ingredient of the protocol. It examines a proposal message m carrying a QC justification m.justify, and determines whether m.node is safe to accept. The safety rule to accept a proposal is the branch of m.node extends from the currently locked node lockedQC.node. On the other hand, the liveness rule is the replica will accept m if m.justify has a higher view than the current lockedQC. The predicate is true as long as either one of two rules holds.

预提交(pre-commit)阶段 当leader节点收到n-f个为当前提案curProposal投票时,将他们合并为prepareQC。leader使用pre-commit类型的消息广播prepareQC。一个副本通过pre-commit投票对leader节点做出响应,并对提案进行签名摘要。

pre-commit phase. When the leader receives (n - f) prepare votes for the current proposal curProposal, it combines them into a prepareQC. The leader broadcasts prepareQC in pre-commit messages. A replica responds to the leader with pre-commit vote having a signed digest of the proposal.

提交(commit)阶段 commit阶段类似于pre-commit阶段。当leader接收到n-f个pre-commit投票时,它将它们组合成一个precommitQC,并在commit类型消息中广播它;副本用commit投票来响应它。重要的是,此时通过将副本的lockedQC设置为precommitQC(算法2的第25行),副本将被锁定在precommitQC上。这对于保护提案的安全至关重要,以防提案成为一项协商一致的决定。

commit phase. The commit phase is similar to pre-commit phase. When the leader receives (n-f) pre-commit votes, it combines them into a precommitQC and broadcasts it in commit messages; replicas respond to it with a commit vote. Importantly, a replica becomes locked on the precommitQC at this point by setting its lockedQC to precommitQC (Line 25 of Algorithm 2). This is crucial to guard the safety of the proposal in case it becomes a consensus decision.

决定(decide)阶段 当leader节点收到n-f个commit的选票时,它将它们合并成一个commitQC。一旦leader节点获得了commitQC,它就会以decide消息的形式将其发送给所有其他副本。在接收到DECIDE消息时,副本将commitQC中包含的提案视图视为已提交的决策,并在已提交的分支中执行命令。副本将增加viewNumber并启动下一个视图。

decide phase. When the leader receives (n - f) commit votes, it combines them into a commitQC. Once the leader has assembled a commitQC, it sends it in a decide message to all other replicas. Upon receiving a decide message, a replica considers the proposal embodied in the commitQC a committed decision, and executes the commands in the committed branch. The replica increments viewNumber and starts the next view.

下一个视图(nextView)中断 在所有阶段中,副本都会在viewNumber视图处等待由nextView(viewNumber)函数确定的超时时间段。如果nextView(viewNumber)中断等待,副本也会增加viewNumber并开始下一个视图。

nextView interrupt. In all phases, a replica waits for a message at view viewNumber for a timeout period, determined by an auxiliary nextView(viewNumber) utility. If nextView(viewNumber) interrupts waiting, the replica also increments viewNumber and starts the next view.

4.2 数据结构

消息(message) 协议中的消息m由一组固定的字段组成,这些字段使用算法1中所示的Msg()程序填充。m会被自动打上curView标记,即发送者当前的视图编号。每个消息都有一个类型m.type \in \{new view, prepare, pre commit, commit, decide\}m.node包含一个建议的节点(建议分支的叶节点)。有一个可选的字段m.justify。leader节点总是在不同的阶段使用这个域来储存仲裁证书(QC)。副本节点则在新视图消息中使用它来携带最高的prepareQC。以副本角色发送的每个消息都包含部分签名m.partialSig,该值是通过将三元组\langle m.type, m.viewNumber, m.node \rangle通过voteMsg()函数签名获得的。

Messages. A message m in the protocol has a fixed set of fields that are populated using the Msg() utility shown in Algorithm 1. m is automatically stamped with curView, the sender’s current view number. Each message has a type m.type ∈ {new-view, prepare, pre-commit, commit, decide}. m.node contains a proposed node (the leaf node of a proposed branch). There is an optional field m.justify. The leader always uses this field to carry the QC for the different phases. Replicas use it in new-view messages to carry the highest prepareQC. Each message sent in a replica role contains a partial signature m.partialSig by the sender over the tuple hm.type, m.viewNumber, m.nodei, which is added in the voteMsg() utility.

仲裁证书(Quorum certificates) 一个仲裁证书是建立在三元组\langle type, viewNumber, node \rangle之上的,并且组合了该三元组在n-f个副本上的签名。给定一个仲裁证书 qc,我们使用qc.type, qc.viewNumber, qc.node来指示原三元组中的项。

Quorum certificates. A Quorum Certificate (QC) over a tuple \langle type, viewNumber, node \rangle is a data type that combines a collection of signatures for the same tuple signed by (n-f) replicas. Given a QC qc, we use qc.type, qc.viewNumber, qc.node

树(Tree)和分枝(Branches) 每个命令都包装在一个节点中,该节点还包含一个父链接,该父链接可以是父节点的哈希摘要。我们在伪代码中省略了实现细节。在协议期间,副本仅在节点所对应的分支已在其本地树中之后才传递消息。实际上,落后的副本可以通过从其他副本中获取丢失的节点数据来赶上。为简洁起见,伪代码中也省略了这些细节。如果两个分支都不是另一个分支的扩展,那么这两个分支就是冲突的。如果两个节点所对应的分支冲突,则两个节点冲突。

Tree and branches. Each command is wrapped in a node that additionally contains a parent link which could be a hash digest of the parent node. We omit the implementation details from the pseudocode. During the protocol, a replica delivers a message only after the branch led by the node is already in its local tree. In practice, a recipient who falls behind can catch up by fetching missing nodes from other replicas. For brevity, these details are also omitted from the pseudocode. Two branches are conflicting if neither one is an extension of the other. Two nodes are conflicting if the branches led by them are conflicting.

记账变量 副本使用额外的局部变量来记录协议状态:
i) viewNumber,最初为1,通过完成决策或nextView中断实现递增;
ii) lockedQC,最初为⊥,存储副本投票提交的最高QC;
iii) prepareQC,最初为⊥,存储副本投票预提交的QC。
此外,为了增量地执行一个提交的命令日志,副本维护其分支已被执行的最高节点。为了简洁起见,下面省略这一点。

Bookkeeping variables. A replica uses additional local variables for bookkeeping the protocol state: (i) a viewNumber, initially 1 and incremented either by finishing a decision or by a nextView interrupt; (ii) a locked quorum certificate lockedQC, initially ⊥, storing the highest QC for which a replica voted commit; and (iii) a prepareQC, initially ⊥, storing the highest QC for which a replica voted pre-commit. Additionally, in order to incrementally execute a committed log of commands, the replica maintains the highest node whose branch has been executed. This is omitted below for brevity.

4.3 协议规范

算法2中给出的协议被描述为逐视图循环迭代。在每轮视图中,副本根据扮演的其角色(描述为一系列as块)连续执行阶段(注意,副本可以具有多个角色,例如,leader也是一个normal副本)。不同角色的as块的执行可以并行化。每个as块的执行本身则是原子的。nextView中断中止任何as块中的所有操作,并跳到Finally块。

The protocol given in Algorithm 2 is described as an iterated view-by-view loop. In each view, a replica performs phases in succession based on its role, described as a succession of “as” blocks. A replica can have more than one role. For example, a leader is also a (normal) replica. Execution of as blocks across roles can be proceeded concurrently. The execution of each as block is atomic. A nextView interrupt aborts all operations in any as block, and jumps to the “Finally” block.

4.4 安全性、活跃性以及复杂度

安全性(Safety). 我们定义仲裁证书 qc是合法的当且仅当tverify(\langle qc.type, qc.viewNumber, qc.node \rangle, qc.sig)为真。

Safety. We first define a quorum certificate qc to be valid if tverify(\langle qc.type, qc.viewNumber, qc.node \rangle, qc.sig) is true.

引理1. 对于任意合法的qc_1, qc_2,如果有qc_1.type = qc_2.typeqc_1.nodeqc_2.node 冲突,那么我们有qc_1.viewNumber \neq qc_2.viewNumber

Lemma 1. For any valid qc_1, qc_2 in which qc_1.type = qc_2.type and qc_1.node conflicts with qc_2.node, we have qc_1.viewNumber \neq qc_2.viewNumber.

证明. 反证法,假设qc_1.viewNumber = qc_2.viewNumber = v,由于一个合法的仲裁证书只能够通过n - f = 2f + 1个副本投票(部分签名)生成。因此,一定有一个正常工作的副本对v投票了两次,伪代码里面并不允许这个情况出现。

Proof. To show a contradiction, suppose qc_1.viewNumber = qc_2.viewNumber = v. Because a valid QC can be formed only with n - f = 2f + 1 votes (i.e., partial signatures) for it, there must be a correct replica who voted twice in the same phase of v. This is impossible because the pseudocode allows voting only once for each phase in each view.

定理2 如果wb是冲突的节点,那么他们不能够同时被一个正确的副本所提交(commit)。

Theorem 2. If w and b are conflicting nodes, then they cannot be both committed, each by a correct replica.

证明 我们也会用反证法证明这个重要的结论。 令qc_1表示一个合法的commitQC(即q_1.type = commit),此时q_1.node = w,同时qc_2是另一个合法的commitQC,满足q_2.node = b。令v_1 = qc_1.viewNumberv2=qc_2.viewNumber。由引理1我们知道,v_1 \neq v_2,不失一般性,可以假设v_1 < v_2

现在,我们假设v_s是比v_1高的prepareQC里面的最小的仲裁证书。qc_s.type=prepare, qc_s.viewNumber = v_sqc_s.nodew冲突,那么我们可以定义下面的谓词逻辑:

E(prepareQC):=(v_1 < prepareQC.viewNumber \leq v_2 ) \wedge (prepareQC.node\ conflicts \quad with\quad w).

我们现在可以按照下列形式定义出qc_sqc_s可以看做是“冲突”的起始位置:

qc_s := argmin_{prepareQC} \{prepareQC.viewNumber | prepareQC\quad is\quad valid \wedge E(prepareQC)\} .

注意,基于之前的假设,这样的qc_s必须存在;例如,在最坏情况,qc_s可以是在视图v_2中形成的prepareQC。

一个正确的副本,会发送部分签名的结果tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node \rangle)给leader节点,令r成为对于tsign_r(\langle qc_s.type, qc_s.viewNumber, qc_s.node\rangle)有贡献的第一个副本;这样的r必须存在,否则,qc_1.sigqc_s.sig其中一个就不能创建。在视图v_1期间,副本r在算法2的第25行使用precommitQC更新它的lockedQC,对应树节点w。由于v_s的最小化定义,副本r上曾经在w处生成的lockedQC在qc_s形成之前不会改变。否则r一定看到了其他一些低视图的prepareQC,因为第17行在第25行之前,与最小值相矛盾。现在考虑副本r在视图v_s的准备阶段调用safeNode,其中消息m包含m.node=qc_s.node。假设m.nodelockedQC.node冲突,因此算法1第26行的判断为false。此外,m.justify.viewNumber>v_1将违反v_s的最小值的假设,因此算法1第27行中的判断也是false的。因此,safeNode必须返回false,而r不能针对v_s的观点进行prepare投票,发现矛盾,证毕。

Proof. We prove this important theorem by contradiction. Let qc_1 denote a valid commitQC (i.e., qc_1.type = commit) such that qc_1.node = w, and qc_2 denote a valid commitQC such that qc_2.node = b. Denote v_1 = qc_1.viewNumber and v_2 = qc_2.viewNumber. By Lemma 1, v_1 \neq v_2 . W.l.o.g. assume v_1 < v_2 .

We will now denote by v_s the lowest view higher than v_1 for which there is a valid prepareQC, qc_s (i.e., qc_s.type = prepare) where qc_s.viewNumber = v_s , and qc_s.node conflicts with w. Formally, we define the following predicate for any prepareQC:

E(prepareQC):=(v_1 < prepareQC.viewNumber \leq v_2 ) \wedge (prepareQC.node\ conflicts\ with\ w).

We can now set the first switching point qc_s :

qc_s := argmin_{prepareQC} \{prepareQC.viewNumber | prepareQC\ is\ valid \wedge E(prepareQC)\} .

Note that, by assumption such a qc_s must exist; for example, qc_s could be the prepareQC formed in view v_2 .

Of the correct replicas that sent a partial result tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node \rangle), let r be the first that contributed tsign_r(\langle qc_1.type, qc_1.viewNumber, qc_1.node \rangle); such an r must exist since otherwise, one of qc_1.sig and qc_s.sig could not have been created. During view v_1 , replica r updates its lock lockedQC to a precommitQC on w at Line 25 of Algorithm 2. Due to the minimality of v_s , the lock that replica r has on the branch led by w is not changed before qc_s is formed. Otherwise r must have seen some other prepareQC with lower view because Line 17 comes before Line 25, contradicting to the minimality. Now consider the invocation of safeNode in the prepare phase of view v_s by replica r, with a message m carrying m.node = qc_s.node. By assumption, m.node conflicts with lockedQC.node, and so the disjunct at Line 26 of Algorithm 1 is false. Moreover, m.justify.viewNumber > v_1 would violate the minimality of v_s , and so the disjunct in Line 27 of Algorithm 1 is also false. Thus, safeNode must return false and r cannot cast a prepare vote on the conflicting branch in view v_s , a contradiction.

活跃性(Liveness). 前面的叙述中,缺少了两个函数,leader和nextView。他们的定义与协议的安全性无关,但和协议的活跃性有关。在给出候选定义之前,我们首先定义GST之后,存在一个有界的持续时间T_f,满足如果所有副本在T_f期间仍然在视图v中,且视图v的leader节点正常运行,那么决定(decision)可以到达。之后的叙述中,我们定义qc_1qc_2匹配(match)当且仅当qc_1qc_2是合法的,且qc_1.node = qc_2.node,且qc_1.viewNumber = qc_2.viewNumber

Liveness. There are two functions left undefined in the previous section: leader and nextView. Their definition will not affect safety of the protocol, though they do matter to liveness. Before giving candidate definitions for them, we first show that after GST, there is a bounded duration T_f such that if all correct replicas remain in view v during T_f and the leader for view v is correct, then a decision is reached. Below, we say that qc_1 and qc_2 match if qc_1 and qc_2 are valid, qc_1.node = qc_2.node, and qc_1.viewNumber = qc_2.viewNumber.

引理3 如果一个正常运行的副本已经锁定了,且锁定在了lockedQC = precommitQC,那么至少f+1个副本投票了与lockedQC匹配的prepareQC

Lemma 3. If a correct replica is locked such that lockedQC = precommitQC, then at least f+1 correct replicas voted for some prepareQC matching lockedQC.

证明 如果一些副本r锁定了precommitQC,那么prepareQC在prepare阶段一定获得了(n-f)个投票(算法2的第10行),其中至少f+1个是正常运行的副本。

Proof. Suppose replica r is locked on precommitQC. Then, (n - f) votes were cast for the matching prepareQC in the prepare phase (Line 10 of Algorithm 2), out of which at least f + 1 were from correct replicas.

定理4 在GST之后,存在一个时间界限T_f,满足如果所有副本在T_f期间仍然在视图v中,且视图v的leader节点是正确的,那么决定(decision)可以在T_f结束时到达。

Theorem 4. After GST, there exists a bounded time period T_f such that if all correct replicas remain in view v during T_f and the leader for viewv is correct, then a decision is reached.

证明 从一个新的视图开始,leader节点收集(n-f)个newView消息,并计算出他们的highQC,最后广播prepare消息。假设在所有副本中(包括leader本身),最高的锁定QC是lockedQC = precommitQC^*,通过引理, 我们知道了,至少有f+1个正确的副本曾经向一个匹配precommitQCprepareQC投票,并且该值已经被附在NewView消息中,传送给leader节点。在这些New-View消息中,至少有一个会被leader接受,并赋值到highQC中。基于假设条件,所有正确的副本在该视图中处于同步状态(synchronized),且leader节点是无缺陷的。因此,所有正确的副本都会在prepare阶段投票,由于在函数safeNode里面,算法1第27行的条件被满足了(即使节点的消息和副本的lockedQC.node冲突,即26行条件不满足)。然后,在leader为这个视图聚合出有效的prepareQC之后,所有副本将在接下来的所有阶段进行投票,从而产生一个新的决策。在GST之后,这些阶段完成的持续时间T_f是有界的。

Proof. Starting in a new view, the leader collects (n - f) new-view messages and calculates its highQC before broadcasting a prepare messsage. Suppose among all replicas (including the leader itself), the highest kept lock is lockedQC = precommitQC^∗ . By Lemma 3, we know there are at least f+1 correct replicas that voted for a prepareQC^∗ matching precommitQC^∗ , and have already sent them to the leader in their new-view messages. Thus, the leader must learn a matching prepareQC^∗ in at least one of these new-view messages and use it as highQC in its prepare message. By the assumption, all correct replicas are synchronized in their view and the leader is nonfaulty. Therefore, all correct replicas will vote in the prepare phase, since in safeNode, the condition on Line 27 of Algorithm 1 is satisfied (even if the node in the message conflicts with a replica’s stale lockedQC.node, and so Line 26 is not). Then, after the leader assembles a valid prepareQC for this view, all replicas will vote in all the following phases, leading to a new decision. After GST, the duration T_f for these phases to complete is of bounded length.

协议是乐观响应性的,因为没有严格的“等待\Delta时间”的步骤,并且使用三步骤的逻辑,让我们可以通过safeNode的判断覆盖过时的锁。

The protocol is Optimistically Responsive because there is no explicit “wait-for-∆” step, and the logical disjunction in safeNode is used to override a stale lock with the help of the three-phase paradigm.

我们现在为leader和nextView提供简单的构造,以确保GST之后,最终将到达一个视图,在该视图中leader节点是正确的,并且所有正确的副本将在此视图中保留T_f的时间。对于leader来说,一个特定的从视图号到副本的映射就已经足够了,最终leader的选择在所有副本之间进行旋转。nextView的一个可能的解决方案是利用一种exponential back-off机制来保持超时时间间隔。然后在进入每个视图时设置一个计时器。当计时器在没有做出任何决定的情况下结束,副本将间隔加倍,并调用nextView来推进视图。由于每次间隔都是原来的两倍,所有正确副本的等待间隔最终都会有至少T_f的共同重叠,在此期间,leader可以驱动决策。

We now provide simple constructions for leader and nextView that suffice to ensure that after GST, eventually a view will be reached in which the leader is correct and all correct replicas remain in this view for T_f time. It suffices for leader to return some deterministic mapping from view number to a replica, eventually rotating through all replicas. A possible solution for nextView is to utilize an exponential back-off mechanism that maintains a timeout interval. Then a timer is set upon entering each view. When the timer goes off without making any decision, the replica doubles the interval and calls nextView to advance the view. Since the interval is doubled at each time, the waiting intervals of all correct replicas will eventually have at least T_f overlap in common, during which the leader could drive a decision.

HotStuff两阶段变种的活跃性 现在,我们将展示一个“两阶段”HotStuff的无限非决定性场景。这解释了在Casper和Tendermint中引入同步延迟的必要性,并因此放弃(乐观的)响应。

在HotStuff的两阶段变种中,我们删除pre-commit阶段,直接到commit阶段。一个副本在投票prepareQC的时候即锁定。假设在视图v,一个leader发起提议b。当完成prepare阶段时,一些副本r_v向prepareQC投票,得到qc,满足qc.node=b。因此r_v都会锁定qc,一个异步的网络调度使得其余的副本在没有接收qc的情况下直接进入下一个视图v+1

现在,我们重复无穷多次这个单视图脚本。最开始,我们在视图v+1中,副本r_v持有系统中最高的prepareQC(如qc)。新的leaderl从除了r_v之外的2f+1个副本中收集了NewView消息。在这些节点中最高的prepareQCqc',对应了视图v-1b'=qc'.node,与b矛盾。l进一步提议基于b'的提案b'',对于这个来说,2f个诚实的副本提供了投票。但是副本r_v拒绝了这个提案,因为他已经被锁定在了qc上面。b''b是冲突的,qc'qc要低,最终,2f个副本放弃向下一个视图的转换。就在这个时候,一个崩溃的副本回应了l的提议,l合并了一个prepareQC(v+1, b''),并且一个副本r_{v+1}表示会投票给这个提议,并且锁定了他。

Livelessness with two-phases. We now briefly demonstrate an infinite non-deciding scenario for a “two-phase” HotStuff. This explains the necessity for introducing a synchronous delay in Casper and Tendermint, and hence for abandoning (Optimistic) Responsiveness.

In the two-phase HotStuff variant, we omit the pre-commit phase and proceed directly to commit. A replica becomes locked when it votes on a prepareQC. Suppose, in view v, a leader proposes b. It completes the prepare phase, and some replica r_v votes for the prepareQC, say qc, such that qc.node = b. Hence, r_v becomes locked on qc. An asynchronous network scheduling causes the rest of the replicas to move to view v + 1 without receiving qc.

We now repeat ad infinitum the following single-view transcript. We start view v + 1 with only r_v holding the highest prepareQC (i.e. qc) in the system. The new leader l collects new-view messages from 2f + 1 replicas excluding r_v . The highest prepareQC among these, qc' , has view v-1 and b' = qc'.node conflicts with b. l then proposes b'' which extends b' , to which 2f honest replicas respond with a vote, but r_v rejects it because it is locked on qc, b'' conflicts with b and qc' is lower than qc. Eventaully, 2f replicas give up and move to the next view. Just then, a faulty replica responds to l’s proposal, l then puts together a prepareQC(v+1, b'') and one replica, say r_{v+1} votes for it and becomes locked on it.

复杂度 在HotStuff的每个阶段,只有leader向所有副本广播,而副本则用部分签名向发送者回复一次以完成投票。在leader的消息中,仲裁证书由先前收集到的(n-f)张选票组成,这些选票可以通过一个阈值签名进行编码。在副本的响应中,来自该副本的部分签名是唯一的身份验证程序。因此,在每个阶段中,总共接收到O(n)个验证器(authenticators)。由于有固定数量的阶段,每个视图的总体复杂性为O(n)

Complexity. In each phase of HotStuff, only the leader broadcasts to all replicas while the replicas respond to the sender once with a partial signature to certify the vote. In the leader’s message, the QC consists of a proof of (n-f) votes collected previously, which can be encoded by a single threshold signature. In a replica’s response, the partial signature from that replica is the only authenticator. Therefore, in each phase, there are O(n) authenticators received in total. As there is a constant number of phases, the overall complexity per view is O(n).

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Model

这是《HotStuff:BFT Consensus in the Lens of Blockchain》论文翻译的第三篇,包含了论文的模型部分。


3. 模型

我们考虑一个由n=3f+1个副本组成的系统,由i \in [n]索引,其中[n] = {1,2,\dots,n}。一个集合拥有至少f = |F|的集合F \contains [n]是拜占庭正确的,反之则是错误的。我们通常将拜占庭副本定义为由竞争对手操控的副本,它们拥有这些副本所持有的所有内部状态(包括它们的加密密钥,见下文)。

We consider a system consisting of a fixed set of n = 3f + 1 replicas, indexed by i ∈ [n] where [n] = {1, . . . , n}. A set F ⊂ [n] of up to f = |F| replicas are Byzantine faulty, and the remaining ones are correct. We will often refer to the Byzantine replicas as being coordinated by an adversary, which learns all internal state held by these replicas (including their cryptographic keys, see below).

网络通信是点对点的,经过验证且可靠的:一个正常的副本可以从另一个正常副本接受消息当且仅当后者确实曾经发送过消息。当我们说“广播”时,也往往指的是发送相同的点对点消息给包括自身在内的所有副本。我们采用了Dwork等人的部分同步模型,模型存在一个已知的界限\Delta和一个未知的全局稳定时间GST。在GST时限过后,副本间的所有通信都顺利到达。我们的模型可以保持持续的安全性,并且保证可以在GST之后有限的时间内获得进展(前人已经证明GST之前取得任何进展是不可能的)。在实际操作中,如果系统在GST后保持足够长时间的稳定,我们的模型一定会取得进展。这里,对于稳定性的要求是用来简化讨论的。

Network communication is point-to-point, authenticated and reliable: one correct replica receives a message from another correct replica if and only if the latter sent that message to the former. When we refer to a “broadcast”, it involves the broadcaster, if correct, sending the same point-to-point messages to all replicas, including itself. We adopt the partial synchrony model of Dwork et al. [25], where there is a known bound ∆ and an unknown Global Stabilization Time (GST), such that after GST, all transmissions between two correct replicas arrive within time ∆. Our protocol will ensure safety always, and will guarantee progress within a bounded duration after GST. (Guaranteeing progress before GST is impossible [27].) In practice, our protocol will guarantee progress if the system remains stable (i.e., if messages arrive within ∆ time) for sufficiently long after GST, though assuming that it does so forever simplifies discussion.

加密原语 HotStuff使用阈值签名加密。在一个(k-n)阈值签名的场景下,所有节点都持有一个公钥,而这n个副本各自又单独持有一个唯一的私钥。第i个副本可以用他所持有的私钥去生成消息m的一个部分签名\rho_i \leftarrow tsign_i(m)。在|I|=k的条件下,部分签名{\rho_i}, i \in I可以用于生成一个m的数字签名\sigma \leftarrow tcombine(m, {\rho_i}, i\in I)。所有其余的节点都可以使用签名以及函数tverify验证签名的合法性。即tverify(m, \sigma)返回真当且仅当对于i \in I, |I| = k,有\rho_i \leftarrow tsign_i(m),且\sigma \leftarrow tcombine(m, {\rho_i}, i\in I)。考虑到我们需要允许先知者(oracles)访问oracles{tsign_i(\cdot)}{i \in [n]/F},一个竞争对手通过在小于k-f个先知者处查询tsign_i(m)得到可验证通过的签名概率可以忽略不计。在后续的论文中,我们将使用k=2f+1作为阈值,同样,我们通常会在协议描述中保留隐含的tverify调用。

Cryptographic primitives. HotStuff makes use of threshold signatures [48, 18, 14]. In a (k, n)-threshold signature scheme, there is a single public key held by all replicas, and each of the n replicas holds a distinct private key. The i-th replica can use its private key to contribute a partial signature \rho_i \leftarrow tsign_i(m) on message m. Partial signatures {\rho_i} i \in I , where |I|=k and each \rho_i \leftarrow tsign_i(m), can be used to produce a digital signature \sigma \leftarrow tcombine(m, {\rho_i}, i\in I) on m. Any other replica can verify the signature using the public key and the function tverify. We require that if \rho_i \leftarrow tsign_i(m) for each i \in I, |I| = k, and if \sigma \leftarrow tcombine(m, {\rho_i}, i\in I), then tverify(m, \sigma) returns true. However, given oracle access to oracles oracles{tsign<em>i(\cdot)}</em>{i \in [n]/F} , an adversary who queries tsign_i(m) on strictly fewer than k-f of these oracles has negligible probability of producing a signature σ for the message m (i.e., such that tverify(m, \sigma)returns true). Throughout this paper, we use a threshold of k = 2f + 1. Again, we will typically leave invocations of tverify implicit in our protocol descriptions.

我们还需要一个加密哈希函数h(该函数也被称为消息摘要函数),该函数将任意长度的数据映射至一个固定长度的输出。哈希函数应该是能够防止碰撞的,意思是对手能够找到两个输入mm_o,满足h(m)=h(m_o)是忽略不计的。综上,h(m)可以被看作输入消息唯一的标识符。

We also require a cryptographic hash function h (also called a message digest function), which maps an arbitrary length input to a fixed-length output. The hash function must be collision resistant [46], which informally requires that the probability of an adversary producing inputs m and m 0 such that h(m) = h(m_0) is negligible. As such, h(m) can serve as an identifier for a unique input m in the protocol.

复杂性度量 我们关心的复杂度度量是“认证复杂度”,具体来说,指的是GST之后(在GST之前,可能存在一些比较坏的情况,共识通信消息尚未完全到达),为了获得决策的共识,所有副本i \in [n]从其他副本那里获得的验证器(authenticator)的总和。这里,验证器或者是一个部分签名或者是一个完整签名。认证复杂度在衡量通信复杂度方面很有意义,原因有三点:第一,和比特复杂度相似,和消息复杂度不同,认证复杂度隐藏了一些由于传输方式导致的细节偏差。比如说,被分装到n个消息中的认证器和被装在1个消息中的认证器拥有相同的认证开销。第二,与比特复杂度相比,认证复杂度更能够捕获这类迭代达成共识的算法的开销。第三,在实践中,我们发现生成或验证数字签名,合并部分签名的密码学操作通常是使用他们的协议中计算最为密集的操作,因此验证复杂度同样也提供了对于计算量的洞察。

Complexity measure. The complexity measure we care about is authenticator complexity, which specifically is the sum, over all replicas i \in [n], of the number of authenticators received by replica i in the protocol to reach a consensus decision after GST. (Again, before GST, a consensus decision might not be reached at all in the worst case [27].) Here, an authenticator is either a partial signature or a signature. Authenticator complexity is a useful measure of communication complexity for several reasons. First, like bit complexity and unlike message complexity, it hides unnecessary details about the transmission topology. For example, n messages carrying one authenticator count the same as one message carrying n authenticators. Second, authenticator complexity is better suited than bit complexity for capturing costs in protocols like ours that reach consensus repeatedly, where each consensus decision (or each view proposed on the way to that consensus decision) is identified by a monotonically increasing counter. That is, because such a counter increases indefinitely, the bit complexity of a protocol that sends such a counter cannot be bounded. Third, since in practice, cryptographic operations to produce or verify digital signatures and to produce or combine partial signatures are typically the most computationally intensive operations in protocols that use them, the authenticator complexity provides insight into the computational burden of the protocol, as well.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Related work

这是《HotStuff:BFT Consensus in the Lens of Blockchain》论文翻译的第二篇,包含了论文的相关工作部分。


2. 相关工作

面对拜占庭失败,如何能够让节点达成共识,就是Lamport等研究人员提出的“拜占庭将军问题”。当然,所谓“拜占庭失败”这个概念也是他们创造出来的。该问题的第一个同步解决方案是由Pease等人提出的,经由Dolev和Strong等人的改进后,通信复杂度被降至O(n^3)级别,这个复杂度由Dolev和Resichuk等人进行了严格证明。Katz和Koo给出了一个基于leader节点的同步解决方案,具有期望常数轮的运行次数以及(n-1)/2的节点容错量。

Reaching consensus in face of Byzantine failures was formulated as the Byzantine Generals Problem by Lamport et al. [37], who also coined the term “Byzantine failures”. The first synchronous solution was given by Pease et al. [43], and later improved by Dolev and Strong [24]. The improved protocol has O(n^3) communication complexity, which was shown optimal by Dolev and Reischuk [23]. A leader-based synchronous protocol that uses randomness was given by Katz and Koo [32], showing an expected constant-round solution with (n-1)/2 resilience.

同时,在异步的场景下,Fischer等人证明了问题在单一故障下,是无法被确切地解决的。此外,Ben等人证明了任意异步解决方案拥有一个容错量(n-1)/3的界限。两种技巧被提出,用于规避上述的“无解性”。一个是由Ben提出的基于随机化的做法,使用持续随机的“虚拟硬币”翻转,直到恰好收敛到共识状态。后续的工作使用加密方法来共享一个不可预测的“硬币”,将复杂度降低到可预期的恒定回合,以及复杂度O(n^3)的通信开销。

Meanwhile, in the asynchronous settings, Fischer et al. [27] showed that the problem is unsolvable deterministically in asynchronous setting in face of a single failure. Furthermore, an (n-1)/3 resilience bound for any asynchronous solution was proven by Ben-Or [12]. Two approaches were devised to circumvent the impossibility. One relies on randomness, initially shown by Ben-Or [12], using independently random coin flips by processes until they happen to converge to consensus. Later works used cryptographic methods to share an unpredictable coin and drive complexities down to constant expected rounds, and O(n^3) communication [18].

第二种做法则依赖于部分同步,该方法最初由Bwork, Lynch和Stockmyer(DLS)提出,该协议在异步执行期间保证了安全性,并且在系统同步后,DLS保证系统可停机。一旦同步性被成功维护,DLS总计会产生O(n^4)的通信开销和后续单次决策的O(n)级别开销。

The second approach relies on partial synchrony, first shown by Dwork, Lynch, and Stockmeyer (DLS) [25]. This protocol preserves safety during asynchronous periods, and after the system becomes synchronous, DLS guarantees termination. Once synchrony is maintained, DLS incurs O(n^4) total communication and O(n) rounds per decision.

状态复制机(SMR)通过其核心的共识机制对客户端请求进行排序,以达到确定性的执行结果。这种SMR反复出现的需求,让Lamport设计出了Paxos协议,该协议提供了一个高效的流水线,在这个流水线上,一个稳定的leader节点可以在一轮通信中,使用线性的通信开销来广播决策。Castro和Liskov也实现了相似的需求,他们提供了一个高效的基于leader节点的拜占庭SMR协议,该协议名字叫做PBFT。在PBFT中,一个稳定的leader节点需要(n^2)的通信开销和两轮通信来完成一次决策广播,同时,倘若leader节点需要更替,则需要花费O(n^3)的通信开销。PBFT被部署在了多个系统中,其中就包括BFTSMaRt。Kotla等研究者则在PBFT基础上,引入了一个乐观的线性路径(optimistic linear path),并得到了Zyzzyva协议,该协议也被部署在了Upright、Byzcoin等系统中。乐观路径拥有线性的复杂度,但leader节点的更替仍然需要O(n^3)的时间复杂度。Abraham等人随机表示Zyzzyva协议存在安全漏洞,并提交了修正方案。改修正方案还有效地减小了协议自身的复杂度。Song等人提出了Bosco协议,这是一个建议的单阶段协议,在乐观路径上拥有较低的延迟。该协议要求5f+1个副本节点。SBFT协议引入了一个能够在O(n^2)通信复杂度下,实现view-change的协议,并且支持一个稳定的leader节点实现乐观的线性、单轮决策。协议利用两种方法来降低通信复杂度,Reiter提出的一个机遇收集者(collector)的通信机制,以及Cachin等人提出的通过门限密码(threshold crptography)对协议投票以实现部分签名合并的算法。

State machine replication relies on consensus at its core to order client requests so that correct replicas execute them in this order. The recurring need for consensus in SMR led Lamport to devise Paxos [36], a protocol that operates an efficient pipeline in which a stable leader drives decisions with linear communication and one roundtrip. A similar emphasis led Castro and Liskov [20, 21] to develop an efficient leader-based Byzantine SMR protocol named PBFT, whose stable leader requires O(n^2) communication and two round-trips per decision, and the leader replacement protocol incurs O(n^3) communication. PBFT has been deployed in several systems, including BFTSMaRt [13]. Kotla et al. introduced an optimistic linear path into PBFT in a protocol named Zyzzyva [34], which was utilized in several systems, e.g., Upright [22] and Byzcoin [33]. The optimistic path has linear complexity, while the leader replacement protocol remains O(n 3 ). Abraham et al. [4] later exposed a safety violation in Zyzzyva, and presented fixes [5, 30]. On the other hand, to also reduce the complexity of the protocol itself, Song et al. proposed Bosco [49], a simple one-step protocol with low latency on the optimistic path, requiring 5f + 1 replicas. SBFT [30] introduces an O(n^2) communication view-change protocol that supports a stable leader protocol with optimistically linear, one round-trip decisions. It reduces the communication complexity by harnessing two methods: a collector based communication paradigm by Reiter [45], and signature combining via threshold cryptography on protocol votes by Cachin et al. [18].

Ramasamy等人提出了一个基于leader节点的随机化拜占庭SMR协议,同时,Miller等人则提出了一个无leader节点的变种,称之为HoneyBadgerBFT。他们的核心思想是,随机化的拜占庭解决方案应该由随机化的异步共识解决。他们最佳的通信复杂度为O(n^3),并通过批处理摊销成本。然而,最近,基于HotStuff文章中提到的思想,本文同时投递的另一篇PODC’19的论文将通信复杂度降低到了O(n^2)

A leader-based Byzantine SMR protocol that employs randomization was presented by Ramasamy et al. [44], and a leaderless variant named HoneyBadgerBFT was developed by Miller et al. [39]. At their core, these randomized Byzantine solutions employ randomized asynchronous Byzantine consensus, whose best known communication complexity was O(n^3) (see above), amortizing the cost via batching. However, most recently, based on the idea in this HotStuff paper, a parallel submission to PODC’19 [8] further improves the communication complexity to O(n^2).

比特币的核心模块是一个称之为中本共识(Nakamoto Consensus)的协议,这是一个异步协议,仅保证了概率意义上的安全性,并存在不确定性。该协议是在一个无许可限制的网络中运行的,也就是说参与者是未知的。通过工作量证明保证其摊销。正如上述所介绍的,最近的区块链解决方案都是以各种方式将工作量证明和经典的BFT协议混合设计的。这些混合的解决方案中,都需要解决轮流领导的问题,这为HotStuff提供了设计动机。

Bitcoin’s core is a protocol known as Nakamoto Consensus [40], a synchronous protocol with only probabilistic safety guarantee and no finality (see analysis in [28, 41, 6]). It operates in a permissionless model where participants are unknown, and resilience is kept via Proof-of-Work. As described above, recent blockchain solutions hybridize Proof-of-Work solutions with classical BFT solutions in various ways [26, 33, 7, 17, 29, 31, 42]. The need to address rotating leaders in these hybrid solutions and others provide the motivation behind HotStuff.

【论文翻译】HotStuff:BFT Consensus in the Lens of Blockchain – Introduction

这是《HotStuff:BFT Consensus in the Lens of Blockchain》论文翻译的第一篇,包含了论文的摘要和简介部分。


摘要

在这篇文章中,我们将介绍HotStuff算法。这个算法是一个基于leader节点的拜占庭问题(Byzantine fault-tolerant replication protocal, BFT)的部分同步(partially synchronous)模型。一旦网络进入同步状态,HotStuff算法将允许一个leader节点依照某个固定间隔(如最大网络延迟)发起共识。这个算法的网络通信开销和副本(replica)的数量成线性相关。据我们所知,HotStuff是第一个拥有上述特征的部分同步的BFT算法(partially synchronous BFT replication protocol)。我们的算法被部署到了一个小型测试网络中,该网络可以同时支持其他的著名协议,如DLS、PBFT、Tendermint、Casper等等……

【副本的概念是由状态复制机引入的。全异步指的是网络传输不存在上限,半同步指的是网络传输存在上限,但大家不知道。全同步指的是网络传输存在已知上限】

We present HotStuff, a leader-based Byzantine fault-tolerant replication protocol for the partially synchronous model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) network delay—a property called responsiveness—and with communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first partially synchronous BFT replication protocol exhibiting these combined properties. HotStuff is built around a novel framework that forms a bridge between classical BFT foundations and blockchains. It allows the expression of other known protocols (DLS, PBFT, Tendermint, Casper), and ours, in a common framework.

在一个副本超过100个的网络中,HotStuff延迟与BFT-SmaRt相当,且在leader节点更替时,拥有线性的网络通信次数(与之相对的,BFTSMaRt算法相同的功能需要使用O(n^3)次网络通信)。

Our deployment of HotStuff over a network with over 100 replicas achieves throughput and latency comparable to that of BFT-SMaRt, while enjoying linear communication footprint during leader failover (vs. cubic with BFTSMaRt).

简介

拜占庭容错(Byzantine fault tolerance, BFT)指的是一个计算网络在其副本节点遭遇任意错误(如拜占庭错误)时,如何保证关键的网络操作能够得以执行。从状态复制机(State Machine Replication, SMR)的角度来看,整个系统提供了一个可复制的服务,该服务可以被镜像部署到网络的N个副本中。一个BFT-SMR协议用来保证没有出错的副本能够以统一的顺序执行一系列由客户端提交的指令,即使存在一些拜占庭节点尝试阻止网络达成共识。在一轮共识中,算法保证由n-f个未出错的副本能够独立的执行客户端指令,并产生该指令下相同且唯一的结果。我们在这里尝试考虑部分同步通行模型(Partially Synchronous Communication Model),在指令发出之后的一定全局稳定时间(Global Stabilazation Time, GST) \Delta后,所有的未出错副本都能够对于指令的顺序以及结果产生共识,进而让整个网络的状态变为确定性的。该模型中,共识达成需要满足条件n \geq 3f+1

【这里的f意思是可以允许的“恶意”拜占庭节点的个数,具体推导细节可以看本文Model章节的Cryptographic primitives部分】

Byzantine fault tolerance (BFT) refers to the ability of a computing system to endure arbitrary (i.e., Byzantine) failures of its components while taking actions critical to the system’s operation. In the context of state machine replication (SMR) [35, 47], the system as a whole provides a replicated service whose state is mirrored across n deterministic replicas. A BFT SMR protocol is used to ensure that non-faulty replicas agree on an order of execution for clientinitiated service commands, despite the efforts of f Byzantine replicas. This, in turn, ensures that the n−f non-faulty replicas will run commands identically and so produce the same response for each command. As is common, we are concerned here with the partially synchronous communication model [25], whereby a known bound ∆ on message transmission holds after some unknown global stabilization time (GST). In this model, n \geq 3f + 1 is required for non-faulty replicas to agree on the same commands in the same order (e.g., [12]) and progress can be ensured deterministically only after GST [27].

在BFT SMR协议最开始构想的时候,研究者往往使用一个经典的研究系统,该系统拥有n=4n=7个副本节点,并被部署在一个本地网络中。然而,近期对于BFT的研究,由于区块链的出现,拥有了新的研究重点,即尝试让网络能够拓展到更大的规模n。与无授权的区块链系统(如比特币)相反,一种称之为授权区块链系统,拥有特定个数的副本节点,用于维护账本或者客户端指令的顺序,换句话来说,这些副本支持了网络的SMR特性。如果不考虑副本授权的限制,副本的个数可以达到几百个。值得注意的是,最终的全局稳定时间\Delta在一个广域网络中,需要进行针对性调整以适应网络通信的延时。

【区块链往往看重是怎么能够得到一个高吞吐量的账本,现在比特币那种10分钟一次记账频率事实上是完全不达标的】

When BFT SMR protocols were originally conceived, a typical target system size was n = 4 or n = 7, deployed on a local-area network. However, the renewed interest in Byzantine fault-tolerance brought about by its application to blockchains now demands solutions that can scale to much larger n. In contrast to permissionless blockchains such as the one that supports Bitcoin, for example, so-called permissioned blockchains involve a fixed set of replicas that collectively maintain an ordered ledger of commands or, in other words, that support SMR. Despite their permissioned nature, numbers of replicas in the hundreds or even thousands are envisioned (e.g., [42, 30]). Additionally, their deployment to wide-area networks requires setting \Delta to accommodate higher variability in communication delays.

扩展性挑战 算法PBFT的引入,标志着第一个可以实践的半同步BFT解决方案的实现。基于该算法提出的两阶段共识思想,又有各种各样的BFT解决方案被创造。从实践的角度来看,一个稳定的leader节点可以通过两轮的消息交换达成共识,第一阶段通过让一个副本获得包含n-f个投票的法定人数证书(Quorum Certificate, QC),使之成为leader,进而保证提议的唯一性。第二阶段则保证leader节点可以说服其他副本去向一个安全的提议投票。

The scaling challenge. Since the introduction of PBFT [20], the first practical BFT replication solution in the partial synchrony model, numerous BFT solutions were built around its core two-phase paradigm. The practical aspect is that a stable leader can drive a consensus decision in just two rounds of message exchanges. The first phase guarantees proposal uniqueness through the formation of a quorum certificate (QC) consisting of (n−f) votes. The second phase guarantees that the next leader can convince replicas to vote for a safe proposal.

为新的leader节点收集信息以及向副本发送提案的过程,称之为视图转换(view change),是整个网络的核心操作。不幸的是,视图转换是基于两步算法的,过于简单且容易出错,即使是在中等大小的网络中,也会存在严重的通信损失。它要求新的领导者从n-f副本中传递信息,每个副本都报告自己知道的QC值最高的副本。即使仅计算认证耗时(数字签名或消息认证码),为了传输新的提案,PBFT算法需要消耗O(n^3)的通信代价。该算法的一个变体,通过阈值数字签名(Threshold Digital Signatures),将多个认证器组合成一个的算法能有效优化耗时,不过即便如此依旧有O(n^2)的时间复杂度。在PBFT中,在单次共识达成之前,有O(n)的视图转换,故总体的传输量为O(n^4),即使加入了阈值签名优化也是O(n^3)复杂度。这种由规模增大带来的传输量剧增挑战着包括PBFT在内的诸多协议,包括Prime、Zyzzyva 、Upright、BFT-SMaRt、 700BFT和SBFT.

The algorithm for a new leader to collect information and propose it to replicas—called a view-change—is the epicenter of replication. Unfortunately, view-change based on the two-phase paradigm is far from simple [38], is bug-prone [4], and incurs a significant communication penalty for even moderate system sizes. It requires the new leader to relay information from (n-f) replicas, each reporting its own highest known QC. Even counting just authenticators (digital signatures or message authentication codes), conveying a new proposal has a communication footprint of O(n^3) authenticators in PBFT, and variants that combine multiple authenticators into one via threshold digital signatures (e.g., [18, 30]) still send O(n^2) authenticators. The total number of authenticators transmitted if O(n) view-changes occur before a single consensus decision is reached is O(n^4) in PBFT, and even with threshold signatures is O(n^3). This scaling challenge plagues not only PBFT, but many other protocols developed since then, e.g., Prime [9], Zyzzyva [34], Upright [22], BFT-SMaRt [13], 700BFT [11], and SBFT [30].

HotStuff算法含有三个核心步骤,第一步允许新leader节点选择周围QC值最高的副本。第二步允许其他副本在投票环节之后改变想法,由于这个改变不需要leader节点进行任何的证明,因此同时降低了时间复杂度以及换leader协议的复杂度。最后,在几乎完成所有阶段的共识后,HotStuff能够很简单的并行化,以保证频繁的leader更迭。

HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. It introduces a second phase that allows replicas to “change their mind” after voting in the phase, without requiring a leader proof at all. This alleviates the above complexity, and at the same time considerably simplifies the leader replacement protocol. Last, having (almost) canonized all the phases, it is very easy to pipeline HotStuff, and to frequently rotate leaders.

据我们所知,BFT协议只有在区块链领域(如Tendermint和Casper)拥有如此简单的leader选举制度,然而,这些系统都是围绕同步内核建立的,在此情形下,提议是按照一个提前定义好的间隔被提出的。这个间隔保证了在最坏情况下,消息能够在广域网络中完整传播。这样做意味着我们不得不放弃大部分最为常用的BFT-SMR解决方案(包括上面列举的那一些),我们放弃的上述解决方案都被定义为拥有乐观响应性(Optimistic Responsiveness)。非正式的,响应性(Responsiveness)要求非故障leader节点一旦被指定,就可以仅根据实际消息延迟来使协议及时达成共识,而与消息传输延迟的任何已知上限无关。更加适合我们模型的是乐观响应性,他放宽了标准,仅仅要求在“乐观”的情况(比如最普通的情况下)拥有响应性,在这里乐观的情况指的是GST时限到达的时候。不管是否乐观,在算法Tendermint/Casper中,响应能力都被排除在外,这个问题的症结在于,可能存在一个诚实的副本,拥有最高的QC值,但是leader对此毫不知情。我们完全可以构造一个场景,使得刚刚描述的问题能够让共识的进展直接卡死(第4.4节将介绍这个问题的详细细节)。基于现有的一些已部署工程产生的报告,如果没有能够在关键协议步骤中纳入必要的延迟,可能使得整个系统彻底失去活性。

To our knowledge, only BFT protocols in the blockchain arena like Tendermint [15, 16] and Casper [17] follow such a simple leader regime. However, these systems are built around a synchronous core, wherein proposals are made in pre-determined intervals that must accommodate the worst-case time it takes to propagate messages over a wide-area peer-to-peer gossip network. In doing so, they forego a hallmark of most practical BFT SMR solutions (including those listed above), namely optimistic responsiveness [42]. Informally, responsiveness requires that a non-faulty leader, once designated, can drive the protocol to consensus in time depending only on the actual message delays, independent of any known upper bound on message transmission delays [10]. More appropriate for our model is optimistic responsiveness, which requires responsiveness only in beneficial (and hopefully common) circumstances—here, after GST is reached. Optimistic or not, responsiveness is precluded with designs such as Tendermint/Casper. The crux of the difficulty is that there may exist an honest replica that has the highest QC, but the leader does not know about it. One can build scenarios where this prevents progress ad infinitum (see Section 4.4 for a detailed liveless scenario). Indeed, failing to incorporate necessary delays at crucial protocol steps can result in losing liveness outright, as has been reported in several existing deployments, e.g., see [3, 2, 19].

我们的贡献点 据我们所知,HotStuff是第一个拥有下列特性的MFT-SMR协议——

  • 线性视图变换复杂度:在达到GST时限后,任何正常的leader节点,一旦被选定,只需要与O(n)个认证副本通信,就可以提出一次共识决策。包括leader节点被替换的情况。这也意味着,大量领导者崩溃的情况下,到达GST的共识通信成本最坏O(n^2)
  • 乐观响应性:在达到GST时限后,任何正常的leader节点,一旦被选定,就只需要等待最先到达的n-f个响应,就可以确信自己有能力发送一个可以通过的提案。包括leader节点刚刚修改的情况。

Our contributions. To our knowledge, we present the first BFT SMR protocol, called HotStuff, to achieve the following two properties:

  • Linear View Change: After GST, any correct leader, once designated, sends only O(n) authenticators to drive a consensus decision. This includes the case where a leader is replaced. Consequently, communication costs to reach consensus after GST is O(n^2) authenticators in the worst case of cascading leader failures.
  • Optimistic Responsiveness: After GST, any correct leader, once designated, needs to wait just for the first n-f responses to guarantee that it can create a proposal that will make progress. This includes the case where a leader is replaced.

HotStuff的另一个特征是,一个新的leader节点尝试达成共识的协议通信开销不大于当前leader节点。这意味着HotStuff支持频繁的leader节点更替,这样更能够保证链的质量,在区块链领域是非常重要的。

Another feature of HotStuff is that the costs for a new leader to drive the protocol to consensus is no greater than that for the current leader. As such, HotStuff supports frequent succession of leaders, which has been argued is useful in blockchain contexts for ensuring chain quality [28].

HotStuff为了实现上述的特性,对于每个视图,都增加了一个额外的阶段,额外阶段的微小开销换取了简化leader更替的协议复杂度。这个交换只对于真实的网络延迟带来的开销,通常这个额外开销相比于\Delta是非常小的。我们期望额外的延迟会远小于之前协议为了实现线性视图变换而放弃响应性产生的开销。此外,通过第5章介绍的高效的管道操作,我们甚至可以让吞吐量不受影响。

HotStuff achieves these properties by adding another phase to each view, a small price to latency in return for considerably simplifying the leader replacement protocol. This exchange incurs only the actual network delays, which are typically far smaller than \Delta in practice. As such, we expect this added latency to be much smaller than that incurred by previous protocols that forgo responsiveness to achieve linear view-change. Furthermore, throughput is not affected due to the efficient pipeline we introduce in Section 5.

在理论贡献之外,HotStuff同时提供了一个BFT算法的复现。有助于读者全面了解BFT协议:

  • 一个运行在图上的BFT复现框架。 通过投票和规则提交保障的安全性。 活跃性是通过Pacemaker提供的,支持向图添加新的副本。
  • 在该框架下实现已知协议(DLS、PBFT、Tendmint和Casper),以及我们自己的HotStuff协议。

In addition to the theoretical contribution, HotStuff also provides insights in understanding BFT replication in general and instantiating the protocol in practice (see Section 6):

A framework for BFT replication over graphs of nodes. Safety is specified via voting and commit graph rules.

  • Liveness is specified separately via a Pacemaker that extends the graph with new nodes.
  • A casting of several known protocols (DLS, PBFT, Tendermint, and Casper) and one new (ours, HotStuff), in this framework.

HotStuff的另一个优点是非常简单,由于其机制本身就是非常经济性的:只包含两种消息的类别,以及一个简单地规则用于决定副本相互之间如何对待。安全性是通过投票和提交规则来保证的,而活跃性则需要Pacemaker机制,该机制与安全机制是完全独立的。同时,他允许通过少量的变化实现几种已知的协议(DLS, PBDF, Tendermint和Casper)。这种灵活性来源于其对于图上节点的操作,在现代区块链和传统的BFT基础间架设了一个桥梁。

HotStuff has the additional benefit of being remarkably simple, owing in part to its economy of mechanism: There are only two message types and simple rules to determine how a replica treats each. Safety is specified via voting and commit rules over graphs of nodes. The mechanisms needed to achieve liveness are encapsulated within a Pacemaker, cleanly separated from the mechanisms needed for safety. At the same time, it is expressive in that it allows the representation of several known protocols (DLS, PBFT, Tendermint, and Casper) as minor variations. In part this flexibility derives from its operation over a graph of nodes, in a way that forms a bridge between classical BFT foundations and modern blockchains.

我们设计了一个HotStuff的原型并进行了初步评估。HotStuff被部署在一个具有一百多个副本的网络上。HotStuff实现的吞吐量与延迟可以媲美BFT-SMaRt等成熟系统,这些系统的代码复杂度都远大于HotStuff。我们进一步演示了在频繁更改领导者的情况下,HotStuff的通信开销基本不变,而BFT-SMaRt则随着副本的增长呈二次方增长。

We describe a prototype implementation and a preliminary evaluation of HotStuff. Deployed over a network with over a hundred replicas, HotStuff achieves throughput and latency comparable to, and sometimes exceeding, those of mature systems such as BFT-SMaRt, whose code complexity far exceeds that of HotStuff. We further demonstrate that the communication footprint of HotStuff remains constant in face of frequent leader replacements, whereas BFT-SMaRt grows quadratically with the number of replicas.

表1 GST后各协议的表现对比

D-Cube: Dense-Block Detection in Terabyte-Scale Tensors 阅读笔记

背景

密集块

本文直接略过了Dense-Block的含义,这里稍微提一提[1]——

假设你拥有一个点评网站的评分数据集,里面记录了“某个时刻,某个用户,对某个页面,进行了点评”。怎么找到数据集里面的虚假点评呢?

研究发现,“点评者相同、点评时间相同、点评页面相同”,意味着这些点评可能是一个群聚虚假点评!

如果把所有的数据,放到一个K维的立方体中,上述虚假点评会组成一个“块”,块内的权值比其他位置更为“密集”。这就是密集块(Dense-Block)的含义。

计算机储存的层次结构

计算机内存访问速度快、容量较小;与之相反计算机磁盘访问速度慢、容量较大。这意味着,对于小数据,我们的程序都会将他放在内存中以获得较快的访问速度。但是倘若数据量增大,那么数据就不得不被放在更低层级的储存结构上(如硬盘),而访问速度会变慢约1000倍之多。

简介

在欺诈检测领域,Dense-Block检测被证明非常的有效。但是,至今为止,所有的Dense-Block计算方法都默认了数据集足够的小以至于可以被塞到电脑内存中。当数据量稍微大一些的话,这类算法就会产生大量的磁盘IO,以至于变得非常低效。

本文提出了D-Cube,一个基于磁盘的最密集块检测算法,该算法以最小化磁盘IO为目标进行优化,并支持Hadoop的MapReduce框架进行分布式运算。D-Cube有如下的特征——

  • 储存高效:D-Cube与传统算法相比,在相同数据集下,使用的内存减小了1600倍,而能够处理的数据规模则增大到1000倍
  • 快速:相比于State-Of-Art模型,D-Cube在真实世界的向量中有5倍加速比,在人工生成向量中有12倍加速比。
  • 准确率可靠:可证明的算法效果,和State-Of-Art算法在真实世界向量中结果相似性极高。
  • 有效性:能够高准确率检测出TCP-Dump网络攻击数据集。

问题描述

指标定义

该问题的形式定义相对比较复杂,完整的定义可以参见原文。

以下图为例,首先我们有若干关系R(A_1, \dots, A_N, X)的多元组,其中编码了N维的张量组合A_1, A_2, \dots, A_N以及其对应的非负指标 X。例如在wikipedia revision history数据集中,R(user, page, date, count)表示了用户u修订了页面p,修改时间d,修改次数c。显然,关系R可以被表示到一个N维立方体中,如图b所示。我们可以进一步,在R的诸多维度中,每个维度都提取出一个子集,形成一个块B。对于B和R,我们都可以定义其质量为块内所有指标t[A_n]的和。

上图是Dense-Block计算的例子,a)描述了一个关系R,b)描述了R的Tensor表示方法,以及其中的一个Block, B(alice,1),M(B(alice, 1)) = 9

块密度

研究者发现定义快密度对于异常检测非常有用,以下是两种快密度的表示方法

算数块密度

\rho_{ari}(B,R) = \rho_{ari}(M_B, \{|B_n|\}_{n=1}^{N},M_R, \{|M_n|\}_{n=1}^{N}) = \frac{M_B}{\frac{1}{N}\sum_{n=1}^{N}|B_n|}

几何快密度

\rho_{geo}(B,R) = \rho_{geo}(M_B, \{|B_n|\}_{n=1}^{N},M_R, \{|M_n|\}_{n=1}^{N}) = \frac{M_B}{(\prod_{n=1}^{N}|B_n|)^{\frac{1}{N}}}

需要注意的是,这里的块密度并不是单纯的块内数值加权平均,而是一个类似于块内总质量除以块各维度平均大小的一个东西。

最后定义块的可疑性(Suspiciousness)

(1)   \begin{align*}\rho_{susp}(B,R) =& \rho_{susp}(M_B, \{|B_n|\}_{n=1}^{N},M_R, \{|M_n|\}_{n=1}^{N}) \\=& M_B \left(log\left(\frac{M_B}{M_R}\right)-1\right) + M_R \prod_{n=1}^{N} \frac{|B_N|}{|R_N|}  \\& -M_B log\left(\prod_{n=1}^{N}\frac{|B_N|}{|R_N|}\right)\end{align*}

问题定义

大规模前k密集块检测问题
给定:a)一个大规模R数组,该数组无法在内存中存下来,b)一个计算块密度的函数\rho
计算:k个密度最大且不相交的块。

方法

首先是D-Cube的算法框架,该框架使用迭代的方式,每一轮都调用find\_single\_block算法获得一个密集块,然后将密集块中出现过的标签都从数据集中删掉——

  • 计算R的质量M_R作为后续find_single_block调用的参数
  • 调用find_single_block 得到一个密集块B
  • 令R←R – B
  • 进入下一个迭代

可以看出,D-Cube作为大的框架,多次调用快选择算法find\_single\_block

D-Cube算法框架,通过迭代,每次找到一个块,并从R中删除掉,直到成功选取k个块为止。

接下来是重点,如何在R中找到一个块B,使其密度\rho尽量大。事实上也是使用迭代法,首先令B为R全集,每一轮都挑选出一个维度i,计算该维度的所有标签对应的M值,删掉那些不能够有效增加M_B的项。在所有维度都被遍历之后返回一个极度压缩的块。大体思路如下——

  • 选择一个维度i(如标签数量最大的维度)
  • 遍历维度i的所有标签,计算每个标签对应的质量Mass
  • 一定会有一部分标签的Mass小于平均值,那么这些标签一定拖了总体密度的后腿,可尝试删去(实现细节有所不同)
  • 遍历直到每一个维度都被压缩成一系列高密度的标签为止。
find_single_block,该函数目的在于从R中选取一个尽量密集的块。

最后,维度的选择也有若干方法,但他们本质上只决定了贪心的顺序,对于模型本身没有太大的结构性贡献。

效率优化

注意到所有数据都是以文件的形式储存的,这意味着

  • 每次访问数据,都需要从文件中把数据给读出来。
  • 每次修改数据,都需要先从文件中读取数据,将修改后的数据写入临时文件,最后将临时文件移动覆盖原文件。

文章提到了两种效率优化的方式——

合并磁盘访问步骤:磁盘IO的总量可以通过合并多步骤的磁盘访问进行优化,这种优化的核心思路是,如果我们首先要访问一遍磁盘数据,又要写相同的数据,我们完全可以省略第二遍的磁盘IO中最为耗时的磁盘扫描。该优化能将磁盘访问数量减至原来的30%。

使用内存缓存法:对于那些最常访问的Tensor,将他们缓存到内存中,能有提速约3倍。

以D-Cube函数的line5和line8为例

此图像的alt属性为空;文件名为image-3.png
line5:计算R中所有权值的和;Line8:依据计算出来的R中的块B,将B的所有标签从R中删除
  • 在执行line8的时候
    • 从磁盘文件[disk_value_current]中读入权值数据
    • 依照Block的指示,将那些存在于Block的权重置0
    • 置0的同时,将权重写入文件[disk_value_temp]
  • 顺便执行下一轮迭代的line5
    • 写回数据过程中,顺便对新的权重求和
  • 使用[disk_value_temp]覆盖[disk_value_current]完成数据更新
  • 额外的,对于数据前面num_of_buffer条数据,直接缓存到内存中。

复杂度分析

具体分析略过,可以证明最坏复杂度O(kN^2|R|L)

  • k表示要寻找k个密集块
  • N表示数据维度
  • R表示数据集大小
  • L = max_{n \in \{1\dots n\}}|R_n|

除去最坏的情况,在实际的分析中,我们发现时间复杂度和k,N,R线性相关,和L次线性相关。

准确率分析

可证明模型返回的块密度上界为最优解块密度的N倍,即——

B*为最大化\rho_{ari}(B,R)的块,\tilde{B}为算法返回值,有\rho_{ari}(\tilde{B},R) \geq \frac{1}{N} \rho_{ari}(B*,R),证明略。

MapReduce实现

Map-Reduce框架适用于数据被储存在分布式文件系统的情况。 实际上本文基本上算是将最经典的Map-Reduce模型套进来了,创新点是“用了”而非“如何用”。

例如要计算R每一个标签对应的质量——

  • [MAP]将所有需要计算的标签分发到所有分布式Client节点
  • [PROCESS]分布式节点计算每一个标签的质量
  • [REDUCE]将计算的结果汇总到Master节点

参考文献

[1]M-zoom: Fast dense-block detection in tensors with quality guarantees.
[原文] Shin, K., Hooi, B., Kim, J., & Faloutsos, C. (2017, February). D-cube: Dense-block detection in terabyte-scale tensors. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (pp. 681-689).