We consider a system consisting of a fixed set of n = 3f + 1 replicas, indexed by i ∈ [n] where [n] = {1, . . . , n}. A set F ⊂ [n] of up to f = |F| replicas are Byzantine faulty, and the remaining ones are correct. We will often refer to the Byzantine replicas as being coordinated by an adversary, which learns all internal state held by these replicas (including their cryptographic keys, see below).
Network communication is point-to-point, authenticated and reliable: one correct replica receives a message from another correct replica if and only if the latter sent that message to the former. When we refer to a “broadcast”, it involves the broadcaster, if correct, sending the same point-to-point messages to all replicas, including itself. We adopt the partial synchrony model of Dwork et al. [25], where there is a known bound ∆ and an unknown Global Stabilization Time (GST), such that after GST, all transmissions between two correct replicas arrive within time ∆. Our protocol will ensure safety always, and will guarantee progress within a bounded duration after GST. (Guaranteeing progress before GST is impossible [27].) In practice, our protocol will guarantee progress if the system remains stable (i.e., if messages arrive within ∆ time) for sufficiently long after GST, though assuming that it does so forever simplifies discussion.
Cryptographic primitives. HotStuff makes use of threshold signatures [48, 18, 14]. In a (k, n)-threshold signature scheme, there is a single public key held by all replicas, and each of the n replicas holds a distinct private key. The i-th replica can use its private key to contribute a partial signature on message m. Partial signatures , where and each , can be used to produce a digital signature on m. Any other replica can verify the signature using the public key and the function tverify. We require that if for each , and if , then returns true. However, given oracle access to oracles , an adversary who queries on strictly fewer than of these oracles has negligible probability of producing a signature σ for the message m (i.e., such that returns true). Throughout this paper, we use a threshold of . Again, we will typically leave invocations of implicit in our protocol descriptions.
We also require a cryptographic hash function h (also called a message digest function), which maps an arbitrary length input to a fixed-length output. The hash function must be collision resistant [46], which informally requires that the probability of an adversary producing inputs m and m 0 such that is negligible. As such, can serve as an identifier for a unique input m in the protocol.
Complexity measure. The complexity measure we care about is authenticator complexity, which specifically is the sum, over all replicas , of the number of authenticators received by replica i in the protocol to reach a consensus decision after GST. (Again, before GST, a consensus decision might not be reached at all in the worst case [27].) Here, an authenticator is either a partial signature or a signature. Authenticator complexity is a useful measure of communication complexity for several reasons. First, like bit complexity and unlike message complexity, it hides unnecessary details about the transmission topology. For example, n messages carrying one authenticator count the same as one message carrying n authenticators. Second, authenticator complexity is better suited than bit complexity for capturing costs in protocols like ours that reach consensus repeatedly, where each consensus decision (or each view proposed on the way to that consensus decision) is identified by a monotonically increasing counter. That is, because such a counter increases indefinitely, the bit complexity of a protocol that sends such a counter cannot be bounded. Third, since in practice, cryptographic operations to produce or verify digital signatures and to produce or combine partial signatures are typically the most computationally intensive operations in protocols that use them, the authenticator complexity provides insight into the computational burden of the protocol, as well.
Reaching consensus in face of Byzantine failures was formulated as the Byzantine Generals Problem by Lamport et al. [37], who also coined the term “Byzantine failures”. The first synchronous solution was given by Pease et al. [43], and later improved by Dolev and Strong [24]. The improved protocol has communication complexity, which was shown optimal by Dolev and Reischuk [23]. A leader-based synchronous protocol that uses randomness was given by Katz and Koo [32], showing an expected constant-round solution with resilience.
Meanwhile, in the asynchronous settings, Fischer et al. [27] showed that the problem is unsolvable deterministically in asynchronous setting in face of a single failure. Furthermore, an resilience bound for any asynchronous solution was proven by Ben-Or [12]. Two approaches were devised to circumvent the impossibility. One relies on randomness, initially shown by Ben-Or [12], using independently random coin flips by processes until they happen to converge to consensus. Later works used cryptographic methods to share an unpredictable coin and drive complexities down to constant expected rounds, and communication [18].
The second approach relies on partial synchrony, first shown by Dwork, Lynch, and Stockmeyer (DLS) [25]. This protocol preserves safety during asynchronous periods, and after the system becomes synchronous, DLS guarantees termination. Once synchrony is maintained, DLS incurs total communication and rounds per decision.
状态复制机(SMR)通过其核心的共识机制对客户端请求进行排序,以达到确定性的执行结果。这种SMR反复出现的需求,让Lamport设计出了Paxos协议,该协议提供了一个高效的流水线,在这个流水线上,一个稳定的leader节点可以在一轮通信中,使用线性的通信开销来广播决策。Castro和Liskov也实现了相似的需求,他们提供了一个高效的基于leader节点的拜占庭SMR协议,该协议名字叫做PBFT。在PBFT中,一个稳定的leader节点需要的通信开销和两轮通信来完成一次决策广播,同时,倘若leader节点需要更替,则需要花费的通信开销。PBFT被部署在了多个系统中,其中就包括BFTSMaRt。Kotla等研究者则在PBFT基础上,引入了一个乐观的线性路径(optimistic linear path),并得到了Zyzzyva协议,该协议也被部署在了Upright、Byzcoin等系统中。乐观路径拥有线性的复杂度,但leader节点的更替仍然需要的时间复杂度。Abraham等人随机表示Zyzzyva协议存在安全漏洞,并提交了修正方案。改修正方案还有效地减小了协议自身的复杂度。Song等人提出了Bosco协议,这是一个建议的单阶段协议,在乐观路径上拥有较低的延迟。该协议要求个副本节点。SBFT协议引入了一个能够在通信复杂度下,实现view-change的协议,并且支持一个稳定的leader节点实现乐观的线性、单轮决策。协议利用两种方法来降低通信复杂度,Reiter提出的一个机遇收集者(collector)的通信机制,以及Cachin等人提出的通过门限密码(threshold crptography)对协议投票以实现部分签名合并的算法。
State machine replication relies on consensus at its core to order client requests so that correct replicas execute them in this order. The recurring need for consensus in SMR led Lamport to devise Paxos [36], a protocol that operates an efficient pipeline in which a stable leader drives decisions with linear communication and one roundtrip. A similar emphasis led Castro and Liskov [20, 21] to develop an efficient leader-based Byzantine SMR protocol named PBFT, whose stable leader requires communication and two round-trips per decision, and the leader replacement protocol incurs communication. PBFT has been deployed in several systems, including BFTSMaRt [13]. Kotla et al. introduced an optimistic linear path into PBFT in a protocol named Zyzzyva [34], which was utilized in several systems, e.g., Upright [22] and Byzcoin [33]. The optimistic path has linear complexity, while the leader replacement protocol remains O(n 3 ). Abraham et al. [4] later exposed a safety violation in Zyzzyva, and presented fixes [5, 30]. On the other hand, to also reduce the complexity of the protocol itself, Song et al. proposed Bosco [49], a simple one-step protocol with low latency on the optimistic path, requiring replicas. SBFT [30] introduces an communication view-change protocol that supports a stable leader protocol with optimistically linear, one round-trip decisions. It reduces the communication complexity by harnessing two methods: a collector based communication paradigm by Reiter [45], and signature combining via threshold cryptography on protocol votes by Cachin et al. [18].
A leader-based Byzantine SMR protocol that employs randomization was presented by Ramasamy et al. [44], and a leaderless variant named HoneyBadgerBFT was developed by Miller et al. [39]. At their core, these randomized Byzantine solutions employ randomized asynchronous Byzantine consensus, whose best known communication complexity was (see above), amortizing the cost via batching. However, most recently, based on the idea in this HotStuff paper, a parallel submission to PODC’19 [8] further improves the communication complexity to .
Bitcoin’s core is a protocol known as Nakamoto Consensus [40], a synchronous protocol with only probabilistic safety guarantee and no finality (see analysis in [28, 41, 6]). It operates in a permissionless model where participants are unknown, and resilience is kept via Proof-of-Work. As described above, recent blockchain solutions hybridize Proof-of-Work solutions with classical BFT solutions in various ways [26, 33, 7, 17, 29, 31, 42]. The need to address rotating leaders in these hybrid solutions and others provide the motivation behind HotStuff.
We present HotStuff, a leader-based Byzantine fault-tolerant replication protocol for the partially synchronous model. Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) network delay—a property called responsiveness—and with communication complexity that is linear in the number of replicas. To our knowledge, HotStuff is the first partially synchronous BFT replication protocol exhibiting these combined properties. HotStuff is built around a novel framework that forms a bridge between classical BFT foundations and blockchains. It allows the expression of other known protocols (DLS, PBFT, Tendermint, Casper), and ours, in a common framework.
Our deployment of HotStuff over a network with over 100 replicas achieves throughput and latency comparable to that of BFT-SMaRt, while enjoying linear communication footprint during leader failover (vs. cubic with BFTSMaRt).
简介
拜占庭容错(Byzantine fault tolerance, BFT)指的是一个计算网络在其副本节点遭遇任意错误(如拜占庭错误)时,如何保证关键的网络操作能够得以执行。从状态机复制(State Machine Replication, SMR)的角度来看,整个系统提供了一个可复制的服务,该服务可以被镜像部署到网络的N个副本中。一个BFT-SMR协议用来保证没有出错的副本能够以统一的顺序执行一系列由客户端提交的指令,即使存在一些拜占庭节点尝试阻止网络达成共识。在一轮共识中,算法保证由n-f个未出错的副本能够独立的执行客户端指令,并产生该指令下相同且唯一的结果。我们在这里尝试考虑部分同步通行模型(Partially Synchronous Communication Model),在指令发出之后的一定全局稳定时间(Global Stabilazation Time, GST) 后,所有的未出错副本都能够对于指令的顺序以及结果产生共识,进而让整个网络的状态变为确定性的。该模型中,共识达成需要满足条件。
Byzantine fault tolerance (BFT) refers to the ability of a computing system to endure arbitrary (i.e., Byzantine) failures of its components while taking actions critical to the system’s operation. In the context of state machine replication (SMR) [35, 47], the system as a whole provides a replicated service whose state is mirrored across n deterministic replicas. A BFT SMR protocol is used to ensure that non-faulty replicas agree on an order of execution for clientinitiated service commands, despite the efforts of f Byzantine replicas. This, in turn, ensures that the n−f non-faulty replicas will run commands identically and so produce the same response for each command. As is common, we are concerned here with the partially synchronous communication model [25], whereby a known bound ∆ on message transmission holds after some unknown global stabilization time (GST). In this model, is required for non-faulty replicas to agree on the same commands in the same order (e.g., [12]) and progress can be ensured deterministically only after GST [27].
When BFT SMR protocols were originally conceived, a typical target system size was or , deployed on a local-area network. However, the renewed interest in Byzantine fault-tolerance brought about by its application to blockchains now demands solutions that can scale to much larger n. In contrast to permissionless blockchains such as the one that supports Bitcoin, for example, so-called permissioned blockchains involve a fixed set of replicas that collectively maintain an ordered ledger of commands or, in other words, that support SMR. Despite their permissioned nature, numbers of replicas in the hundreds or even thousands are envisioned (e.g., [42, 30]). Additionally, their deployment to wide-area networks requires setting to accommodate higher variability in communication delays.
The scaling challenge. Since the introduction of PBFT [20], the first practical BFT replication solution in the partial synchrony model, numerous BFT solutions were built around its core two-phase paradigm. The practical aspect is that a stable leader can drive a consensus decision in just two rounds of message exchanges. The first phase guarantees proposal uniqueness through the formation of a quorum certificate (QC) consisting of votes. The second phase guarantees that the next leader can convince replicas to vote for a safe proposal.
为新的leader节点收集信息以及向副本发送提案的过程,称之为视图转换(view change),是整个网络的核心操作。不幸的是,视图转换是基于两步算法的,过于简单且容易出错,即使是在中等大小的网络中,也会存在严重的通信损失。它要求新的领导者从副本中传递信息,每个副本都报告自己知道的QC值最高的副本。即使仅计算认证耗时(数字签名或消息认证码),为了传输新的提案,PBFT算法需要消耗的通信代价。该算法的一个变体,通过阈值数字签名(Threshold Digital Signatures),将多个认证器组合成一个的算法能有效优化耗时,不过即便如此依旧有的时间复杂度。在PBFT中,在单次共识达成之前,有的视图转换,故总体的传输量为,即使加入了阈值签名优化也是复杂度。这种由规模增大带来的传输量剧增挑战着包括PBFT在内的诸多协议,包括Prime、Zyzzyva 、Upright、BFT-SMaRt、 700BFT和SBFT.
The algorithm for a new leader to collect information and propose it to replicas—called a view-change—is the epicenter of replication. Unfortunately, view-change based on the two-phase paradigm is far from simple [38], is bug-prone [4], and incurs a significant communication penalty for even moderate system sizes. It requires the new leader to relay information from replicas, each reporting its own highest known QC. Even counting just authenticators (digital signatures or message authentication codes), conveying a new proposal has a communication footprint of authenticators in PBFT, and variants that combine multiple authenticators into one via threshold digital signatures (e.g., [18, 30]) still send authenticators. The total number of authenticators transmitted if O(n) view-changes occur before a single consensus decision is reached is in PBFT, and even with threshold signatures is . This scaling challenge plagues not only PBFT, but many other protocols developed since then, e.g., Prime [9], Zyzzyva [34], Upright [22], BFT-SMaRt [13], 700BFT [11], and SBFT [30].
HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. It introduces a second phase that allows replicas to “change their mind” after voting in the phase, without requiring a leader proof at all. This alleviates the above complexity, and at the same time considerably simplifies the leader replacement protocol. Last, having (almost) canonized all the phases, it is very easy to pipeline HotStuff, and to frequently rotate leaders.
To our knowledge, only BFT protocols in the blockchain arena like Tendermint [15, 16] and Casper [17] follow such a simple leader regime. However, these systems are built around a synchronous core, wherein proposals are made in pre-determined intervals that must accommodate the worst-case time it takes to propagate messages over a wide-area peer-to-peer gossip network. In doing so, they forego a hallmark of most practical BFT SMR solutions (including those listed above), namely optimistic responsiveness [42]. Informally, responsiveness requires that a non-faulty leader, once designated, can drive the protocol to consensus in time depending only on the actual message delays, independent of any known upper bound on message transmission delays [10]. More appropriate for our model is optimistic responsiveness, which requires responsiveness only in beneficial (and hopefully common) circumstances—here, after GST is reached. Optimistic or not, responsiveness is precluded with designs such as Tendermint/Casper. The crux of the difficulty is that there may exist an honest replica that has the highest QC, but the leader does not know about it. One can build scenarios where this prevents progress ad infinitum (see Section 4.4 for a detailed liveless scenario). Indeed, failing to incorporate necessary delays at crucial protocol steps can result in losing liveness outright, as has been reported in several existing deployments, e.g., see [3, 2, 19].
Our contributions. To our knowledge, we present the first BFT SMR protocol, called HotStuff, to achieve the following two properties:
Linear View Change: After GST, any correct leader, once designated, sends only authenticators to drive a consensus decision. This includes the case where a leader is replaced. Consequently, communication costs to reach consensus after GST is authenticators in the worst case of cascading leader failures.
Optimistic Responsiveness: After GST, any correct leader, once designated, needs to wait just for the first responses to guarantee that it can create a proposal that will make progress. This includes the case where a leader is replaced.
Another feature of HotStuff is that the costs for a new leader to drive the protocol to consensus is no greater than that for the current leader. As such, HotStuff supports frequent succession of leaders, which has been argued is useful in blockchain contexts for ensuring chain quality [28].
HotStuff achieves these properties by adding another phase to each view, a small price to latency in return for considerably simplifying the leader replacement protocol. This exchange incurs only the actual network delays, which are typically far smaller than in practice. As such, we expect this added latency to be much smaller than that incurred by previous protocols that forgo responsiveness to achieve linear view-change. Furthermore, throughput is not affected due to the efficient pipeline we introduce in Section 5.
In addition to the theoretical contribution, HotStuff also provides insights in understanding BFT replication in general and instantiating the protocol in practice (see Section 6):
A framework for BFT replication over graphs of nodes. Safety is specified via voting and commit graph rules.
Liveness is specified separately via a Pacemaker that extends the graph with new nodes.
A casting of several known protocols (DLS, PBFT, Tendermint, and Casper) and one new (ours, HotStuff), in this framework.
HotStuff has the additional benefit of being remarkably simple, owing in part to its economy of mechanism: There are only two message types and simple rules to determine how a replica treats each. Safety is specified via voting and commit rules over graphs of nodes. The mechanisms needed to achieve liveness are encapsulated within a Pacemaker, cleanly separated from the mechanisms needed for safety. At the same time, it is expressive in that it allows the representation of several known protocols (DLS, PBFT, Tendermint, and Casper) as minor variations. In part this flexibility derives from its operation over a graph of nodes, in a way that forms a bridge between classical BFT foundations and modern blockchains.
We describe a prototype implementation and a preliminary evaluation of HotStuff. Deployed over a network with over a hundred replicas, HotStuff achieves throughput and latency comparable to, and sometimes exceeding, those of mature systems such as BFT-SMaRt, whose code complexity far exceeds that of HotStuff. We further demonstrate that the communication footprint of HotStuff remains constant in face of frequent leader replacements, whereas BFT-SMaRt grows quadratically with the number of replicas.
[1]M-zoom: Fast dense-block detection in tensors with quality guarantees. [原文] Shin, K., Hooi, B., Kim, J., & Faloutsos, C. (2017, February). D-cube: Dense-block detection in terabyte-scale tensors. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (pp. 681-689).
如果我们修改工程,将client的请求端口直接对接到server监听端口,就会出现前端“Http response at 400 or 500 level”,后端“grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams received bogus greeting from client: \”OPTIONS /protos.Orderer/\””这样的报错。