Article
10/01/2019

October 2019

Volume 34 Number 10

[Blockchain]

Exploring Blockchain Consensus

By Erik Zhang | October 2019

Blockchain platforms have led to incredible advances in designing and developing decentralized applications and systems, and have been applied toward domains ranging from cryptocurrencies to enterprise supply chains. While the applications are vast, they’re all based on a core set of design patterns that advance the state of the art in the theory and practice of distributed systems.

A blockchain is a monotonically increasing list of records (or blocks) that are linked together using cryptographic techniques. Blocks consist of valid transactions that are hashed and encoded into a Merkle tree, and each block contains a cryptographic hash of the previous block in the chain. This ensures the integrity of the blockchain and enables the relatively inexpensive verification and the independent audit of the transactions in each block and across the chain. A blockchain intrinsically is public and tamper-proof, meaning existing blocks cannot be altered in any manner.

Core to the blockchain is the model of the ledger, an unalterable, append-only log of the transactions that take place across various entities. To maintain the integrity of the ledger, the various entities need a way to “agree” or to reach consensus on which set of incremental transactions (or blocks) are to be appended to the ledger.

The consensus problem is a well-known and fundamental computer science problem in the coordination and control of multi-entity systems. A simplistic approach is of course for all the entities to agree on a majority value. However, one or more faulty entities can skew the outcome, resulting in consensus that is unachievable or incorrect.

In this article, we explore the topic of consensus for blockchains, and share a practical, real-world implementation built on the .NET Core platform using C# that’s used by the NEO blockchain, the Binance Exchange and other organizations.

Let’s start by looking at blockchain platforms, which are programmable blockchains that enable developers to envision and build truly decentralized applications. These can span all manner of markets, including financial markets, gaming, enterprise consortiums, sports, health care networks, sovereign identities, real estate and other asset markets and more. Blockchain platforms such as Ethereum and NEO serve as decentralized application platforms that provide the foundation for a new application model for developers.

At their core, blockchain platforms are distributed systems, building on a foundation of theory and practice that spans decades of computer science research. While there are many recurring patterns and principles, Blockchain platforms have revolutionized the theory of distributed systems in how we deal with trust. In the next section, we drill down further into distributed systems and their implementation using the well-known state machine model in computer science.

Distributed Systems and the State Machine Approach

Distributed systems share a core set of special characteristics, as explored in the field of theoretical computer science. These include:

Concurrency: Multiple activities across the distributed system may be executed simultaneously and independently of each other. This implies that there’s a need for coordination across the different flows of execution.

Independent failure modes: Multiple components across the distributed system may fail independently of each other.

No global time: Multiple flows of execution may be aligned with spatially independent local clocks. Even in the event that these clocks are initially synchronized, clock drift will eventually result. This implies that time and event ordering is a core challenge in distributed systems.

Communications delay: There’s an inherent lag in how events and their side effects propagate through the distributed system.

Inconsistent state: Concurrency, independent failure modes, and communications delays together imply that the view of any state will not be consistent throughout the distributed system.

Collectively, these characteristics require that distributed systems be designed to be fault-tolerant, in order to continue to operate in the event of one or more faults (or complete failure) of one or more subsystems.

The state machine approach is a general method for implementing a fault-tolerant distributed system by replicating services and coordinating client interactions across service replicas. A replicated state machine is deterministic in that it consists of a set of state variables that encode its state and transactions. These state variables can cause the machine to transition from one valid state to the valid next state. Each transaction is executed deterministically (that is, transactions are atomic). Essentially, a replicated state machine is a distributed set of services where all the services start with the same initial state and then agree (that is, reach consensus) on each of the subsequent state transitions.

Consensus Across Replicated State Machines

Formally, the goal of a consensus algorithm is to satisfy three key properties. These are:

Termination: All non-faulty services in the system eventually decide on some output value. This is often referred to as liveness.

Integrity: If all of the non-faulty services propose the same output value, then any non-faulty service must decide on the same output value. A weaker form of integrity is one where the output value must equal a value that was proposed by some non-faulty service (not necessarily all of them).

Agreement: All non-faulty services in the system eventually decide on the same output value. This is often referred to as safety.

Distributed systems theory has made tremendous leaps in the understanding of consensus algorithms, but consensus in a completely asynchronous distributed system has proven impossible to achieve in the presence of even a single faulty service. This is called the FLP impossibility, named after the researchers (Michael J. Fischer, Nancy Lynch and Mike Patterson) who posited a definitive upper bound on what’s possible to achieve with distributed processes in an asynchronous environment.

The FLP impossibility has spawned research spanning two innovative approaches. One set of algorithms relies on the so-called Nakamoto consensus. It applies an unconventional approach that relies on non-determinism to address the inherent scale challenges in attempting to generate consensus in a distributed system. The brilliance of the Nakamoto consensus is that rather than every service agreeing on a value, the algorithm focuses on all of the services agreeing on the probability of the value being correct. However, this results in probabilistic agreement—that is, the lack of deterministically finalizing a value at every state transition creates a situation where there’s no guarantee of true finality. This leads to the so-called forking scenario with respect to the distributed system. For this reason, we’ll ignore the Nakamoto consensus for the remainder of this article.

A second set of practical fault-tolerant consensus algorithms has assumed some level of synchrony assumptions in order to make progress. What this means is that some protocols are designed to work in unreliable networks—such as, say, the Internet—that drop messages and may cause arbitrary delay, while other protocols are optimized for highly reliable network channels. These protocols are said to operate under different sets of synchrony assumptions. Synchrony assumptions may be explicit or implicit by relying on leader election algorithms, for instance. Consensus algorithms that are based on leader election are called Paxos algorithms.

Byzantine Fault Tolerant Consensus

Byzantine failures pose a challenge for leader-based consensus algorithms. These failures occur when components or sub-components of a distributed system fail, and there’s imperfect information about whether a component (or sub-component) has actually failed. Algorithmic proofs exist to demonstrate that a malicious leader can’t cause inconsistency, but distributed systems theory has yet to demonstrate that a malicious leader can’t prevent progress.

The so-called practical BFT (pBFT) algorithm by Castro and Liskov was the first attempt to describe an algorithm by which the system can detect lack of progress and choose a new leader. pBFT was devised to address the twin flaws in previous attempts—either the algorithm was too slow to be of practical use or synchrony had to be assumed to satisfy the “agreement” property.

The pBFT algorithm demonstrated that it could provide both liveness and safety as long as a maximum of (n - 1) / 3 services were faulty in the distributed system. pBFT cycles through a succession of “views” with each view having one primary service acting as the leader and the remaining services acting as backups. At a conceptual level, the pBFT algorithm works as follows:

The client sends a request to the primary (leader) service.
The primary (leader) service broadcasts the request to all of the backup services.
The primary and the backup services perform the work requested and then send back a reply to the client.
The request is served successfully when the client receives m+1 responses from the different services spanning the distributed system with the same result, where m is the maximum number of faulty services allowed.

The primary (leader) service is changed during every view (round of consensus) and may be substituted if a predefined quantity of time has passed without the leader broadcasting a request to the backups. As long as the leader is non-faulty, pBFT works reasonably well; however, the process of replacing a faulty leader is highly inefficient.

pBFT improved on the existing theory, but in practice it’s not suitable for real-world scenarios due to its inherent scalability challenges and its inability to distinguish malicious behavior from transient communication faults.

Delegated Byzantine Fault Tolerance

Enter Delegated Byzantine Fault Tolerance (dBFT), which was proposed by Erik Zhang, founder of the NEO blockchain in 2014. dBFT extends pBFT concepts to the state machine replication scenario, and provides the first practical, public access to fast single-block data finality (about 15 seconds). dBFT is now in use by the NEO blockchain, the Binance Exchange and other major platforms globally.

The key innovation in Zhang’s proposal was to distinguish consensus nodes (services that can participate in the consensus algorithm to propose new state changes and vote) and ordinary nodes (services that can execute the atomic transactions and transition state, but don’t take part in the consensus algorithm and can’t propose new state changes). In doing so, dBFT became the first practical BFT to function at scale, addressing the challenges inherent in pBFT.

The C# implementation of dBFT is available in the public domain (MIT License) on GitHub at bit.ly/2Zl1Sem.

The dBFT algorithm comprises three distinct phases, Pre-Prepare, Prepare and Persist. Before we explore each of these phases, let’s take a moment to clarify terminology and the algorithmic steps involved.

N: The number of active consensus nodes

f: The number of Byzantine (that is, malicious) nodes, with f being no more than (N - 1) / 3

v: The current view number (each view is a new round or attempt at consensus)

b: The proposed block of atomic transactions, the execution of which transitions the system to the next valid state

p: The index of the speaker, that is the leader for this view which proposes the block. The speaker and the remaining delegates together comprise the N consensus nodes.

At a conceptual level, dBFT comprises the following steps:

A cryptographically signed transaction is “broadcast” by a client to the nodes in the distributed system.
The N consensus nodes receive the transaction and collect them into their in-memory pool of transactions.
For the current view, the unique speaker p packages the transactions from the memory pool into a new proposal for block b. Figure 1 illustrates the MakePrepareRequest method and Figure 2 illustrates the SendPrepareRequest method.
The N-1 remaining delegates receive the new proposed block b, verify it and then broadcast the verification. For brevity, the code snippets for the remaining steps aren’t included inline and are on GitHub at the link provided earlier.
Any consensus node, on receiving at least (N - f) verifications reaches consensus and then publishes the new block.
Any node, on receiving a new block, deletes all the transactions from the in-memory pool, and if it’s a consensus node starts on the next view (round of consensus).

Figure 1 The MakePrepareRequest Method

public ConsensusPayload MakePrepareRequest()
{
  byte[] buffer = new byte[sizeof(ulong)];
  random.NextBytes(buffer);
  List<Transaction> transactions =
    Blockchain.Singleton.MemPool.GetSortedVerifiedTransactions()
    .Take((int)NativeContract.Policy.GetMaxTransactionsPerBlock(Snapshot))
    .ToList();
  TransactionHashes = transactions.Select(p => p.Hash).ToArray();
  Transactions = transactions.ToDictionary(p => p.Hash);
  Block.Timestamp = Math.Max(TimeProvider.Current.UtcNow.ToTimestampMS(),
    PrevHeader.Timestamp + 1);
  Block.ConsensusData.Nonce = BitConverter.ToUInt64(buffer, 0);
  return PreparationPayloads[MyIndex] = MakeSignedPayload(new PrepareRequest
  {
    Timestamp = Block.Timestamp,
    Nonce = Block.ConsensusData.Nonce,
    TransactionHashes = TransactionHashes
  });
}

Figure 2 The SendPrepareRequest Method

private void SendPrepareRequest()
{
  Log($"send prepare request: height={context.Block.Index} view={context.ViewNumber}");
  localNode.Tell(new LocalNode.SendDirectly { Inventory = context.MakePrepareRequest() });
  if (context.Validators.Length == 1)
    CheckPreparations();
  if (context.TransactionHashes.Length > 0)
  {
    foreach (InvPayload payload in InvPayload.CreateGroup(InventoryType.TX,
      context.TransactionHashes))
      localNode.Tell(Message.Create(MessageCommand.Inv, payload));
  }
  ChangeTimer(TimeSpan.FromMilliseconds((Blockchain.MillisecondsPerBlock <<
    (context.ViewNumber + 1)) - (context.ViewNumber == 0 ?
    Blockchain.MillisecondsPerBlock : 0)));
}

With all that settled, let’s return to the three phases of the dBFT algorithm. They are:

Pre-Prepare: The speaker for the view is responsible for broadcasting the Prepare-Request message to the delegates and for initiating the new proposed block of transactions.

Prepare: On receiving the Pre-Prepare message, the delegates broadcast the Prepare-Response message if the proposed block is successfully verified. On receiving (N - f) successful block verifications, the consensus nodes enter the next phase.

Persist: The nodes publish a new block and enter the next round of consensus.

Figure 3 The dBFT Consensus Algorithm Phases

In the event that consensus isn’t reached in a certain round (view), the consensus nodes may initiate a proposal to change the view, with a new speaker (leader) and restart the activity of reaching consensus, after receiving at least (N - f) proposals to change the view with the exact same view number. The wait time to propose a new round increases exponentially in order to avoid frequent view changes and to ensure consensus within practical time bounds.

The first version of dBFT was susceptible to a single block fork due to network latency in certain edge cases. Essentially, the fact that nodes could time-out after sending a PrepareResponse message meant that nodes could time-out and transition at slightly different times. In the event that only one consensus node didn’t time-out and that node had already received 2f PrepareResponse messages, it would then generate a valid block while the other consensus nodes would’ve moved on to the next view. There, those consensus nodes could in theory achieve consensus and sign another block of transactions at the same level. While this scenario could transpire without blocking consensus, one or more nodes could accept the forked block and stall.

dBFT 2.0 addressed this by adding a new commit phase. To prevent potential stalling, dBFT 2.0 also augments the consensus algorithm with a recovery message implementation. This recovery mechanism has the additional benefit of significantly improving block times in cases where network latency is degraded due to the network being compromised.

Wrapping Up

Distributed systems have been fundamental in transforming the computing industry, and by extension how we conduct commerce globally and how we universally engage as a community. The emergence of blockchains has spurred developers to study and to scrutinize well-established principles and paradigms in distributed systems, and in doing so has catalyzed a wave of innovation that continues to break new ground in how developers build the next generation of software applications.

In this article we’ve focused on the topic of consensus and illustrated a pioneering new approach with the dBFT consensus algorithm. While we used C# to illustrate the dBFT algorithm for consensus, we’ve implemented the dBFT algorithm in C++, Python, Typescript and Go, to name a few languages. We hope this article provides developers with a better understanding of blockchain consensus and enables them to use and to build on top of the pioneering dBFT algorithm.

Erik Zhang is founder and core developer of NEO, author of the dBFT consensus mechanism, expert on blockchain technology and computer security, and a certified information system auditor (CISA). Previously he was engaged in Shanda Games and huobi.com, where he specialized in information security and R&D in digital currency.

John deVadoss leads development for NEO Global Development in Seattle. Previously he built Microsoft Digital, .NET Patterns & Practices, and .NET Architecture Strategy. He also incubated Microsoft Azure when he was at Microsoft. Most recently, he launched two machine learning startups. deVadoss did his Ph.D. work in machine learning, specializing in recurrent neural networks.

Thanks to the following technical experts for reviewing this article: Chuan Lu, Harry Pierson

Discuss this article in the MSDN Magazine forum