You have decided to use clustering in designing or modifying an infrastructure tier to provide highly available services.
How should you design a highly available infrastructure tier that protects against loss of service due to the failure of a single server or the software that it hosts?
As you design a highly available infrastructure tier, consider the following forces:
The failure of hardware components, applications, or services can render an application unusable or unavailable. For example, imagine if a server that is delivering an application experiences a power supply failure. If this is the only server or only power supply in the server, a single point of failure exists and the application will be unavailable.
Planned server downtime can affect the application availability. For example, if you want to update the operating system on a database server for which there is no standby server, you might have to bring down the application to patch the server.
Monitoring and maintaining multiserver tiers increases demand on system and network resources.
An application using a failover cluster may need special coding to ensure that when a failure occurs, the failover process is transparent to the user and the application remains available. For example, placing timeouts and retries in code that saves data to a database ensures that a transaction will complete if a failover occurs.
Install your application or service on multiple servers that are configured to take over for one another when a failure occurs. The process of one server taking over for a failed server is commonly known as failover. A failover cluster is a set of servers that are configured so that if one server becomes unavailable, another server automatically takes over for the failed server and continues processing. Each server in the cluster has at least one other server in the cluster identified as its standby server.
For a standby server to become the active server, it must somehow determine that the active server no longer functions. The system usually uses one of the following general types of heartbeat mechanisms to accomplish this:
Push heartbeats. For push heartbeats, the active server sends specified signals to the standby server at a well-defined interval. If the standby server does not receive a heartbeat signal over a certain time interval, it determines that the active server failed and takes the active role. For example, the active server sends a status message to the standby server every 30 seconds. Due to a memory leak, the active server eventually runs out of memory and then crashes. The standby server notes that it has not received any status messages for 90 seconds (three intervals) and takes over as the active server.
Pull heartbeats. For pull heartbeats, the standby server sends a request to the active server. If the active server does not respond, the standby server repeats the request a specific number of times. If the active server still does not respond, the standby server takes over as the active server. For example, the standby server may send a getCustomerDetails message to the active server every minute. Due to a memory leak, the active server eventually crashes. The standby server sends the getCustomerDetails request three times without receiving a response. At this time, the standby server takes over as the active server.
A cluster can use multiple levels of heartbeats. For example, a cluster can use push heartbeats at the server level and a set of pull heartbeats at the application level. In this configuration, the active server sends heartbeat messages to the standby server any time the server is up and connected to the network. These heartbeat messages are sent at relatively frequent intervals (for example every 5 seconds) and the standby server may be programmed to take over as active server if only two heartbeats are missed. This means that no more than 10 seconds will elapse before the standby server will detect the failure of the active server and initiate the standby process.
Quite often, heartbeats are sent over dedicated communication channels so that network congestion and general network problems do not cause spurious failovers. Additionally, the standby server might send query messages to one or more key applications running on the active server and wait for a response within a specified timeout interval. If the standby server receives the correct response, it takes no further action. To minimize the performance impact on the active server, application-level querying is usually conducted over a relatively long period, such as every minute or longer. The standby server may be programmed to wait until it has sent at least five requests without a response before taking over as the active server. This means that up to 5 minutes could elapse before the standby server initiates the failover process.
Before the standby server can start processing transactions, it must synchronize its state with the state of the failed server. There are basically three different approaches to synchronization:
Transaction log. In transaction log, the active server maintains a log of all changes to its state. Periodically, a synchronization utility processes this log to update the standby server's state to match the state of the active server. When the active server fails, the standby server must use the synchronization utility to process any additions to the transaction log since the last update. After the state is synchronized, the standby server becomes the active server and begins processing.
Hot standby. In hot standby, updates to the internal state of the active server are immediately copied to the standby server. Because the standby server's state is a clone of the active server's, the standby server can immediately become the active server and start processing transactions.
Shared storage. In shared storage, both servers maintain their state on a shared storage device such as a Storage Area Network or a dual-hosted disk array. Again, the failover can happen immediately because no state synchronization is required.
Determining the Active Server
It is extremely important that only one active server exists for a given set of applications. If multiple servers behave as if they are the active server, data corruption and deadlock often result. The usual way to address this issue is by using some variant of the active token concept. The token, at its simplest level, is a flag that identifies a server as the active server for an application. Only one active token exists for each set of applications; therefore, only one server can own the token. When a server starts, it verifies that its partner owns the active token. If it does, the server starts as the standby server. If it does not detect the active token, it takes ownership of the active token and starts as the active server. The failover process transfers the active token to the standby server when the standby server becomes active.
In most cases, when a standby server becomes active, it is transparent to the application or user that it is supporting. When a failure does occur during a transaction, the transaction may have to be retried for it to complete successfully. This raises the importance of coding the application in such a way that the failover process remains transparent. One example of doing so is including timeouts with retries when committing data to a database.
Additionally, most servers use Internet Protocol (IP) addresses to communicate; therefore, for a failover to succeed, the infrastructure must be able to support the transferring of an IP address from one server to another. An example of this is having network switches that can support the IP address transfer. If your systems infrastructure cannot support this, you may want to use a load-balanced cluster instead of a failover cluster. For more information, see the Load-Balanced Cluster pattern.
Scaling Failover Cluster Servers
Scalability in failover clusters is typically achieved by scaling up, or adding more capability, to an individual server within the cluster. It is important to understand that a failover cluster must be designed to handle the expected load and that individual servers should be sized so that they can accommodate expected growth in CPU, memory, and disk usage. Failover Cluster servers are typically high-end multiprocessor servers that are configured for high availability by using multiple redundant subsystems. If the resource requirements of your solution are greater than the limitations of the servers in the cluster, the cluster will be extremely difficult to scale.
To help you better understand how to use failover clustering to achieve high availability, the following discussion walks through refactoring an already implemented basic solution, which contains a single system (single point of failure), into a highly available solution.
Initially, an organization might start with a basic solution architecture such as the one outlined in Figure 1. Although the solution might meet initial availability expectations, certain factors such as an increase in the number of users or a need for less application downtime may force changes in the design.
Figure 1: Non-failover solution with single point of failure
In Figure 1, the data tier contains only a single database server (Database10) that services the application tier. If the database server or the software that it hosts fails, the application server will no longer be able to access the data it needs to service the client. This will make the application unavailable to the client.
Failover Cluster Solution
To increase the availability of the solution, the organization might decide to eliminate the potential single point of failure presented by the single database server in the data tier. You could do this by adding a server to the data tier and creating a failover cluster from the existing database server, the new server, and a shared storage device. In Figure 2, which illustrates this change, the cluster consists of the two servers connected to a shared storage array.
Figure 2: Solution with failover data tier
The first server (Database01) is the active server that handles all of the transactions. The second server (Database02), which is sitting idle, will handle transactions only if Database01 fails. The cluster exposes a virtual IP address and host name (Database10) to the network that clients and applications use.
Note: You can extend this design to include multiple active servers (more than the one shown) either with a single standby server shared among them or with each active server configured as a standby server for another active server.
The Failover Cluster pattern results in the following benefits and liabilities:
Accommodates planned downtime. Failover clusters can allow for system downtime without affecting availability. This accommodates routine maintenance and upgrades.
Reduces unplanned downtime. Failover clusters reduce application downtime related to server and software failure by eliminating single points of failure at the system and application levels.
Can increase response times. Failover cluster designs can increase response times due to the increased load on the standby server or the need to update state information on or from multiple servers.
Increases equipment costs. The additional hardware that failover clusters require can easily double the cost of an infrastructure tier.
For more information, see the following related patterns:
Server Clustering. Server Clustering presents the concept of using virtual computing resources to enhance the scalability and eliminate the single points of failure that affect availability.
Load-Balanced Cluster. Load-balanced clusters can increase application performance for the current number of users by sharing workload across multiple servers.
Tiered Distribution. Tiered Distribution organizes the system infrastructure into a set of physical tiers to optimize server environments for specific operational requirements and system resource usage.