Disaster Recovery and High Availability for Azure Applications

Primary Authors: Michael McKeown, Cloud Solutions Architect (Aditi); Hanu Kommalapati, Principal Technology Evangelist (Microsoft)

Contributing Author: Jason Roth

Reviewers: Patrick Wickline, Dennis Mulder, Steve Howard, Tim Wieman, James Podgorski, Ryan Berry, Shweta Gupta, Harry Chen, Jim Blanchard, Andy Roberts, Christian Els

Introduction

This paper focuses on high availability for applications running in Azure. An overall strategy for high availability also includes the area of disaster recovery (DR). Planning for failures and disasters in the cloud requires you to recognize the failures quickly. You then implement a strategy that matches your tolerance for the application’s downtime. Additionally, you have to consider the extent of data loss the application can tolerate without causing adverse business consequences as it is restored.

When we ask customers if they are prepared for temporary and large-scale failures, most say they are. However, before you answer that question for yourself, does your company rehearse these failures? Do you test the recovery of databases to ensure you have the correct processes in place? Chances are probably not. That’s because successful DR starts with lots of planning and architecting to implement these processes. Just like many other non-functional requirements, such as security, disaster recovery rarely gets the up-front analysis and time allocation it requires. Also, most customers don’t have the budget for geographically distributed datacenters with redundant capacity. Consequently, even mission critical applications are frequently excluded from proper DR planning.

Cloud platforms, such as Azure, provide geographically dispersed datacenters around the world. These platforms also provide capabilities that support availability and a variety of DR scenarios. Now, every mission critical cloud application can be given due consideration for disaster proofing of the system. Azure has resiliency and DR built in to many of its services. You must study these platform features carefully and supplement with application strategies.

This whitepaper outlines the necessary architecture steps you must take to disaster-proof an Azure deployment. Then you can implement the larger business continuity process. A business continuity plan is a roadmap for continuing operations under adverse conditions. This could be a failure with technology, such as a downed service, or a natural disaster, such as a storm or power outage. Application resiliency for disasters is only a subset of the larger DR process as described in this NIST document: Contingency Planning Guide for Information Technology Systems.

The following sections define different levels of failures, techniques to deal with them, and architectures that support these techniques. This information provides input to your DR processes and procedures to ensure your DR strategy works correctly and efficiently.

Characteristics of Resilient Cloud Applications

A well architected application can withstand capability failures at a tactical level and can also tolerate strategic system-wide failures at the datacenter level. The following sections define the terminology referenced throughout the document to describe various aspects of resilient cloud services.

High Availability

A highly available cloud application implements strategies to absorb the outage of the dependencies like the managed services offered by the cloud platform. Despite possible failures of the cloud platform’s capabilities, this approach permits the application to continue to exhibit the expected functional and non-functional systemic characteristics. This is covered in depth in the paper Failsafe: Guidance for Resilient Cloud Architectures.

When you implement the application, you must consider the probability of a capability outage. Additionally, consider the impact an outage will have on the application from the business perspective before diving deep into the implementation strategies. Without due consideration to the business impact and the probability of hitting the risk condition, the implementation can be expensive and potentially unnecessary.

Consider an automotive analogy for high availability. Even quality parts and superior engineering does not prevent occasional failures. For example, when your car gets a flat tire, the car still runs, but it is operating with degraded functionality. If you planned for this potential occurrence, you can use one of those thin-rimmed spare tires until you reach a repair shop. Although the spare tire does not permit fast speeds, you can still operate the vehicle until you replace the tire. Similarly, a cloud service that plans for potential loss of capabilities can prevent a relatively minor problem from bringing down the entire application. This is true even if the cloud service must run with degraded functionality.

There are a few key characteristics of highly available cloud services: availability, scalability, and fault tolerance. Although these characteristics are interrelated, it is important to understand each and how they contribute to the overall availability of the solution.

Availability

An available application considers the availability of its underlying infrastructure and dependent services. Available applications remove single points of failure through redundancy and resilient design. When we talk about availability in Azure, it is important to understand the concept of the effective availability of the platform. Effective availability considers the Service Level Agreements (SLA) of each dependent service and their cumulative effect on the total system availability.

System availability is the measure of the percentage of a time window the system will be able to operate. For example, the availability SLA of at least two instances of a web or worker role in Azure is 99.95% (out of 100%). It does not measure the performance or functionality of the services running on those roles. However, the effective availability of your cloud service is also affected by the various SLA of the other dependent services. The more moving parts within the system, the more care you must take to ensure the application can resiliently meet the availability requirements of its end users.

Consider the following SLAs for an Azure service that uses Azure services: Compute, Azure SQL Database, and Azure Storage.

Azure Service SLA Potential Minutes Downtime/Month (30 days)

Compute

99.95%

21.6

SQL Database

99.90%

43.2

Storage

99.90%

43.2

You must plan for all services to potentially go down at different times. In this simplified example, the total number of minutes per month that the application could be down is 108 minutes. A 30-day month has a total of 43,200 minutes. 108 minutes is .25% of the total number of minutes in a 30-day month (43,200 minutes). This gives you an effective availability of 99.75% for the cloud service.

However, using availability techniques described in this paper can improve this. For example, if you design your application to continue running when the SQL Database is unavailable, you can remove that from the equation. This might mean that the application runs with reduced capabilities, so there are also business requirements to consider. For a complete list of Azure SLA’s, see Service Level Agreements.

Scalability

Scalability directly affects availability—an application that fails under increased load is no longer available. Scalable applications are able to meet increased demand with consistent results in acceptable time windows. When a system is scalable, it scales horizontally or vertically to manage increases in load while maintaining consistent performance. In basic terms, horizontal scaling adds more machines of the same size (processor, memory, bandwidth) while vertical scaling increases the size of the existing machines. For Azure, you have vertical scaling options for selecting various machine sizes for compute. However, changing the machine size requires a re-deployment. Therefore, the most flexible solutions are designed for horizontal scaling. This is especially true for compute because you can easily increase the number of running instances of any web or worker role. These additional instances handle increased traffic through the Azure Web portal, PowerShell scripts, or code. Base this decision on increases in specific monitored metrics. In this scenario, user performance or metrics do not suffer a noticeable drop under load. Typically, the web and worker roles store any state externally. This allows for flexible load balancing and graceful handling of any changes to instance counts. Horizontal scaling also works well with services, such as Azure Storage, which do not provide tiered options for vertical scaling.

Cloud deployments should be seen as a collection of scale-units. This allows the application to be elastic in servicing the throughput needs of end users. The scale units are easier to visualize at the web and application server level. This is because Azure already provides stateless compute nodes through web and worker roles. Adding more compute scale-units to the deployment will not cause any application state management side effects because compute scale-units are stateless. A storage scale-unit is responsible for managing a partition of data (structured or unstructured). Examples of storage scale-units include Azure Table partition, Blob container, and SQL Database. Even the usage of multiple Azure Storage accounts has a direct impact on the application scalability. You must design a highly scalable cloud service to incorporate multiple storage scale-units. For instance, if an application uses relational data, partition the data across several SQL Databases. Doing so allows the storage to keep up with the elastic compute scale-unit model. Similarly, Azure Storage allows data partitioning schemes that require deliberate designs to meet the throughput needs of the compute layer. For a list of best practices for designing scalable cloud services, see Best Practices for the Design of Large-Scale Services on Azure Cloud Services.

Fault Tolerance

Applications need to assume that every dependent cloud capability can and will go down at some point in time. A fault tolerant application detects and maneuvers around failed elements to continue and return the correct results within a specific timeframe. For transient error conditions, a fault tolerant system will employ a retry policy. For more serious faults, the application can detect problems and fail over to alternative hardware or contingency plans until the failure is corrected. A reliable application can properly manage the failure of one or more parts and continue operating properly. Fault tolerant applications can use one or more design strategies, such as redundancy, replication, or degraded functionality.

Disaster Recovery

A cloud deployment might cease to function due to a systemic outage of the dependent services or the underlying infrastructure. Under such conditions, a business continuity plan triggers the disaster recovery (DR) process. This process typically involves both operations personnel and automated procedures in order to reactivate the application at a functioning datacenter. This requires the transfer of application users, data, and services to the new datacenter. It also involves the use of backup media or ongoing replication.

Consider the previous analogy that compared high availability to the ability to recover from a flat tire through the use of a spare. In contrast, disaster recovery involves the steps taken after a car crash where the car is no longer operational. In that case, the best solution is to find an efficient way to change cars by calling a travel service or a friend. In this scenario, there is likely going to be a longer delay in getting back on the road. There is also more complexity in repairing and returning to the original vehicle. In the same way, disaster recovery to another datacenter is a complex task that typically involves some downtime and potential loss of data. To better understand and evaluate disaster recovery strategies, it is important to define two terms: recovery time objective (RTO) and recovery point objective (RPO).

RTO

The recovery time objective (RTO) is the maximum amount of time allocated for restoring application functionality. This is based on business requirements and is related to the importance of the application. Critical business applications require a low RTO.

RPO

The recovery point objective (RPO) is the acceptable time window of lost data due to the recovery process. For example, if the RPO is one hour, you must completely back up or replicate the data at least every hour. Once you bring up the application in an alternate datacenter, the backup data may be missing up to an hour of data. Like RTO, critical applications target a much smaller RPO.

Part 1: High Availability

A highly available application absorbs fluctuations in availability, load, and temporary failures in the dependent services and hardware. The application continues to operate at an acceptable user and systemic response level as defined by business requirements or application service level agreements.

Azure High Availability Features

Azure has many built-in platform features that support highly available applications. This section describes some of those key features. For a more comprehensive analysis of the platform, see Azure Business Continuity Technical Guidance.

The Azure Fabric Controller (FC) is responsible for provisioning and monitoring the condition of the Azure compute instances. The Fabric Controller checks the status of the hardware and software of the host and guest machine instances. When it detects a failure, it enforces SLAs by automatically relocating the VM instances. The concept of fault and upgrade domains further supports the compute SLA.

When multiple role instances are deployed, Azure deploys these instances to different fault domains. A fault domain boundary is basically a different hardware rack in the same datacenter. Fault domains reduce the probability that a localized hardware failure will interrupt the service of an application. You cannot manage the number of fault domains that are allocated to your worker or web roles. The Fabric Controller uses dedicated resources that are separate from Azure hosted applications. It has 100% uptime because it serves as the nucleus of the Azure system. It monitors and manages role instances across fault domains. The following diagram shows Azure shared resources that are deployed and managed by the FC across different fault domains.

Fault Domain Isolation (Simplified View)

Figure 1 Fault Domain Isolation (Simplified View)

Upgrade domains are similar to fault domains in function, but they support upgrades rather than failures. An upgrade domain is a logical unit of instance separation that determines which instances in a particular service will be upgraded at a point in time. By default, for your hosted service deployment, five upgrade domains are defined. However, you can change that value in the service definition file. For example, you have eight instances of your web role. There will be two instances in three upgrade domains and two instances in one upgrade domain. Azure defines the update sequence, but it is based on the number of upgrade domains. For more information on upgrade domains, see Update an Azure Service.

In addition to these platform features that support high compute availability, Azure embeds high availability features into its other services. For example, Azure Storage maintains three replicas of all blob, table, and queue data. It also allows the option of geo-replication to store backups of blobs and tables in a secondary datacenter. The Content Delivery Network (CDN) allows blobs to be cached around the world for both redundancy and scalability. Azure SQL Database maintains multiple replicas as well. In addition to the Azure Business Continuity Technical Guidance paper, see the Best Practices for the Design of Large-Scale Services on Azure Cloud Services paper. They provide a full discussion of the platform availability features.

Although Azure provides multiple features that support high availability, it is important to understand their limitations. For compute, Azure guarantees that your roles are available and running, but it does not know if your application is running or overloaded. For Azure SQL Database, data is replicated synchronously within the datacenter. These database replicas are not point-in-time backups. For Azure Storage, table and blob data is replicated by default to an alternate datacenter. However, you cannot access the replicas until Microsoft chooses to fail over to the alternate site. A datacenter failover typically only occurs in the case of a prolonged datacenter-wide outage, and there is no SLA for geo-failover time. It is also important to note that any data corruption quickly spreads to the replicas. For these reasons, you must supplement the platform availability features with application-specific availability features. These application availability features include the blob snapshot feature to create point-in-time backups of blob data.

The majority of this paper focuses on cloud services, which use a Platform as a Service (PaaS) model. However, there are also specific availability features for Azure Virtual Machines, which use an Infrastructure as a Service (IaaS) model. In order to achieve high availability with Virtual Machines, you must use availability sets. An availability set serves a similar function to fault and upgrade domains. Within an availability set, Azure positions the virtual machines in a way that prevents localized hardware faults and maintenance activities from bringing down all of the machines in that group. Availability sets are required to achieve the Azure SLA for the availability of Virtual Machines. The following diagram provides a representation of two availability sets that group web and SQL Server virtual machines respectively.

Availability Sets for Windows Azure VMs

Figure 2 Availability Sets for Azure Virtual Machines

Note that in the previous diagram, SQL Server is installed and running on virtual machines. This is different from the previous discussion of Azure SQL Database, which provides database as a managed service.

Application Strategies for High Availability

Most application strategies for high availability involve either redundancy or the removal of hard dependencies between application components. Application design should support fault tolerance during sporadic downtime of Azure or third-party services. The following sections describe several application patterns for increasing availability of your cloud services.

Asynchronous Communication and Durable Queues

Consider asynchronous communication between loosely-coupled services to increase availability in Azure applications. In this pattern, write messages to either storage queues or Service Bus queues for later processing. When you write the message to the queue, control immediately returns to the sender of the message. Another tier of the application handles the message processing, typically implemented as a worker role. If the worker role goes down, the messages accumulate in the queue until the processing service is restored. As long as the queue is available, there is no direct dependency between the front-end sender and the message processor. This eliminates the requirement for synchronous service calls that can be a throughput bottleneck in distributed applications.

A variation of this uses Azure Storage (blobs, tables, queues) or Service Bus queues as a failover location for failed database calls. For example, a synchronous call within an application to another service (such as Azure SQL Database) fails repeatedly. You may be able to serialize that data into durable storage. At some later point when the service or database is back on-line, the application can re-submit the request from storage. The difference in this model is that the intermediate location is not a constant part of the application workflow. It is used only in failure scenarios.

In both scenarios, asynchronous communication and intermediate storage prevents a downed backend service from bringing the entire application down. Queues serve as a logical intermediary. For more guidance on choosing the correct queuing service, see Azure Queues and Azure Service Bus Queues - Compared and Contrasted.

Fault Detection and Retry Logic

A key point in highly available application design is to utilize retry logic within code to gracefully handle a service that is temporarily down. The The Transient Fault Handling Application Block, developed by the Microsoft Patterns and Practices team, assists application developers in this process. The word “transient” means a temporary condition lasting only for a relatively short time. In the context of this paper, handling transient failures is part of developing a highly available application. Examples of transient conditions include intermittent network errors and lost database connections.

The Transient Fault Handling Application Block is a simplified way for you to handle failures within your code in a graceful manner. It allows you to improve the availability of your applications by adding robust transient fault handling logic. In most cases, retry logic handles the brief interruption and reconnects the sender and receiver after one or more failed attempts. A successful retry attempt typically goes unnoticed to application users.

There are three options for developers to manage their retry logic: incremental, fixed interval, and exponential. Incremental waits longer before each retry in an increasing linear fashion (for example, 1, 2, 3, and 4 seconds). Fixed interval waits the same amount of time between each retry (for example, 2 seconds). For a more random option, the exponential back-off waits longer between retries. However, it uses exponential behavior (for example, 2, 4, 8, and 16 seconds).

The high-level strategy within your code is:

  1. Define your retry strategy and policy

  2. Try the operation that could result in a transient fault

  3. If transient fault occurs, invoke the retry policy

  4. If all retries fail, catch a final exception

Test your retry logic in simulated failures to ensure that retries on successive operations do not result in an unanticipated lengthy delay. Do this before deciding to fail the overall task.

Reference Data Pattern (High Availability)

Reference data is the read-only data of an application. This data provides the business context within which the application generates transactional data during the course of a business operation. Transactional data is a point-in-time function of the reference data. Therefore, its integrity depends on the snapshot of the reference data at the time of the transaction. This is a somewhat loose definition, but should suffice for our purpose here.

Reference data in the context of an application is necessary for the functioning of the application. The respective applications create and maintain reference data; Master Data Management systems often perform this function. These systems are responsible for the lifecycle of the reference data. Examples of reference data include product catalog, employee master, parts master, and equipment master. Reference data can also originate from outside the organization, for example, zip codes or tax rates. Strategies for increasing the availability of reference data are typically less difficult than those for transactional data. Reference data has the advantage of being mostly immutable.

You can make Azure web and worker roles that consume reference data autonomous at run time by deploying the reference data along with the application. If the size of the local storage allows such a deployment, this is an ideal state. Embedded databases (SQL, NOSQL) or XML files deployed to a local file system will help with the autonomy of Azure compute scale-units. However, you should have a mechanism to update the data in each role without requiring redeployment. To do this, place any updates to the reference data to a cloud storage endpoint (for example, Azure Blob storage or SQL Database). Add code to each role that downloads the data updates into the compute nodes at role startup. Alternatively, add code that allows an administrator to perform a forced download into the role instances. To increase availability, the roles should also contain a set of reference data in case storage is down. This enables the roles to start with a basic set of reference data until the storage resource becomes available for the updates.

High Availability through Autonomous Compute Nodes

Figure 3 Application high availability through autonomous compute nodes

One consideration for this pattern is the deployment and startup speed for your roles. If you are deploying or downloading large amounts of reference data on startup, this can increase the amount of time it takes to spin up new deployments or role instances. This might be an acceptable tradeoff for the autonomy of having the reference data immediately available on each role rather than depending on external storage services.

Transactional Data Pattern (High Availability)

Transactional data is the data generated by the application in a business context. Transactional data is a combination of the set of business processes the application implements and the reference data that supports these processes. Transactional data examples can include orders, advanced shipping notices, invoices, and CRM opportunities. The transactional data thus generated will be fed to external systems for record keeping or for further processing.

Keep in mind that reference data can change within the systems that are responsible for this data. For this reason, transactional data must save the point-in-time reference data context so that it has minimal external dependencies for its semantic consistency. For example, consider the removal of a product from the catalog a few months after an order was fulfilled. The best practice is to embed as much reference data context as feasible into the transaction. This preserves the semantics associated with the transaction even if the reference data were to change after the transaction is captured.

As mentioned previously, architectures that use loose coupling and asynchronous communication lend themselves to higher levels of availability. This holds true for transactional data as well, but the implementation is more complex. Traditional transactional notions typically rely on the database for guaranteeing the transaction. When you introduce intermediate layers, the application code must correctly handle the data at various layers to ensure sufficient consistency and durability.

The following sequence describes a workflow that separates the capture of transactional data from its processing:

  1. Web Compute Node: Present reference data.

  2. External Storage: Save intermediate transactional data.

  3. Web Compute Node: Complete the end-user transaction.

  4. Web Compute Node: Send the completed transactional data along with the reference data context to a temporary durable storage that is guaranteed to give predictable response.

  5. Web Compute Node: Signal end user the completion of the transaction.

  6. Background Compute Node: Extract the transactional data, post processes it if necessary, and send it to its final storage location in the current system.

The following diagram shows one possible implementation of this design in an Azure cloud service.

High availability through loose coupling

Figure 4 Application high availability through loose coupling

The dashed arrows in the above diagram indicate asynchronous processing. The front-end web role is not aware of this asynchronous processing. This leads to the storage of the transaction at its final destination with reference to the current system. Due to the latency introduced by this asynchronous model, the transactional data is not immediately available for query. Therefore, each unit of the transactional data needs to be saved in a cache or user session to meet the immediate UI needs.

Consequently, the web role is autonomous from the rest of the infrastructure. Its availability profile is a combination of the web role and the Azure queue and not the entire infrastructure. In addition to high availability, this approach allows the web role to scale horizontally, independent of the backend storage. This high availability model can have an impact on the economics of operations. Additional components like Azure queues and worker roles can impact monthly usage costs.

Note that the previous diagram shows one implementation of this decoupled approach to transactional data. There are many other possible implementations. The following list provides some alternative variations.

  • A worker role might be placed between the web role and the storage queue.

  • A Service Bus queue can be used instead of an Azure Storage queue.

  • The final destination might be Azure Storage or a different database provider.

  • Azure Caching can be used at the web layer to provide the immediate caching requirements following the transaction.

Scalability Patterns

In addition to the patterns discussed in this section, it is important to note that the scalability of the cloud service directly impacts availability. If increased load causes your service to be unresponsive, the user impression is that the application is down. Follow best practices for scalability based on your expected application load and future expectations. The highest scale involves many considerations, such as the use of single vs. multiple storage accounts, sharing across multiple databases, and caching strategies. For an in-depth look at these patterns, see Best Practices for the Design of Large-Scale Services on Azure Cloud Services.

Part 2: Disaster Recovery

While high availability is about temporary failure management, disaster recovery (DR) is about the catastrophic loss of application functionality. For example, consider the scenario where one or more datacenters go down. In this case, you need to have a plan to run your application or access your data outside of the datacenter. Execution of this plan involves people, processes, and supporting applications that allow the system to function. The business and technology owners, who define its disaster operational mode, determine the level of functionality for the service during a disaster. This can take many forms: completely unavailable, partially available (degraded functionality or delayed processing), or fully available.

Azure Disaster Recovery Features

As with availability considerations, Azure has Azure Business Continuity Technical Guidance designed to support disaster recovery. There is also a relationship between some of the availability features of Azure and disaster recovery. For example, the management of roles across fault domains increases the availability of an application. Without that management, an unhandled hardware failure would become a “disaster” scenario. So the correct application of many of the availability features and strategies should be seen as an important part of disaster-proofing your application. However, this section goes beyond general availability issues to more serious (and rarer) disaster events.

Multiple Datacenter Regions

Azure maintains datacenters in many different regions around the world. This supports several disaster recovery scenarios, such as the system-provided geo-replication of Azure Storage to secondary datacenters. It also means that you can easily and inexpensively deploy a cloud service to multiple locations around the world. Compare this with the cost and difficulty of running your own datacenters in multiple regions. Deploying data and services to multiple datacenters protects your application from major outages in a single datacenter.

Azure Traffic Manager

Once a datacenter-specific failure occurs, you must redirect traffic to services or deployments in another datacenter. This routing can be done manually, but it is more efficient to use an automated process. Azure Traffic Manager (WATM) is designed for this task. It allows you to automatically manage the failover of user traffic to another datacenter in case the primary datacenter fails. Because traffic management is an important part of the overall strategy, it is important to understand the basics of WATM.

In the diagram below, users connect to a URL specified for WATM (https://myATMURL.trafficmanager.net) that abstracts the actual site URLs (https://app1URL.cloudapp.net and https://app2URL.cloudapp.net). Based on how you configure the criteria for when to route users, they will be sent to the correct actual site when the policy dictates. The policy options are round-robin, performance, or failover. For the sake of this whitepaper we will only be concerned with the option of failover.

Routing using the Windows Azure Traffic Manager

Figure 5 Routing using the Azure Traffic Manager

When configuring WATM you will provide a new Traffic Manager DNS prefix. This is the URL prefix you will provide to your users to access your service. WATM now abstracts load balancing one level up and not at the datacenter level. The Traffic Manager DNS maps to a CNAME for all the deployments it manages.

Within WATM, you specify the priority of the deployments that users will be routed to when failure occurs. The WATM monitors the endpoints of the deployments and notes when the primary deployment fails. At failure, WATM will analyze the prioritized list of deployments and route users to the next one on the list.

While WATM decides where to go in a failover, you can decide if your failover domain is dormant or active while NOT in failover mode. That functionality has nothing to do with Azure Traffic Manager. WATM detects a failure in the primary site and goes to rollover to the failover site. WATM rolls over regardless of whether that site is currently actively serving users or not. More information about dormant or active failover domains can be found in later sections of this paper.

For more information on how Azure Traffic Manager works please refer to the following links.

Common Azure Disaster Scenarios

The following sections cover several different types of disaster scenarios. Datacenter failure is not the only cause of application-wide failures. Poor design or administration errors can also lead to outages. It is important to consider the possible causes of a failure during both the design and testing phases of your recovery plan. A good plan takes advantage of Azure features and augments them with application-specific strategies. The chosen response is dictated by the importance of the application, the RPO, and the RTO.

Application Failure

As mentioned previously, Azure Fabric Controller automatically handles failures resulting from the underlying hardware or operating system software in the host virtual machine. Azure creates a new instance of the role on a functioning server and adds it to the load balancer rotation. If the number of role instances is greater than one, Azure shifts processing to the other running role instances while replacing the failed node.

There are serious application errors that happen independently of any hardware or operating system failures. The application could fail due to the catastrophic exceptions caused by bad logic or data integrity issues. You must incorporate enough telemetry into the code so that a monitoring system can detect failure conditions and notify an application administrator. The administrator with full knowledge of the disaster recovery processes can make a decision to invoke a failover process. Alternatively, the administrator could simply accept an availability outage to resolve the critical errors.

Data Corruption

Azure automatically stores your Azure SQL Database and Azure Storage data three times redundantly within different fault domains in the same datacenter. If geo-replication is used, it is stored three additional times in a different datacenter. However, if your users or your application corrupts that data in the primary copy, it quickly replicates to the other copies. Unfortunately, this results in three copies of corrupt data.

To manage potential corruption of your data, you have two options. First, you can manage a custom backup strategy. You can store your backups in Azure or on-premises depending on your business requirements or governance regulations. Another option is to use the new Point in Time Restore database recovery option for SQL database. For more information, see the section on Data Strategies for Disaster Recovery.

Network Outage

When parts of the Azure network are down, you may not be able to get to your application or data. If one or more role instances are unavailable due to network issues, Azure leverages the remaining available instances of your application. If your application can’t access its data because of an Azure network outage, you can potentially run in a degraded mode locally by using cached data. You need to architect the disaster recovery strategy for running in degraded mode in your application. For some applications, this might not be practical. Another option is to store data in an alternate location until connectivity is restored. If degraded mode is not an option, the remaining options are application downtime or failover to an alternate datacenter. The design of an application running in degraded mode is a much a business decision as a technical one. This is discussed further in the section on Degraded Application Functionality.

Failure of Dependent Service

Azure provides many services that can experience periodic downtime. Consider Azure Shared Caching as an example. This multitenant service provides caching capabilities to your application. It is important to consider what happens in your application if the dependent service is unavailable. In many ways, this scenario is similar to the network outage scenario. However, considering each service independently results in potential improvements to your overall plan.

For example, with caching, there is a relatively new alternative to the multitenant Shared Caching model. Azure Caching on roles provides caching to your application from within your cloud service deployment. (This is also the recommended way to use Caching going forward). While it has a limitation of only being accessible from within a single deployment, there are potential disaster recovery benefits. First, the service now runs on roles that are local to your deployment. Therefore, you are better able to monitor and manage the status of the cache as part of your overall management processes for the cloud service. However, this type of caching also exposes new features. One of the new features is high availability for cached data. This helps to preserve cached data in the event that a single node fails by maintaining duplicate copies on other nodes. Note that high availability decreases throughput and increases latency because of the updating of the secondary copy on writes. It also doubles the amount of memory used for each item, so plan for that. This specific example demonstrates that each dependent service might have capabilities that improve your overall availability and resistance to catastrophic failures.

With each dependent service, you should understand the implications of a total outage. In the Caching example, it might be possible to access the data directly from a database until you restore the Caching capabilities. This would be a degraded mode in terms of performance but would provide full functionality with regard to data.

Datacenter Down

The previous failures have primarily been failures that can be managed within the same Azure datacenter. However, you must also prepare for the possibility that there is an outage of the entire datacenter. When a datacenter goes down, the locally redundant copies of your data are not available. If you have enabled Geo-replication, there are three additional copies of your blobs and tables in a datacenter in a different region. When Microsoft declares the datacenter lost, Azure remaps all of the DNS entries to the geo-replicated datacenter. Note that you do not have any control over this process, and it will only occur for datacenter-wide failures. Because of this, you must also rely on other application-specific backup strategies to achieve the highest level of availability. For more information, see the section on Data Strategies for Disaster Recovery.

Azure Down

In disaster planning, you must consider the entire range of possible disasters. One of the most severe outages would involve all Azure datacenters simultaneously. As with other outages, you might decide that you will take the risk of temporary downtime in that event. Widespread failures that span datacenters should be much rarer than isolated failures involving dependent services or single datacenters. However, for some mission critical applications, you might decide that there must be a backup plan for this scenario as well. The plan for this event could include failing over to services in an Alternative Clouds or a Hybrid On-Premises and Cloud Solutions.

Degraded Application Functionality

A well designed application typically uses a collection of modules that communicate with each other though the implementation of loosely coupled information interchange patterns. A DR-friendly application particularly requires separation of tasks at the module level. This is to prevent an outage of a dependent service from bringing down the entire application. For example, consider a web commerce application for Company Y; the following modules might constitute the application:

  • Product Catalog: allows the users to browse products

  • Shopping Cart: allows users to add/remove products in their shopping cart

  • Order Status: shows the shipping status of user orders

  • Order Submission: finalizes the shopping session by submitting the order with payment

  • Order Processing: validates the order for data integrity and performs quantity availability check

When a dependent of a module in this application goes down, how does the module function until that part recovers? A well architected system implements isolation boundaries through separation of tasks both at design time and run time. You can categorize every failure as recoverable and non-recoverable. Non-recoverable errors will take down the module, but you can mitigate a recoverable error through alternatives. As discussed in the high availability section, you can hide some problems from users by handling faults and taking alternate actions. During a more serious outage, the application might be completely unavailable. However, a third option is to continue servicing users in degraded mode.

For instance, if the database for hosting orders goes down, the Order Processing module loses its ability to process sales transactions. Depending on the architecture, it might be hard or impossible for the Order Submission and Order Processing parts of the application to continue. If the application is not designed to handle this scenario, the entire application might go offline.

However, in this same scenario, it is possible that the product data is stored in a different location. In that case, the Product Catalog module can still be used for viewing products. In degraded mode, the application continues to be available to users for available functionality like viewing the product catalog. Other parts of the application, however, are unavailable, such as ordering or inventory queries.

Another variation of degraded mode centers on performance rather than capabilities. For example, consider a scenario where the product catalog was being cached with Azure Caching. If Caching became unavailable, it is possible that the application could go directly to the server storage to retrieve product catalog information. But this access might be slower than the cached version. Because of this, the application performance is degraded until the Caching service is fully restored.

Deciding how much of an application will continue to function in degraded mode is both a business and a technical decision. The application must also decide how to inform the users of the temporary problems. In this example, the application could allow viewing products and even adding them to a shopping cart. However, when the user attempts to make a purchase, the application notifies the user that the sales module is down temporarily. It is not ideal for the customer, but it does prevent an application-wide outage.

Data Strategies for Disaster Recovery

Handling data correctly is the hardest area to get right in any disaster recovery plan. Restoring data is also the part of the recovery process that typically takes the most time. Different choices in degradation modes result in difficult challenges for data recovery from failure and consistency after failure. One of the factors is the need to restore or maintain a copy of the application’s data. You will use this data for referential and transactional purposes at a secondary site. An on-premises setting requires an expensive and lengthy planning process to implement a multi-datacenter DR strategy. Conveniently, most cloud providers, including Azure, readily allow the deployment of applications to multiple datacenters. These datacenters are geographically located in such a way that multi-datacenter outages should be extremely rare. The strategy for handling data across datacenters is one of the contributing factors for the success of any disaster recovery plan.

The following sections discuss disaster recovery techniques related to data backups, reference data, and transactional data.

Backup and Restore

Regular backups of application data can support some disaster recovery scenarios. Different storage resources require different techniques.

For the Basic, Standard, and Premium SQL Database tiers, you can take advantage of Point in Time Restore to recover your database. For more information, see Point in Time Restore for Azure SQL Database. Another option is to use Active Geo-Replication for SQL Database. This automatically replicates database changes to secondary databases in the same Azure region or even in a different Azure region. This provides a potential alternative to some of the more manual data synchronization techniques presented in this paper. For more information, see Active Geo-Replication for Azure SQL Database.

You can also use a more manual approach for backup and restore. Use the DATABASE COPY command to create a copy of the database. You must use this command to get a backup with transactional consistency. You can also leverage the import/export service of Azure SQL Database. This supports exporting databases to BACPAC files that are stored in Azure Blob storage. The built-in redundancy of Azure Storage creates two replicas of the backup file in the same datacenter. However, the frequency of running the backup process determines your RPO, which is the amount of data you might lose in disaster scenarios. For example, you perform a backup at the top of the hour, and disaster occurs two minutes before the top of the hour. You lose 58 minutes of data that happened after the last backup was performed. Also, to protect against a datacenter outage, you should copy the BACPAC files to an alternate datacenter. You then have the option of restoring those backups in the alternate datacenter. For more details, see Business Continuity in Azure SQL Database.

For Azure Storage, you can develop your own custom backup process or use one of many third-party backup tools. Note that there are additional complexities in most application designs where storage resources reference each other. For example, consider a SQL Database that has a column that links to a blob in Azure Storage. If the backups do not happen simultaneously, the database might have the pointer to a blob that was not backed-up before the failure. The application or disaster recovery plan must implement processes to handle this inconsistency after a recovery.

Reference Data Pattern (Disaster Recovery)

As mentioned previously, reference data is read-only data that supports application functionality. It typically does not change frequently. Although backup and restore is one method to handle datacenter outages, the RTO is relatively long. When you deploy the application to a secondary datacenter, there are some strategies that improve the RTO for reference data.

Because reference data changes infrequently, you can improve the RTO by maintaining a permanent copy of the reference data in the secondary datacenter. This eliminates the time required to restore backups in the event of a disaster. To meet the multi-datacenter DR requirements, you must deploy the application and the reference data together on multiple datacenters. As mentioned in Reference Data Pattern (High Availability), you can deploy reference data to the role itself, external storage, or a combination of both. The intra compute node reference data deployment model implicitly satisfies the DR requirements. Reference data deployment to SQL Database requires that you deploy a copy of the reference data to each datacenter. The same strategy applies to Azure Storage. You must deploy a copy of any reference data stored on Azure Storage to the primary and the secondary datacenters.

Reference data publication to both datacenters

Figure 6 Reference data publication to both primary and secondary datacenters

As mentioned previously, you must implement your own application-specific backup routines for all data, including reference data. Geo-replicated copies across datacenters are only used in a datacenter-wide outage. To prevent extended downtime, deploy the mission critical parts of the application’s data to the secondary datacenter. For an example of this topology, see the Active/Passive model.

Transactional Data Pattern (Disaster Recovery)

Implementation of a fully functional disaster mode strategy requires asynchronous replication of the transactional data to the secondary datacenter. The practical time windows within which the replication can occur will determine the RPO characteristics of the application. You might still recover the data lost from the primary datacenter during the replication window. You may also be able to merge with the secondary datacenter later.

The following architecture examples provide some ideas on different ways of handling transactional data in a failover scenario. It is important to note that these examples are not exhaustive. For example, intermediate storage locations such as queues could be replaced with Azure SQL Database. The queues themselves could be either Azure Storage or Service Bus queues (see Azure Queues and Azure Service Bus Queues - Compared and Contrasted). Server storage destinations could also vary, such as Azure tables instead of SQL Database. In addition, there might be worker roles that are inserted as intermediaries in various steps. The important thing is not to emulate these architectures exactly, but to consider various alternatives in the recovery of transactional data and related modules.

Consider an application that uses Azure Storage queues to hold transactional data. This allows worker roles to process the transactional data to the server database in a decoupled architecture. As discussed, this requires the transactions to use some form of temporary caching if the front-end roles require the immediate query of that data. Depending on the level of data loss tolerance, you could choose to replicate the queues, the database, or all of the storage resources. With only database replication, if the primary datacenter goes down, you can still recover the data in the queues when the primary datacenter comes back. The following diagram shows an architecture where the server database is synchronized across datacenters.

Replicate transactional data in preparation for DR

Figure 7 Replicate transactional data in preparation for DR

The biggest challenge to implement the previous architecture is the replication strategy between datacenters. Azure provides a SQL Data Sync service for this type of replication. However, the service is still in preview and is not recommended for production environments. For more information, see Business Continuity in Azure SQL Database. For production applications, you must invest in a third-party solution or create your own replication logic in code. Depending on the architecture, the replication might be bi-directional, which is also more complex. One potential implementation could make use of the intermediate queue in the previous example. The worker role that processes the data to the final storage destination could make the change in both the primary and secondary datacenters. These are not trivial tasks, and complete guidance for replication code is beyond the scope of this paper. The important point is that a lot of your time and testing should focus on how you replicate your data to the secondary datacenter. Additional processing and testing should be done to ensure that the failover and recovery processes correctly handle any possible data inconsistencies or duplicate transactions.

Note

Most of this paper focuses on Platform as a Service. However, there are additional replication and availability options for hybrid applications that use Azure Virtual Machines. These hybrid applications use Infrastructure as a Service (IaaS) to host SQL Server on virtual machines in Azure. This allows traditional availability approaches in SQL Server, such as AlwaysOn Availability Groups or Log Shipping. Some techniques, such as AlwaysOn, only work between on-premises SQL Servers and Azure virtual machines. For more information, see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines.

Consider a second architecture that operates in degraded mode. The application on the secondary datacenter deactivates all the functionality, such as reporting, BI, or draining queues. It only accepts the most important types of transactional workflows as defined by business requirements. The system captures the transactions and writes them to queues. The system may postpone processing the data during the initial stage of the outage. If the system on the primary datacenter reactivates within the expected time window, the worker roles in the primary datacenter can drain the queues. This process eliminates the need for database merging. If the primary datacenter outage goes beyond the tolerable window, the application can start processing the queues. In this scenario, the database on the secondary contains incremental transactional data that must be merged once the primary reactivates. The following diagram shows this strategy for temporarily storing transactional data until the primary datacenter is restored.

Degraded application mode for transaction capture

Figure 8 Degraded application mode for transaction capture only

For more discussion of data management techniques for resilient Azure applications, see Failsafe: Guidance for Resilient Cloud Architectures.

Deployment Topologies for Disaster Recovery

Prepare mission critical applications for the eventuality of the entire datacenter going down. You do this by incorporating a multi-datacenter deployment strategy into the operational planning. Multi-datacenter deployments might involve IT Pro processes to publish the application and reference data to the secondary datacenter after experiencing a disaster. If the application requires instant failover, the deployment process may involve an active/passive or an active/active setup. This type of deployment has existing instances of the application running in the alternate datacenter. As discussed, a routing tool such as the Azure Traffic Manager provides load balancing services at the DNS level. It can detect outages and route the users to different datacenters when needed.

Part of a successful Azure disaster recovery is architecting that recovery into the solution from the start. The cloud provides additional options for recovering from failures during a disaster that are not available in a traditional hosting provider. Specifically, you can dynamically and quickly allocate resources to a different datacenter. Therefore, you won’t pay a lot for idle resources while waiting for a failure to occur.

The following sections cover different deployment topologies for disaster recovery. Typically, there is a tradeoff in increased cost or complexity for additional availability.

Single-Region Deployment

A single-region deployment is not really a disaster recovery topology, but is meant to contrast the other architectures. Single-region deployments are common for applications in Azure. It is not, however, a serious contender for a disaster recovery plan. The following diagram depicts an application running in a single Azure datacenter. As discussed previously, the Azure Fabric Controller and the use of fault and upgrade domains increase availability of the application within the datacenter.

Single Region Deployment

Figure 9 Single Region Deployment

Here it is apparent that the database is a single point of failure. Even though Azure replicates the data across different fault domains to internal replicas, this all occurs in the same datacenter. It cannot withstand a catastrophic failure. If the datacenter goes down, all of the fault domains go down, which includes all service instances and storage resources.

For all but the least critical applications, you must devise a plan to deploy your application across multiple datacenters in different regions. You should also consider RTO and cost constraints in considering which deployment topology to use.

Let’s take a look now at specific patterns to support failover across different datacenters. These examples all use two datacenters to describe the process.

Redeploy

In this pattern, only the primary datacenter has applications and databases running. The secondary datacenter is not set up for an automatic failover. So when disaster occurs, you must spin up all the parts of the service in the new datacenter. This includes uploading a cloud service to Azure, deploying the cloud services, restoring the data, and changing the DNS to reroute the traffic.

While this is the most affordable of the multi-region options, it has the worst RTO characteristics. In this model, the service package and database backups are stored either on-premises or in the blob storage of the secondary datacenter. However, you must deploy a new service and restore the data before it resumes operation. Even if you fully automate the data transfer from backup storage, spinning up the new database environment consumes a lot of time. Moving data from the backup disk storage to the empty database on the secondary datacenter is the most expensive part of restore. You must do this, however, to bring the new database to an operational state since it is not replicated.

The best approach is to store the service packages in Azure Blob storage in the secondary datacenter. This eliminates the need to upload the package to Azure, which is what happens when you deploy from an on-premises development machine. You can quickly deploy the service packages to a new cloud service from blob storage by using PowerShell scripts.

This option is only practical for non-critical applications that can tolerate a high RTO. For instance, this might work for an application that can be down for several hours, but should be running again within 24 hours.

Redeploy to a secondary Windows Azure datacenter

Figure 10 Redeploy to a secondary Azure datacenter

Active/Passive

The Active/Passive pattern is the choice that many companies favor. This pattern provides improvements to the RTO with a relatively small increase in cost over the redeploy pattern. In this scenario, there is again a primary and a secondary Azure datacenter. All of the traffic goes to the active deployment on the primary datacenter. The secondary datacenter is better prepared for disaster recovery because the database is running on both datacenters. Additionally, there is a synchronization mechanism in place between them. This standby approach can involve two variations: a database-only approach or a complete deployment in the secondary datacenter.

In the first variation of the Active/Passive pattern, only the primary datacenter has a deployed cloud service application. However, unlike the redeploy pattern, both datacenters are synchronized with the contents of the database (see the section on Transactional Data Pattern (Disaster Recovery)). When a disaster occurs, there are fewer activation requirements. You start the application in the secondary datacenter, change connection strings to the new database, and change the DNS entries to reroute traffic.

Like the redeploy pattern, you should already have stored the service packages in Azure Blob storage in the secondary datacenter for faster deployment. Unlike the redeployment pattern, you don’t incur the majority of the overhead that database restore operations requires. The database is ready and running. This saves a significant amount of time, making this an affordable and, therefore, the most popular DR pattern.

Active/Passive (Database Only)

Figure 11 Active/Passive (Database Only)

In the second variation of the Active/Passive pattern, the primary and secondary datacenters have a full deployment. This deployment includes the cloud services and a synchronized database. However, only the primary datacenter is actively handling network requests from the users. The secondary datacenter becomes active only when the primary datacenter goes down. In that case, all new network requests route to the secondary region. Azure Traffic Manager can manage this failover automatically.

Failover occurs faster than the database-only variation because the services are already deployed. This pattern provides a very low RTO; the secondary failover datacenter must be ready to go immediately after failure of the primary datacenter.

Along with quicker response, this pattern also has an additional advantage of pre-allocating and deploying backup services. You don’t have to worry about a datacenter not having the space to allocate new instances in a disaster. This is important if your secondary Azure datacenter is nearing capacity. There is no guarantee (SLA) that you will instantly be able to deploy a number of new cloud services in any datacenter.

For the fastest response time with this model, you must have similar scale (number of role instances) in the primary and secondary datacenters. Despite the advantages, paying for unused compute instances is costly, and this is often not the most prudent financial choice. Because of this, it is more common to use a slightly scaled-down version of cloud services on the secondary datacenter. Then you can quickly failover and scale out the secondary deployment when necessary. You should automate the failover process so that, once the primary datacenter fails, you activate additional instances depending up on the load. This could involve some type of automatic scaling mechanism, such as the AutoScale Preview or the The Autoscaling Application Block. The following diagram shows the model where the primary and secondary datacenters contain a fully deployed cloud service in an Active/Passive pattern.

Active/Passive (Full Replica)

Figure 12 Active/Passive (Full Replica)

Active/Active

By now, you’re probably figuring out the evolution of the patterns – decreasing the RTO increases costs and complexity. The Active/Active solution actually breaks this tendency with regard to cost. In an Active/Active pattern, the cloud services and database are fully deployed in both datacenters. Unlike the Active/Passive model, both datacenters receive user traffic. This option yields the quickest recovery time. The services are already scaled to handle a portion of the load at each datacenter. The DNS is already enabled to use the secondary datacenter. There is additional complexity in determining how to route users to the appropriate datacenter. Round-robin scheduling might be possible. It is more likely that certain users would use a specific datacenter where the primary copy of their data resides.

In case of failover, simply disable DNS to the primary datacenter, which routes all traffic to the secondary datacenter. Even in this model, there are some variations. For example, the following diagram shows a model where the primary datacenter owns the master copy of the database. The cloud services in both datacenters write to that primary database. The secondary deployment can read from the primary or replicated database. Replication in this example happens one way.

Active/Active

Figure 13 Active/Active

There is a downside to the Active/Active architecture in the previous diagram. The second datacenter must access the database in the first datacenter because the master copy resides there. Performance significantly drops off when you access data from outside a datacenter. In cross-datacenter database calls, you should consider some type of batching strategy to improve the performance of these calls. For more information, see Batching Techniques for SQL Database Applications in Azure. An alternative architecture could involve each datacenter accessing its own database directly. In that model, some type of bidirectional replication would be required to synchronize the databases in each datacenter.

In the Active/Active pattern, you might not need as many instances on the primary datacenter as you would in the Active/Passive pattern. If you have ten instances on the primary datacenter in an Active/Passive architecture, you might need only five in each datacenter in an Active/Active architecture. Both regions now share the load. This could be a cost savings over the Active/Passive pattern if you kept a warm standby on the passive datacenter with ten instances waiting for failover.

Realize that until you restore the primary datacenter, the secondary datacenter may receive a sudden surge of new users. If there were 10,000 users on each server when the primary datacenter fails, the secondary datacenter suddenly has to handle 20,000 users. Monitoring rules on the secondary datacenter must detect this increase and double the instances in the secondary datacenter. For more information on this, see the section on Failure Detection.

Hybrid On-Premises and Cloud Solutions

One additional strategy for disaster recovery is to architect a hybrid application that runs on-premises and in the cloud. Depending on the application, the primary datacenter might be either location. Consider the previous architectures and imagine the primary or secondary datacenters as an on-premises location.

There are some challenges in these hybrid architectures. First, most of this paper has addressed Platform as a Service (PaaS) architecture patterns. Typical PaaS applications in Azure rely on Azure-specific constructs such as roles, cloud services, and the Fabric Controller. To create an on-premises solution for this type of PaaS application would require a significantly different architecture. This might not be feasible from a management or cost perspective.

However, a hybrid solution for disaster recovery has fewer challenges for traditional architectures that have simply moved to the cloud. This is true of architectures that use Infrastructure as a Service (IaaS). IaaS applications use Virtual Machines in the cloud that can have direct on-premises equivalents. The use of virtual networks also allows you to connect machines in the cloud with on-premises network resources. This opens up several possibilities that are not possible with PaaS-only applications. For example, SQL Server can take advantage of disaster recovery solutions such as AlwaysOn Availability Groups and database mirroring. For details, see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines.

IaaS solutions also provide an easier path for on-premises applications to use Azure as the failover option. You might have a fully functioning application in an existing on-premises datacenter. However, what if you lack the resources to maintain a geographically separate datacenter for failover? You might decide to use Virtual Machines and Virtual Networks to get your application running in Azure. Define processes that synchronize data to the cloud. The Azure deployment then becomes the secondary datacenter to use for failover. The primary datacenter remains the on-premises application. For more information about IaaS architectures and capabilities, see Virtual Machines and Virtual Network.

Alternative Clouds

There are situations when even the robustness of Microsoft’s cloud might not be enough for your availability requirements. In the past year or so, there have been a few severe outages of various cloud platforms. This includes Amazon Web Services (AWS) and the Azure platforms. Even the best preparation and design to implement backup systems during a disaster fall short and your entire cloud takes the day off.

You’ll want to compare availability requirements with the cost and complexity of increased availability. Perform a risk analysis, and define the RTO and RPO for your solution. If your application cannot tolerate any downtime, it might make sense for you to consider using another cloud solution. Unless the entire Internet goes down simultaneously, another cloud solution, such as Rackspace or Amazon Web Services, will be still functioning on the rare chance that Azure is completely down.

As with the hybrid scenario, the failover deployments in the previous DR architectures can also exist within another cloud solution. Alternative cloud DR sites should only be used for those solutions with an RTO that allows very little, if any, downtime. Note that a solution that uses a DR site outside of Azure will require more work to configure, develop, deploy, and maintain. It is also more difficult to implement best practices in a cross-cloud architecture. Although cloud platforms have similar high-level concepts, the APIs and architectures are different. Should you decide to split your DR among different platforms, it would make sense to architect abstraction layers in the design of the solution. If you do this, you won’t need to develop and maintain two different versions of the same application for different cloud platforms in case of disaster. As with the hybrid scenario, the use of Virtual Machines might be easier in these cases than cloud-specific PaaS designs.

Automation

Some of the patterns we just discussed require quick activation of off-line deployments as well as restoration of specific parts of a system. Automation, or scripting, supports the ability to activate resources on-demand and deploy solutions rapidly. In this paper, DR-related automation is equated with Azure PowerShell, but the Service Management REST API is also an option. Developing scripts helps to manage the parts of DR that Azure does not transparently handle. This has the benefit of producing consistent results each time, which minimizes the chance of human error. Having pre-defined DR scripts also reduces the time to rebuild a system and its constituent parts in the midst of a disaster. You don’t want to try to manually figure out how to restore your site while it is down and losing money every minute.

Once you create the scripts, test them over and over from start to finish. After you verify their basic functionality, make sure that you test them in Disaster Simulation. This helps uncover flaws in the scripts or processes.

A best practice with automation is to create a repository of Azure DR PowerShell scripts. Cleary mark and categorize them for easy lookup. Designate one person to manage the repository and versioning of the scripts. Document them well with explanations of parameters and examples of script use. Also ensure that you keep this documentation in sync with your Azure deployments. This underscores the purpose of having one person in charge of all parts of the repository.

Failure Detection

In order to correctly handle problems with availability and disaster recovery, you must be able to detect and diagnose failures. You should do advanced server and deployment monitoring so you can quickly know when a system or its parts are suddenly down. Monitoring tools that look at the overall health of the cloud Service and its dependencies can perform part of this work. One Microsoft tool is System Center 2012 R2 (SCOM). Other third-party tools, such as AzureWatch, can also provide monitoring capabilities. AzureWatch also allows you to automate scalability. Most monitoring solutions track key performance counters and service availability.

Although these tools are vital, they do not replace the need to plan for fault detection and reporting within a cloud service. You must plan to properly use Azure diagnostics. Custom performance counters or event log entries can also be part of the overall strategy. This provides more data during failures to quickly diagnose the problem and restore full capabilities. It also provides additional metrics for the monitoring tools to use to determine application health. For more information, see Collect Logging Data by Using Azure Diagnostics. For a discussion of how to plan for an overall “health model”, see Failsafe: Guidance for Resilient Cloud Architectures.

Disaster Simulation

Simulation testing involves creating small real life situations on the actual work floor to observe how the team members react. Simulations also show how effective the solutions are outlined in the recovery plan. Carry out simulations in such a way that the scenarios created do not disrupt actual business while still feeling like “real” situations.

Consider architecting a type of “switchboard” in the application to manually simulate availability issues. For instance, through a soft switch, trigger database access exceptions for an ordering module by causing it to malfunction. Similar lightweight approaches can be taken for other modules at the network interface level.

Any issues that were inadequately addressed are highlighted during simulation. The simulated scenarios must be completely controllable. This means that, even if the recovery plan seems to be failing, you can restore the situation back to normal without causing any significant damage. It’s also important that you inform higher-level management about when and how the simulation exercises will be executed. This plan should include information on the time or resources that may become unproductive while the simulation test is running. When subjecting your disaster recovery plan to a test, it is also important to define how success will be measured.

There are several other techniques that you can use to test disaster recovery plans. However, most of them are simply altered versions of these basic techniques. The main motive behind this testing is to evaluate how feasible and how workable the recovery plan is. Disaster recovery testing focuses on the details to discover holes in the basic recovery plan.

Checklist

Let’s summarize the key points that have been covered in this paper. This summary will act as a checklist of items you should consider for your own availability and disaster recovery planning. These are best practices that have been useful for customers seeking to get serious about implementing a successful solution. This type of solution truly works, recovering in a timely and successful manner when system failure hits.

  1. Conduct a risk assessment for each application because each can have different requirements. Some applications are more critical than others and would justify the extra cost to architect them for disaster recovery.

  2. Use this information to define the RTO and RPO for each application.

  3. Design for failure, starting with the application architecture.

  4. Implement best practices for high availability, while balancing cost, complexity, and risk.

  5. Implement disaster recovery plans and processes.

    1. Consider failures that span the module level all the way to a complete cloud outage.

    2. Establish backup strategies for all reference and transactional data.

    3. Choose a multi-site disaster recovery architecture.

  6. Define a specific owner for disaster recovery processes, automation, and testing. The owner should manage and own the entire process.

  7. Document the processes so they are easily repeatable. Although there is one owner, multiple people should be able to understand and follow the processes in an emergency.

  8. Train the staff to implement the process.

  9. Use regular disaster simulations for both training and validation of the process.

Summary

When hardware or software fails within Azure, the techniques and strategies for managing them are different than when failure occurs on on-premise systems. The main reason for this is that cloud solutions typically have more dependencies on infrastructure that’s dispersed across the datacenter and managed as separate services. You must deal with partial failures using high availability techniques. To manage more severe failures, possibly due to a disaster event, use disaster recovery strategies.

Azure detects and handles many failures, but there are many types of failures that require application-specific strategies. You must actively prepare for and manage the failures of applications, services, and data.

When creating your application’s availability and disaster recovery plan, consider the business consequences of the application’s failure. Defining the processes, policies, and procedures to restore critical systems after a catastrophic event takes time, planning, and commitment. And once you establish the plans, you cannot stop there. You must regularly analyze, test, and continually improve the plans based on your application portfolio, business needs, and the technologies available to you. Azure provides both new capabilities and new challenges to creating robust applications that withstand failures.

See Also

Other Resources

Azure Business Continuity Technical Guidance
Business Continuity in Azure SQL Database
High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines
Failsafe: Guidance for Resilient Cloud Architectures
Best Practices for the Design of Large-Scale Services on Azure Cloud Services