Disaster Recovery and High Availability for Windows Azure Applications
Primary Authors: Michael McKeown, Cloud Solutions Architect (Aditi); Hanu Kommalapati, Principal Technology Evangelist (Microsoft)
Contributing Author: Jason Roth
Reviewers: Patrick Wickline, Dennis Mulder, Steve Howard, Tim Wieman, James Podgorski, Ryan Berry, Shweta Gupta, Harry Chen, Jim Blanchard, Andy Roberts, Christian Els
This paper focuses on high availability for applications running in Windows Azure. An overall strategy for high availability also includes the area of disaster recovery (DR). Planning for failures and disasters in the Cloud requires you to recognize the failures quickly and implement a strategy that matches your tolerance for the application’s downtime. Additionally you have to consider the extent of data loss the application can tolerate without adverse business consequences as it is restored.
When we ask customers if they are prepared for temporary and large-scale failures, most say they are. However, before you answer that question for yourself, does your company rehearse these failures? Do you test the recovery of databases to ensure you have the correct processes in place? Chances are probably not. That’s because successful DR starts with lots of planning and architecting to implement these processes. Just like many other non-functional requirements, such as security, disaster recovery rarely gets the up-front analysis and time allocation required. Also, most customers don’t have the budget for geographically distributed datacenters with redundant capacity. Consequently even mission critical applications are frequently excluded from proper DR planning.
Cloud platforms, such as Windows Azure, provide geographically dispersed datacenters around the world. These platforms also provide capabilities that support availability and enable a variety of DR scenarios. Now, every mission critical Cloud application can be given due consideration for disaster proofing of the system. Windows Azure has resiliency and DR built into many of its services. These platform features must be studied carefully and supplemented with application strategies.
The whitepaper outlines the necessary architecture steps to be taken to disaster-proof a Windows Azure deployment so that the larger business continuity process can be implemented. A business continuity plan is a roadmap for continuing operations under adverse conditions. This could be a failure with technology, such as a downed service, or a natural disaster, such as a storm or power outage. Application resiliency for disasters is only a subset of the larger DR process as described in this NIST document: Contingency Planning Guide for Information Technology Systems.
The following sections define different levels of failures, techniques to deal with them, and architectures that support these techniques. This information provides input into your DR processes and procedures to ensure your DR strategy works correctly and efficiently.
Characteristics of Resilient Cloud Applications
A well architected application can withstand capability failures at a tactical level and can also tolerate strategic system-wide failures at the datacenter level. The following sections define the terminology referenced throughout the document to describe various aspects of resilient cloud services.
A highly available cloud application implements strategies to absorb the outage of the dependencies like the managed services offered by the cloud platform. In spite of possible failures of the Cloud platform capabilities, this approach permits the application to continue to exhibit the expected functional and non-functional systemic characteristics as defined by the designers. This is covered in depth in the paper Failsafe: Guidance for Resilient Cloud Architectures.
The implementation of the application needs to factor in the probability of a capability outage. It also needs to consider the impact it will have on the application from the business perspective before diving deep into the implementation strategies. Without due consideration to the business impact and the probability of hitting the risk condition, the implementation can be expensive and potentially unnecessary.
Consider an automotive analogy for high availability. Even quality parts and superior engineering does not prevent occasional failures. For example, when your car gets a flat tire, the car still runs, but it is operating with degraded functionality. If you planned for this potential occurrence, you can use one of those thin-rimmed spare tires until you reach a repair shop. Although the spare tire does not permit fast speeds, you can still operate the vehicle until the tire is replaced. In the same way, a cloud service that plans for potential loss of capabilities can prevent a relatively minor problem from bringing down the entire application. This is true even if the cloud service must run with degraded functionality.
There are a few key characteristics of highly available cloud services: availability, scalability, and fault tolerance. Although these characteristics are interrelated, it is important to understand each and how they contribute to the overall availability of the solution.
An available application considers the availability of its underlying infrastructure and dependent services. Available applications remove single points of failure through redundancy and resilient design. When we talk about availability in Windows Azure, it is important to understand the concept of the effective availability of the platform. Effective availability considers the Service Level Agreements (SLA) of each dependent service and their cumulative effective on the total system availability.
System availability is the measure of what percentage of a time window the system will be able to operate. For example, the availability SLA of at least two instances of a web or worker role in Windows Azure is 99.95%. This percentage represents the amount of time that the roles are expected to be available (99.95%) out of the total time they could be available (100%). It does not measure the performance or functionality of the services running on those roles. However, the effective availability of your cloud service is also affected by the various SLA of the other dependent services. The more moving parts within the system, the more care must be taken to ensure the application can resiliently meet the availability requirements of its end users.
Consider the following SLAs for a Windows Azure service that uses Windows Azure roles (Compute), Windows Azure SQL Database, and Windows Azure Storage.
|Windows Azure Service||SLA||Potential Minutes Downtime/Month (30 days)|
You must plan for all services to potentially go down at different times. In this simplified example, the total number of minutes per month that the application could be down is 108 minutes. A 30-day month has a total of 43,200 minutes. 108 minutes is .25% of the total number of minutes in a 30-day month (43,200 minutes). This gives you an effective availability of 99.75% for the cloud service.
However, using availability techniques described in this paper can improve this. For example, if you design your application to continue running when SQL Database is unavailable, you can remove that line from the equation. This might mean that the application runs with reduced capabilities, so there are also business requirements to consider. For a complete list of Windows Azure SLA’s, see Service Level Agreements.
Scalability directly affects availability—an application that fails under increased load is no longer available. Scalable applications are able to meet increased demand with consistent results in acceptable time windows. When a system is scalable, it scales horizontally or vertically to manage increases in load while maintaining consistent performance. In basic terms, horizontal scaling adds more machines of the same size while vertical scaling increases the size of the existing machines. In the case of Windows Azure, you have vertical scaling options for selecting various machine sizes for compute. But changing the machine size requires a re-deployment. Therefore, the most flexible solutions are designed for horizontal scaling. This is especially true for compute, because you can easily increase the number of running instances of any web or worker role to handle increased traffic through the Azure Web portal, PowerShell scripts, or code. This decision should be based on increases in specific monitored metrics. In this scenario user performance or metrics do not suffer a noticeable drop under load. Typically, the web and worker roles store any state externally to allow for flexible load balancing and to gracefully handle any changes to instance counts. Horizontal scaling also works well with services, such as Windows Azure Storage, which do not provide tiered options for vertical scaling.
Cloud deployments should be seen as a collection of scale-units, which allows the application to be elastic in servicing the throughput needs of the end users. The scale units are easier to visualize at the web and application server level as Windows Azure already provides stateless compute nodes through web and worker roles. Adding more compute scale-units to the deployment will not cause any application state management side effects as compute scale-units are stateless. A storage scale-unit is responsible for managing a partition of data either structured or unstructured. Examples of storage scale-units include Windows Azure Table partition, Blob container, and SQL Database. Even the usage of multiple Windows Azure Storage accounts has a direct impact on the application scalability. A highly scalable cloud service needs to be designed to incorporate multiple storage scale-units. For instance, if an application uses relational data, the data needs to be partitioned across several SQL Databases so that the storage can keep up with the elastic compute scale-unit model. Similarly Azure Storage allows data partitioning schemes that require deliberate designs to meet the throughput needs of the compute layer. For a list of best practices for designing scalable cloud services, see Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services.
Applications need to assume that every dependent cloud capability can and will go down at some point in time. A fault tolerant application detects and maneuvers around failed elements to continue and return the correct results within a specific timeframe. For transient error conditions, a fault tolerant system will employ a retry policy. For more serious faults, the application is able to detect problems and fail over to alternative hardware or contingency plans until the failure is corrected. A reliable application is able to properly manage the failure of one or more parts and continue operating properly. Fault tolerant applications can use one or more design strategies, such as redundancy, replication, or degraded functionality.
A cloud deployment might cease to function due to a systemic outage of the dependent services or the underlying infrastructure. Under such conditions, a business continuity plan triggers the disaster recovery (DR) process. This process typically involves both operations personnel and automated procedures in order to reactivate the application at a functioning datacenter. This requires the transfer of application users, data, and services to the new datacenter. This involves the use of backup media or ongoing replication.
Consider the previous analogy that compared high availability to the ability to recover from a flat tire through the use of a spare. By contrast, disaster recovery involves the steps taken after a car crash where the car is no longer operational. In that case, the best solution is to find an efficient way to change cars, perhaps by calling a travel service or a friend. In this scenario, there is likely going to be a longer delay in getting back on the road as well as more complexity in repairing and returning to the original vehicle. In the same way, disaster recovery to another datacenter is a complex task that typically involves some downtime and potential loss of data. To better understand and evaluate disaster recovery strategies, it is important to define two terms: recovery time objective (RTO) and recovery point objective (RPO).
The recovery time objective (RTO) is the maximum amount of time allocated for restoring application functionality. This is based on business requirements and is related to the importance of the application. Critical business applications require a low RTO.
The recovery point objective (RPO) is the acceptable time window of lost data due to the recovery process. For example, if the RPO is one hour, then the data must be completely backed up or replicated at least every hour. Once the application is brought up in an alternate datacenter, the backup data could be missing up to an hour of data. Like RTO, critical applications target a much smaller RPO.
Part 1: High Availability
A highly available application absorbs fluctuations in availability, load, and temporary failures in the dependent services and hardware. The application continues to operate at an acceptable user and systemic response level, as defined by business requirements or application service level agreements.
Windows Azure High Availability Features
Windows Azure has many built-in features of the platform that support highly available applications. This section describes some of those key features. For a more comprehensive analysis of the platform, see Business Continuity for Windows Azure.
The Windows Azure Fabric Controller is responsible for provisioning and monitoring the health of the Windows Azure compute instances. The Fabric Controller listens for heartbeats from the hardware and software of the host and guest machine instances. When a failure is detected, it enforces SLAs by automatically relocating the VM instances. The compute SLA is further supported by the concept of fault and upgrade domains.
When multiple role instances are deployed, Windows Azure deploys these instances to different fault domains. A fault domain boundary is basically a different hardware rack in the same datacenter. Fault domains reduce the probability that a localized hardware failure will interrupt the service of an application. You cannot manage the number of fault domains that are allocated to your worker or web roles. The Fabric Controller uses dedicated resources that are separate from what Windows Azure hosted applications, and it has 100% uptime as it serves as the nucleus of the Windows Azure system. It monitors and manages role instances across fault domains. The following diagram shows Windows Azure shared resources that are deployed and managed by the FC across different fault domains.
Figure 1 Fault Domain Isolation (Simplified View)
Upgrade domains are similar to fault domains in function, but they support upgrades rather than failures. An upgrade domain is a logical unit of instance separation which determines which instances in a particular service will be upgraded at a point in time. By default for your hosted service deployment five upgrade domains are defined, but you can change that value in the service definition file. So for instance if you have eight instances of your web role, you would have two instances in three upgrade domains and two instances in one upgrade domain. That update sequence is defined by Windows Azure but is based on the number of upgrade domains. For more information on upgrade domains, see Update a Windows Azure Service.
In addition to these platform features that support high compute availability, Windows Azure embeds high availability features into its other services. For example, Windows Azure Storage maintains three replicas of all blob, table, and queue data. It also allows the option of geo-replication to store backups of blobs and tables in a secondary datacenter. The Content Delivery Network (CDN) allows blobs to be cached around the world for both redundancy and scalability. Windows Azure SQL Database maintains multiple replicas as well. In addition to the Business Continuity for Windows Azure paper, see the Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services paper for a full discussion of the platform availability features.
Although Windows Azure provides multiple features that support high availability, it is important to understand their limitations. For compute, Windows Azure guarantees that your roles are available and running, but it does not know whether your application is running or overloaded. For Windows Azure SQL Database, data is replicated synchronously within the datacenter. These database replicas are not point-in-time backups. For Windows Azure Storage, table and blob data is replicated by default to an alternate datacenter, but you cannot access the replicas until Microsoft chooses to failover to the alternate site. A datacenter failover typically only occurs in the case of a prolonged datacenter-wide outage, and there is no SLA for geo-failover time. It is also important to note that any data corruption quickly spreads to the replicas. For these reasons, you must supplement the platform availability features with application-specific availability features, such as the blob snapshot feature to create point-in-time backups of blob data.
The majority of this paper focuses on cloud services, which use a Platform as a Service (PaaS) model. But there are also specific availability features for Windows Azure Virtual Machines, which use an Infrastructure as a Service (IaaS) model. In order to achieve high availability with Virtual Machines, you must use availability sets. An availability set serves a similar function to fault and upgrade domains. Within an availability set, Windows Azure locates the virtual machines in a way that prevents localized hardware faults and maintenance activities from bringing down all of the machines in that group. Availability sets are required to achieve the Windows Azure SLA for the availability of Virtual Machines. For more information, see Manage the Availability of Virtual Machines. The following diagram provides a representation of two availability sets that group web and SQL Server virtual machines respectively.
Figure 2 Availability Sets for Windows Azure Virtual Machines
Note that in the previous diagram, SQL Server is installed and running on virtual machines. This is different from the previous discussion of Windows Azure SQL Database, which provides database as a managed service.
Application Strategies for High Availability
Most application strategies for high availability involve either redundancy or the removal of hard dependencies between application components. Application design should support fault tolerance during sporadic downtime of Windows Azure or third-party services. The following sections describe several application patterns for increasing availability of your cloud services.
Asynchronous Communication and Durable Queues
Consider asynchronous communication between loosely-coupled services to increase availability in Windows Azure applications. In this pattern, messages are written to either storage queues or Service Bus queues for later processing. When the message is written to the queue, control immediately returns to the sender of the message. The message processing is handled by another tier of the application, typically implemented as a worker role. If the worker role goes down, the messages accumulate in the queue until the processing service is restored. As long as the queue is available, there is no direct dependency between the front-end sender and the message processor. This eliminates the requirement for synchronous service calls which can be a throughput bottleneck in distributed applications.
A variation on this uses Windows Azure Storage (blobs, tables, queues) or Service Bus queues as a failover location for failed database calls. If a synchronous call within an application to another service (such as Windows Azure SQL Database) fails repeatedly, you might be able to serialize that data into durable storage. At some later point when the service or database is back on-line, the application can re-submit the request from storage. The difference in this model is that the intermediate location is not a constant part of the application workflow. It is used only in failure scenarios.
In both cases, asynchronous communication and intermediate storage prevents a down backend service from bringing the entire application down. Queues serve as a logical intermediary. For more guidance on choosing the correct queuing service, see Windows Azure Queues and Windows Azure Service Bus Queues - Compared and Contrasted.
Fault Detection and Retry Logic
A key point in a highly available application design is to utilize retry logic within code to gracefully handle a service that is temporarily down. The Transient Fault Handling Application Block, developed by the Microsoft Patterns and Practices team, assists application developers in this process. The word “Transient” means a temporary condition lasting only for a relatively short time. In the context of this paper, handling transient failures is part of developing a highly available application. Examples of transient conditions include intermittent network errors and lost database connections.
The Transient Fault Handling Application Block is a simplified way for you to handle failures within your code in a graceful manner. It allows you to improve the availability of your applications by adding robust transient fault handling logic. In most cases retry logic handles the brief hiccup and reconnects the sender and receiver after one or more failed attempts. A successful retry attempt typically goes unnoticed to application users.
There are three options for developers to manage their retry logic: incremental, fixed interval, and exponential. Incremental waits longer before each retry in an increasing linear fashion (for example 1, 2, 3, and 4 seconds). Fixed interval waits the same amount of time between each retry (for example 2 seconds). For a more random option, the exponential back-off waits longer between retries but with an exponential behavior (for example 2, 4, 8, and 16 seconds).
The high-level strategy within your code is:
Define your retry strategy and policy
Try the operation that could result in a transient fault
If transient fault occurs the retry policy is invoked
If all retries fail a final exception is caught
Test your retry logic in simulated failures to ensure that retries on successive operations do not result in an unanticipated lengthy delay before deciding to fail the overall task.
Reference Data Pattern (High Availability)
Reference data is the read-only data of an application that provides a business context within which the application generates transactional data during the course of a business operation. Transactional data is a point-in-time function of the reference data and hence its integrity depends on the snapshot of the reference data at the capture time of the transaction. This is a somewhat loose definition but should suffice for our purpose here.
Reference data in the context of an application is necessary for the functioning of the application but it is created and maintained by the respective applications, often by Master Data Management systems. These systems are responsible for the lifecycle of the reference data. Examples of reference data include product catalog, employee master, parts master, and equipment master. Reference data can also originate from outside the organization, such as zip code or tax rates. Strategies for increasing the availability of reference data are typically less difficult than those for transactional data. Reference data has the advantage of being mostly immutable.
Windows Azure web and worker roles that consume reference data can be made autonomous at run time by deploying the reference data along with the application. If the size of the local storage allows such a deployment, this is an ideal state to be in. Embedded database (SQL or NOSQL) or even XML files deployed to local file system will help with the autonomy of Windows Azure compute scale-units. However, you should have a mechanism to update the data in each role without requiring a redeployment. To do this, place any updates to the reference data to a Cloud storage endpoint (for example, Windows Azure Blob storage or SQL Database). Add code to each role that downloads the data updates into the compute nodes at role startup. Or add code that allows an administrator to perform a forced download into the role instances. To increase availability, the roles should also contain a set of reference data in case storage is down. This enables the roles to start with a basic set of reference data until the storage resource becomes available for the updates.
Figure 3 Application high availability through autonomous compute nodes
One consideration for this pattern is the deployment and startup speed for your roles. If you are deploying or downloading large amounts of reference data on startup, this can increase the amount of time it takes to spin up new deployments or role instances. This might be an acceptable tradeoff for the autonomy of having the reference data immediately available on each role rather than depending on external storage services.
Transactional Data Pattern (High Availability)
Transactional data is the data generated by the application in a business context that is identified by combination of the set of business processes implemented by the application and the reference data that supports these processes. Examples of transactional data includes orders, advanced shipping notices, invoices, and CRM opportunities. The transactional data thus generated will be fed to external systems for record keeping and/or for further processing.
Keep in mind that reference data can change within the systems that are responsible for this data. For this reason, transactional data must save the point-in-time reference data context so that it has minimal external dependencies for its semantic consistency. For example, consider the removal of a product from the catalog a few months after an order was fulfilled. The best practice is to embed as much reference data context as feasible into the transaction. This preserves the semantics associated with the transaction even if the reference data were to change after the transaction is captured.
As mentioned previously, architectures that use loose coupling and asynchronous communication lend themselves to higher levels of availability. This holds true for transactional data as well, but the implementation is more complex. Traditional transactional notions typically rely on the database for guaranteeing the transaction. When intermediate layers are introduced, the application code must correctly handle the data at various layers to ensure sufficient consistency and durability.
The following sequence describes a workflow that separates the capture of transactional data from its processing:
Web Compute Node: Present reference data.
External Storage: Save intermediate transactional data.
Web Compute Node: Complete the end-user transaction.
Web Compute Node: Send the completed transactional data along with the reference data context to a temporary durable storage that is guaranteed to give predictable response.
Web Compute Node: Signal end user the completion of the transaction.
Background Compute Node: extracts the transactional data, post processes it if necessary and sends it to its final resting place in the current system.
The following diagram shows one possible implementation of this design in a Windows Azure cloud service.
Figure 4 Application high availability through loose coupling
The dashed arrows in the above diagram indicate asynchronous processing of which the front-end web role is unaware. This leads to the storage of the transaction at its final destination with reference to the current system. Due to the latency introduced by this asynchronous model, the transactional data is not immediately available for query; so, each unit of the transactional data needs to be saved in a cache or user session to meet the immediate UI needs.
Consequently the web role is autonomous from the rest of the infrastructure and hence its availability profile is a combination of web role and the Windows Azure queue and not the entire infrastructure. In addition to high availability, this approach allows the web role to scale horizontally independent of the backend storage. This high availability model can have an impact on the economics of the operations as additional components like Windows Azure queues and worker roles can impact monthly usage costs.
Note that the previous diagram shows one implementation of this decoupled approach to transactional data. There are many other possible implementations. The following list provides some alternative variations.
A worker role might be placed between the web role and the storage queue.
A Service Bus queue can be used instead of a Windows Azure Storage queue.
The final destination might be Windows Azure Storage or a different database provider.
Windows Azure Caching can be used at the web layer to provide the immediate caching requirements following the transaction.
In addition to the patterns discussed in this section, it is important to note that the scalability of the cloud service directly impacts availability. If increased load causes your service to be unresponsive, the user impression is that the application is down. Follow best practices for scalability based on your expected application load and future expectations. The highest scale involves many considerations, such as the use of single vs. multiple storage accounts, sharding across multiple databases, and caching strategies. For an in-depth look at these patterns, see Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services.
Part 2: Disaster Recovery
Unlike the temporary failure management discussed in the previous sections on high availability, disaster recovery (DR) revolves around the catastrophic loss of application functionality. For example, consider the scenario where one or more datacenters go down. In this case you need to have a plan to run your application or access your data outside of the datacenter. Execution of this plan revolves around people, processes, and supporting applications that allow system to function. The level of functionality for the service during a disaster is determined by business and technology owners who define its disaster operational mode. That can take many forms from completely unavailable to partially available (degraded functionality or delayed processing) to fully available.
Windows Azure Disaster Recovery Features
As with availability considerations, Windows Azure has many business continuity features designed to support disaster recovery. There is also a relationship between some of the availability features of Windows Azure and disaster recovery. For example, the management of roles across fault domains increases the availability of an application. But without that management, an unhandled hardware failure would become a “disaster” scenario. So the correct application of many of the availability features and strategies should be seen as an important part of disaster-proofing your application. However, this section goes beyond general availability issues to more serious (and rarer) disaster events.
Multiple Datacenter Regions
Windows Azure maintains datacenters in many different regions around the world. This enables several disaster recovery scenarios, such as the system-provided geo-replication of Windows Azure Storage to secondary datacenters. But it also means that you can easily and inexpensively deploy a cloud service to multiple locations around the world. Compare this with the cost and difficulty of running your own datacenters in multiple regions. Deploying data and services to multiple datacenters protects your application from major outages in a single datacenter.
Windows Azure Traffic Manager
Once a datacenter-specific failure occurs, you must redirect traffic to services or deployments in another datacenter. This routing can be done manually, but it is more efficient to use an automated process. Windows Azure Traffic Manager (WATM) is a preview service that is designed for this task. It enables you to automatically manage the failover of user traffic to another datacenter in case the primary datacenter fails. Because traffic management is an important part of the overall strategy, it is important to understand the basics of WATM.
In the diagram below users connect to a URL specified for WATM (http://myATMURL.trafficmanager.net) that abstracts the actual site URLs (http://app1URL.cloudapp.net and http://app2URL.cloudapp.net). Based upon how you configure the criteria for when to route users they will be sent to the correct actual site when the policy dictates. The policy options are round-robin, performance, or failover. For the sake of this whitepaper we will only be concerned with the option of failover.
Figure 5 Routing using the Windows Azure Traffic Manager
When configuring WATM you will provide a new Traffic Manager DNS prefix. This is the URL prefix you will provide to your users to access your service. Load balancing when using the WATM is now abstracted by WATM one level up and no longer at the datacenter level. The Traffic Manager DNS maps to a CNAME for all the deployments it manages.
Within WATM you specify the priority of the deployments that users will be routed to when failure occurs. The WATM monitors the endpoints of the deployments and notices when the primary deployment fails. Upon failure WATM will walk the prioritized list of deployments and route users to the next one on the list.
A key point to understand about WATM is that while WATM decides where to go in a failover (based upon the priority of failover sites you configure) you can decide if your failover domain is dormant or active while NOT in failover mode. That functionality has nothing to do with Windows Azure Traffic Manager. When WATM detects a failure in the primary site and goes to rollover to the failover site it does so regardless of whether that site is currently actively serving users or not. More information about dormant or active failover domains can be found in later sections of this paper.
For more information on how Windows Azure Traffic Manager works please refer to the following links.
Common Windows Azure Disaster Scenarios
The following sections cover several different types of disaster scenarios. Datacenter failure is not the only cause of application-wide failures. Poor design or administration errors can also lead to outages. It is important to consider the possible causes of a failure during both design and testing phases of your recovery plan. A good plan takes advantage of Windows Azure features and augments them with application-specific strategies. The chosen response is dictated by the importance of the application, the RPO, and the RTO.
As mentioned previously, failures resulting from the underlying hardware or operating system software in the host virtual machine are automatically handled by the Windows Azure Fabric Controller. Windows Azure spawns a new instance of the role on a functioning server and adds it to the load balancer rotation. If the number of role instances specified in the service configuration is greater than one, Windows Azure shifts processing to the other running role instances while replacing the failed node.However, there are serious application errors that happen independently of any hardware or operating system failures. The application could fail due to the catastrophic exceptions caused by either bad logic or data integrity issues. You must incorporate enough telemetry into the code so that a monitoring system can detect failure conditions and notify an application administrator. The administrator with full knowledge of the disaster recovery processes can make a decision to invoke a failover process or simply accept an availability outage to resolve the critical errors.
Windows Azure automatically stores your Windows Azure SQL Database and Windows Azure Storage data three times redundantly within different fault domains in the same datacenter. If geo-replication is used it is stored three additional times in a different datacenter. However if your users or your application corrupts that data in the primary copy, it quickly replicates to the other copies. Unfortunately, this results in three copies of corrupt data.
To manage potential corruption of your data you will need to manage your own backups in order to maintain transactional consistency. You can store your backups in Windows Azure or on-premise depending upon your level of comfort or governance regulations. For more information, see the section on Data Strategies for Disaster Recovery.
When parts of the Windows Azure network are down you may not be able to get to your application or data. If one or more role instances are unavailable due to network issues, Windows Azure leverages the remaining available instances of your application. If your application can’t access its data due to a Windows Azure network outage, you can potentially run in a degraded mode locally by using cached data. You need to architect the disaster recovery strategy for running in degraded mode into your application. For some applications this might not be practical. Another option is to store data in an alternate location until connectivity is restored. If degraded mode is not an option, the remaining options are application downtime or failover to an alternate datacenter. The design of an application running in degraded mode is a much a business decision as a technical one. This is discussed further in the section on Degraded Application Functionality.
Failure of Dependent Service
Windows Azure provides many services which can experience periodic downtime. Consider Windows Azure Shared Caching as an example. This multitenant service provides caching capabilities to your application. It is important to consider what happens in your application if the dependent service is unavailable. In many ways, this scenario is similar to the network outage scenario, but considering each service independently results in potential improvements to your overall plan.
For example, with caching, there is a relatively new alternative to the multitenant Shared Caching model. Windows Azure Caching on roles provides caching to your application from within your cloud service deployment (this is also the recommended way to use Caching going forward). While it has a limitation of only being accessible from within a single deployment, there are potential disaster recovery benefits. First, the service now runs on roles that are local to your deployment. So you are better able to monitor and manage the health of the cache as part of your overall management processes for the cloud service. But this type of caching also exposes new features. One of the new features is high availability for cached data. This helps to preserve cached data in the event that a single node fails by maintaining duplicate copies on other nodes. Note that high availability decreases throughput and increases latency due to the updating of the secondary copy on writes. It also doubles the amount of memory used for each item, so plan for that. But this specific example demonstrates that each dependent service might have capabilities that improve your overall availability and resistance to catastrophic failures.
With each dependent service, you should understand the implications of a total outage. For example, in the example of Caching, it might be possible to access the data directly from a database until the Caching capabilities are restored. This would be a degraded mode in terms of performance but would provide full functionality with regard to data.
The previous failures have primarily been failures that can be managed within the same Windows Azure datacenter. However, you must also prepare for the possibility that there is an outage of the entire datacenter. When a datacenter goes down, the locally redundant copies of your data are not available. If you have enabled Geo-replication for your Windows Azure Storage account, there are three additional copies of your blobs and tables in a datacenter in a different region. When Microsoft declares the datacenter lost, all of the DNS entries are remapped to the geo-replicated datacenter. Note that you do not have any control over this process, and it will only occur for datacenter-wide failures. Because of this, you must also rely on other application-specific backup strategies to achieve the highest level of availability. For more information, see the section on Data Strategies for Disaster Recovery.
Windows Azure Down
In disaster planning, you must consider the entire range of possible disasters. One of the most severe outages would involve all Windows Azure datacenters simultaneously. As with other outages, you might decide that you will take the risk of temporary downtime in that event. Widespread failures that span datacenters should be much rarer than isolated failures involving dependent services or single datacenters. But some mission critical applications might decide that there must be a backup plan for this scenario as well. The plan for this event could include failing over to services in an Alternative Clouds or an Hybrid On-Premises and Cloud Solutions.
Degraded Application Functionality
A well designed application typically uses a collection of modules that communicate with each other though the implementation of loosely coupled information interchange patterns. Especially a DR-friendly application requires separation of concerns at the module level so that an outage of a dependent service won’t bring down the entire application. For example, consider a web commerce application for Company Y as an example, the following modules might constitute the application:
Product Catalog: allows the users to browse products
Shopping Cart: supports users to add/remove products in their shopping cart
Order Status: shows the shipping status of user orders
Order Submission: finalizing the shopping session by submitting the order with payment
Order Processing: validates the order for data integrity and performs quantity availability check
When a dependency of a module in this application goes down, how does the module function until that part recovers? A well architected system implements isolation boundaries through separation of concerns both at design time and run time. Every failure can be categorized as recoverable and non-recoverable. It is expected that the non-recoverable errors will take down the module while a recoverable error can be mitigated through alternatives. As discussed in the high availability section, some problems can be hidden from users by handling faults and taking alternate actions. During a more serious outage, the application might be completely unavailable. However, a third option is to continue servicing users in degraded mode.
For instance, if the database for hosting orders goes down, the Order Processing module loses its ability to process sales transactions. Depending on the architecture, it might be hard or impossible for the Order Submission and Order Processing parts of the application to continue. If the application is not designed to handle this scenario, the entire application might go offline.
However, in this same scenario, it is possible that the product data is stored in a different location. In that case, the Product Catalog module can still be used for viewing products. In degraded mode, the application continues to be available to users for available functionality, like viewing the product catalog. But other parts of the application are unavailable, such as ordering or inventory queries.
Another variation of degraded mode centers on performance rather than capabilities. For example, consider a scenario where the product catalog was being cached with Windows Azure Caching. If Caching became unavailable, it is possible that the application could go directly to the backend storage to retrieve product catalog information. But this access might be slower than the cached version. Because of this, the application performance is degraded until the Caching service is fully restored.
It is just as much of a business decision as a technical decision as to how much of an application will continue to function in degraded mode. The application must also decide how to inform the users of the temporary problems. In this example, the application could allow viewing products and even adding them to a shopping cart. However, when the user attempts to make a purchase, the application notifies the user that the sales module is down temporarily. It is not ideal for the customer, but it does prevent an application-wide outage.
Data Strategies for Disaster Recovery
Handling data correctly is the hardest area to get right in any disaster recovery plan. Restoring data is also the part of the recovery process that typically takes the most time. Different choices in degradation modes result in difficult challenges for data recovery from failure and consistency following failure. One of the controlling factors is the necessity of restoring or maintaining a copy of the application’s data, both reference and transactional, at a secondary site. In an on-premise setting, it requires an expensive and lengthy planning process to implement a multi-datacenter DR strategy. Conveniently, most Cloud providers including Windows Azure readily allows the deployment of applications to multiple datacenters. These datacenters are geographically located in such a way that multi-datacenter outages should be extremely rare. The strategy for handling data across datacenters is one of the determining factors for the success of any disaster recovery plan.
The following sections discuss disaster recovery techniques related to data backups, reference data, and transactional data.
Backup and Restore
Some disaster recovery scenarios can be supported by regular backups of application data. Different storage resources require different techniques.
For Windows Azure SQL Database, you can use the DATABASE COPY command to create a copy of the database. This command is required to get a backup with transactional consistency. Following the DATABASE COPY, you can leverage the import/export service of Windows Azure SQL Database. This supports exporting databases to BACPAC files that are stored in Windows Azure Blob storage. The built-in redundancy of Windows Azure Storage creates two replicas of the backup file in the same datacenter. However, the frequency of running the backup process determines your RPO, which is the amount of data you might lose in disaster scenarios. For example, if you perform a backup at the top of the hour and disaster occurs at HH:58 (two minutes before the top of the hour), you lose 58 minutes of data that happened after the last backup was performed. Also, to protect against a datacenter outage, you should copy the BACPAC files to an alternate datacenter. You then have the option of restoring those backups in the alternate datacenter. For more details, see Business Continuity in Windows Azure SQL Database.
For Windows Azure Storage you can either develop you own custom backup process or use one of many third-party backup tools. Note that there are additional complexities in most application designs where storage resources reference each other. For example, consider a SQL Database that has a column that links to a blob in Windows Azure Storage. If the backups do not happen at exactly the same moment, the database might have the pointer to a blob that was not itself backed-up before the failure. The application or disaster recovery plan must implement processes to handle this inconsistency after a recovery.
Reference Data Pattern (Disaster Recovery)
As mentioned previously, reference data is read-only data that supports application functionality. It typically does not change frequently. Although backup and restore is one method to handle datacenter outages, the RTO is relatively long. When the application is deployed to a secondary datacenter, there are some strategies that improve the RTO for reference data.
Because reference data changes infrequently, you can improve the RTO by maintaining a permanent copy of the reference data in the secondary datacenter. This eliminates the time required to restore backups in the event of a disaster. To meet the multi-datacenter DR requirements, the application and the reference data must be deployed together on multiple datacenters. As mentioned previously in the Reference Data Pattern (High Availability), reference data can be deployed to the role itself, to external storage, or to a combination of both. Intra compute node reference data deployment model implicitly also satisfies the DR requirements as it does with the application availability needs. Reference data deployment to SQL Database requires a copy of the reference data to be deployed to each datacenter. The same strategy is applicable to Windows Azure Storage. Any reference data stored on Windows Azure Storage needs a copy to be deployed to the primary as well as secondary datacenters.
Figure 6 Reference data publication to both primary and secondary datacenters
As mentioned previously, you must implement your own application-specific backup routines for all data, including reference data. Geo-replicated copies across datacenters are only used in a datacenter-wide outage. To prevent extended downtime, the mission critical parts of the application’s data should be deployed to the secondary datacenter. For an example of this topology, see the Active/Passive model.
Transactional Data Pattern (Disaster Recovery)
Implementation of a full-functional disaster mode strategy requires asynchronous replication of the transactional data to the secondary datacenter. The practical time windows within which the replication can occur will determine the RPO characteristics of the application. The data that is lost during the replication window might still be recoverable from the primary datacenter and later be merged with the secondary.
The following architecture examples provide some ideas on different ways of handling transactional data in a failover scenario. It is important to note that these examples are not exhaustive. For example, intermediate storage locations such as queues could be replaced with Windows Azure SQL Database. The queues themselves could be either Windows Azure Storage or Service Bus queues (see Windows Azure Queues and Windows Azure Service Bus Queues - Compared and Contrasted). Backend storage destinations could also vary, such as Windows Azure tables instead of SQL Database. In addition, there might be worker roles that are inserted as intermediaries in various steps. The important thing is not to emulate these architectures exactly but to use the examples to consider various alternatives in the recovery of transactional data and related modules.
Consider an application that uses Windows Azure Storage queues to hold transactional data. This allows worker roles to process the transactional data to the backend database in a decoupled architecture. As discussed, this requires the transactions to use some form of temporary caching if the front-end roles require the immediate query of that data. Depending on the level of data loss tolerance, one could choose to replicate the queues, the database, or all of the storage resources. With just the database replication, in the event of primary datacenter outage, the data in the queues can still be recovered when the primary datacenter comes back. The following diagram shows an architecture where the backend database is synchronized across datacenters.
Figure 7 Replicate transactional data in preparation for DR
The biggest challenge to implement the previous architecture is the replication strategy between datacenters. Although Windows Azure provides a SQL Data Sync service for this type of replication, the service is still in preview and is not recommended for production environments. For more information, see Data Sync Service and Disaster Recovery. For production applications, you must either invest in a third-party solution or create your own replication logic in code. Depending on the architecture, the replication might be bi-directional, which is also more complex. One potential implementation could make use of the intermediate queue in the previous example. The worker role that processes the data into the final storage destination could make the change in both the primary and secondary datacenters. These are not trivial tasks, and complete guidance for replication code is beyond the scope of this paper. The important point is that a lot of your time and testing should center on how you replicate your data to the secondary datacenter. Additional processing and testing should be done to ensure that the failover and recovery processes correctly handle any possible data inconsistencies or duplicate transactions.
|Although most of this paper focuses on Platform as a Service, there are additional replication and availability options for hybrid applications that use Windows Azure Virtual Machines. These hybrid applications use Infrastructure as a Service (IaaS) to host SQL Server on virtual machines in Windows Azure. This enables traditional availability approaches in SQL Server, such as AlwaysOn Availability Groups or Log Shipping. Some techniques, such as AlwaysOn, only work between on-premises SQL Servers and Windows Azure virtual machines. For more information, see High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines.|
Consider a second architecture that operates in degraded mode. The application on the secondary datacenter disables all the functionality, such as reporting, BI, or draining queues. It only accepts the most important types of transactional workflows as defined by business requirements. The system captures the transactions and writes them to queues but may elect to postpone processing the data during the initial stage of the outage. If the system on the primary datacenter comes back within the expected time window, the queues can be drained by the worker roles in the primary datacenter there by eliminating the need for database merging. If the primary datacenter outage goes beyond the tolerable window, the application can start processing the queues. In this scenario, the database on the secondary contains incremental transactional data which needs to be merged once the primary comes back up. The following diagram shows this strategy for temporarily storing transactional data until the primary datacenter is restored.
Figure 8 Degraded application mode for transaction capture only
For more discussion of data management techniques for resilient Windows Azure applications, see Failsafe: Guidance for Resilient Cloud Architectures.
Deployment Topologies for Disaster Recovery
Mission critical applications need to prepare for the eventuality of the entire datacenter going down by incorporating a multi-datacenter deployment strategy into the operational planning. Multi-datacenter deployments might involve IT Pro processes to publish the application and reference data to the secondary datacenter after experiencing a disaster. If the application requires instant failover, the deployment process may involve an active/passive or an active/active setup with existing instances of the application running in the alternate datacenter. As discussed, a routing tool such as the Windows Azure Traffic Manager preview provides load balancing services at the DNS level. It can detect outages and route the users to different datacenters when needed.
Part of a successful Windows Azure disaster recovery is architecting it into the solution upfront. With the advent of the Cloud we now have additional options for recovering from failures during a disaster that are not available in a traditional hosting provider. Specifically you have the ability to dynamically and quickly allocate resources in a different datacenter without having to pay a lot for them to sit around and wait for a failure to occur.
The following sections cover different deployment topologies for disaster recovery. Typically there is a tradeoff in increased cost/complexity for additional availability.
A single-region deployment is not really a disaster recovery topology but is meant to contrast the other architectures. Single-region deployments are common for applications in Windows Azure. But it is not one that seriously considers disaster recovery planning. The following diagram depicts an application running in a single Windows Azure datacenter. As discussed previously, the Windows Azure Fabric Controller and the use of fault and upgrade domains increases availability of the application within the datacenter.
Figure 9 Single Region Deployment
Here it is apparent that the database is a single point of failure. Even though Windows Azure replicates the data across different fault domains to internal replicas, this all occurs in the same datacenter. It cannot withstand a catastrophic failure. If the datacenter goes down, all of the fault domains go down, which includes all service instances and storage resources.
For all but the least critical applications, you must devise a plan to deploy your application across multiple datacenters in different regions. You also should consider RTO and cost constraints in considering which deployment topology to use.
Let’s take a look now at specific patterns to support failover across different datacenters. These examples all use two datacenters to describe the process.
In this pattern only the primary datacenter has applications and databases running. The secondary datacenter is not setup for an automatic failover. So when disaster occurs, you must spin up all the parts of the service in the new datacenter. This includes uploading a cloud service to Windows Azure, deploying the cloud services, restoring the data, and changing the DNS to reroute the traffic.
While this is the most affordable of the multi-region options it has the worst RTO characteristics. In this model, the service package and database backups are stored either on-premises or in blob storage of the secondary datacenter. But a new service must be deployed and the data restored before it resumes operation. Even if you fully automate the data transfer from backup storage, this can still consume a large amount of time to spin up the new database environment. Moving data from the backup disk storage into the empty database on the secondary datacenter is the most expensive part of restore. But this must be done in order to bring the new database into an operational state since it is not replicated.
The best approach is to store the service packages in Windows Azure Blob storage in the secondary datacenter. This eliminates the need to upload the package to Windows Azure, which is what happens when you deploy from a development machine on-premises. They can be quickly deployed into a new cloud service from blob storage by using PowerShell scripts.
This option is only practical for non-critical applications that can tolerate a high RTO. For instance, this might work for an application that can be down for several hours but should be running again within 24 hours.
Figure 10 Redeploy to a secondary Windows Azure datacenter
The Active/Passive pattern is the choice favored by many companies due to the improvements to the RTO with a relatively small increase in cost over the redeploy pattern. In this scenario there is again a primary and a secondary Windows Azure datacenter. All of the traffic goes to the active deployment on the primary datacenter. The secondary datacenter is better prepared for disaster recovery because the database is running on both datacenters and there is a synchronization mechanism in place between them. This standby approach can involve two variations: a database-only approach or a complete deployment in the secondary datacenter.
In the first variation of the Active/Passive pattern, only the primary datacenter has a deployed cloud service application. However, unlike the redeploy pattern, both datacenters are synchronized on the contents of the database (see the section on Transactional Data Pattern (Disaster Recovery)). When a disaster occurs, the only requirement is to spin up the application in the secondary datacenter, change connection strings to the new database, and change the DNS entries to reroute traffic.
Like the redeploy pattern, the service packages should already be stored in Windows Azure Blob storage in the secondary datacenter for faster deployment. But unlike the redeployment pattern, you don’t incur the majority of the overhead that is required by database restore operations. The database is ready and running. This saves a significant amount of time, making this an affordable and, therefore, the most popular DR pattern.
Figure 11 Active/Passive (Database Only)
In the second variation of the Active/Passive pattern both the primary and secondary datacenters have a full deployment that includes both the cloud services and a synchronized database. However only the primary datacenter is actively handling network requests from the users. Only if the primary datacenter goes down does the secondary datacenter become active. In that case all new network requests are now routed to the secondary region. Windows Azure Traffic Manager can manage this failover automatically.
Failover occurs faster than the database-only variation, because the services are already deployed. This pattern is designed for a very low RTO where you need the secondary failover datacenter to be ready to go immediately after failure of the primary datacenter.
Along with quicker response, this also has an additional advantage of pre-allocating and deploying backup services. You don’t have to worry about a datacenter not having the space to allocate new instances in a disaster. This is important if your secondary Windows Azure datacenter is nearing capacity. There is no guarantee (SLA) that you will instantly be able to deploy a number of new cloud services in any datacenter.
For the fastest response time with this model, you must have close to the same scale (number of role instances) in both the primary and secondary datacenters. Yet paying for unused compute instances is costly, and this is often not the most prudent financial choice. Because of this, it is more common to use a slightly scaled-down version of the cloud services on the secondary datacenter. Then you can quickly failover and scale out the secondary deployment when necessary. You would automate the failover process so that once the failure of the primary datacenter is recognized you spin up additional instances depending up on the load. This could involve some type of automatic scaling mechanism, such as the Autoscaling Application Block. The following diagram shows the model where both the primary and secondary datacenters contain a fully deployed cloud service in an Active/Passive pattern.
Figure 12 Active/Passive (Full Replica)
By now you’re probably figuring out the evolution of the patterns – decreasing the RTO increases costs and complexity. The Active/Active solution actually breaks this tendency with regard to cost. In an Active/Active pattern, the cloud services and database are fully deployed in both datacenters. But unlike the Active/Passive model, they both receive user traffic. This option yields the quickest recovery time. The services are already scaled to handle a portion of the load at each datacenter, and the DNS is already enabled to use the secondary datacenter. There is additional complexity in determining how to route users to the appropriate datacenter. Although a round-robin might be possible, it is more likely that certain users would use a specific datacenter where the primary copy of their data resides.
All that needs to be done in the case of failover is to disable DNS to the primary datacenter, routing all traffic to the secondary datacenter. Even in this model, there are some variations. For example, the following diagram shows a model where the primary datacenter owns the master copy of the database. The cloud services in both datacenters write to that primary database. The secondary deployment can read from either the primary or replicated database. Replication in this example happens one way.
Figure 13 Active/Active
There is a downside to the Active/Active architecture shown in the previous diagram. The second datacenter must access the database in the first datacenter, since the master copy resides there. Performance significantly drops off when you access data from outside a datacenter. In cross-datacenter database calls, you should consider some type of batching strategy to improve the performance of these calls. For more information, see Batching Techniques for SQL Database Applications in Windows Azure. An alternative architecture could involve each datacenter accessing their own database directly. In that model, some type of bidirectional replication would be required to synchronize the databases in each datacenter.
A nice benefit of the Active/Active pattern is that you might not need as many instances on the primary datacenter as you would as in the Active/Passive pattern. If you have ten instances on the primary datacenter in an Active/Passive architecture, you might need only five in each datacenter in an Active/Active architecture. Both regions now share the load. This could be a cost savings over the Active/Passive pattern if you kept a warm standby in the passive datacenter with ten instances waiting for failover.
Realize that until the primary datacenter is restored the secondary datacenter may receive a sudden surge of new users. If there were 10,000 users on both servers at the time the primary datacenter fails, the secondary center now suddenly has to handle 20,000 users. Monitoring rules on the secondary datacenter needs to detect this increase and double the instances in the primary datacenter. For more information on this, see the section on Failure Detection.
Hybrid On-Premises and Cloud Solutions
One additional strategy for disaster recovery is to architect a hybrid application that runs both on-premises and the Cloud. Depending on the application, the primary datacenter might be either location. Consider the previous architectures and imagine either the primary or secondary datacenters as an on-premises location.
There are some challenges in these hybrid architectures. First, most of this paper has addressed Platform as a Service (PaaS) architecture patterns. Typical PaaS applications in Windows Azure rely on Windows Azure specific constructs such as roles, cloud services, and the Fabric Controller. To create an on-premises solution for this type of PaaS application would require a significantly different architecture. This might not be feasible from a management or cost perspective.
However, a hybrid solution for disaster recovery has fewer challenges for traditional architectures that have simply moved to the Cloud. This is true of architectures that use Infrastructure as a Service (IaaS). IaaS applications use Virtual Machines in the Cloud that can have direct on-premises equivalents. The use of virtual networks also enables you to connect machines in the Cloud with on-premises network resources. This opens up several possibilities that are not possible with PaaS-only applications. For example, SQL Server can take advantage of disaster recovery solutions such as AlwaysOn Availability Groups and database mirroring. For details, see High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines.
IaaS solutions also provide an easier path for on-premises applications to use Windows Azure as the failover option. You might have a fully functioning application in an existing on-premises datacenter. But what if you lack the resources to maintain a geographically separate datacenter for failover? You might decide to use Virtual Machines and Virtual Networks to get your application running in Windows Azure. Define processes that synchronize data to the Cloud. The Windows Azure deployment then becomes the secondary datacenter to use for failover. The primary datacenter remains the on-premises application. For more information about IaaS architectures and capabilities, see Virtual Machines and Virtual Networks.
There are situations when even the robustness of Microsoft’s Cloud might not be enough for your availability requirements. In the past year or so there have been a few severe outages of various Cloud platforms, such as the Amazon Web Services (AWS) and the Windows Azure platforms. Even the best preparation and design to implement backup systems during a disaster fall short of your entire Cloud takes the day off.
You want to compare availability requirements with the cost and complexity of increased availability. So if after doing a risk analysis and defining the RTO and RPO for your solution it is determined that application cannot tolerate any downtime it might make sense for you to consider using another Cloud solution. Unless the entire Internet goes down at once there is a very high probability that another Cloud solution, such as Rackspace or Amazon Web Services will be still functioning on the rare chance that Windows Azure is completely down.
As with the hybrid scenario, the failover deployments in the previous DR architectures can also exist within another Cloud solution. Alternative Cloud DR sites should only be used for those solutions with an RTO that allows very little, if any, downtime. Be aware that a solution that uses a DR site outside of Windows Azure will require more work to configure, develop, deploy, and maintain. It is also more difficult to implement best practices in a cross-Cloud architecture. Although Cloud platforms have similar high-level concepts, the APIs and architectures are different. Should you decide to split your DR among different platforms it would make sense to architect abstraction layers into the design of the solution. This avoids the need to develop and maintain two completely different versions of the same application to run on different Cloud platforms in case of disaster. As with the hybrid scenario, the use of Virtual Machines might be easier in these cases than Cloud-specific PaaS designs.
Some of the patterns we just discussed require quick activation of off-line deployments as well as restoring of specific parts of a system. Automation, or scripting, supports the ability to spin up resources on-demand and deploy solutions rapidly. In this paper, DR-related automation is equated with Windows Azure PowerShell scripting, but the Service Management API is also an option. Developing scripts helps to manage the parts of DR that Windows Azure does not transparently handle. This has the benefit of producing consistent results each time, which minimizes the chance of human error. Having pre-defined DR scripts also reduces the time to rebuild a system and its constituent parts in the midst of a disaster. You do not want to try to manually figure out how to restore your site while it is down while losing money every minute.
Once the scripts are created you should test them over and over from start to finish. After their basic functionality is verified, make sure that they are also tested in Disaster Simulation. This helps uncover flaws in the scripts or processes.
A best practice with automation is to create a repository of Windows Azure DR PowerShell scripts. Make sure they are clearly marked and categorized for easy lookup. Designate one person to manage the repository and versioning of the scripts. Ensure you document them well with explanations of parameters and examples of script use. And ensure this documentation is kept in sync with your Windows Azure deployments. This underscores the purpose of having one person in charge of all parts of the repository.
In order to correctly handle problems with availability and disaster recovery, you must be able to detect and diagnose failures. You should do advanced server and deployment monitoring so you can quickly know when a system or its parts are suddenly down. Part of this work can be done by the use of monitoring tools that look at the overall health of the Cloud Service and its dependencies. Two Microsoft tools are MetricsHub and System Center Operations Manager (SCOM). Other third-party tools, such as AzureWatch, can also provide monitoring capabilities. AzureWatch also allows you to automate scalability. Most monitoring solutions track key performance counters and service availability.
Although these tools are vital, they do not replace the need to plan for fault detection and reporting within a cloud service. You must plan to properly use Windows Azure diagnostics. Custom performance counters or event log entries can also be part of the overall strategy. This provides more data during failures to quickly diagnose the problem and restore full capabilities. It also provides additional metrics for the monitoring tools to use to determine application health. For more information, see Collect Logging Data by Using Windows Azure Diagnostics. For a discussion of how to plan for an overall “health model”, see Failsafe: Guidance for Resilient Cloud Architectures.
Simulation testing involves creating small real life situations on the actual work floor to observe how the team members react and how effective the solutions are outlined in the recovery plan. Simulation should be carried out in such a way that the scenarios created do not disrupt actual business while still feeling like “real” situations.
Consider architecting a type of “switchboard” in the application to manually simulate availability issues. For instance, through a soft switch, trigger database access exceptions for an ordering module by causing it to malfunction. Similar lightweight approaches can be taken for other modules at the network interface level.
Any issues that were inadequately addressed get highlighted during simulation. The simulated scenarios must be completely controllable so that even if the recovery plan seems to be failing, the situation can be restored back to normal without causing any significant damage. It’s also important that higher-level management is informed about when and how the simulation exercises will be executed. This plan includes information on the time or resources that may become unproductive while the simulation test is running. When subjecting your disaster recovery plan to a test, it is also important to define how success will be measured.
There are several other techniques that you can use to test disaster recovery plans. However, most of them are simply altered versions of these basic techniques. The main motive behind this testing is to evaluate how feasible and how workable the recovery plan is. Disaster recovery testing focuses on the details to discover holes in the basic recovery plan.
Let’s summarize the key points that have been covered in this paper to give you a checklist of items you should consider in your own availability and disaster recovery planning. These are best practices that have been useful for customers seeking to get serious about implementing a successful solution. This type of solution truly works, recovering in a timely and successful manner when system failure hits.
Conduct a risk assessment for each application since each can have different requirements. Some applications are more critical than others and would justify the extra cost to architect them for disaster recovery.
Use this information to define the RTO and RPO for each application.
Design for failure, starting with the application architecture.
Implement best practices for high availability, while balancing cost, complexity, and risk.
Implement disaster recovery plans and processes.
Consider failures that span the module level all the way to a complete Cloud outage.
Establish backup strategies for all reference and transactional data.
Choose a multi-site disaster recovery architecture.
- Consider failures that span the module level all the way to a complete Cloud outage.
Define a specific owner for disaster recovery processes, automation, and testing. The owner should manage and own the entire process.
Document the processes so they are easily repeatable. Although there is one owner, multiple people should be able to understand and follow the processes in an emergency.
Train the staff to implement the process.
Use regular disaster simulations for both training and validation of the process.
When hardware or software fails within Windows Azure the techniques and strategies for managing them are different than when this occurs on systems you manage on-premise. This is mainly due to the fact that Cloud solutions typically have more dependencies on infrastructure that is dispersed across the datacenter and managed as separate services. Partial failures must be dealt with using high availability techniques. More severe failures, possibly due to a disaster event, are managed using disaster recovery strategies.
Windows Azure detects and handles many failures, but there are many types of failures that require application-specific strategies. You must actively prepare for and manage the failures of applications, services, and data.
Your application’s availability and disaster recovery plan should be made with respect to the business consequences of the application’s failure. Defining the processes, policies, and procedures to restore critical systems after a catastrophic event takes time, planning, and commitment. And once the plans are established, you cannot stop there. You must regularly analyze, test, and continually improve the plans based on your application portfolio, the needs of the business, and the technologies available to you. Windows Azure provides both new capabilities and new challenges to creating robust applications that withstand failures.
Other ResourcesBusiness Continuity for Windows Azure
Business Continuity in Windows Azure SQL Database
High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines
Failsafe: Guidance for Resilient Cloud Architectures
Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services