Business Continuity for Windows Azure
Published: March 2012
Authors: Patrick Wickline, Adam Skewgar, Walter Myers III
Reviewers: Michael Hyatt, Brad Calder, Tina Stewart, Steve Young, Shont Miller, Vlad Romanenko, Sasha Nosov, Monilee Atkinson, Yousef Khalidi, Brian Goldfarb
IT managers spend considerable time and effort ensuring their data and applications are available when needed. The general approach to business continuity in the cloud is no different than any other environment. The same principles and goals apply, but the implementation and process will be different.
Availability problems can be classified into three broad categories:
Failure of individual servers, devices, or network connectivity
Corruption, unwanted modification, or deletion of data
Widespread loss of facilities
This document explains how you think about and plan for availability in all three of these categories when using Windows Azure.
We start by describing some of the high availability services that Windows Azure provides and then describe some of the common backup and recovery mechanisms you may choose to employ. We then move on to describe the services that Windows Azure exposes to help you build highly-available applications and, finally, discuss strategies for designing applications that can withstand full site outages.
The fundamental concepts of preparedness and recovery are consistent with what you are probably already familiar with. However, Windows Azure mitigates some potential failures for you and provides the infrastructure and capabilities for you to design your own availability and recovery plans at a much lower cost than traditional solutions.
Security and availability are closely related topics. While we touch on security of the platform itself, this document is not intended to provide full coverage of that topic and we will not cover designing secure applications. For more information on these topics, please see the following:
A Cloud Platform Designed for High Availability
By deploying your application on Windows Azure you are taking advantage of many fault tolerance and secure infrastructure capabilities that you would otherwise have to design, acquire, implement, and manage. This section covers the things Windows Azure does for you without any additional expense or complexity to ensure that you are building on a platform you can trust.
World Class Data Centers in Multiple Geographies
A basic requirement for business continuity is the availability of well-managed datacenter infrastructure in diverse geographic locations. If individual data centers are not properly managed, the most robust application designs may be undermined by gross failures at the infrastructure level. With Windows Azure, you can take advantage of the extensive experience Microsoft has in designing, managing, and securing world-class data centers in diverse locations around the world.
These are the same data centers that run many of the world’s largest online services. These datacenters are designed and constructed with stringent levels of physical security and access control, power redundancy and efficiency, environmental control, and recoverability capabilities. The physical facilities have achieved broad industry compliance, including ISO 27001 and SOC / SSAE 16 / SAS 70 Type II and within the United States, FISMA certification.
To ensure recoverability of the Windows Azure platform core software components, Microsoft has established an Enterprise Business Continuity Program based on Disaster Recovery Institute International (DRII) Professional Practice Statements and Business Continuity Institute (BCI) Good Practice Guidelines. This program also aligns to FISMA and ISO27001 Continuity Control requirements. As part of this methodology, recovery exercises are performed on a regular basis simulating disaster recovery scenarios. In the rare event that a system failure does occur, Microsoft uses an aggressive, root cause analysis process to deeply understand the cause. Implementation of improvements learned from outages is a top priority for the engineering organization. In addition, Microsoft provides post-mortems for every customer impacting incident upon request.
Infrastructure Redundancy and Data Durability
The Windows Azure platform mitigates outages due to failures of individual devices, such as hard drives, network interface adapters, or even entire servers. Data durability for Windows Azure SQL Database and Windows Azure Storage (blobs, tables, and queues) is facilitated by maintaining multiple copies of all data on different drives located across fully independent physical storage sub-systems. Copies of data are continually scanned to detect and repair bit rot, an often overlooked threat to the integrity of stored data.
Compute availability is maintained by deploying roles on isolated groupings of hardware and network devices known as fault domains. The health of each compute instance is continually monitored and roles are automatically relocated to new fault domains in the event of a failure. By taking advantage of the Windows Azure service model and deploying at least two instances of each role in your application, your application can remain available as individual instances are relocated to new fault domains in the event of a failure within a fault domain. This all happens transparently and without downtime with no need for you to deal with managing the underlying infrastructure.
Furthermore, Windows Azure provides built in network load balancing, automatic OS and service patching, and a deployment model that allows you to upgrade your application without downtime by using upgrade domains, a concept similar to fault domains, which ensures that only a portion of your service is updated at any time.
Data Backup and Recovery
The ability to restore application data in the event of corruption or unwanted modification or deletion is a fundamental requirement for many applications. The following sections discuss recommended approaches for implementing point-in-time backups for Windows Azure Blob and Table storage, and SQL Databases.
Blob and Table Storage Backup
While blobs and tables are highly durable, they always represent the current state of the data. Recovery from unwanted modification or deletion of data may require restoring data to a previous state. This can be achieved by taking advantage of the capabilities provided by Windows Azure to store and retain point-in-time copies.
For Windows Azure Blobs, you can perform point-in-time backups using the blob snapshot feature. For each snapshot, you are only charged for the storage required to store the differences within the blob since the last snapshot state. The snapshots are dependent on the existence of the original blob they are based on, so a copy operation to another blob or even another storage account is advisable to ensure that backup data is properly protected against accidental deletion. For Windows Azure Tables, you can make point-in-time copies to a different table or to Windows Azure Blobs. More detailed guidance and examples of performing application-level backups of tables and blobs can be found here:
Windows Azure SQL Database Backup
Point-in-time backups for SQL Databases are achieved with the Windows Azure SQL Database Copy command. You can use this command to create a transactionally-consistent copy of a database on the same logical database server or to a different server. In either case, the database copy is fully functional and completely independent of the source database. Each copy you create represents a point-in-time recovery option. You can recover the database state completely by renaming the new database with the source database name. Alternatively, you can recover a specific subset of data from the new database by using Transact-SQL queries. For additional details about SQL Database backup, see Business Continuity in Windows Azure SQL Database.
Platform Services for Building Highly Available Solutions
On top of a secure and highly redundant infrastructure, there are Windows Azure services that act as building blocks for designing highly available deployments that span multiple data centers. These services can each be used on their own or in combination with each other, third-party services and application-specific logic to achieve the desired balance of availability and recovery goals.
Geo-Replication of Blob and Table Storage
All Windows Azure blob and table storage is replicated between paired data centers hundreds of miles apart within a specific geographic region (e.g., between North Central and South Central in the United States or between North and West in Europe). Geo-replication is designed to provide additional durability in case there is a major data center disaster. With this first version of geo-replication, Microsoft controls when failover occurs and failover will be limited to major disasters in which the original primary location is deemed unrecoverable in a reasonable amount of time. In the future, Microsoft plans to provide the ability for customers to control failover of their storage accounts on an individual basis. Data is typically replicated within a few minutes, although synchronization interval is not yet covered by an SLA. More information can be found the blog post Introducing Geo-replication for Windows Azure Storage.
Windows Azure SQL Database Export to Blob Using the SQL Database Import/Export Service
Geo-replication of SQL Database data can be achieved by exporting the database to an Azure Storage blob using the SQL Database Import/Export service. This can be implemented in three ways:
Export to a blob using storage account in a different data center.
Export to a blob using storage account in the same data center (and rely on Windows Azure Storage geo-replication to the separate data center).
Import to your on-premises SQL Server.
For implementation details see the MSDN article Business Continuity in Windows Azure SQL Database.
Windows Azure Traffic Manager (currently a Community Technology Preview) allows you to load balance incoming traffic across multiple hosted Windows Azure services whether they are running in the same datacenter or across different datacenters around the world. By effectively managing traffic, you can ensure high performance, availability, and resiliency of your applications. Traffic routing occurs as a result of policies that you define and that are based on one of the following three criteria:
Performance – traffic is forwarded to the closest hosted service in terms of network latency
Round robin – traffic is distributed equally across all hosted services
Failover – traffic is sent to a primary service and, if this service is not online, to the next available service in a list
Traffic Manager constantly monitors hosted services to ensure they are online and will not route traffic to a service that is unavailable. Using Traffic Manager you can create applications that are designed to respond to requests in multiple geographic locations and that can therefore survive entire site outages. Prior to the general availability of a production version of Traffic Manager, you can implement third-party DNS routing solutions such as those provided by Akamai or Level 3 Communications to build applications that are designed for high-availability. For more information, see Windows Azure Traffic Manager.
Planning for Site Disasters
Windows Azure greatly reduces what you need to worry about to ensure high availability within a single data center. However, even the most well designed data center could be rendered inaccessible in the case of a true disaster. To plan for such disasters, you must think through both the technical and procedural steps required to provide the level of cross-site availability you require. Much of this is application specific and as the application owner you must make tradeoff decisions between availability and complexity or cost.
Types of Cross-Site Availability
There are some basic steps that all but the most trivial of applications should take to know that they can be deployed in a different data center in the event of a site disaster. For many applications, redeployment from scratch is an acceptable solution. For those that need a quicker and more predictable recovery a second deployment must be ready and waiting in a second data center. For an even smaller set of applications, true multi-site high-availability is required. We will look at each of these classes of applications in order from the least complex and costly to the most complex and costly. Cross-site availability is a lot like insurance – you pay for protection that you hope you will never have to use. No two applications have precisely the same business requirements or technical design points, so these classes of applications are meant as general guidance that should be adapted to your specific needs.
For an application to become available in a secondary data center, three requirements must be met:
The customer’s application and dependent services must be deployed.
The necessary data must be available to the application (typically in the same data center).
Any external traffic must be routed to the application.
Each of these three requirements can be accomplished at the time of a disaster or ahead of time, and each can be accomplished manually or automatically. Every application is different, so it’s possible there are other application-specific requirements, such as availability of a dependent service. In the rest of this section, we’ll describe several common patterns for recovering availability in the case of a disaster.
Redeploy on Disaster
The simplest form of disaster recovery is to redeploy your application and dependent services, such as Windows Azure Cache and Windows Azure Service Bus, when a disaster occurs. Redeployment of applications is accomplished using the same method used when the application was originally created. This can be done manually via the Windows Azure Portal or can be automated using the Windows Azure Service Management interface.
To mitigate data loss, the redeployment strategy should be coupled with data backup or synchronization to make sure the data exists and is usable from a backup storage account or database. Because the data for the new deployment is in a different account, the application must be designed with configurable connection strings. This way, on redeploy, only the application’s configuration must be changed.
To meet the third requirement (traffic routing to the new deployment), a custom domain name must be used (not <name>.cloudapp.net). In the case of a disaster, the custom domain name can be configured to route to the new application name after the new deployment is completed. A CNAME record is sufficient to accomplish this, but you should be sure you know the process for making the necessary changes with your DNS provider.
Because no compute resources are reserved for recovery this is the least expensive solution. However, low cost comes at the expense of increased risk and increased recovery time. In the event of a large-scale disaster, resources in other data centers are allocated based on real-time availability.
The redeployment strategy described previously takes time and has risks. Some customers need faster and more predictable recovery and may want to reserve standby resources in an alternate data center. The active/passive pattern means keeping an always-ready (potentially smaller) deployment of the application in a secondary data center. At the time of a disaster, the secondary deployment can be activated and scaled.
Here, the first requirement (deployment of the app) is taken care of ahead of time. The second requirement (data availability) is typically handled one of the data replication methods discussed above. The third requirement (routing traffic) is handled by the same way as with the redeploy pattern (DNS change at the time of disaster). In this pattern, however, the process can be fully automated using Windows Azure Traffic manager (CTP) or other similar DNS management solutions, which can be configured to reroute traffic to a secondary deployment only when the first deployment stops responding.
When using a warm standby, it is important to carefully think through data consistency implications at the time of the failover. You need to plan for any steps that must be taken before your application is brought online in an alternate location as well as how and when to return to the original deployment when it becomes available again. Because the application is not designed to function simultaneously in multiple data centers, returning service to the primary location should follow a similar procedure to the one you follow in the case of a failover.
The active/active pattern is very similar to the active/passive deployment pattern, but the third requirement (traffic routed to the deployment) is always satisfied because both deployments now handle incoming traffic at all times . As in the active/passive configuration, this can be accomplished with Windows Azure Traffic Manager (CTP) or similar solutions, but now using a performance policy, which routes traffic from a user to whichever data center provides them the lowest latency or a load-balancing policy, which distributes load evenly to each site. This solution is the most difficult to implement, as it requires the application to be designed to handle simultaneous requests across multiple instances of the application that reside in distinct data centers. However, it is more efficient than the active/passive solution in that all compute resources are utilized all the time.
To achieve full, multi-site high availability without downtime, another common pattern is the active/active deployment, so named because two deployments in two data centers are both active. This meets the first requirement (that the app is deployed). In this model, a full copy of the data is in both locations, and they are continually synchronized. This meets the second requirement (that the data is available).
Access Control Service (ACS)
Windows Azure Access Control Service (ACS) is a cloud-based service that provides an easy way of authenticating and authorizing users to gain access to your web applications and services while allowing the features of authentication and authorization to be factored out of your code. Instead of implementing an authentication system with user accounts that are specific to your application, you can let ACS orchestrate the authentication and much of the authorization of your users. ACS integrates with standards-based identity providers, including enterprise directories such as Active Directory, and web identities such as Windows Live ID, Google, Yahoo!, and Facebook.
Restoration of ACS Namespaces in the Event of a Disaster
Being built on the Azure platform, ACS is a highly available system and resilient to failures within a single data center. However in the event of a disaster rendering a full data center inoperable, there is the potential for data loss. Should such a disaster occur, all Windows Azure services will make best efforts to restore you to a pre-disaster state as quickly as possible.
The ability to restore an ACS subscription is dependent on the version of ACS that you subscribe
to. ACS currently consists of two versions (version 1.0 and version 2.0) with different disaster recovery and data restoration capabilities. There are two ways to determine the version of an ACS namespace:
See the “ACS Version” column in the Service Bus, Access Control, & Caching section of the Windows Azure portal. This column is present for all Access Control and Service Bus namespaces, the latter of which includes a built-in ACS namespace.
You can also determine the ACS version by trying to connect to the ACS 2.0 management portal in a web browser. The URL to the ACS 2.0 management portal is https://< tenant>.accesscontol.windows.net/ where <tenant> should be replaced by your actual ACS namespace name. If you can successfully access the portal or are prompted to sign in, then the ACS version is 2.0. If you get an HTTP status code 404 or error message “Page cannot be found”, then the ACS version is 1.0.
ACS 1.0 Namespaces
While benefiting from the high availability built into Windows Azure, ACS 1.0 namespaces are not recoverable in the event of a disaster. Should a disaster occur Microsoft will work with ACS 1.0 customers to mitigate the loss of data, however no guarantees are made about the ability to recover from a disaster.
In December 2011, Microsoft officially deprecated ACS version 1.0. All customers with ACS 1.0 namespaces are encouraged to migrate to ACS 2.0 in advance of December 20, 2012 to avoid potential service interruptions. For more information about migrating from ACS 1.0 to ACS 2.0, please see the Guidelines for Migrating an ACS 1.0 Namespace to an ACS 2.0 Namespace.
ACS 2.0 Namespaces
ACS 2.0 takes backups of all namespaces once per day and stores them in a secure offsite location. When ACS operation staff determines there has been an unrecoverable data loss at one of ACS’s regional data centers, ACS may attempt to recover customers’ subscriptions by restoring the most recent backup. Due to the frequency of backups data loss up to 24 hours may occur.
ACS 2.0 customers concerned about potential for data loss are encouraged to review a set of Windows Azure PowerShell Cmdlets available through the Microsoft hosted Codeplex Open Source repository. These scripts allow administrators to manage their namespaces and to import and extract all relevant data. Through use of these scripts, ACS customers have the ability develop custom backup and restore solutions for a higher level of data consistency than is currently offered by ACS.
In the event of a disaster, information will be posted at the Windows Azure Service Dashboard describing the current status of all Windows Azure services globally. The dashboard will be updated regularly with information about the disaster. If you want to receive notifications for interruptions to any the services, you can subscribe to the service’s RSS feed on the Service Dashboard. In addition, you can contact customer support by visiting the Support options for Windows Azure web page and follow the instructions to get technical support for your service(s).
Meeting your availability and recovery requirements using Windows Azure is a partnership between you and Microsoft. Windows Azure greatly reduces the things you need to deal with by providing application resiliency and data durability to survive individual server, device and network connection failures. Windows Azure also provides many services for implementing backup and recovery strategies and for designing highly-available applications using world-class datacenters in diverse geographies. Ultimately, only you know your application requirements and architecture well enough to ensure that you have an availability and recovery plan that meets your requirements.
We hope this documents helps you think through the issues and your options. We welcome your feedback on this document and your ideas for improving the business continuity capabilities we provide. Please send your comments to email@example.com.