Lately, I’ve been giving a lot of thought to the deployment of applications. It turns out that for applications, the matrix for fault tolerance and upgrade path gets a bit tricky—and even trickier when applications have a mix of services, a Web UI and back-end processes. Add in geographic distribution and the logistics become even more muddied.
In large IT organizations, a minimum deployment of any Web or application server often involves two servers that are geographically separated. This easily moves up to four servers if two servers are specified for the expected load and you have a mirror site with the same setup (of course, database and other supporting server infrastructure can push the number higher still). What if the company serves multiple locations, such as North America and Europe, the Middle East and Africa (EMEA)? Now the setup gets replicated to both sides of the Atlantic, turning what started as two Web servers into eight servers for geo failover and for staging assets closer to consumers.
Eventually, an application is deployed on all these servers and everything is running along smoothly—and then some cheeky developer creates new functionality and wants to update the deployment.
As you can imagine, it takes a good bit of planning to determine the order in which servers will drain connections, get updated and tested, and then be put back into the pool. Some folks spend late nights working through upgrade plans, and that’s even when there are no real problems.
Windows Azure doesn’t eliminate the need for an upgrade plan, but it does take much of the complexity out of upgrading by handling most of it as part of the fabric. In this column, I’m going to cover fault domains and upgrade domains, and write a little bit of code to apply an upgrade across the deployment.
Windows Azure includes the concepts of fault domains and upgrade domains, both of which are almost fully described by their names. Fault domains define a physical unit of deployment for an application and are typically allocated at the rack level. By placing fault domains in separate racks, you separate instances of application deployment to hardware enough that it’s unlikely all would fail at the same time. Further, a failure in one fault domain should not precipitate the failure of another. When you deploy a role with two configured instances, the fabric ensures the instances are brought up in two different fault domains. Unfortunately, with fault domains, you have no control over how many domains are used or how roles are allocated to them.
Upgrade domains are another matter. You have control over these domains and can perform incremental or rolling upgrades across a deployment by upgrading a group of instances at a time. Whereas fault domains are about physical deployment of the roles, upgrade domains relate to logical deployment. Because an upgrade domain is a logical grouping of roles, a single Web application could easily exist in five different upgrade domains divided into only two separate physical deployments (fault domains). In this case, to update a Web application, you might update all roles in group 0 (upgrade domain 0) and then all roles in group 1 and so on. You can exercise more finite control by updating individual roles one at a time in each Update Domain.
In summary, an application that requires more than one instance will be split into at least two fault domains. To make upgrading a Web application across the whole farm easier, roles are combined into logical groupings that are updated at the same time.
The Windows Azure Management Console shows an Update Domain column, but not a Fault Domain column (see Figure 1). (Note that upgrade domain and update domain are interchangeable terms. The documentation often refers to upgrade domains, but in the API it’s called an update domain.)
Figure 1 The Windows Azure Management Console
In Figure 1you can see that the numbers for my four deployments run from 0 to 3. By default, Windows Azure uses five update domains for each service and assigns them in a round-robin style. This is something you can change in the service definition file by assigning the desired number of upgrade domains to the upgradeDomainCount attribute of the ServiceDefinition element. You’ll find links for each of the schemas for Web and Worker roles at msdn.microsoft.com/library/ee758711. To force a WebRole to use only three upgrade domains, for example, you set the upgradeDomainCount in the service definition file:
<ServiceDefinition name="<service-name>" xmlns=”http://schemas.microsoft.com/ServiceHosting/2008/10/
<WebRole name="<web-role-name>" vmsize="[ExtraSmall|Small|Medium|Large|ExtraLarge]"
This is important, because the number of update domains ultimately affects your plan and execution. Unfortunately, there’s no column that lets you see fault domain assignments. By writing a little code, however, you can pull back the curtain a bit on the deployment and see both update domain and fault domain assignments, as Figure 2 shows.
Figure 2 Finding Role Information
protected void GetRoleInfo()
List<RoleInfo> RoleInfos = new List<RoleInfo>();
foreach (var role in RoleEnvironment.Roles)
RoleInfo info = new RoleInfo();
info.RoleName = role.Value.Name;
foreach (RoleInstance roleInstance in role.Value.Instances)
info.InstanceId = roleInstance.Id;
info.FaultDomain = roleInstance.FaultDomain.ToString();
info.UpgradeDomain = roleInstance.UpdateDomain.ToString();
GridView1.DataSource = RoleInfos;
This code doesn’t show a small class I defined to store the relevant information. And unfortunately, though I have this nice nested loop that goes through the roles and the instances, the API allows the code running in the page to return data related to only the specific instance running the code. Thus, the code produces a small grid with just the current WebRole information in it (see Figure 3), without any other instance information.
Figure 3 Current WebRole Information
This code provides a quick look at the current WebRole’s fault and upgrade domains, but you’ll need to use the Get Deployment REST URI to get more comprehensive data. It returns the deployment XML, which contains, among other things, elements for <Configuration/> and for each of the <RoleInstances />. Once you’ve fetched the configuration, you can change it and put it back. Take a look at my October 2010 column (msdn.microsoft.com/magazine/gg232759) for examples that show many of the same operations that would be involved here.
There are two basic strategies for updating a Windows Azure deployment: in-place upgrades and virtual IP (or VIP) swap. VIP swap is the simpler approach and allows for full testing of the new or updated application before opening the gates to the public. Moreover, the application can run at full capacity as soon as it’s live. Should there be issues when the swap is complete, you can quickly put the previous version back in place while the new deployment is being worked on.
You’ll find a good reference describing what can and can’t be done in each deployment model at bit.ly/x7lRO4. Here are the points that might force a choice:
Other than these points and some SDK version considerations, it’s up to you to decide.
Swapping the VIP of the staging and production environments is a pretty good solution for many, if not most, cases when rolling out a new version. Sometimes it’s the only way to keep the site mostly available while making changes, though if you’re upgrading a large deployment, bringing up another full deployment can be cumbersome. There’s also a cost associated with deploying a complete copy—one compute hour charge for each deployed instance and then the additional compute hours for the two running copies.
In Web farms nowadays, updates are generally rolled out through a farm by either: taking one server offline at a time, upgrading, bringing the server online and returning it to the farm pool; or dividing the farm into segments and draining the connections on one segment at a time, then upgrading each segment, bringing it online and returning it to the farm, and finally moving on to the next segment.
An in-place update works like the second pattern. However, the more upgrade domains used, the more the pattern resembles the first option. The upside of using a larger number of upgrade domains is that the site capacity decreases only by the size of the segment during the entire upgrade.
The primary challenge that traditional non-cloud deployments face is the same for cloud deployments: when you perform rolling upgrades, mixed versions of the application will be running. The instances might deliver different visuals, use different data and service connections, and so forth. This can lead to site errors or even undesirable user experiences, and it may be completely unacceptable for your business. Moreover, it puts a heavy burden on the development and test teams to make sure the application will run when there are multiple versions in play.
What do you do if you can’t use VIP swap and availability requirements preclude a delete and deploy? You might try using only two update domains and performing an in-place update, which keeps a single version of the application running during the deployment. The downside: half of your site’s capacity will be unavailable during the transition.
The grid in Figure 4 might help you consider which approach to employ in performing an upgrade.
Figure 4 Upgrade Decision Matrix
2 Update Domains
3+ Update Domains
Nice advancements have been made in the ability to perform the upgrade both within the management console and via scripting. For small to midsize organizations with a relatively modest number of deployments, it’s easiest to manage the updates through the Windows Azure Management Console, shown in Figure 5.
Figure 5 Windows Azure Management Console
As you can see in the upper left corner of the screen, a Manual Upgrade is running. This requires clicking the Start button to initiate the process for each upgrade domain—that’s the manual part of it. Once the update is started, the console displays what’s going on in the instances in each domain, as shown in Figure 6.
Figure 6 Update Activity
The manual, push-button method works well for smaller deployments. For larger deployments or those where you want to automate the build-test-deploy process, you should choose a scripted approach. You can automate the process using the CSManage command-line tool, which you can download from bit.ly/A6uQRi. CSManage will initiate the upgrade and walk through the process of upgrading one update domain at a time from the command line. Though this is helpful, there’s a level of fine control that can only be accomplished using the REST API directly.
If for one reason or another you’ve decided to not walk the update domains from 0 – n and instead plan to use your own starting point or order, you’ll need to take a look at the combination of update and fault domains. The grid in Figure 7 makes it obvious that if you were to update Upgrade Domain 1, and Fault Domain 0 faulted during the update, the site would be completely down. This should normally be covered by the fabric, though, and the grid shows that if the update happens in order, there will always be different fault domains running.
Figure 7 Domain Matrix
The lesson here is to consider potential consequences during planning, and to not “fix” something that’s already working.
When you’re designing a Windows Azure application, you need to take deployment architecture into account. Windows Azure provides the functionality of the fabric to ensure that an application will not fault due to a single hardware failure, while providing an easy, automatic way to incrementally update the deployment. Still, support for an in-place update is something that has to be designed into the application—and the update that’s being pushed.
You can update a Windows Azure service using VIP swap or a two-upgrade-domain, in-place plan where a full in-place update can’t be supported. Last, there are both UI and programmatic means to control the deployment and updates so that you can perform a scheduled update or even use a build-test-deploy schedule or a scheduled update.
Joseph Fultz is a software architect at Hewlett-Packard Co., working as part of the HP.com Global IT group. Previously he was a software architect for Microsoft working with its top-tier enterprise and ISV customers defining architecture and designing solutions.
Thanks to the following technical expert for reviewing this article: Don Glover
You write: By placing fault domains in separate racks, you separate instances of application deployment to hardware enough that it’s unlikely all would fail at the same time. Further, a failure in one fault domain should not precipitate the failure of another. But that can increase latency. Affinity groups minimize the distance between resources, while fault domains spread them out.
@Andrew - the problem is that any context that has been established (e.g., session, state data, etc.) will be lost, because of the switch of the servers. Now, it is feasible to implement a solution that mitigates the problem by storing such thing in a shared cache. However, that won't address any network session losses.
Hi Joseph, Thanks for the great article. In the figure 4 (the table) you have the following in the CONS section for VIP SWAP: "Hiccup in service at time of swap". What is the reason for this Hiccup, is it a cold start on the ASP.NET web app OR is it a DNS issue OR connection pooling issue? Thanks, Andrew
More MSDN Magazine Blog entries >
Browse All MSDN Magazines
Subscribe to MSDN Flash newsletter
Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.