How Virtualized Corporate Networks Raise the Premium on System Reliability


Phil Riccio

June 2010

 

Summary: Most application managers have heard of virtualization, but few of them understand its implications for application availability and performance. Even IT managers who make their living at the top of the IT stack need a good working knowledge of this critical new infrastructure technology.

 

Introduction

It is hard to argue that the action in IT is anywhere else but at the application level. Applications are the frontline of IT, where technology touches workers, customers, and partners. E-mail, messaging and database servers, and business services such as online transactions and credit-card authorization: These are where companies make their money—or where companies lose their money, if their applications fail for any reason.

Perhaps in contradiction of the previous point, the hottest technology movement of the moment—virtualization—is not an application play, but an infrastructure play, far down the stack from applications. Virtualization is about remaking the data center. So, why should managers who are in charge of, say, Microsoft Office Exchange/Office Outlook or Office SharePoint, care what is happening that far down the stack?

The answer is simple: Because they will ultimately affect application operation and management. Residing on the glamorous end of the IT stack does not justify a benign ignorance of virtualization. Even a passing curiosity is not enough for application managers to do their own jobs properly if their organizations are considering virtualization. IT managers must be aware of the strengths and weaknesses of virtualization, because how they respond to virtualization will determine whether their applications continue to meet service-level agreements by providing the necessary application availability.

Virtualization adds flexibility and efficiency to the IT infrastructure. Its potential to reduce overall footprint in the data center by making better use of server resources benefits the entire IT organization by trimming costs and freeing up money and people to perform revenue-producing work. However, it also reverses a long trend that is marked by the beginning of the client/server computing era and extending until recently. Virtualization reintroduces single points of failure back to the IT infrastructure. Even when IT managers think about application availability in a virtualized environment, they often come to the wrong conclusion, which is that virtualization is its own availability solution. That conclusion reveals a risky misperception of the capabilities of virtualization and can lead to unacceptable levels of  application downtime.

The consequences of downtime are high. IDC estimated the cost of an 11-hour IT outage at $1 million. Half of the medium-size businesses that were surveyed in a recent Gartner report revealed that their employees experience at least six hours of unplanned downtime per month, with corresponding losses in productivity. Large companies—those that have 2,500 users or more—report even higher losses: up to 16 hours of unplanned downtime per month.

 

Virtual Reality Check

Server-based architectures reduced the single points of failure that existed in mainframe- and minicomputer-based architectures. In the data-center architectures that still dominate today, applications either have a dedicated server or share a server with (at most) one or two other applications. Workloads have been physically partitioned to safeguard security, performance, and availability. A single server failure can knock a few applications offline, but does not have broad consequences across the application infrastructure.

The potential loss of performance and availability from a hardware failure is much more serious in a virtualized environment. Virtualization software can contribute to availability higher up the stack, at the application level. A hardware failure, however, will take down all of the applications on the physical server, regardless of how many virtual machines are backing up the primary application servers. The reason is simple: The ability to perform "live migration" is meaningless when the server crashes, because there is nothing to move from or to; everything is down.

Even if the virtual environment includes backups for primary physical servers, failures can create substantial downtime and erode performance. Applications are not available in these high-availability environments when a primary server is failing over to and restarted on a secondary machine, and all in-flight data is lost. Applications that run on secondary machines are also more likely to experience extra latency, because the secondary machines are more likely running their own primary applications, as well as the applications that were inherited from the failed server.

Transient errors are an additional availability and performance consideration in virtual environments. High-availability features that are built into some virtual environments often have the unintended effect of duplicating the error that caused a primary server to fail in the first place. The transient error migrates to the secondary server along with the processing load, where it can crash the secondary server in the same way that it crashed the primary server.

Even now, relatively early in the development of virtualization, companies are consolidating as many as 15 to 20 application servers on a single physical machine that is running multiple virtual machines. In addition to reducing the server footprint of the data center, this consolidation enables IT to provision extra processing capacity more easily to meet peak requirements. Credit-card companies, for  example, can be expected to process the majority of their transactions in the period from November to December, because of holiday/seasonal buying. Instead of building for that peak processing and leaving unused processing power on the table for most of the year, a virtualized infrastructure enables IT to provision a single pool of resources as fluidly as demand ebbs and flow. Virtualized environments can also take advantage of "availability on demand" by moving applications from standard servers to fault-tolerant virtualized servers via live migration. IT managers can do that when applications are needed during certain times of the week, month, or year.

In the long run, the virtualized model is a better server architecture for supporting the application infrastructure than today's one-server-one-application model. However, it also increases the number of applications that are affected by each crash. Even if the individual applications are not necessarily critical, their collective impact from a server failure can be very disruptive and costly to an organization. A  virtual architecture that includes fault-tolerant hardware avoids most of these pitfalls. Fault-tolerant servers are architected to maintain uninterrupted processing, even if one or more components fail. That eliminates failover time and, with it, the downtime and performance degradations that occur during failover.

Beyond the application-consolidation factor, there are other reasons for application managers to have a solid working knowledge of virtualization. In most cases, application managers cannot count on simply virtualizing their existing architectures, as though conventional- and virtual-server environments function in the same way. Virtualization is not an automatic fit for some applications, especially for those that rely on real-time processing. An application that relies on "in flight" data—data that has not been written to disk—cannot simply be ported over to a virtual environment and expected to deliver the service level that it did when it ran on a highly reliable conventional-server infrastructure. This is not a failure of virtualized architectures, which were not designed to provide ultra-high reliability. It is a combination of over-reliance on virtualized environments that have been built from off-the-shelf hardware without built-in reliability functionality.

 

Understanding the Limits

At the architectural level, a well-designed virtual-server environment is inherently more resilient than a conventional-server environment. That is primarily because IT managers can create multiple instances of applications on different virtual machines and across multiple physical machines, to ensure that they always have a backup at hand should the primary fail. This is one of the most valuable properties of virtualization; a similar scheme in a conventionally architected data center would be prohibitively expensive and unwieldy to manage.

The resilience of virtualization can be deceptive, however. When a virtual machine has to be restarted, the information about application state as well as in-flight data is lost. It can also take anywhere from 3  to 30 minutes to restart a virtual machine, which is more downtime than many critical applications can tolerate.

"The more workloads that you stack on a single machine, when something does go wrong, the more pain an organization is going to feel because all those applications have dropped on the floor," said industry analyst Dan Kuznetsky, vice president of research operations at The 451 Group. "Even when people are thinking about availability and reliability, they do not often think about how long it takes for a function to move from a failed environment to one that is working. Next, when they choose an availability regimen—a set of tools and processes—they often do not think about how long it will take for that resource to become available again, once the failure scenario has  started."

In-flight data loss is not the only drawback to relying on virtual machines as an application-availability solution. An error that causes downtime on a primary virtual machine can propagate to secondary virtual machines—crashing them as soon as they launch. Today's virtualization-software solutions do not isolate the cause of a failure or enable you to find the root cause, whether in software or hardware, which increases the chances that failures can be replicated between virtual machines. Virtualization software also is not designed to deal with transient (temporary) hardware errors such as device-driver malfunctions, which can cause downtime and data corruption when they are left uncorrected.

The point of all of this is not that virtualized architectures are inherently flawed. They are not. But neither the architectures nor the  software that spawns them are designed to deliver the ultra-high reliability that many critical applications require. The hardware infrastructure has been largely overlooked in the equation. Without proper attention, it can cause trouble all the way up the stack.

 

Building Blocks and Foundations

The second factor that application managers must understand is the importance of the actual hardware. Hardware is such a commodity in conventional data-center architectures that application managers stopped thinking of it long ago. Servers were like the furnace: out there somewhere doing its job reliably, without needing a lot of hand holding. When server prices dropped enough and for each application to have its own server, the primacy of the applications was ensured.

Virtualization reverses that dynamic, back to a variation on the mainframe model. With so many processing loads relying on fewer hardware devices, the quality and durability of that device is pivotal to maintaining application uptime. Commodity servers often cannot support peak processing loads from multiple virtual machines, without slowing down or  experiencing a failure.

Blade servers are often cited as the cure for this problem, but they are not. Blade servers assemble server resources in a more physically efficient form factor, but at their base they are still commodity servers. They share a common chassis, but they are essentially separate devices that have the same failover issues as stand-alone PC servers. They are not architected to communicate with each other automatically or to provide higher-than-standard reliability. "Standard reliability" would be at about 99.9 percent (at best) or, more commonly, at  99.5 or 99 percent. That amounts to more than eight hours of unscheduled downtime per year. Most critical applications require at least 99.99 percent uptime, and many require 99.999 percent, which is approximately five minutes of unscheduled downtime per year.

A virtualized server environment that supports critical applications with 99.999 percent application availability must include more than commodity servers. Even in a virtual environment, commodity servers cannot provide adequate uptime for critical applications.

Application managers need to know that their critical applications are implemented on a virtual architecture that includes fault-tolerant servers for maximum uptime.

Fault-tolerant servers are engineered "from the ground up" to have no single point of failure, and they essentially are two servers in a  single form factor—running a single instance of an operating system, and where the application(s) are licensed once. A truly fault-tolerant architecture employs "lockstepping" to guard against unscheduled downtime. Unlike virtual machines or other hardware-based availability solutions—server clusters, hot standbys, and storage-area networks, for example—there is no "failover" time if one of the processors in a fault-tolerant server crashes. A true fault-tolerant architecture eliminates failover; it does not "recover" from failure. The unit simply continues to process with no lag or loss in performance, while it automatically "calls home" to alert the network-operations center.

Fault-tolerant hardware has proven itself over three decades of use in industries such as financial services, in which it provides continuous operation of crucial functions such as stock trading and transaction processing. And what was once large, expensive, and proprietary is now based on industry-standard components—fitting in industry-standard racks, and running industry-standard operating systems at industry-standard prices. Fault-tolerant hardware fits seamlessly into virtual environments as simply another server, albeit with ultra-high reliability built in.

There are software products that attempt to emulate fault-tolerant functionality; so far, however, software has been unable to emulate the performance of hardware-based fault tolerance. Software-based solutions have more resource overhead. They cannot stave off the effects of transient errors or trace the root causes of errors. Trying to achieve full fault tolerance through software also means sacrificing performance; applications that run software fault-tolerance products are restricted to using only a single core of a multicore CPU in a multisocket server.

 

Know Thyself

The inevitable march of virtualization through the data center does not mean that application managers have to become infrastructure experts. However, they do need to understand the availability requirements of their application mix. Trusting the management tools that come with new virtualization environments without weighing the uptime needs of each application is asking for lost revenue and disgruntled users. Virtualization-management tools enable application managers to shift processing loads to new virtual machines. However, if the underlying hardware infrastructure fails, that critical advantage of virtualization is lost. You cannot manage virtual machines on a crashed server.

Many applications can endure as much as 30 minutes of downtime without a serious impact on the operations of the company. They are well suited to operate in a virtual environment without integrated fault tolerance. Inventory tracking, for example, can just catch up whenever the system comes back online.

If the application relies on in-flight data, however, the virtual architecture almost certainly requires fault-tolerant hardware. No  virtualization solution on the market today can deliver the 99.999 percent uptime that many critical applications require, and many cannot even guarantee 99.9 percent, which allows for hours of  unscheduled downtime.In its research report "Beating the Downtime Trend," HP documented a 56 percent increase in unplanned downtime since 2005. This is occurring at exactly the same time that businesses are increasingly depending on IT systems as direct revenue streams and customer-service channels. With the tide already flowing against them, application managers have to know what they are facing when their companies start to virtualize the data center. While it is true that the money is made at the application layer, it can be lost just as quickly if the bottom layer of the stack—the hardware infrastructure—is shaky.

 

Conclusion

An IT manager who is educated in virtualization and its nuances is one who fully understands the importance of infrastructure planning, which includes application availability and hardware uptime.

 

About the Author

Phil Riccio ( phil.riccio@stratus.com) is a product manager at Maynard (MA)–based Stratus Technologies and is responsible for the company's overall virtualization and Linux marketing strategies. He has been in the high-tech industry since 1981. Prior to joining Stratus, he was director, business development, at Mainline Information Systems—IBM's largest reseller—where his focus was on new and competitive opportunities for the IBM AIX and x86 platforms. Previous to Mainline, Riccio was North American marketing manager for enterprise-server products at Hewlett-Packard.