Testing for Availability
Testing for availability means running an application for a planned period of time, collecting failure events and repair times, and comparing the availability percentage to the original service level agreement.
Where reliability testing is about finding defects and reducing the number of failures, availability testing is primarily concerned with measuring and minimizing the actual repair time. That may seem odd at first, but take another look at the formula for calculating percentage availability: (MTBF / (MTBF + MTTR)) X 100. Notice that as MTTR trends towards zero, the percentage availability trends towards 100%. This idea becomes the essential focus of availability testing: reduce and eliminate downtime.
This idea of measuring repair time modifies the usual test emphasis in two ways. First, availability testing must target the entire range of events and procedural scenarios possible in the lifecycle of an application to see if all of the automated support processes and people-based procedures are really production ready. Second, when the application fails, the test clock is still running, and a knowledgeable recovery team had better be onsite fixing the problem.
The closer the testing is to real-world situations, the better the test confidence. Some organizations are reluctant to allocate fully configured server machines and isolated network environments to a long battery of availability testing. Just remember that a software defect found after deployment costs ten times more to fix than if found before deployment.
This section contains a selection of testing concepts and recommendations that are especially relevant to creating available applications. Some other ideas for testing application reliability are at Testing for Reliability. For more complete information on the general subject of testing, search for "testing" in the MSDN Online Library (http://search.microsoft.com/us/dev/default.asp).
Test the Change Control Process
Applications are always evolving to fit new business requirements and improve behavior. Even mission-critical applications change over time. Because the change control process is a large source of downtime-causing errors, you had better test and validate all of the change control procedures. A business- or mission-critical application must not go into production until you can repeatedly perform error free change control.
Consider this: if you haven't tested the full deployment and configuration process, how do you know if it will work at midnight when the application goes into production? Also, don't overlook the sometimes difficult problem of eventually removing your application without impairing the operation of other applications.
Test Catastrophic Failure
Before deploying your new application, make sure that the catastrophic recovery procedures you have created work as expected. Is the recovery team ready? It must be trained, equipped, and well rehearsed. How do you know it is ready unless you test it? It is one thing to perform data backups and write a disaster recovery plan, but if you cannot actually do disaster recovery, the plan is useless.
What you need to do is create outages of a catastrophic nature and test the recovery process. For example, pull the building's power plug at midnight and see how long it takes to recover. Catastrophic testing validates not only the correctness of the recovery procedures (and proof of a backup strategy that works), but it also provides a measure of confidence in a well-prepared recovery response team.
Test the Failover Technologies
Before deploying your new application, make sure that the failover technologies you have implemented work as expected. This test should include both servers and RAID disks. For example, pick a favorite piece of hardware, say a disk controller or drive, and loosen the card or pull the connection. Then watch and measure the recovery team as they locate and replace the failed hardware.
As another test idea, let the application run for a few hours, put a few dozen users on the system making client requests, and then pull the power plug on a front-end server and a back-end data store. Watch as the clustering failover technology restarts the failed application service on another server. The application should not only stay online, but every user process should be completed correctly. Again, observe and measure the recovery team as it identifies and replaces the server hardware.
Test the Monitoring Technology
Since you are staging all these tests, you should analyze the Windows Management Instrumentation (WMI) data using the intended monitoring reports and make sure that you plainly see resource consumption data and — especially — all of the test outages. If you have implemented a management console (perhaps using Application Center 2000), make sure you are getting the necessary failure, availability, and trend analysis data.
Test the Help Desk Procedures
For critical applications, the help desk must be fully trained and ready to handle customer inquiries and failure scenarios. How soon can the help desk identify a problem? Do help desk representatives clearly understand how to resolve the crisis? Test the escalation process by staging a serious failure while several people on the call list aren't home. Remember, the clock is running on all repair time scenarios.
Test for Resource Conflicts
Availability engineering requires in-depth consideration of an application's interactions with other system processes. You must look at how a particular service is provided, evaluate all the ways some other application might interfere with the intended service, test for conflicts, and possibly consider design alternatives.
Applications often run slowly because of competition for resources such as CPUs, memory, disk I/O, and network bandwidth. When an application service must wait for hardware to become available in order to complete the task, or when several background events are occurring simultaneously, your application may run slowly. This affects perceived availability in this way: a slow application is technically "available," but who wants to wait for it? It might as well have failed.
Testing for resource conflicts should be conducted in a full, production-like target environment where transient workloads cause multiple applications to compete for resource allocation.