Click to Rate and Give Feedback
MSDN
MSDN Library
Visual Studio .NET
Design Goals
Availability
 Best Practices for Availability
Designing Distributed Applications with Visual Studio .NET
Best Practices for Availability

The following best practices are recommended for creating applications with high-availability.

Use Clustering

Despite the best software engineering, all applications fail eventually. Delivering application services in spite of failures is what availability engineering is all about. Clustering is a key technology for high-availability because it provides instant failover application services in the event of a failure.

Use Network Load Balancing

Network Load Balancing (NLB) enhances the availability of critical Web-based applications by detecting server failures and automatically redistributing traffic to still-running servers. With NLB you get two design benefits: high-availability with minimal operational support, and incremental scalability with easily added capacity.

Use Service Level Agreements

After an application is funded and the design specification is in progress, it is important to create a written agreement that defines the expected level of service. Such a service level agreement would include specifics such as, "The application should run 24 X 7 with a yearly availability of 99.9%".

Like reliability requirements, availability requirements also contain uncertainty. It is often difficult to balance the business need versus budget tradeoffs that will still satisfy the customer. Over time, the business requirement and application usage will probably change, affecting the original availability assumptions.

Provide Vigilant Monitoring

Some companies do not keep application failure or resource consumption data. While there might be a few statistics on network outages or operating system failures, there is typically a lack of adequate information for isolating problems and analyzing trends. Continuous monitoring of operational workload and failure data is crucial to discovering trends and improving service.

For your application, the most important data is provided by implementing Windows Management Instrumentation (WMI). The Microsoft platform also provides many additional tools and utilities to monitor the condition of the operating system, the hardware resources, supporting services, and the application. This includes the use of the Windows System Event Log to determine both planned and total availability. The same event log can help determine the mean time between failure (MTBF) for your application. For more information, see Monitoring Reliability and Availability of Windows 2000-bases Server Systems (http://www.microsoft.com/WINDOWS2000/techinfo/administration/cluster/monitorrel.asp).

You should install monitoring tools and establish availability metrics for all business- and mission-critical systems. This will help you identify potential problems before they escalate and become failures. Typical resources that should be monitored include physical and logical disk, memory, CPU utilization, network traffic, server work queues, security errors, application service requests, transaction throughput, and data access requests.

You must continually investigate the source of problems and recommend software design and process changes to prevent them. Many companies make the mistake of deploying a new application expecting operations to repair and restart it when it breaks. This is a mistake because a typical operations fix is usually a shortcut to get the application back online. Operations rarely has either the skill or the budgetary support to solve the underlying problem.

Instead, assign some of your best software engineers to the full-time task of application failure analysis. Given the complexity of distributed applications, it is important to have skilled people engaged in identifying the root cause of an outage. Let them analyze both the application's technical infrastructure and your company's software lifecycle processes with a focus on recommending a continuous stream of procedural improvements and reengineering options.

Establish a Help Desk to Reduce Downtime

The help desk is responsible for gathering problem information, determining problem history, and starting immediate problem resolutions. In normal times, the help desk provides applications support to users within an organization. The help desk should also be centrally involved in every failure recovery scenario, handling first level support for application failure issues.

Every help desk should have a problem escalation plan defining who to call for various types of problems. A clearly defined escalation process can avoid confusion during a crisis. The problem escalation plan should include:

  • A call hierarchy in case the problem gets worse.
  • A designated contact for each application and type of operational failure.
  • Pagers and cell-phones for important contacts.
  • A documented history of problems with descriptive details of the solution.

Test the Recovery Plan

Test the recovery plan — several times. A well-documented plan is only paper until it is proven during rigorous test conditions. Practice recovery procedures using scheduled and "surprise" interrupts. Pull the plug, cut the power, disconnect a network cable. Did the application still run? Did the recovery team act swiftly to get the application running again? Are they well-practiced at fast install, configure, and start processes? If not, go back to the drawing board.

A recovery plan should schedule real outage recovery tests on a regular basis and specifically define how to meet the following requirements:

  • Detailed instructions for all catastrophic recovery procedures.
  • Checklist of all the operations needed for each front- and back-end server to become operational.
  • Mandatory on-site spare parts inventory.
  • Hard-copy recovery manuals. Remember: if the environment is down, you will not have access to electronic text.

Choose Good Infrastructure

Hardware

Standardized hardware is an essential requirement for achieving high availability. Standard hardware has the following advantages:

  • Simplified testing for hardware certification.
  • Replacement parts always match the production parts.
  • Reduction of parts inventory.
  • Reduced training requirements for support personnel.
  • Cloning tools, disk imaging, and setup processes are identical for all hardware targets.

You should only use hardware that is listed in the Windows 2000 Hardware Compatibility List. For more information, see Windows Hardware Compatibility List (http://www.microsoft.com/hcl/default.asp).

Software

Buy and install only software that is "Certified for Windows." The Windows 2000 certification program provides clear, concise guidelines to create and validate applications that must deliver high levels of availability. Insist on certification for all software that provides business- or mission-critical support.

Memory

To enhance memory error detection and correction, use only error corrective coding (ECC) memory.

Physical Data Storage

Use at least two Windows 2000 Advanced Servers running the Microsoft Windows Cluster service with a fiber channel connection to a shared RAID array. With this approach, any server failure is covered by the server, and the RAID array maintains data availability in the event of a disk failure. Use the NTFS file format because it provides high security and data integrity. Use multiple disk controllers. Disk controllers don't fail often, but if you have only one controller the drive may become unavailable.

For a summary of the various RAID options and how to use them, see Planning Fault Tolerance and Avoidance (http://www.microsoft.com/TechNet/win2000/planning.asp).

Database

Use SQL Server 2000. With its high-speed, data-management, and analysis capabilities, SQL Server 2000 is the backbone for .NET applications and services.

If your company has established a data warehouse for decision support applications, you need a good way to replicate data from the operational database to the decision support database. It is recommended that you use the replication services provided by SQL Server 2000. This is not only more efficient than updating both databases synchronously, but it also improves availability in an interesting way. Because the replication process can queue the changes, your application can run without failure when the decision database is down.

Networking

Every server should have two network interface cards to provide isolated public and private connectivity. Use TCP/IP protocol throughout the installation. Only the front-end servers should have addresses that are publicly accessible. This architectural approach prevents direct public access to back-end servers containing data.

Implement multiple network paths to business- and mission-critical servers. It is true that network components fail infrequently, but more than one network path guarantees that the server will be available.

Environment

Protection of the data center environment concerns things like physical security, fire prevention, backup power, and providing alternate offsite facilities. Not all servers that are part of a distributed application run in a locally controlled, clean environment. Temperature, humidity, and dust can all stop a server. If the server is in an office setting, it probably is all right. But if your application relies on an application service that is external to your own facilities, you might consider a design that can deal with a failed remote Web server.

Front-End Servers

Use Windows 2000 Advanced Server. Use Microsoft Cluster Service to provide scalable, highly available, failover capability. Use Network Load Balancing (NLB) to smooth the workload and automatically redistribute incoming traffic in the event of a server failure. All the front-end Web servers should deliver identical services and share the workload. As the traffic on the site increases, additional front-end servers can be easily added with no downtime. Use dedicated network cards so that the front-end servers can send periodic messages, called heartbeats, to each other to detect failed applications or servers.

Back-End Servers

Use Windows 2000 Advanced Server. Use Microsoft Cluster Service to provide scalable, highly available, failover capability. Use Network Load Balancing (NLB) to smooth the workload and automatically redistribute incoming traffic in the event of a server failure. The back-end cluster provides two data services: SQL Server databases and miscellaneous shared file storage. All the back-end servers should deliver identical services and share the workload. As the traffic on the site increases, additional back-end servers can be easily added with no downtime. Use dedicated network cards so that the back-end servers can send periodic messages, called heartbeats, to each other to detect failed applications or servers.

Use Windows Server System

The Microsoft .NET distributed services platform makes it easy to build sophisticated Web applications as a collection of services. The facilities provided by the Microsoft .NET Platform are already tested to provide high reliability, availability, and scalability. As a starting point for your application, there are a number of existing .NET services that you might find useful in your application. These include:

  • Microsoft Application Center 2000
  • Microsoft SQL Server 2000
  • Microsoft BizTalk Server 2000
  • Microsoft Exchange Server 2000
  • Microsoft Commerce Server 2000
  • Microsoft Content Management Server 2001
  • Microsoft Internet Security and Acceleration Server 2000
  • Microsoft Windows 2000 Server
  • Microsoft Windows 2000 Advanced Server
  • Microsoft Windows 2000 Datacenter Server
  • Microsoft Host Integration Server 2000
  • Microsoft Mobile Information 2001 Server
  • Microsoft SharePoint Portal Server 2001

For more information, see Windows Server System.

Synchronize All Clocks

Operating systems and applications running on separate servers must be time-synchronized. If you violate this simple premise, process and file-creation time stamps will be inconsistent and confusing to both automated and manual reconstruction processes. Be sure to maintain synchronized date and time information on every server.

Use Data Backups

Every data center must anticipate the possibility of severe data corruption or loss. The consequence of data errors ranges from problems with customer authentication to damaged financial accounts and business community credibility.

Maintaining data integrity can be as simple as performing full database backups. One strategy for maintaining data integrity is to create a full backup of the primary database and then incrementally test the source computer for data corruptions.

You can improve on this approach by combining the backup with application transaction logs. You can then run the DataBase Consistency Checker (DBCC) to detect and repair corruption.

Because modern RAID technology provides a lot of data confidence, implementing full database backups depends on your assessment of risk and the consequences of data corruption. At a minimum, providing full backups will aid recovery from catastrophic system failure.

Review All Security Plans

Security is about making sure that application services are available only to qualified users. Security also means protecting all of the distributed components and resources that your application uses. For critical Internet applications, the security implementation must account for safe communication between your application, anonymous users, validated users, and the Web sites of your business partners.

Security for a Web application has several primary obligations:

  • Protection of front-end servers from unauthorized access.
  • Protection of data privacy and integrity on back-end servers.
  • Protection against intrusion (such as a denial of service attack).

Data exchanges over the Internet pose an interesting problem because such exchanges are often disconnected — that is, the application service does not hold session information or wait for a reply. This means that in order to maintain security, the application must specifically correlate any returning data or request with a validated user. You can use SSL to protect against transport level data structure tampering, and you can use digital signatures to ensure user validation.

For more information on security, see Securing a Dot-Com Installation (http://www.microsoft.com/technet/ecommerce/SecComIn.asp) or search for "Site Security Planning" in MSDN Library Online.

Advocate Training and Certification

Staff expertise is an important part of availability engineering. Only a knowledgeable staff can design, deploy, and maintain a high-availability application. Trained, experienced people make better design and construction decisions. In addition to building a better application, a side bonus is that trained people are likely to deploy your application on time and within budget.

Develop a training plan that identifies how availability engineering skills will be acquired, and make a serious investment in people resources. Ask your staff to seek Microsoft Certification in the various software engineering disciplines. Using Microsoft Certified Professionals sets the standard for software engineering quality. For more information, see Microsoft Education Certification (http://www.microsoft.com/education/training/cert/default.asp).

Pay Attention to the Budget

Depending on how the application is used, very high availability may cost more than it is worth. Many applications simply do not require special engineering and redundant hardware. The historical decrease in hardware costs may have created the misconception that hardware redundancy is affordable, but the fact is hardware still costs money and is only part of an overall availability solution.

You must understand the business requirements and the associated cost-benefit factors and then select an appropriate availability level. Before you try to create an application that never fails, be sure somebody needs it.

See Also

Availability

© 2008 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Page view tracker