Server Performance and Scalability Killers

Article
06/30/2006

George V. Reilly
Microsoft Corporation

February 22, 1999

Contents

Introduction
Server vs. Desktop Applications
IIS Flexibility and Performance
The Ten Commandments of Killing Server Performance
Conclusion

Introduction

Server performance is an issue that many writers of desktop applications now need to address. One unexpected consequence of the success of Component Object Model (COM) and componentware is that application servers like Active Server Pages (ASP)—an extension of Microsoft Internet Information Server (IIS)—end up hosting code that was never written with server environments in mind. There are important differences between desktop and server environments that can have huge, unappreciated effects on performance.

Server vs. Desktop Applications

The factors that affect the performance of desktop applications are well known. Long instruction paths mean slower code, the cardinal sin of performance. Heavy resource usage means fatter applications that leave fewer resources for other applications in the system. Slow startup times infuriate users. Large working sets bog down machines with high page-fault rates, leaving them slow and unresponsive. Server applications are also subject to these influences, as well as others that I examine below:

Server applications typically deal with dozens, if not hundreds, of clients simultaneously. A desktop application is considered snappy if it can respond to the user in under one-tenth of a second. If an operation takes the full 100 milliseconds, the application can perform only ten operations per second. Most server applications are expected to provide much better throughput than ten requests/second. High-latency networks (latency = message transmission time) lengthen response times, leaving even less time for a server to satisfy requests.
Server applications often deal with huge datasets. Inefficient algorithms, especially those with quadratic (or worse) running times, will not scale to millions of items.
Server machines are more powerful than desktop machines. They have more memory, bigger disks, faster CPUs, and often multiple processors. But more is expected of them; desktop machines spend much of their time idling with only sporadic bursts of activity, while server machines are under constant load. Server machines are expensive and they are expected to perform well.
Server applications are expected to have uptimes measured in months. Performance must not degrade over time due to resource leaks or the build-up of cruft (data structures and statistics that need to be periodically pruned).
A multithreaded architecture is a requirement for most server applications. A single-threaded server that takes one request at a time, and then spends most of its time blocking on I/O, will have unacceptable performance. A pool of threads can work on several requests simultaneously, making use of otherwise idle processor cycles. Indeed, to fully exploit multiprocessor systems, server applications must be multithreaded. Unfortunately, multithreaded applications are harder to write, harder to debug, and harder to make perform well, especially on multiprocessor systems. The thing that makes it worthwhile is that when you get it right, your performance will far exceed that of an equivalent single-threaded application.

Single-threaded applications are relatively straightforward to understand: exactly one thing is happening at a time in the program. In a multithreaded application, concurrent activity leads to complex interactions, with hard-to-predict effects. In addition, these interactions, buggy or not, are often hard to reproduce. Desktop applications rarely have more than one thread, and even when they do, these threads are used for discrete background tasks, such as printing.

IIS Flexibility and Performance

Internet Information Server is an application server. In many ways it is like a virtual operating system, because many ASP and ISAPI applications execute in our process space.

IIS uses an I/O thread pool to process all incoming requests. Requests for static files (.htm, .jpg, etc.) are satisfied immediately, while requests for dynamic content are dispatched to the appropriate ISAPI extension DLL. The Active Server Pages extension uses a worker thread pool to execute ASP pages. Because ASP is COM-based, all sorts of components end up executing in our process. This is a mixed blessing. It's wonderful for developers, because it allows easy reuse of components, makes ASP marvelously flexible, and has made ASP and IIS a big success. However, this flexibility has caused us performance problems. Many of these components were written for the desktop, and many of the components created specifically for ASP were written by people who don't fully understand what it takes to write a high-performance server component.

The same is often true for ISAPI extensions and filters; the components interact badly with other components and they interact badly with different instances of themselves.

All of the following apply to IIS; most of them apply to other kinds of server applications too.

The Ten Commandments of Killing Server Performance

Following each of these "commandments" will effectively hamper the performance and scalability of your code. In other words, these commandments should be broken whenever possible! I explain why and how to break them in order to improve performance and scalability.

Thou shalt allocate and free lots of objects.

You should avoid excessive allocation, because memory allocation can be costly. Freeing blocks of memory can be even more costly, as most allocators attempt to coalesce adjacent blocks of freed memory into larger blocks. Until the release of Windows NT® 4.0 service pack 4, the system heap often performed badly in multithreaded processes. The heap was protected by a single global lock and scaled negatively on multiprocessor systems.
Thou shalt not think about processor cache.

Most people know that hard page faults caused by the virtual memory subsystem are expensive and best avoided, but still believe that all other memory access patterns are equally good. This has not been true since the 80486 was introduced. Modern CPUs have become so much faster than RAM that they need at least two levels of memory cache. On Pentium-class machines, the fast L1 cache holds 8 KB of data and 8 KB of instructions, while the slower L2 cache holds several hundred KB of mixed code and data. A reference to a memory location found in the L1 cache costs one cycle, references to the L2 cache cost four to seven cycles, while references to main memory cost dozens of processor cycles. The latter figure will soon exceed 100 cycles. In many ways, the caches are like a small, fast, virtual memory system.

The fundamental unit of memory as far as the caches are concerned is not a byte, but a cache line. Pentium cache lines are 32 bytes wide, Alpha cache lines are 64 bytes wide. This means that there are only 512 slots for code and data in the L1 cache. If data that is used together (temporal locality) is not stored together (spatial locality), it can lead to poor performance. Arrays have excellent spatial locality, while linked lists and other pointer-based data structures tend to have poor locality.

Packing data into the same cache line usually helps performance, but it can also hurt performance on multiprocessor systems. The memory subsystem works hard to keep the caches coherent between processors. If a read-only datum, used by all of the processors, shares a cache line with a datum that is being updated frequently by one of the processors, the caches will spend a lot of time updating their copy of the cache line. This high-speed game of Ping-Pong is often referred to as "cache sloshing." If the read-only data were in a different cache line, this sloshing could be avoided.

Optimizing code for space is more effective than optimizing for speed. Smaller code fits in fewer pages, leading to a smaller working set and fewer page faults, and it fits in fewer cache lines. However, certain core functions should be optimized for speed. Use a profiler to identify these functions.
Thou shalt never cache frequently used data.

Software caching can be used by all kinds of applications. When a calculation is expensive, you store a copy of the result. This is the classic space-time tradeoff: Sacrifice some memory to save some time. If done well, it can be extremely effective.

You must cache judiciously. If you cache the wrong data, you're wasting memory. If you cache too much, you'll have less memory for other operations. If you cache too little, the cache will be ineffective because you'll have to recalculate the data missed by the cache. If you cache time-sensitive data for too long, it will become stale. Servers generally care more about speed than space, so they cache more aggressively than desktop systems. Be sure to invalidate and flush unused cache entries periodically; otherwise you'll have working set problems.
Thou shalt create lots of threads. The more, the merrier.

Tuning the number of threads active in a server is critical. If the threads are I/O-bound, they'll spend much of their time blocking, waiting for I/O to complete — and a blocked thread is a thread that's not doing useful work. Adding additional threads can increase throughput, but adding too many threads will decrease server performance, because context swapping will become a significant overhead. The rate of context switches should be kept low for three reasons: context switches are pure overhead and contribute nothing useful to the application's work; context switches use up valuable cycles; and worst of all, context switches leave the processor's caches full of stale data, which is expensive to replace.

Much depends upon your threading architecture. One thread per client is almost never appropriate, as it does not scale well to large numbers of clients. Context switching becomes intolerable and Windows NT runs out of resources. A thread pool model, where a pool of worker threads consumes a queue of requests, works better. In the future, you will no longer need to write your own thread pool, because Windows 2000 provides suitable APIs, such as QueueUserWorkItem.
Thou shalt use global locks for data structures.

The easiest way to make data thread-safe is to surround it with one big lock. For simplicity, make everything use the same lock. There's a problem with this approach: serialization. Every thread that needs to manipulate the data is going to have to wait in line to acquire the lock. If a thread is blocked on a lock, it's not doing useful work. When the server is under light load, this is seldom a problem, because only one thread is likely to want the lock at a time. Under heavy load, a high-contention lock can become a huge problem.

Consider an accident on a multilane freeway, where all the traffic has been diverted into one lane. If traffic is light, the diversion will have a negligible effect on the rate of traffic flow. When traffic is heavy, the traffic jam will stretch for miles as the cars slowly merge into the single lane.

Several techniques can reduce lock contention.

Don't be overprotective—that is, don't lock data when you don't have to. Hold locks for just as long as you need them, and no longer. It is important not to hold locks unnecessarily around large sections of code or in frequently executed sections of code.

Partition your data so that it's protected by a set of disjoint locks. For example, a symbol table could be partitioned on the first letter of the identifiers, so that modifying the value of a symbol whose name starts with Q does not interfere with reading the value of a symbol whose name starts with H.

Use the Interlocked family of APIs (InterlockedIncrement, InterlockedCompareExchangePointer, etc.) to atomically modify data without acquiring a lock.

Use multi-reader/single-writer locks when the data isn't modified often. You'll get better concurrency, though the lock operations will be more expensive and you may risk starving writers.

Use spincounts on critical sections. See the SetCriticalSectionSpinCount API introduced in the Windows NT 4.0 service pack 3.

Use TryEnterCriticalSection and do some other useful work if you can't acquire the lock.

High contention leads to serialization, which leads to low CPU utilization, which encourages users to add more threads, which makes matters even worse.
Thou shalt not pay attention to multiprocessor machines.

It can be a nasty surprise to find that your code performs worse on a multiprocessor system than it does on a uniprocessor system. The natural expectation is that it will perform N times better on an N-way system. The reason for the poor performance is contention: lock contention, bus contention, and/or cache line contention. The processors are fighting over the ownership of shared resources instead of doing productive work.

If you're at all serious about writing multithreaded applications, you should be stress-testing and performance-testing your application on multiprocessor boxes. Uniprocessor systems provide a coarse-grained illusion of concurrency by time-slicing the execution of threads. Multiprocessor boxes have true concurrency, and race conditions and contentions are far more likely to arise.
Thou shalt use blocking calls all the time; they are fun.

Synchronous blocking calls to perform I/O are appropriate for most desktop applications, but they're a poor way to exploit the CPU(s) on a server. I/O takes millions of cycles to complete; cycles that could be put to good use. You can achieve significantly higher rates of client request and I/O throughput by using asynchronous I/O, at a cost of additional complexity.

If you have blocking calls or I/O operations that can take a long time, you should think about how many resources you want to commit. Do you want all the threads to be used, or only a limited set? In general, it is better to use a limited set of threads. Build a small thread pool and a queue, and use the queue to schedule work for the threads to complete the blocking calls. This will enable your application to leave other threads available to pick up and process new, non-blocking requests.
Thou shalt not measure.

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of science.

- Lord Kelvin (William Thomson)

Without measurements, you do not understand your application's performance. You are groping blindly, making half-informed guesses. You have not identified your performance problems and you cannot begin to make any improvements or do capacity planning.

Measurement encompasses black-box measurement and profiling. Black-box measurement means gathering the numbers exposed by performance counters (memory usage, context switches, CPU utilization, etc.) and external testing tools (throughput, response times, etc.). To profile your code, you compile an instrumented version of your code and run it through various scenarios, gathering statistics on execution time and procedure call frequency.

Measurement is of little use if it is not complemented by analysis. Measurement will tell you that you have problems and it may even help you isolate where those problems are, but it won't tell you why you have problems. Analyze the problems so that you can determine how to fix them properly. Address the underlying problems and not just the symptoms.

When you make your changes, measure again. You need to learn if your changes were effective. The changes may unmask other performance problems, and the cycle of measure-analyze-fix-measure will begin anew. You must also measure regularly to catch performance regressions.
Thou shalt use single-client, single-request testing.

A common mistake for people writing ASP and ISAPI applications is to test their applications by using a single browser. When they deploy their applications on the Internet, they discover that they cannot cope with high loads and that they have miserable throughput and response times.

Testing with a browser is necessary, but not sufficient. If the server does not respond quickly enough, you know you have a problem. But even if it is snappy when you're using a single browser, you don't know how well it copes with load. What happens if a dozen clients are making requests simultaneously? Or a hundred? What sort of throughput can your application sustain? What sort of response time does it provide? What are these numbers like when your application is under light load? Medium load? Heavy load? How well does your application scale if you run it on a multiprocessor machine? Stress testing your application is essential to shake out bugs and it's essential to discover performance problems.

Similar considerations about load testing apply to all server applications.
Thou shalt not use real-world scenarios.

It's easy to fall into the trap of tuning your application for a few specific, artificial scenarios (such as benchmarks). It's important to pick a broad spectrum of scenarios that correspond to real-world usage and optimize for a broad range of operations. If you don't find these scenarios, your users and reviewers certainly will, and they'll pan you for it.

Conclusion

We've learned a few things about killing server performance and scalability since we started developing IIS. Writing high-performance server applications is not easy. In addition to the traditional performance issues that arise when writing desktop applications, you must pay special attention to memory allocations, cache lines, caching data, thread proliferation, locking strategies, multiprocessor machines, blocking calls, measurement and analysis, multiclient testing, and real-world scenarios. These issues will make you or break you.

Good luck!

George V. Reilly works on the IIS development team, where he is responsible for the performance of IIS and ASP. He has been doing his best to write tight code since he discovered the BBC Micro in his native Dublin, Ireland in 1982. In retrospect, he has a hard time believing that he found enough time to co-write Beginning ATL COM Programming , Wrox Press, 1998.

This article was inspired by my colleague Murali R. Krishnan's list of Top Ten Killers of Server Performance.