From the August 2001 issue of MSDN Magazine

MSDN Magazine

Multiprocessor Optimizations: Fine-Tuning Concurrent Access to Large Data Collections

Ian Emmons
This article assumes you're familiar with Win32 and C++
Level of Difficulty     1   2   3 
Download the code for this article: Concurrent.exe (46KB)
SUMMARYApplication performance involves more than just speed. In a Web server environment, top performance also means ensuring that the maximum numbers of users can be served concurrently. This can be accomplished through efficient use of multiprocessor machines and thread management. This article presents techniques that can solve a number of concurrency problems. One approach, using thread management, controls access to a database on a per-thread basis, which protects the integrity of the data. In the article, reusable thread classes are built and presented. The classes are then tested and their performance in a live environment is examined.

Sooner or later, the success of a server application comes down to performance. But performance in a server application isn't quite the same as pure speed. You also have to consider concurrency—that is, how many clients you can serve at the same time. In fact, concurrency is often more important for a server than raw speed, especially on high-end hardware with many CPUs. A server application with low concurrency simply can't make use of multiple CPUs, no matter how fast it is.
      I know of no straightforward cookbook recipes for optimizing concurrency, but over the years I've developed a bag of tricks that help a great deal. One of those tricks is great for synchronizing access to the objects in a large collection, without sacrificing concurrency. This article will demonstrate my technique. First, I'll show a small multithreaded sample program that demonstrates the need for concurrent access to the objects in a large collection. Then I will develop a reusable class that solves the problem, and will compare the resulting performance numbers. As an added bonus, the sample program will include a handy set of classes (reusable, of course!) for creating and synchronizing threads.
      The sample code will run on Windows® 98 Second Edition, Windows NT® with Service Pack 6a, or Windows 2000 with Service Pack 1. (It may run on other versions, but that's all I've tested.) Also, the code requires Visual C++® 6.0 with Service Pack 4.
      Note that I use a somewhat different coding style than is the norm in the Windows programming community. In particular, I use the identifier prefixes k_ for constants and enums, g_ for globals, static globals, and static class members, and m_ for class members.

Building Blocks

      Before slicing into the meat of the sample program, I'll discuss some of the building blocks that went into it. The first piece is the Exception class (see Figure 1), which is mainly intended for converting Win32® API errors into C++ exceptions. There really isn't much to the class, so I won't drag you through all of the details. (If you're interested, download the full sample source code from the link at the top of this article.) One interesting feature of Exception is that the getErrorMsg method uses the Win32 API FormatMessage to fetch the textual error description corresponding to the Win32 error code. This makes tracking down Win32 API errors much faster. For example, instead of seeing merely the error code 120, it will tell you "This function is not supported on this system."
      The Timer class (see Figure 2) is used to time the sample program test runs. It's also an extremely simple class, even more so than Exception. It is entirely inline, so there is only a header file for it. Timer calls the Win32 API GetSystemTimeAsFileTime from its constructor to get a start time, and then once again from the stop method to get an end time. The getXxx methods convert the difference between the two into a variety of sensible units.
      The CritSec class wraps a critical section (see Figure 3). At first, some folks find this class to be so simple that they don't believe it's useful. First appearances can be deceptive, though. This class is important for three reasons. First, it makes the critical section object oriented, so that it looks right in a C++ program. This is extremely useful because modern C++ uses very different programming idioms than C does, so direct use of C functions forces the programmer to switch idioms frequently, leading to code that's difficult to read and hard to maintain. Nowhere is this more apparent than in the error handling surrounding thread synchronization objects. CritSec converts Win32 API errors into exceptions (using the Exception class), resulting in vastly cleaner code.
      Second, CritSec (in combination with LockKeeper, to be discussed shortly) makes critical sections exception-safe. In other words, you must be sure that no critical sections are left locked after an exception is thrown. This is extremely important for avoiding deadlocks while recovering from error conditions.
      Third, by wrapping critical sections in a class, you can make them platform-independent. A cross-platform implementation is beyond the scope of this article, but this same CritSec class interface can be implemented with the POSIX thread API on Unix.
      The LockKeeper template class (also in Figure 3) is used in combination with CritSec. Rather than call the acquire and release methods directly, I usually instantiate a LockKeeper<CritSec> instance on the stack. The constructor calls the acquire method for me, and the destructor calls the release method automatically when the LockKeeper goes out of scope. This eliminates a line of code, but more importantly it ensures that an exception thrown while the lock is held will automatically release the lock as the exception propagates out of the scope of the LockKeeper instance.
      Notice that because LockKeeper is a template, it can be used on any class where the acquire and release methods do not take arguments. This is very handy because in a real multithreaded app there will be many such classes—Mutex, Semaphore, and others (as you will see shortly).
      Unlike the previous building block classes, the Thread class is a bit difficult to wrap your head around (see Figure 4). For those of you who know Java, it is much like the Java Thread class, except that it doesn't support a separate Run interface—it requires a derivation directly from Thread.
      For those of you not familiar with Java's thread mechanism, here is the way you use the Thread class. First, you derive a class from Thread (say, MyThread) and override the pure virtual method run. Put the code that you want the new thread to execute in this method. Then you construct an object of type MyThread and call start. At this point, the new thread will begin execution in your run method. If you want another thread to wait until the MyThread thread exits, call wait. (Of course, you should never call the wait method from inside the run method.)
      One of the advantages of modeling a thread in this way is that it makes communication with the thread much easier. To pass information to the thread, all you need to do is store the information in data members of your MyThread class. I often do this in a constructor. To pass information out of your thread, just do the reverse—in your run method, store the information in data members of your MyThread class, and then implement getter methods to allow other threads to fetch the data.
      It is very important that any data members of your thread class be private, with getter and/or setter methods to access them. This is because these data members will most likely be accessed from multiple threads, so you need the control offered by encapsulation to guarantee proper thread synchronization.
      The implementation of the Thread class has a few tricky points (see Figure 5). First, notice the private and static entryPoint method:

  static unsigned __stdcall entryPoint(void* pArg);

This method is the function that is passed to the _beginthreadex API to start the thread running. You can't pass a nonstatic method (like run) because the calling convention of nonstatic methods is different from what _beginthreadex expects. (It doesn't know how to pass the this pointer.) However, a static method that is declared __stdcall will fit the bill nicely. The call to the _beginthreadex API (which is in the start method) passes a pointer to the entryPoint method and the this pointer. When the new thread starts executing in entryPoint, the pArg parameter will be the this pointer passed to _beginthreadex. Then the entryPoint method casts it to Thread* and invokes run. The magic of virtual methods takes over to route execution to your MyThread::run method.
      The entryPoint method has one other purpose. It provides last-ditch exception handlers for the most likely exception types. This ensures that you get a useful error message when something happens to go wrong.
      One other subtle aspect of the Thread class deserves mention. In the start method, you will notice that the thread is created in a suspended state, and then the thread ID and handle are assigned to class members. Finally, the thread is resumed. This suspension is necessary to prevent a race condition. Suppose that the new thread was created in the running state. If the new thread calls getId or getHandle early in the run method, then it may access those members before the assignments in the start method have been made.

Sample Server Program

      Enough preliminaries—on to the sample program. It is difficult to provide a realistic example of a server application that is also simple enough to present in a short article, so the sample shown here is a bit artificial. In fact, it isn't really a server application at all. However, I've attempted to mock up a believable scenario for a server application so you should be able to see applications to real-life scenarios without having to use too much imagination.
      The sample, ConcurrencyDemo (or CD for short), simulates a simple in-memory database. The DBEntry class (see Figure 6) represents an entry in the database, which consists of a numeric key and some text. Clients can either read the text associated with a given key, or they can update the text for a particular key. The number of objects is fixed at startup time.
      The CD sample contains a pool of worker threads. In real life, these would take requests from clients via some remote invocation mechanism, such as TCP, DCOM, IIOP, or (if you're right on the bleeding edge) SOAP. In the sample, the generation of requests is simulated rather than real.
      The CD sample takes the following command-line arguments: the number of entries in the database, the number of test iterations to perform, and the number of worker threads to use. If multiple numbers of threads are specified, then the program will run the test once with each number.
      When CD starts up, it first interprets its command-line arguments, then it creates the database (see Figure 7). Next, CD prints an output header. (If this doesn't look like legal C++ to you, then you haven't learned about a handy C preprocessor feature. The preprocessor concatenates all character string literals that are separated only by white space.) Finally, CD calls doTest twice for each number of threads specified on the command line. The first test synchronizes access to the database in the most obvious way. The second uses my optimized technique. In order to ensure useful error messages, the main function also catches the most likely exception types.
      The doTest function is shown in Figure 8. It starts by creating a thread pool, which is a vector of pointers to WorkerThread objects. Actually, to be more precise, it's a vector of auto_ptr<WorkerThread> objects, so that when the array is destroyed, all of the WorkerThread objects will be deleted also. I will discuss the WorkerThread class in detail in the next section.
      Next, doTest creates a timer to time the test, starts each thread running, and waits on each thread to be sure they have all finished the test. Finally, DoTest stops the timer and prints the result. Note that the output format is a little bit odd. It's in comma-separated value (CSV) format. This way I can redirect the output into a file with the .csv file extension and import it directly into Microsoft Excel. This makes creating pretty charts quick and easy.

The WorkerThread Class

      The WorkerThread class (see Figure 9 and Figure 10) is, obviously, derived from Thread. The constructor takes a bunch of input parameters and simply stores them in data members for later use. The run method uses the first three of these to control a loop. This may look a bit strange at first, but the point is to distribute the total number of iterations among all the threads as evenly as possible, even when the number of iterations is not evenly divisible by the number of threads. In such a case, some of the threads will iterate one more time than the others.
      In each iteration, the run method calls getRequest to fetch the parameters of the next request. In a real-life program, getRequest would get input from clients via some network transport mechanism. To keep things simple, I wrote getRequest to generate random requests. In order to prevent this from skewing the results, I had to use my own random number generator. This is because the compiler's random number generator uses a global seed, and therefore needs a critical section. This introduces an extra thread synchronization that masks the results I am trying to observe. My random number generator uses a thread-local seed (stored in the WorkerThread class), and therefore needs no synchronization.
      Depending on the type of request, the run method then calls either doUpdate or doRead. Each of these methods fetches a reference to the indicated database entry, locks it, performs the requested operation, and returns. Note that a LockKeeper object is used to lock the database entry directly. This is an example of how the LockKeeper template can be used to lock any object that has acquire and release methods that take no arguments. Later on I will take a close look at the acquire and release methods on the DBEntry class.
      Also note that the doUpdate and doRead methods call the simulateWork method. In a real-life server, updates and reads (especially updates) will incur some overhead. In order to make doUpdate and doRead take enough execution time to show the intended result, I needed to add the calls to simulateWork. On the machines I used for the benchmark, this added about 20 and 200 microseconds to the execution times of doRead and doUpdate, respectively. (A word of praise for the Visual C++ optimizer: I had to add an access of a global to simulateWork in order to prevent the method and calls to it from being entirely optimized away.)
      There is one subtlety in doRead and doUpdate that is tricky to spot. If you look at the CD code in general, wherever a string is passed to a method as an in parameter, the parameter type is a const string reference, and wherever a string is returned from a method, the return type is string. However, the setValue and getValue methods on DBEntry take and return a const char* instead. You might think that this is just a holdover from my years of C and C++ coding before the C++ Standard Library, but it is quite intentional. Two string objects can contain references to the same character array in order to optimize memory usage and reduce the time spent copying strings. If setValue took a const string reference and assigned it to the m_value member, then the member would merely contain a reference to the passed-in string. This could lead to two threads simultaneously accessing the same character array. Since the string class is unsynchronized, this can lead to heap corruption. My solution is to pass (or return) a const char* instead. This forces a string copy while the lock is held on the DBEntry object. Once a copy has been made, other threads can touch the DBEntry without harm.

The Problem and Possible Solutions

      Now that you have a basic understanding of what the sample program is doing, let's look at the problem: how can access to the database entries be synchronized? You can't allow the worker threads to get and set database entries willy-nilly, since the resulting race conditions would lead to incorrect results at best, and server crashes at worst.
      The simplest and most obvious solution is to use a single critical section. Any thread that wants to read or update an entry must first acquire the critical section. In many cases, this is the most appropriate solution. However, notice that it synchronizes the threads more than necessary. Two threads that are accessing different entries don't need to be synchronized at all, but the use of a single critical section synchronizes them anyway. If multiple threads often want to access different entries at the same time, then this will cause a bottleneck in your server and limit its concurrency.
      Another solution is to embed a critical section in the DBEntry itself. This solution has two nasty problems. First, it consumes a tremendous amount of system resources. Each critical section uses 24 bytes of memory, plus an event kernel object. Thus, if your database has 1,000,000 entries in it, this approach will consume 24MB of memory and 1,000,000 events just for synchronization purposes! This will not be popular with your customers.
      The other problem with this approach is that you can't use it to synchronize the deletion of an entry. (This isn't a problem in the sample, but it typically is in real life.) To delete an entry, you'd lock it and then delete it. Another thread that was waiting to read the object would then acquire a critical section residing in deleted memory. This is most decidedly unpopular with customers.
      My solution to this problem is encapsulated in the CritSecTable class (see Figure 11 and Figure 12). This class's interface looks almost identical to that of CritSec, except that the acquire and release methods take a void* argument.
      To use this class, you associate a CritSecTable instance with the large collection to which you want concurrent access (the in-memory database, in this case). When you want to lock or unlock an entry, you call acquire or release, respectively, on the CritSecTable object, passing the address of the entry as the void* parameter. See the implementation of the DBEntry class's acquire and release methods (shown in Figure 13) for an example. (I've implemented these methods so that they can use either a single critical section for the whole database or a CritSecTable, depending on the setting of the global flag g_useSingleCritSec.)
      Internally, CritSecTable maintains an array of CritSec instances (256 in my implementation). When you call acquire or release, it computes a hash from the void* argument and uses the hash to index into the array, calling acquire or release on that CritSec object. This means that for any given database entry, you will always use the same CritSec object, guaranteeing that you maintain the correct synchronization behavior.
      If two threads try to lock different entries simultaneously, the probability is low (one chance in 256) that one will block the other. Naturally, as the number of threads attempting concurrent access to the database goes up, so does the probability that one or more will be blocked. However, on Windows this is generally not a problem. Machines running Windows rarely have more than eight CPUs, and (as far as I know) the upper limit is 32. If your server application is using an I/O completion port (as it should), then the number of running threads in your server at any given time will be close to the number of CPUs. (Note: A discussion of I/O completion ports is beyond the scope of this article, but several excellent articles on the topic have appeared in MSDN Magazine in recent years. See "Windows Sockets 2.0: Write Scalable Winsock Apps Using Completion Ports" for one such reference.)
      I chose to make the CritSec array length 256 because it makes for a very simple but effective hash function. In this case the hash function simply XORs the bytes of the void* together into a single unsigned char. I've spent some time optimizing the hash function for the cases where a void* is four bytes and cases where it's eight bytes in length, as well as including the general case (just in case you happen to be running Windows on a system that has seven-byte pointers). On a 32-bit Intel system, the Visual C++ compiler reduces the hash method's 32 lines of C++ down to just six assembly language instructions.
      If your application will have a significantly larger number of threads trying to lock different entries at once, you might want a larger CritSec array, but you will need a more sophisticated hash function to pull it off.

Concurrency Results

      I ran the CD sample on a 500 MHz Pentium III machine with one CPU, and on a 550 MHz Pentium III Xeon machine with four CPUs. In each case, I ran two tests (one with a single critical section and the other with a CritSecTable) with increasing numbers of worker threads. Figure 14 shows the results for the single-CPU machine, with one to eight worker threads.

Figure 14 Results on a Single-CPU Machine
Figure 14 Results on a Single-CPU Machine

      When using a single critical section, the time to execute a fixed number of iterations increased dramatically as the number of threads increased. This is because the single critical section causes the threads to context switch excessively often, and the CPU spends a disproportionate amount of time in kernel mode executing the context switching code.
      You can confirm this on Windows NT or Windows 2000 with the Performance Monitor application. On Windows NT you can find this under Start | Programs | Administrative Tools. Choose Add to Chart from the Edit menu, set the performance object to System, and add these three traces to your chart: % Total Processor Time, % Total Privileged Time, and Context Switches/sec. On Windows 2000 it's the same idea, but the Redmondtonians have moved things around. The Performance Monitor is located in the Control Panel under Administrative Tools. Click the big plus on the toolbar, set the performance object to Processor, and add the traces % Processor Time and % Privileged Time. Then set the performance object to System and add Context Switches/sec.
      When you run ConcurrencyDemo, you will see the amount of privileged time and the number of context switches go way up during the single-critical section tests with a lot of worker threads. In contrast, the tests using the critical section table will context switch much less, resulting in the relatively constant execution times, as shown in Figure 14.

Figure 15 Results on a Multi-CPU Machine
Figure 15 Results on a Multi-CPU Machine

      The results on a four-CPU machine are more striking (see Figure 15). The trend with the single-critical section is relatively constant after an initial increase going from one to two threads. Notice the radically different behavior here from the single-critical section case on a one-CPU box. The reason the four-CPU box doesn't get swamped by context switching is that on a multi-CPU machine the critical sections have a spin count. This means that when a thread tries to acquire an already-locked critical section, the thread will "spin" in a loop for a while, checking to see if the lock has been released, before it goes to sleep. This is beneficial because going to sleep causes a context switch, which is more expensive than the cycles wasted spinning.
      When using a critical section table, the execution time drops off dramatically (by over 70 percent) as you go from one to four threads. This shows that all four CPUs are working in parallel to achieve maximum concurrency. (Actually, the theoretical maximum concurrency would show a 75 percent drop-off, but perfectly linear scaling is rare.) Notice that as you go beyond four threads, performance decreases as the context is switched more frequently. This shows the importance of using an I/O completion port to keep the number of running threads very close to the number of CPUs.

Conclusion

      The technique presented here for increasing concurrency has many applications. I first devised it in a server application that used C++ smart pointers to implement garbage collection via reference-counted objects. I protected the reference counts with a critical section table. (You might think that I could use the InterlockedXxx family of functions for this, but the actions of decrementing the reference count to zero and deleting the object must be atomic, which requires a critical section.)
      A critical section table can also be used to overcome one of the major limitations of the database in the CD sample, namely the inability to add or remove entries to or from the database in a thread-safe and highly concurrent manner. Creating a highly concurrent collection is a difficult task, and is beyond the scope of this article, but a critical section table is an important component in one solution to the problem that I have found.
      I hope that you will find your own uses for this technique in your server applications. I'd love to hear about them if you do.

For related articles see:
Write Scalable Winsock Apps Using Completion Ports
For background information see:
Advanced Windows by Jeffrey Richter (Microsoft Press, 1997)
Ian Emmons is a software architect at Persistence Software Inc. (https://www.persistence.com). He specializes in the design of high-performance server applications. He can be reached at ian@persistence.com.