Printer Friendly Version      Send     
Click to Rate and Give Feedback
Related Articles
This article presents an overview of the motivation behind new techniques that decompose problems into independent pieces for optimal use of parallel programming.

By David Callahan (October 2008)
We take a look at planned support for parallel programming for both managed and native code in the next version of Visual Studio.

By Stephen Toub and Hazim Shafi (October 2008)
Here we describe some of the more common challenges to concurrent programming and present advice for coping with them in your software.

By Joe Duffy (October 2008)
Here is an ASP.NET AJAX data-driven Web application that takes the best features from server- and client-side programming to deliver an efficient, user-friendly experience.

By Bertrand Le Roy (October 2008)
More ...
Articles by this Author
Jeffrey Richter shows you some additional cool features of his AsyncEnumerator class.

By Jeffrey Richter (August 2008)
Here Jeffrey Richter introduces his AsyncEnumerator class, which drives an iterator so that different thread pool threads can execute the same code at different times.

By Jeffrey Richter (June 2008)
Jeffrey Richter introduces his AsyncEnumerator class and explains how it harnesses some recent additions to the C# programming language that make working with the asynchronous programming model significantly easier.

By Jeffrey Richter (November 2007)
Jeff Richter uses the AsyncResult class to implement the CLR’s Asynchronous Programming Model to perform hardware device operations asynchronously.

By Jeffrey Richter (June 2007)


By Jeffrey Richter (March 2007)
SideShow Gadgets for Windows Vista are cool. Writing your own is even better. Find out how it's done.

By Jeffrey Richter (January 2007)


By Jeffrey Richter (November 2006)
What can a robot-programming toolkit do for you? Read on and find out.

By Jeffrey Richter (September 2006)
More ...
Popular Articles
See how to build a document-level Visual Studio Tools for Office customization and integrate it with a content type in SharePoint.

By Steve Fox (May 2008)
In this excerpt from his upcoming book, Laurence Moroney explains the basics of Silverlight animation and the animation tools available in Expression Blend.

By Lawrence Moroney (August 2008)
Learn how to create a workflow that uses InfoPath forms and other office documents for passing data to targeted activities and for use in Office documents.

By Rick Spiewak (June 2008)
One-time passwords offer solutions to dictionary attacks, phishing, interception, and lots of other security breaches. Here's how it all works.

By Dan Griffin (May 2008)
More ...
Read the Blog
Well designed code keeps things that have to change together as close together in the code as possible and allows unrelated things in the code to change independently, while minimizing duplication in the code. In the October 2008 issue of MSDN Magazine, Jeremy Miller shows you some design ...
Read more!
The process for ink capture and analysis on the Tablet PC is straightforward in managed code. To the uninitiated developer, however, creating unmanaged Tablet PC applications can be rather daunting. In the October 2008 issue of MSDN Magazine, Gus Class a quick introduction to the Tablet PC ...
Read more!
Multicore systems are becoming increasingly prevalent, but the majority of software today will not automatically take advantage of this additional processing ability. And multithreaded programming, for anything but the most trivial of systems, is incredibly difficult and error prone today. In the October 2008 issue of MSDN ...
Read more!
Concurrent programming is notoriously difficult, even for experts. You have all of the correctness and security challenges of sequential programs plus all of the difficulties of parallelism and concurrent access to shared resources. In the October 2008 issue of MSDN Magazine, David Callahan describes ...
Read more!
A major advantage of AJAX and Silverlight applications is that they can transparently and continuously interact with a back-end service. The problem is that they run over HTTP, which wasn't designed with security in mind. In the September 2008 issue of MSDN Magazine, Dino Esposito shows you ...
Read more!
Unhandled exception processing shouldn't be a mystery. It's actually quite useful since it gives a crashing application an opportunity to perform last-minute diagnostic logging about what went wrong. In the September 2008 issue of MSDN Magazine, Gaurav Khanna discusses how ...
Read more!
More ...
Concurrent Affairs
Performance-Conscious Thread Synchronization
Jeffrey Richter

Code download available at: ConcurrentAffairs0510.exe (117 KB)
Browse the Code Online
In my career, I have architected and implemented many thread synchronization techniques. This has provided me with a lot of experience that has shaped the way I now think about thread synchronization problems. In this new column about concurrency, I will discuss my way of thinking on the subject and offer some code that I think you'll find useful when dealing with thread synchronization problems yourself.
To really grasp the concepts in this column, you must have some understanding of thread synchronization already. In particular, you should have knowledge and experience with existing Windows® thread synchronization primitives such as the interlocked functions, mutexes, semaphores, events, and critical sections. It is also helpful to understand the .NET Framework Monitor and ReaderWriterLock classes of the System.Threading namespace.

Thread Synchronization Mind-Set
Thread synchronization is required when two or more threads might access a shared resource at the same time. A resource can be as simple as a block of memory or a single object, or it can be much more complex, like a collection object that contains thousands of objects inside it, each of which may contain other objects as well. Sometimes it is also necessary to treat multiple objects as a single resource for the purposes of thread synchronization. For example, your application might need to add or remove items to or from a queue and update a count of items in the queue in a thread-safe manner. In this example, the queue and the count data must always be manipulated together in an atomic fashion. For purposes of this text, I would consider the queue object and the count object as a single resource.
No programmer in his right mind wants to think about thread synchronization as he is coding his app. This is because thread synchronization has little to do with making the program do what its true intention is. In addition, writing thread synchronization code is difficult, and doing it incorrectly can lead to resources in inconsistent states causing unpredictable behavior. Furthermore, adding thread synchronization to your code makes the code run slower, hurting performance and reducing scalability. But, unfortunately, thread synchronization is a necessary part of life.

Performance
Thread synchronization code makes your application run slower. In the case where there is only one thread attempting to access a resource, this additional code is pure overhead. So how do you protect performance?
First, don't allow threads to transition from user mode to kernel mode. On Windows, this is a very expensive operation. Second, when a thread enters a wait state, it stops running. Obviously, a thread that stops running is exhibiting very bad performance. Fortunately, a thread can only ever enter a wait state if it calls into the kernel. So, if you follow the first rule (avoid calling into the kernel) then the second rule (don't let your thread enter a wait state) will follow automatically. Third, use as few threads as possible to accomplish your task. In an ideal world, there should never be more threads in existence than there are CPUs in your computer. So, on a machine with four CPUs, there really shouldn't be a need for more than four threads. Why? Because when more than four threads exist, the operating system schedules the threads to the CPUs in a round-robin fashion every few milliseconds. This context switching takes time and is more pure overhead. If the number of runnable threads is equal to or less than the number of CPUs then the operating system just runs the threads and no context-switching occurs; performance is at its best.
Now, we don't live in an ideal world, so we usually see the number of threads in existence far outnumber the number of CPUs in the machine. In fact, on my single-CPU machine I currently have 425 threads in existence—a far cry from the ideal. Windows gives each process its own thread in order to isolate processes from one another, so if one process enters an infinite loop, other processes can still execute. I currently have 45 processes running on my machine, which accounts for 45 threads out of the 425. Why the remaining 380 threads? Many processes also use multiple threads for isolation. For example, the common language runtime (CLR) has a finalizer thread that wants to run in a predictable manner regardless of what some other thread happens to do. There is no way for me to tell how many of the 380 threads are running on my machine for this reason but I'm sure it's quite a few.
The last reason why so many threads exist on my machine is because many (or dare I say most) programmers don't know how to use threads efficiently. Over the years, programmers have been taught how to program in a way that is actually detrimental to building high-performance, scalable applications. And operating systems like Windows have provided APIs, samples, and documentation that continue to perpetuate this. In fact, I have also been guilty of perpetuating this way of programming and I will continue doing it in this column, mostly because it is too late now to change. Changing would require too much re-architecting. However, in the third installment of this column, I will explain how to architect an application for the highest degree of performance and scalability, and this architecture will reduce the number of threads that your application thinks it needs.
If you find yourself asking any of the following questions, then you know that you have most likely architected your application incorrectly and you should re-architect it:
  • What is the maximum number of threads Windows supports?
  • What is the maximum number of threads that I can create in a single process?
  • What is the maximum number of threads that I can have in the thread pool?
  • Can I increase the maximum number of threads that I can have in the thread pool?
  • Should I have a 1-to-1 correspondence between users and threads? For example, I know a few server applications that use one thread per connected user or client.
Let's focus more on the performance of user-mode to kernel-mode transitions. I wrote some code (see Figure 1) that runs a series of tests that measure the performance of various operations.
Running these tests on my personal machine yielded the results shown in Figure 2. Notice that the operations that cause a user-mode-to-kernel-mode transition are very expensive. The time to call Sleep is more than 100 times longer than the time to call an empty method. And the time to call WaitOne (to test a kernel object and always immediately return since timeout is 0) is about 330 times longer than the time to call an empty method!
Now, the first two operations don't actually attempt to do anything that is related to thread synchronization. But, the third operation, Interlocked.Increment, does and because the Interlocked family of methods execute entirely in user mode, you can see that the Interlocked methods hold the key to performing really fast thread synchronization. Unfortunately, though, they are very limited in their capabilities.
The Interlocked methods are usually used to perform atomic, thread-safe operations on Int32 values. There are also Interlocked methods that manipulate Int64, Single, Double, and IntPtr values, as well as references to reference type objects, but these methods are rarely used by anyone and I will not spend time on them here. In fact, the use of Interlocked methods on 64-bit values (Int64 and Double) while running on a 32-bit operating system is strongly discouraged because the CLR offers no way to guarantee that the 64-bit value is aligned properly in memory (and the Interlocked methods require this on some CPUs). In fact, even when running on 64-bit operating systems it is best to avoid the Int64 and Double versions of the Interlocked methods entirely. However, IntPtr and 64-bit object references are guaranteed to be aligned properly and, therefore, you can call the Interlocked methods that work with the Object type successfully.
Here are the prototypes of the Interlocked methods that operate on Int32 values:
public static class Interlocked 
{
    public static Int32 Increment(ref Int32 location);
    public static Int32 Decrement(ref Int32 location);
    public static Int32 Add(ref Int32 location1, Int32 value);
    public static Int32 Exchange(ref Int32 location1, Int32 value);
    public static Int32 CompareExchange(ref Int32 location1, 
        Int32 value, Int32 comparand);
    ...
}
It is important to note that all of these Interlocked methods, except for Increment and Decrement, return the original value that was passed in. Increment and Decrement return the new value rather than the original value.
While the Interlocked methods are great when you want performance, they are limited in functionality. For example, it might be nice to have a way to perform a bit-wise AND, OR, or XOR of two Int32 values in an atomic way. It might also be nice to have methods that could do an atomic bit-test-and-set or bit-test-and-reset, for example. While there are no Interlocked methods for any of these operations, it is possible to build them yourself so that they work atomically. The code in Figure 3 is an example of an Interlocked AND method.Hyper-Threaded CPUs
Hyper-threading is a technology on Xeon, Pentium 4, and later CPUs. These processor chips have multiple "logical" CPUs, each of which can run a thread. Each thread has its own architectural state (set of registers) but all threads share main execution resources such as the CPU cache. When one thread is paused, the CPU automatically executes another thread. A pause can be a cache miss, branch misprediction, waiting for results of a previous instruction, and so on. Intel reports that hyperthreaded CPUs improve throughput somewhere between 10 percent to 30 percent. For more information about hyper-threaded CPUs, see Yaniv Pessach's article on the subject in the June 2005 issue of MSDN Magazine, available at Make It Snappy: Juice Up Your App with the Power of Hyper-Threading.
When executing spin loops on hyper-threaded CPUs, you need to force the current thread to pause so that the other thread has access to the chip's resources. The x86 architecture supports the PAUSE assembly language instruction. The PAUSE instruction ensures that a memory order violation is avoided, improving performance. In addition, the instruction reduces power consumption by placing a hiccup into what would be a very hot, tight loop. On x86, the PAUSE instruction is equivalent to a REP NOP instruction, which makes it compatible on earlier IA-32 CPUs that do not support hyper-threading. PAUSE causes a finite delay (0 on some CPUs). In the Win32 API, the x86 PAUSE instruction is executed by calling the YieldProcessor macro defined in WinNT.h. This macro exists so that you can write unmanaged code in a CPU architecture-independent way. Use of the macro expands the code inline, thus avoiding the overhead of a function call. In the .NET Framework, the Thread class's SpinWait method internally calls the Win32 YieldProcessor macro. In fact, SpinWait is effectively implemented like this on x86 platforms (you should never pass 0 to SpinWait, always pass 1 or greater):
void Thread.SpinWait(Int32 iterations) 
{
   for (Int32 i = 0; i < iterations; i++) YieldProcessor();
}

To atomically AND the low-order byte of an Int32 value, you'd call the method like this:
Int32 x;    // Some Int32 value that is accessed by multiple threads
...
InterlockedEx.And(ref x, 0x000000FF); // This call is in some method
This AND method simply saves the original value of the target (x) in startVal, then it calculates the new desired value (startVal & with) and saves the result in desiredVal. Then, it calls CompareExchange: if the target (x) is equal to startVal, then x gets the desiredVal. CompareExchange returns the value that was in x at the time the method is called. If the value in x was what it was when the method first got called, then the loop exits and the original value of x is returned to the caller. This is the most common case; it is expected that x will not change between the top of the loop and the call to CompareExchange and, therefore, the code in the loop will only execute once and the method will immediately return.
In fact, the only time the loop will execute more than once is if another thread just happens to modify x after the first thread has saved x's value and before it calls CompareExchange. If this contention happens, then the first thread loops around and just attempts to do the operation again. In most applications, it is very unlikely that this contention will actually occur and, therefore, it is very unlikely that the loop will ever actually loop. If contention does occur, then it is very unlikely that there will be contention the second or subsequent times through the loop.
Techniques like this can be used to write additional interlocked operations. An interlocked OR method can be implemented simply by changing the line of code in Figure 3 that performs the & operation to instead perform an | operation. You can also use this technique to build more powerful methods, such as an interlocked test-and-set and an interlocked max and min.

Spin-Wait Lock
It is possible to create a very simple and fast thread synchronization lock class using Interlocked's Exchange method. This type of lock is called a Spin Lock because the thread that is attempting to lock the resource logically associated with the spin lock spins in a loop until it is granted access. Figure 4 shows the code for my SpinLock class; Figure 5 shows some code that demonstrates how to define a class that uses a SpinWaitLock object in order to allow only one thread at a time access to a resource.
Let's say that a thread creates an instance of UseSomeResource and then calls the AccessResource method. AccessResource then attempts to enter the SpinWaitLock field, m_swl. SpinWaitLock's Enter method calls Exchange to atomically set the m_ResourceInUse field to 1 (c_LockIsOwned). If no thread was using this lock, then m_ResourceInUse was 0 (c_LockIsFree) and Exchange returns 0; then the while loop doesn't execute and Enter returns immediately. Wow, this is fast!
However, if the lock was in use, m_ResourceInUse was 1 and Exchange returns 1 and the while loop will execute. In the loop, I check to see how many CPUs are in the machine and if there is more than one CPU, the loop cycles and checks again to see if m_ResourceInUse is 0 or not. Hopefully, the thread that currently "owns" the resource is running on another CPU and the thread will complete its task and call Exit very soon. When a thread calls Exit, Exchange is called to set m_ResourceInUse back to 0. This allows another thread to "own" the resource.
If the number of CPUs is 1, then Enter calls SwitchToThread before looping around again. The reason is that on a single-CPU machine, it is likely that the other thread that "owns" the resource is not running at this time and it will not be able to give up the resource by calling Exit. This means that it is likely that looping will cause more looping and CPU time is now being wasted. The call to SwitchToThread causes the operating system to context-switch to another runnable thread. Hopefully, this will be the thread that owns the resource; the thread will complete its task and call Exit. Note that the Win32® SwitchToThread API is not wrapped with a managed equivalent and, therefore, I must interop out to the unmanaged API. SwitchToThread is ideal because it allows another thread to run for one time slice. This thread can even be a lower-priority thread. For more information see SwitchToThread in the Win32 SDK documentation.
Let me point out some other facts about the SpinWaitLock class. Spin locks are extremely fast. When there is no contention, Enter simply changes a value from 0 to 1 in a thread-safe way and returns; Exit sets the 1 back to a 0 and returns. When there is contention, StallThread is called.
The purpose of StallThread is to increase efficiency when contention on the SpinWaitLocks object is detected. On a multi-processor machine, the thread will call Thread.SpinWait, which causes the thread to remain in user mode; it will not transition to kernel mode and it will never enter a wait state. Thread's SpinWait method was added to support hyper-threaded CPUs. If your code is running on a machine with hyper-threaded CPUs, this method kicks the other thread so that it starts running (for more information about hyper-threaded CPUs, see the sidebar "Hyper-Threaded CPUs"). When there is contention on a single-processor machine, I do force the thread to transition to kernel mode (by calling SwitchToThread) for if I didn't, CPU time would be wasted as the thread spun without any hope of finding the lock released.
It's impossible to know for sure if spinning in user mode would yield better performance than transitioning to kernel mode, context-switching to another thread, and returning back to user mode. It depends on how much time the other thread needed to complete its task and this information cannot be determined. Furthermore, instrumenting your code in order to attempt to determine it would negatively impact performance, which is what you're trying to improve. There is just no way to guarantee a win with this. Also, there are many factors that impact this: CPU make and model, CPU speed (GHz), hyper-threading, dual-core, other threads running on the system, and so forth.
The SpinWaitLock class makes use of two static methods on the Thread class new to the Microsoft® .NET Framework 2.0: BeginCriticalRegion and EndCriticalRegion. These methods are important for any custom locking mechanism, especially when used in custom CLR hosts. For an in-depth look at these APIs, see Stephen Toub's article on reliability in the .NET Framework in this issue of MSDN®Magazine.
The SpinWaitLock class should only be used for tasks that are known to be short and of a finite period of time. If a wait is known to be long or is of unknown length, threads that want to own the resource should be in an efficient kernel mode wait state so that the CPU can run another thread (possibly the thread that currently owns the lock). And, if a thread is going to wait a long time, then the performance of the user-mode-to-kernel-mode transition doesn't matter anyway.
SpinWaitLocks are unfair. If multiple threads are calling Enter at the same time, the threads are not guaranteed to be serviced in a first-in-first-out (FIFO) order. In fact, if the lock is extremely hot (meaning it is accessed frequently), it is possible (but unlikely) that some threads may not get serviced at all. I will talk more about fairness in an upcoming column.

Send your questions and comments for Jeffrey to  mmsync@microsoft.com.


Jeffrey Richter is a cofounder of Wintellect, a training and consulting firm. He is the author of several books, including Applied Microsoft .NET Framework Programming (Microsoft Press, 2002). Jeffrey is also a contributing editor to MSDN Magazine and has been consulting with Microsoft since 1990.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.
Page view tracker