Make It Snappy

Juice Up Your App with the Power of Hyper-Threading

Yaniv Pessach

This article discusses:
  • The behavior of hyper-threaded CPUs
  • Performance on hyper-threaded computers
  • Optimizing your application for hyper-threading
  • CPU caches and affinity
  • Concurrency patterns suitable for hyper-threading
This article uses the following technologies:
C#, .NET Framework, Win32

Contents

The Importance of Hyper-Threading
The Window View
Client Versus Server
Being Hyper
Two Heads Are Faster Than One
Hit and Miss
Locking
False Sharing and Cache Aliasing
Ah, the Competition
To Serve and Hyper-Thread
CPU Affinity
Producer/Consumer
Detecting Hyper-Threading
Conclusion

Hyper-threading technology improves CPU efficiency by allowing the processor to execute two instruction streams concurrently. This feature, found on newer Intel Pentium 4 processors, can typically improve the performance of apps by 20 to 30 percent, boosting some apps by up to 40 percent.

Unfortunately, other applications see no performance benefit, and a few can experience significant performance degradation (I've seen application degradation of as much as 20 percent), especially when violating recommended performance best practices such as the ones discussed in this article. Moreover, given the concurrency this feature introduces to single-processor computers, some multithreading bugs that may only manifest themselves on multiprocessor machines are likely to show themselves on a hyper-threaded single-CPU machine, as well.

In this article, I'll explore the hyper-threading technology and demonstrate how adding parallelism to your code can improve performance on hyper-threaded machines. I'll cover advanced optimizations for hyper-threading and will show you a few useful patterns. My code samples are in C#, but the same principles apply to both managed and unmanaged applications.

The Importance of Hyper-Threading

Hyper-threading technology is available on newer Pentium 4 CPUs, often on computers with clock speeds of 2.4GHz and up, as well as on all Xeon CPUs with speeds of 2.2 GHz and up. This feature is enabled by default in most, but not all, machines that have the capability.

Before hyper-threading, the Pentium 4 had to execute one instruction stream at a time. The CPU maintained a single instruction pointer (IP), and all multitasking on a single-CPU machine relied on the operating system's ability to allocate timeslices of the CPU to threads. Hyper-threading enables the Pentium 4 to execute two instruction streams simultaneously; the CPU maintains two instruction pointers and machine states and invisibly switches execution between them.

When a machine instruction from a program is executed, one or more execution slots will be needed. An execution slot corresponds to an internal CPU resource, such as the floating point unit (FPU), and represents the CPU's ability to perform an operation, such as a floating point calculation.

The Pentium 4 CPU always knows which execution slots are available. Features such as Out-Of-Order Execution (where the CPU can try to execute instructions in a different order than they appear in the instruction stream) enhance performance by keeping more of the execution slots utilized, as opposed to waiting (known as stalling) until a slot becomes available.

By giving the CPU the freedom to switch between the two execution streams, more of the CPU's execution slots can be used. When one instruction stream is stalled on access to the main memory or needs an execution slot that is in use, the CPU can execute instructions from the second instruction stream while the memory for the first stream is being fetched or until the execution slot becomes available.

In a hyper-threaded CPU, two instruction streams are executed at the same time. In reality, the CPU executes a few instructions from the first instruction stream, then a few from the second instruction stream, switching based on an algorithm that tries to optimize execution slot availability and to minimize the impact of stalls while trying to distribute resources fairly between execution streams. Execution context (such as register values) is kept with the instruction stream. The net result of all of this is that a hyper-threaded CPU emulates two physical CPUs, each of which is referred to as a logical CPU.

Hyper-threading can result in better utilization of CPU resources and therefore better performance. The flip side is that those CPU resources are indeed shared, so when both streams of instructions require the same shared resource, performance can suffer.

In addition to shared resources such as the FPU and the arithmetic logic unit (ALU), space inside the CPU cache (divided into level 1 and level 2 caches, known as L1 and L2), which caches recently accessed RAM areas, is shared. The L2 cache space is not allocated to a logical CPU. Sharing the cache space can either help or hinder performance, but I'll discuss that a little later.

If there is no second execution stream, all cache space and other resources are made available to the one execution stream. As the multiple instruction stream mode carries a slight performance cost, the CPU would only use it if instructions are scheduled to execute on both logical CPUs. Also, on a computer with a single processor, enabling hyper-threading entails using the multiprocessor kernel, which handles synchronization better, but comes at a small performance cost.

In the ideal case, each instruction stream would execute almost as fast as it would if the CPU were running only one instruction stream. However, in reality, it's rare that two instructions streams could operate without frequently requiring access to shared resources, and as a result each instruction stream is noticeably slower than that ideal. If an application's work is split in two in an intelligent manner, however, hyper-threading can still result in improved overall application performance.

The Window View

Under recent versions of Windows®, each execution stream is exposed as a logical CPU, so the operating system schedules thread execution on each logical CPU. This means that on a hyper-threaded machine with one physical CPU, and thus two logical CPUs, when two threads are executing they execute at the same time, each on a separate logical CPU.

A user will see that the machine has two logical CPUs, and when two threads are executing, they would execute on both logical CPUs at the same time. I will use the terms "logical1" and "logical2" as a shorthand for the first and second logical CPU exposed by a specific physical CPU.

On Windows NT and later, the scheduler is hyper-threading-aware. When multiple threads are executing concurrently, they will be scheduled on both logical CPUs, rather than with alternating timeslices as is the case with a non-hyper-threaded single CPU. This can lead to noticeable performance improvements and increased responsiveness for hyper-threaded systems.

A dual CPU with hyper-threading is more complex; the operating system first schedules threads on separate physical CPUs before scheduling them on logical CPUs belonging to the same physical CPU. This means that with only two threads running, each thread would be assigned to its own processor rather than having them hyper-threaded onto the same CPU. That said, from this point forward I will focus the rest of the article on discussing single-CPU machines with hyper-threading enabled.

Client Versus Server

The key to optimizing applications for hyper-threading is ensuring that both logical CPUs are performing useful operations. To improve performance further, I'll discuss ways to optimize the division of labor between threads.

The discussion here is relevant both to client and server applications. In a server, sometimes the straightforward division of labor makes the most sense. For example, if you are developing an ASP.NET application or a Web service expecting many requests per second, the usual multithreaded design would ensure that multiple threads are executing at the same time. For this reason, existing server applications tend to show more of a performance increase when hyper-threading is enabled.

Applications that need speedy I/O (network and disk) access benefit from splitting operations into multiple threads or from using asynchronous programming models. For client or server applications, where each request is CPU-intensive, additional gains can be made by splitting work into multiple threads (this is contrary to applications running on a non-hyper-threaded single CPU computer, where splitting a CPU-intensive task up into multiple threads will yield little benefit and may actually hurt performance due to the overhead of context switching). Client applications where work is done in a separate thread from the UI thread (almost always a good idea for usability reasons) show even better responsiveness on hyper-threaded machines.

Being Hyper

Multithreaded applications are likely to show increased performance and responsiveness once hyper-threading is turned on. I'll discuss several approaches you can take to maximize performance. Each approach has downsides, however, so measuring your performance is necessary when evaluating whether an approach helps.

A good place to start understanding the potential performance gains of your code is by measuring execution time on both hyper-threaded and non-hyper-threaded machines (or you can test on the same machine provided that its BIOS allows you to turn off hyper-threading technology). A nice side effect of programming for hyper-threading is that most of it will improve application performance when run on a dual-processor machine, as well. I'll point out which approaches are specific to hyper-threading and will not work as well on a dual- or multi-processor machine. In general, applications that scale well to multiple processors would do well in a hyper-threaded environment.

Good places to start looking for CPU-intensive operations are pieces of code that render and draw graphics, encode or decode music or video, compress or encrypt data, or execute in tight loops. You can start by profiling your application in order to identify those CPU-intensive operations.

A useful tool is the Task Manager (TaskMgr.exe), shown in Figure 1. Under the performance tab, you can see each logical CPU utilization shown separately. You should ensure that Task Manager is configured to show one graph per CPU (under View | Select Columns | CPU Usage).

Figure 1 Dual-Processor with Hyper-Threading Enabled

Figure 1** Dual-Processor with Hyper-Threading Enabled **

Similar information is available by checking the processor's "% Processor Time" performance counter either for all CPUs or for each logical CPU separately. PerfMon.exe is a handy tool to capture this information.

Sometimes, measuring only CPU utilization does not identify all CPU-intensive operations, since it may not show hot spots or short durations of CPU activity, but a CPU profiler can help identify those areas (TaskMgr and PerfMon tend to average out the CPU utilization over a period of a second, so they miss hot spots of shorter duration). The CLR profiler is useful for identifying methods that use excessive CPU, as are other profilers such as VTune, and the profiling tools that are available with the Visual Studio® 2005 Team System.

It's important to note that, for both Task Manager and Perfmon, CPU utilization percentages can be misleading. A CPU-bound operation executing in one thread will show a utilization of 50 percent. Due to the performance interdependencies between the logical CPUs, however, it is unlikely that utilizing both CPUs will result in double the performance.

For the rest of the article, I will assume two threads (T1 and T2) running on two logical CPUs (LC1 and LC2) on one physical CPU.

Two Heads Are Faster Than One

Suppose your client application performs some CPU-intensive operation, such as making complex calculations over a set of data. You notice that when the user initiates the operation, the calculation is running for three seconds, and then the results are available. Obviously, you want the wait to be as short as possible.

Checking the CPU utilization while the application is running is a good start, but profiling tools (such as the CLR profiler) can help you identify which part of the application is consuming the CPU and needs optimizing.

If the user is running on a hyper-threaded machine, one solution would be to split the work into two threads executing at the same time. This would cause both logical CPUs to be used. You will also need to wait until both threads are completed. Note that this approach assumes that the processing of each segment takes the same amount of time; if this assumption isn't fulfilled, the CPU hyper-threading capabilities will not be fully utilized in the latter part of the execution.

To be more generic, you can split CPU-intensive operations into multiple tasks and threads of execution, where the number of tasks and threads is the same as the number of logical processors (more threads are needed for operations involving I/O). For example, on a dual-processor computer with hyper-threading enabled, you might split a set of data to be processed into quarters, with a separate thread processing each part. Splitting up the work in this fashion decreases the need for locking, assuming that each task can be processed independently of the others. In general, lock contention degrades performance considerably on multiple CPU machines as well as on hyper-threaded machines, and as a result should be avoided whenever possible.

Figure 2 shows a code sample that doesn't make use of threading and that does some numerical processing on an array. One simple way to split this workload is to split the array in half and then process each half in separate threads running in the .NET ThreadPool, as shown in the code sample in Figure 3. The threads signal their respective ManualResetEvent when they complete their work, while the main thread waits for both events to be signaled (in the WaitHandle.WaitAll call).

Figure 3 Using the ThreadPool

public class ThreadWorkItem
{
    public int Result;
    public int FirstElement;
    public int LastElement;
    public ManualResetEvent ManualEvent;
}

void DoWork2(int firstElement, int lastElement, ref int result)
{
    ThreadWorkItem wi1 = new ThreadWorkItem();
    wi1.FirstElement = firstElement; 
    wi1.LastElement = (firstElement + lastElement)/2;
    wi1.ManualEvent = new ManualResetEvent(false);

    ThreadWorkItem wi2 = new ThreadWorkItem();
    wi2.FirstElement = wi1.LastElement+1;
    wi2.LastElement = lastElement;
    wi2.ManualEvent = new ManualResetEvent(false);

    ThreadPool.QueueUserWorkItem(new WaitCallback(DoWorkCallBack), wi1);
    ThreadPool.QueueUserWorkItem(new WaitCallback(DoWorkCallBack), wi2);

    WaitHandle.WaitAll(new ManualResetEvent[]{
        wi1.ManualEvent, wi2.ManualEvent});
    result = wi1.Result + wi2.Result;
}

void DoWorkCallback(object state)
{
    ThreadWorkItem wi = (ThreadWorkItem)state;
    DoWork(wi.FirstElement, wi.LastElement, out wi.Result);
    Wi.ManualEvent.Set();
}

Figure 2 CPU-Intensive Code on One Thread

public int[] records;

void DoWork(int firstElement, int lastElement, out int result)
{
    result = 0; 
    for (int i= firstElement;i<=lastElement;i++) 
    {
        int tempvaluei = records[i];
        for (int j=0; j<1000; j++)
        {
            tempvaluei = (tempvaluei * 3 / 2)  + 
                ((tempvaluei * tempvaluei) % 70) + 1; 
        }
        result += tempvaluei;
    }
    return result;
}

The DoWork2 method will only be faster when the cost of the calculation is higher than the overhead of creating the WorkItems, waiting on the events, and all of the other processing required to execute these tasks asynchronously.

Of course, other ways to split the work are possible. For this example, both threads could iterate over all the elements, each performing part of the calculation, rather than each thread performing the whole calculation for only half of the elements. Which solution proves faster often depends on the use of the CPU cache.

Splitting the work into two threads involves overhead: data management overhead and overhead associated with thread creation, management, and switching. Not every operation requires the execution (and coding) overhead that this approach requires. Before splitting a task into multiple threads, you should consider the cost of threads and synchronization.

Hit and Miss

One key element when optimizing your code for hyper-threading is making the best use of the shared CPU cache. Compared to the speed of today's CPUs, RAM is slow. For data-intensive operations, often the time it takes to access memory is responsible for much of the CPU time used. This is because most CPU instructions (such as an integer addition) take one cycle, while main memory access can take as many as a few hundred cycles.

To address this, the Pentium 4 has a two-level memory caching system. An L1 (level 1) data cache will typically hold 8KB, while an L2 cache (level 2) might hold 512KB, but cache sizes increase with CPU releases, and some newer CPUs include an L1 of 16KB and an L2 of 2MB. When the processor requires information in memory, if that data is already in the L1 cache (known as an L1 cache hit), access to it takes on the order of 4 cycles. If the information required is in the L2 cache, access takes on the order of 10 cycles. If the request must go all the way to RAM, this cache miss may result in the operation requiring 100 cycles or more. And remember, if the information is not in physical memory and has to be brought in (paged) from disk, the wait time is equivalent to millions of CPU cycles.

In the Pentium 4, caching of RAM takes place in chunks called cache lines. The cache line size on all hyper-threading-enabled CPUs is 64 bytes, and during memory reads, up to two cache lines are fetched. Placing items that are often referenced together nearby in memory (generally referred to as locality) usually improves performance because there's a better chance that the latter memory accesses will be satisfied by the cache.

As you probably know, the common language runtime (CLR) already improves locality of reference. If you allocate objects consecutively, they will be allocated in consecutive addresses on the managed heap, which makes accessing them in sequence faster. And when a garbage collection is performed, the garbage collector compacts its heaps in order to keep objects closer together. (The CLR is also hyper-threading and cache-size aware, and performs additional optimizations to take advantage of the specific hardware it is running on.) The locality of your data becomes even more important the faster CPUs get. It's a good habit to allocate data together that will be used together, especially if those objects have similar lifetimes.

Along these lines, performance of multithreaded applications running on hyper-threaded processors is often improved when multiple threads read from a common area of memory. If one thread loads information from the main memory to the L2 cache, later requests by a second thread for that same information may be satisfied by the L2 cache. The information would not be stored twice in the L2 cache, freeing up valuable cache space (this usually doesn't happen with symmetrical multiple physical processors, since caches are typically not shared between CPUs), so relying on it in multiprocessor machines is discouraged unless you ensure (as I'll show later) that both threads run on the same physical CPU. Figure 4 shows an example that might benefit from those cache efficiencies when multiple threads are executing the DoWork method. The array used can be a common memory table, such as a translation table in this case.

Figure 4 Sharing Read-Only Data

public int[] records = ...;
public int[] arr = ...;

void DoWork(int firstElement, int lastElement, out int result)
{
    result = 0; 
    for (int i=firstElement; i<=lastElement; i++) 
    {
        int tempvaluei = records[i];
        for (int j=0; j<1000; j++)
        {
            tempvaluei = arr[(tempvaluei * 3 / 2 + 1) % arr.Length]; 
        }
        result += tempvaluei;
    }
}

Elements from the array are more likely to already be in the CPU cache when another thread has recently accessed it. However, caching easily calculated values in tables (such as when arr[i] = Math.Pow(r,i)) may actually be slower than just recalculating the value, so as always make sure to profile your applications.

Sharing writable data between threads is a different story. To maintain information consistency, access to the shared data needs to be protected with locks. If you don't use locks, you'll get synchronization bugs. Many synchronization bugs that manifest only rarely on a single logical CPU machine can happen much more frequently on a hyper-threaded machine or in a multiple processor machine. Consider the code sample in Figure 5, where two threads update a counter at the same time.

Figure 5 Synchronization Bugs

public class ThreadWork 
{
    public static int GlobalCounter;

    private ManualResetEvent manualEvent;
    private int countTo;

    public ThreadWork(ManualResetEvent manualEvent, int countTo)
    {
        this.manualEvent = manualEvent;
        this.countTo = countTo;
    }

    public void DoWork()
    {
        for(int i=0; i<countTo; i++)
        {
            ThreadWork.GlobalCounter++;
        }
        manualEvent.Set();
    }
}

class TestCounters
{
    public static void TestCounter()
    {
        int numThreads = 10;
        int countTo = 100000;
        ManualResetEvent[] events = new ManualResetEvent[numThreads];
        Thread[] threads = new Thread[numThreads];

        ThreadWork.GlobalCounter = 0;
        for (int j=0; j<numThreads; j++)
        {
            events[j] = new ManualResetEvent(false);
            ThreadWork tw = new ThreadWork(events[j], countTo);
            threads[j] = new Thread(new ThreadStart(tw.DoWork));
        }
        for (int j=0; j<numThreads; j++)
        {
            threads[j].Start();
        }

        WaitHandle.WaitAll(events);

        Console.WriteLine("Expected:{0} Actual:{1}",
            numThreads * countTo, ThreadWork.GlobalCounter);
    }
}

The operation ThreadWork.GlobalCounter++ is actually composed of three operations: fetching the value stored in GlobalCounter, incrementing it by one, and storing it back. If the second thread fetches GlobalCounter in the middle of the update operation, it would see the old value, which might result in GlobalCounter being incremented only once.

In a single-processor computer, threads do not actually execute concurrently; each gets a timeslice and continues execution until it issues a blocking operation (such as file I/O) or its allocated timeslice ends. It is less likely, then, that a thread switch would occur in the middle of the update operations. In a hyper-threaded machine (or in one with multiple physical CPUs), the operations can execute at the same time, so the likelihood of this happening is higher. For example, the code in Figure 5 produced the right results on my non-hyper-threaded machine, but got the values off by 20 percent on my hyper-threaded machine. That's not good.

An additional source for hyper-threading and multiprocessor hard-to-detect bugs is relying on thread priority for locking. Some programs using threads with different priorities rely on the lower-priority thread not executing while the higher-priority thread is running. Assuming this is unsafe even on a single CPU system, as the higher-priority thread can block (when performing I/O, and periodically by the starvation-prevention scheduling inherent in Windows scheduler) but on a hyper-threaded system, the CPU executes the top two ready threads, not just one (in a dual processor system, the top two threads each execute on a separate CPU). Also, on a hyper-threaded system, a lower-priority thread can compete for CPU resources with the higher-priority thread (since both are executed as separate instruction streams, but the CPU is not aware of thread priority), causing background threads that perform polling, lower-value operations, or "busy waiting" to have negative performance impact.

Locking

Pay special attention to how you implement locking when your code will be run on a hyper-threaded or multiple-CPU machine. A simple fix for the code in Figure 5 would be to use Interlocked.Increment(ref ThreadWork.globalCounter) rather than ThreadWork.globalCounter++. More complex structures would need the use of more complex synchronization primitives such as Monitor, Mutex, and Semaphore.

Locks slow down execution in two ways: lock entry and exit takes time, and if a lock results in a thread blocking, a thread context switch normally occurs. What's more, in a hyper-threaded system, while one of the threads running on the processor is blocked, the L2 cache may be underutilized. Minimizing the number of locks as well as the time that locks are held can improve performance. You can minimize locking by having each thread update a separate instance (as I did in Figure 2), keeping wi1.Result and wi2.Result separate, and only generating the result after the calculations were complete. Additionally, you can try to make it less likely that threads would block by minimizing the amount of time the lock is held.

If your code can be restructured with a minimum of shared writable data, or updating shared data can be deferred, performance may benefit. Despite the use of the more expensive lock method in Figure 6 (which is transformed by the C# compiler into using Monitor), by removing the common memory reference and the interlock operation from inside the loop, you can speed up the code. Even though Interlock.Increment is considerably faster than Monitor.Enter, it is considerably slower than incrementing a variable, and is slower on hyper-threaded and multiprocessor systems than on a single processor.

Figure 6 Implementing Locking

private object syncObj;

public void DoWork()
{
    int counter = 0
    for(int i = 0; i<countTo; i++)
    {
        counter++;
    }
    lock(syncObj)
    {    
        ThreadWork.GlobalCounter += counter;
    }
    manualEvent.Set();
}

False Sharing and Cache Aliasing

Avoid modifying data shared between threads. Maintaining a local copy of the results per thread and merging the result when thread execution is complete can help. This is especially helpful if the modified data is limited in size, as increasing the application working size also impacts performance. Earlier I mentioned cache lines and locality. Assume the existence of the following class:

class OClass 
{ 
    public int Field1;
    public int Field2;
}

If one thread is reading data from an instance of OClass by referring to Field1, and another thread on the same CPU is writing to nearby memory by assigning to Field2, it is possible (if Field1 and Field2 are close enough) that Field1 and Field2 will share a cache line. This will degrade performance. In machines with multiple physical CPUs, this problem is even worse.

Another best practice is to avoid writing to memory addresses sharing a cache line from two threads at the same time. A common workaround is padding. If OClass had additional fields between Field1 and Field2, the cache sharing might not occur. This problem is often seen when accessing two instances that were allocated sequentially, because the CLR will place them in adjacent memory locations. This generally improves performance since objects allocated together are generally used together, and an adjacent object is likely to be in the cache after the first object is accessed. But when multiple threads try to access fields from those objects, they may share a cache line, which would slow performance—this is called "false sharing". Padding the class (by adding fields) can help, or you can ensure that those elements are not allocated one after the other.

That said, don't just go adding unused fields to your classes in hopes that it'll improve performance. In general, it won't, and it'll make for sloppy code that's hard to maintain. I mention these issues because it's important to understand what's happening in the system under the covers, and to know where to look when problems arise. If you choose, after measuring and profiling your application, to add unused fields to a specific class, consider using an explicit layout to ensure that the fields are not reordered by the compiler. Also consider the effect of garbage collection—if the objects O1, O2, and O3 were allocated in sequence, O1 and O3 started out not sharing a cache line. But if no references to O2 exist, O2 can be garbage collected and O1 will now be adjacent to O3.

Another consideration is a set of effects known as the aliasing problem. A cache capacity aliasing problem occurs due to the CPU cache architecture, known as N-Way associative cache. The data in separate memory areas (separated by multiples of 2KB in the case of an 8KB L1 cache, and multiples of 64KB in the case of L2 cache of size 512KB, 128KB for a 1MB L2 cache, and so on) share a cache line location, and compete for n cache spots (n is usually 2, 4, or 8). For example, data in the addresses 0x00010, 0x08010, 0x38010 and 0x90010 (addresses are in base 16) are separated by a multiple of 32KB (0x8000), and would share an L2 cache location. This means that if more than n of those addresses are repeatedly accessed (by one or more threads), some of them would have to be read from the main memory, reducing performance. Additional aliasing problems exist in all but the newest Pentium 4 processors when memory addresses are separated by multiples of 64KB.

This isn't just an issue for unmanaged code, but affects managed code as well. If you have an array of C# integers, each of which occupy 4 bytes, and the flow of your program has thread A accessing element [i] when thread B accesses element [i+16KB], the memory areas accessed will be 64KB apart. Avoid placing objects referenced by concurrent threads 16KB (or multiples) apart in memory. There are some less common CPU resource issues as well; in general, for CPU-intensive operations try limiting the number of concurrent threads sharing any data to 4 to 8. More threads can be used if the thread operations utilize a significant amount of I/O or perform blocking operations.

Since CPU cache entries are not tagged to one logical CPU, it is possible for a thread running on logical CPU 1 to access enough memory to kick the data needed for a second thread running on logical CPU 2 off the cache. If the threads in your application access large chunks of memory and perform little additional work (such as copying large blocks of memory or summing up the values in an array of integers) running multiple threads on the same physical CPU might not improve performance, and can even hurt performance. For such applications, measure the performance with and without hyper-threading enabled.

Ah, the Competition

With hyper-threading, not all CPU operations are created equal. As mentioned earlier, to optimize performance under hyper-threading you should minimize competition for shared CPU resources between the threads. The most commonly contended resource besides the CPU cache is the FPU, responsible for floating point calculations. It turns out that executing heavy floating point operations simultaneously on both logical CPUs reduces performance significantly.

If you modify the calculation in the code sample from Figure 2 to use floating point operations (shown in Figure 7), the gains from splitting the task over multiple threads will diminish, and sometimes can result in a slowdown of up to 40 percent, if the operating threads do nothing but floating point operations due to inefficient CPU resource use.

Figure 7 Floating Point Calculations

public float[] records;

void DoWork (int firstElement, int lastElement, out int result)
{
    result = 0; 
    for (int i=firstElement; i<=lastElement; i++) 
    {
        float tempvalue = records[i];
        for (int j=0; j<1000; j++)
        {
            tempvalue = (Math.Sin(tempvalue) * 3 /2)  + 
                ((tempvaluei * tempvaluei) % 70) + 1; 
        }
        result += tempvalue;
    }
}

Note that the Pentium 4 has only one FPU, so there is little to gain in trying to schedule execution of floating point operations on both logical CPUs. Performance might be better if the floating point operations are kept to one thread per physical CPU.

To Serve and Hyper-Thread

Server applications that process numerous simultaneous requests will often perform each operation in a separate thread, utilizing all logical CPUs, often by queuing work items to a thread pool. As a result, many server applications (such as ASP.NET applications) may benefit from hyper-threading without any code changes. One area to consider when optimizing server apps to hyper-threaded machines are the shared data problems I discussed earlier.

With ASP.NET, consider the Web garden configuration. In the default configuration for Web gardens, a process that has affinity to a logical CPU maintains its own copy of the data and services a separate set of requests. Since the ASP.NET Web garden configuration is based on logical CPUs and not physical CPUs, it is likely that all logical CPUs will be utilized, but no shared caching between the logical CPUs on one physical CPU will be realized.

CPU Affinity

A thread is normally scheduled to execute on any available logical CPU in the system. Sometimes, though, you want more precise control over which CPU will execute a thread. You can get this through affinity.

The CPU affinity is a bitmask representing which CPUs a thread or a process is allowed to execute on. This information can be retrieved by P/Invoking to the Win32® API GetProcessAffinityMask:

[DllImport("kernel32.dll", SetLastError=true)]
static extern int GetProcessAffinityMask (int hProcess, 
    ref int lpProcessAffinityMask, ref int systemAffinityMask);

The returned value of the third parameter, systemAffinityMask, is a bitmask representing which logical CPUs are present on the machine. In all current version of Windows, the logical CPUs are numbered 0, 1, 2, and so on.

On a single physical CPU machine without hyper-threading, only the lower bit in systemAffinityMask will be set, resulting in the value 1. On a single physical CPU machine with hyper-threading, or a dual physical CPU machine without hyper-threading, both lower bits will be set, and systemAffinityMask will be 1+2=3. A dual physical CPU machine with hyper-threading would show systemAffinityMask as 1+2+4+8=15.

When hyper-threading is enabled, the operating system will assign a number to each logical CPU. In a system with n physical CPUs, the first physical CPU will be exposed as logical CPU 0 and logical CPU n. The second physical CPU will be exposed as logical CPU 1 and logical CPU n+1. Physical CPUs [A, B] would then be exposed as logical CPUs [0(A), 1(B), 2(A), 3(B)]. Using the returned systemAffinityMask is a simple way to count the number of logical CPUs, simply by counting the number of set bits. (This information is also available from several other Windows API calls. However, using the process affinity mask would tell you how many logical CPUs are available to your process, and is preferable.)

The sample shown in Figure 8 shows how to set a process affinity mask. By setting the process affinity mask to only one of the logical CPUs for each physical CPU, the application will behave as if hyper-threading was disabled.

Figure 8 Process Affinity Mask

public void SetProcessAffinityToPhysicalCPUForHyperthreadOnly(
    int processid)
{
    int res;
    int hProcess;
    int ProcAffinityMask = 0, SysAffinityMask = 0;
    hProcess  = OpenProcess(PROCESS_ALL_ACCESS, 0, processid);
    res = GetProcessAffinityMask(
        hProcess, ref ProcAffinityMask, ref SysAffinityMask);
    if (SysAffinityMask == 3) // 1 proc, 2 logical CPUs
        res = SetProcessAffinityMask(hProcess, 1);
    else if (SysAffinityMask == 15)  //dual proc, 4 virtual CPUs
        res = SetProcessAffinityMask(hProcess, 3);
    res = CloseHandle(hProcess);
}

There is one caveat, however. The code in Figure 8 assumes that hyper-threading is enabled. The process referred to, and all processes spawned from it, will run only on the lower-numbered logical CPUs. This approach can be used to get an approximation of the performance of an application with hyper-threading disabled without having to reconfigure hyper-threading on the machine. It can also be useful for trying different performance solutions, but testing your application on both hyper-threaded and non-hyper-threaded machines should still be considered a requirement.

If the process affinity mask (returned in processAffinityMask) differs from the system affinity mask, threads in the process can only run on select logical CPUs. This can be done from the Task Manager user interface or by calling the SetProcessAffinityMask method. To change the process affinity mask in Task Manager, right-click a process from the process tab and select "set Affinity."

Similarly, calling SetThreadAffinityMask can limit a thread to run only on certain CPUs. Never change the affinity of a ThreadPool thread since the thread will be reused by the ThreadPool. This approach would be inadvisable in hosted situations. Also, this assumes that the managed thread you're running on is tied to an underlying OS thread, which may not always be the case. For example, if your code is running as a stored procedure in SQL Server™ 2005, it's possible for a managed thread to be mapped to a fiber, and you should avoid changing or relying on anything to do with a specific OS thread.

If for some reason you do need to control which operation runs on which logical CPU, you could create as many threads as there are logical CPUs, so that the first thread is limited to the first logical CPU, the second thread is limited to the second logical CPU, and so on. When you set a thread affinity mask, it will not execute if no logical CPU in the mask is available. In order to prevent performance and behavior bugs in this situation, it is advisable to use a "soft" affinity by using P/Invoke to access the SetThreadIdealProcessor method instead.

When setting a thread ideal processor, the operating system schedules the thread on the ideal processor as soon as it is available, but will execute it on any other available CPU when the ideal processor is not available.

Producer/Consumer

A very effective and highly scalable way to use caching is a variation on the producer/consumer threads pattern. In this pattern, a "producer" thread performs a transformation on some piece of data, stores the transformed data in a buffer, and then processes the next piece of data. A second thread (the consumer) processes data in the buffer and creates a final output.

Using this variation, the memory that the consumer thread is processing is likely to be in the CPU cache already if the consumer and producer run on the same physical CPU. The producer and consumer threads can run concurrently on separate logical CPUs. To scale well on multiprocessor machines and when either producer or consumer uses I/O, multiple producer and consumer threads can be used; performance would benefit if producers and consumers of the same data reside on the same physical CPU. Using this variation also helps when the workload cannot be accurately divided in advance (for example, when some data elements take more processing, and you do not know which elements would do so before assigning them to threads).

This pattern is also useful for processing streams, and it can be used in situations where running multiple threads on separate segments of the data is inefficient, as when the consumers' (or the producers') calculations rely heavily on floating point operations. Implementing the producer/consumer pattern requires more code than can be covered in this article.

Detecting Hyper-Threading

It is generally not necessary to detect hyper-threading since scaling an application based on the number of logical processors usually works. In the cases where detection is needed, doing so from managed code is difficult.

On Windows Server™ 2003, this information is exposed through P/Invoking GetLogicalProcessorInformation; however, this API is not available on Windows XP. You can get the information in Windows XP by using the CPUID assembler instruction directly, but I won't go into the details of this approach here.

Conclusion

Hyper-threading is the first step in introducing multiple, interdependent logical CPUs to desktop machines. I've presented various ways to optimize code for hyper-threaded machines by splitting execution into multiple threads. Future trends in CPU design will make multithreading an even more useful tool in improving application performance for both server and desktop applications (see Herb Sutter's article on concurrency at The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software). Every developer would benefit from adding multithreading to their toolbox.

Yaniv Pessach is a 13-year industry veteran and a software design engineer at Microsoft for Indigo Web services, where he focuses on network programming and code performance. Reach Yaniv at his personal Web site, www.yanivpessach.com.