Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool
By Erika Fuentes | September 2010
The CLR ThreadPool in the latest release (CLR 4.0) has seen several major changes since CLR 2.0. The recent shift in technology trends, such as the widespread use of manycore architectures and the resulting desire to parallelize existing applications or write new parallel code, has been one of the biggest motivating factors in the improvement of the CLR ThreadPool.
In the December 2008 MSDN Magazine CLR Inside Out column, “Thread Management in the CLR” (msdn.microsoft.com/magazine/dd252943), I discussed some of the motivations and associated issues such as concurrency control and noise. Now I’ll describe how we’ve addressed these in the CLR 4.0 ThreadPool, the associated implementation choices and how these can impact its behavior. Also, I’ll focus on the approach taken toward automating concurrency control in the current CLR 4.0 ThreadPool (hereafter referred to only as ThreadPool for convenience). I will also give a brief outline of the ThreadPool architecture. This article covers implementation details subject to change in future versions. However, those readers designing and writing new concurrent applications who are interested in improving old applications by taking advantage of concurrency, or making use of ASP.NET or Parallel Extension technologies (all in the context of CLR 4.0), may find this material useful to understand and take advantage of current ThreadPool behavior.
Overview of the ThreadPool
A thread pool is meant to provide key services such as thread management, abstractions for different types of concurrency and throttling of concurrent operations. By providing these services, a thread pool takes away some burden from the user to do it manually. For the inexperienced user, it’s convenient not having to learn and deal with the details of a multi-threaded environment. For the more experienced user, having a reliable threading system means that she can focus on improving different aspects of the application. The ThreadPool provides these services for managed applications and support for portability across platforms, running certain Microsoft .NET Framework applications on the Mac OS, for example.
There are different types of concurrency that can be related to various parts of the system. The most relevant are: CPU parallelism, I/O parallelism; timers and synchronization; and load balancing and resource utilization. We can briefly outline the architecture of the ThreadPool in terms of the different aspects of concurrency (for more details on the ThreadPool architecture and related APIs usage, see “The CLR’s Thread Pool” (msdn.microsoft.com/magazine/cc164139). Specifically, it’s worth mentioning that there are two independent implementations of the ThreadPool: one deals with CPU parallelism and is referred to as the worker ThreadPool; the other deals with I/O Parallelism and can be dubbed I/O ThreadPool. The next section will focus on CPU parallelism and the associated implementation work in the ThreadPool—in particular, on strategies for throttling concurrency.
The Worker ThreadPool Designed to provide services at the level of CPU parallelism, the worker ThreadPool takes advantage of multi-core architectures. There are two main considerations for CPU parallelism: dispatching work quickly and optimally; and throttling the degree of parallelism. For the former, the ThreadPool implementation makes use of strategies such as lock-free queues to avoid contention and work-stealing for load balancing, areas that are out of the scope of this discussion (see msdn.microsoft.com/magazine/cc163340 for more insight on these topics). The latter—throttling the degree of parallelism—entails concurrency control to prevent resource contention from slowing down overall throughput.
CPU parallelism can be particularly challenging because it involves many parameters, such as determining how many work items can be run simultaneously at any given time. Another problem is the number of cores and how to tune for different types of workloads. For example, having one thread per CPU is optimal (in theory), but if the workload is constantly blocking, then CPU time is wasted because more threads could be used to execute more work. The size and type of workload is actually another parameter. For example, in the case of blocking workloads, it’s extremely difficult to determine the number of threads that optimizes overall throughput because it’s hard to determine when a request will be completed (or perhaps even how often it will arrive—this is closely related to I/O blocking). The API associated to this ThreadPool is QueueUserWorkItem, which queues a method (the work item) for execution (see msdn.microsoft.com/library/system.threading.threadpool.queueuserworkitem). It’s recommended for applications that have work that could potentially be run in parallel (with other things). The work is handed to the ThreadPool, which automatically “figures out” when to run it. This facility takes away from the programmer the need to worry about how and when to create threads; however, it isn’t the most efficient solution for all scenarios.
The I/O ThreadPool This part of the ThreadPool implementation, related to I/O parallelism, handles blocking workloads (that is, I/O requests that take a relatively long time to service) or asynchronous I/O. In asynchronous calls, threads aren’t blocked and can continue to do other work while the request is serviced. This ThreadPool takes care of the coordination between the requests and the threads. The I/O ThreadPool—like the worker ThreadPool—has an algorithm for throttling concurrency; it manages the number of threads based on the completion rate of asynchronous operations. But this algorithm is quite different from the one in the worker ThreadPool and it’s out of the scope of this document.
Concurrency in the ThreadPool
Dealing with concurrency is a difficult but necessary task that directly impacts the overall performance of a system. How the system throttles concurrency directly impacts other tasks such as synchronization, resource utilization and load balancing (and vice versa).
The concept of “concurrency control,” or more appropriately, “throttling concurrency,” refers to the number of threads that are allowed to do work at a particular time in the ThreadPool; it’s a policy to decide how many threads can be run simultaneously without hurting performance. Concurrency control in our discussion is only in regard to the worker ThreadPool. As opposed to what may be intuitive, concurrency control is about throttling and reducing the number of work items that can be run in parallel in order to improve the worker ThreadPool throughput (that is, controlling the degree of concurrency is preventing work from running).
The concurrency control algorithm in the ThreadPool automatically chooses the level of concurrency; it decides for the user how many threads are necessary to keep the performance generally optimal. The implementation of this algorithm is one of the most complex and interesting parts of the ThreadPool. There are various approaches to optimize the performance of the ThreadPool in the context of concurrency level (in other words, determine the “right” number of threads running at the same time). In the next section, I will discuss some of those approaches that have been considered or used in the CLR.
The Evolution of Concurrency Control in the ThreadPool
One of the first approaches taken was to optimize based on the observed CPU utilization, and adding threads to maximize it—running as much work as possible to keep the CPU busy. Using CPU utilization as a metric is useful when dealing with long or variable workloads. However, this approach wasn’t appropriate because the criteria to evaluate the metric can be misleading. Consider, for example, an application where lots of memory paging is happening. The observed CPU utilization would be low, and adding more threads in such a situation would result in more memory being used, which consequently results in even lower CPU utilization. Another problem with this approach is that in scenarios where there’s a lot of contention, the CPU time is really being spent doing synchronization, not doing actual work, so adding more threads would just make the situation worse.
Another idea was to just let the OS take care of the concurrency level. In fact, this is what the I/O ThreadPool does, but the worker ThreadPool has required a higher level of abstraction to provide more portability and more efficient resource management. This approach may work for some scenarios, but the programmer still needs to know how to do throttling to avoid over saturation of resources (for example, if thousands of threads are created, resource contention can become a problem and adding threads will actually make things worse). Furthermore, this implies that the programmer still has to worry about concurrency, which defeats the purpose of having a thread pool.
A more recent approach was to include the concept of throughput, measured as completion of work items per unit of time, as a metric used to tune performance. In this case, when the CPU utilization is low, threads are added to see if this improves the throughput. If it does, more threads are added; if it didn’t help, threads are removed. This approach is more sensible than previous ones because it takes into account how much work is being completed, rather than just how the resources are being used. Unfortunately, throughput is affected by many factors other than just the number of active threads (for example, work item size), which makes it challenging to tune.
Control Theory for Throttling Concurrency
To overcome some of the limitations of previous implementations, new ideas were introduced with CLR 4.0. The first methodology considered, from the control theory area, was the Hill Climbing (HC) algorithm. This technique is an auto-tuning approach based on an input-output feedback loop. The system output is monitored and measured at small time intervals to see what effects the controlled input had, and that information is fed back into the algorithm to further tune the input. Looking at the input and output as variables, the system is modeled as a function in terms of these variables. The goal is then to optimize the measured output.
In the context of the worker ThreadPool system, the input is the number of threads that are executing work concurrently (or concurrency level), and the output is the throughput (see Figure 1).
Figure 1 ThreadPool Feedback Loop
We observe and measure over time the changes in throughput as a result of adding or removing threads, then decide whether to add or remove more threads based on the observed throughput degradation or improvement. Figure 2 illustrates the idea.
Figure 2 Throughput Modeled as a Function of Concurrency Level
Having throughput as a (polynomial) function of the concurrency level, the algorithm adds threads until the maximum of the function is reached (about 20 in this example). At that point, a decrease in throughput will be seen and the algorithm will remove threads. Over each time interval a sample of throughput measurements is taken and “averaged.” This is then used to make a decision for the next time interval . It’s understandable that if the measurements are noisy, statistical information isn’t representative of the actual situation, unless perhaps it’s taken over a large interval of time. It’s hard to tell whether an improvement was a result of the change in concurrency level or due to another factor such as workload fluctuations.
Adaptive approaches in real-life systems are complicated; in our case its use was particularly problematic because of the difficulty in detecting small variations or extracting changes from a very noisy environment over a short time. The first problem observed with this approach is that the modeled function (see the black trend in Figure 2) isn’t a static target in real-world situations (see blue dots also in the graph), so measuring small changes is hard. The next issue, perhaps more concerning, is that noise (variations in measurements caused by the system environment, such as certain OS activity, garbage collection and more) makes it difficult to tell if there’s a relationship between the input and the output, that is, to tell if the throughput isn’t just a function of the number of threads. In fact, in the ThreadPool, the throughput constitutes only a small part of what the real observed output is—most of it is noise. For example, take an application whose workload uses many threads. Adding just a few more won’t make a difference in the output; an improvement observed in a time interval may not even be related to the change in concurrency level (Figure 3 helps to illustrate this issue).
Figure 3 Example of Noise in the ThreadPool
In Figure 3, on the x-axis is time; on the y-axis, throughput and concurrency level measurements are over-imposed. The top graph illustrates that in some workloads, even if the number of threads (red) is kept constant, there may be observed changes in the throughput (blue). In this example, the fluctuations are noise. The bottom graph is another example where the general increase in the throughput is observed over time even in the presence of noise. However, the number of threads was kept constant so the improvement in throughput is due to a different parameter in the system.
In the next section, I’ll discuss an approach to dealing with the noise.
Bringing in Signal Processing
Signal processing is used in many areas of engineering to reduce noise in signals; the idea is to find the input signal’s pattern in the output signal. This theory can be applied in the context of the ThreadPool if we treat the input (concurrency level) and output (throughput) of the concurrency control algorithm as signals. If we input a purposely modified concurrency level as a “wave” with known period and amplitude, and look for that original wave pattern in the output, we can discern what is noise from the actual effect of input on the throughput. Figure 4 illustrates this idea.
Figure 4 Determining Factors in ThreadPool Output
Consider, for a moment, the system as a black box that generates output given an input. Figure 4 shows a simplified example of the HC input and output (green); below that is an example of how the input and output would look as waves using a filtering technique (black).
Instead of feeding a flat, constant input, we introduce a signal and then try to find it in the noisy output. This effect can be achieved by using techniques such as the band pass filter or match filter, generally used for extracting waves from other waves or finding very specific signals in the output. This also means that by introducing changes to the input, the algorithm is making decisions at every point based on the last small piece of input data.
The particular algorithm used in the ThreadPool uses a discrete Fourier transform, a methodology that gives information such as the magnitude and phase of a wave. This information can then be used to see if and how the input affected the output. The graph in Figure 5 shows an example of the ThreadPool behavior using this methodology on a workload running more than 600 seconds.
Figure 5 Measuring How Input Affects Output
In the Figure 5 example, the known pattern of the input wave (phase, frequency and amplitude) can be traced in the output. The chart illustrates the behavior of the concurrency algorithm using filtering on a sample workload. The red trend corresponds to the input and the blue corresponds to the output. We vary the thread count up and down over time, but this doesn’t mean we’re creating or destroying threads; we keep them around.
Although the scale of the number of threads is different from that of the throughput, we can see how it’s possible to map how the input affects the output. The number of threads is constantly changed by at least one in a time slice, but this doesn’t mean that a thread is being created or destroyed. Instead, threads are kept “alive” in the pool, but they aren’t actively doing work.
As opposed to the initial approach using HC, where the goal was to model the throughput curve and base the decision on that calculation, the improved methodology just determines whether or not a change in the input helped to improve the output. Intuitively, there is increased confidence that the changes that we artificially introduce are having the effect observed on the output (the maximum number of threads observed so far to be introduced in the signal is 20, which is quite reasonable—especially for scenarios that have many threads). One of the drawbacks of the approach that uses signal processing is that due to the artificial wave pattern introduced, the optimal concurrency level will always be off by at least one thread. Also, adjustments to the concurrency level happen relatively slowly (faster algorithms are based on CPU utilization metrics) because it’s necessary to gather enough data to make the model stable. And the speed will depend on the length of work items.
This approach isn’t perfect and works better for some workloads than for others; however, it’s considerably better than previous methodologies. The types of workloads for which our algorithm works the best are those with relatively short individual work items, because the shorter the work item, the faster the algorithm is allowed to adapt. For example, it works pretty well with work item durations less than 250ms, but it’s better when the durations are less than 10ms.
Concurrency Management—We’ll Do It for You
Wrapping up, the ThreadPool provides services that help the programmer focus on things other than concurrency management. In order to deliver such functionality, the ThreadPool implementation has incorporated high-end engineering algorithms that can automate many decisions for the user. One example is the concurrency control algorithm, which has evolved based on the technology and the needs expressed from different scenarios, such as a need to measure useful progress of work execution.
The purpose of the concurrency control algorithm in CLR 4.0 is to automatically decide how many work items can be run concurrently in an efficient manner, hence optimizing the throughput of the ThreadPool. This algorithm is difficult to tune because of noise and parameters such as the type of workload; it also depends on the assumption that every work item is a useful piece of work. The current design and behavior has been heavily influenced by ASP.NET and Parallel Framework scenarios, for which it has good performance. In general, the ThreadPool can do a good job executing the work efficiently. However, the user should be aware that there may be unexpected behavior for some workloads, or if, for example, there are multiple ThreadPools running at the same time.
Erika Fuentes, Ph.D., is a software development engineer in Test on the CLR team, where she works in the Performance Team with particular focus on the Core Operating System area in Threading. She has written several academic publications about scientific computing, adaptive systems and statistics.
Thanks to the following technical experts for reviewing this article: Eric Eilebrecht and Mohamed Abd El Aziz