Parallel Loops

Article
09/01/2010

Use the Parallel Loop pattern when you need to perform the same independent operation for each element of a collection or for a fixed number of iterations. The steps of a loop are independent if they don't write to memory locations or files that are read by other steps.

The syntax of a parallel loop is very similar to the for and foreach loops you already know, but the parallel loop runs faster on a computer that has available cores. Another difference is that, unlike a sequential loop, the order of execution isn't defined for a parallel loop. Steps often take place at the same time, in parallel. Sometimes, two steps take place in the opposite order than they would if the loop were sequential. The only guarantee is that all of the loop's iterations will have run by the time the loop finishes.

It's easy to change a sequential loop into a parallel loop. However, it's also easy to use a parallel loop when you shouldn't. This is because it can be hard to tell if the steps are actually independent of each other. It takes practice to learn how to recognize when one step is dependent on another step. Sometimes, using this pattern on a loop with dependent steps causes the program to behave in a completely unexpected way, and perhaps to stop responding. Other times, it introduces a subtle bug that only appears once in a million runs. In other words, the word "independent" is a key part of the definition of this pattern, and one that this chapter explains in detail.

For parallel loops, the degree of parallelism doesn't need to be specified by your code. Instead, the run-time environment executes the steps of the loop at the same time on as many cores as it can. The loop works correctly no matter how many cores are available. If there is only one core, the performance is close to (perhaps within a few percentage points of) the sequential equivalent. If there are multiple cores, performance improves; in many cases, performance improves proportionately with the number of cores.

The Basics

The .NET Framework includes both parallel For and parallel ForEach loops and is also implemented in the Parallel LINQ (PLINQ) query language. Use the Parallel.For method to iterate over a range of integer indices and the Parallel.ForEach method to iterate over user-provided values. Use PLINQ if you prefer a high-level, declarative style for describing loops or if you want to take advantage of PLINQ's convenience and flexibility.

Parallel For Loops

Here's an example of a sequential for loop in C#.

int n = ...
for (int i = 0; i < n; i++)
{
   // ... 
}

To take advantage of multiple cores, replace the for keyword with a call to the Parallel.For method and convert the body of the loop into a lambda expression.

Note

Don't forget that the steps of the loop body must be independent of one another if you want to use a parallel loop. The steps must not communicate by writing to shared variables.

int n = ...
Parallel.For(0, n, i =>
{
   // ... 
});

Parallel.For is a static method with overloaded versions. Here's the signature of the version of Parallel.For that's used in the example.

Parallel.For(int fromInclusive, 
             int toExclusive, 
             Action<int> body);

In the example, the first two arguments specify the iteration limits. The first argument is the lowest index of the loop. The second argument is the exclusive upper bound, or the largest index plus one. The third argument is an action that's invoked once per iteration. The action takes the iteration's index as its argument and executes the loop body once for each index.

Note

The Parallel.For method does not guarantee any particular order of execution. Unlike a sequential loop, some higher-valued indices may be processed before some lower-valued indices.

The Parallel.For method has additional overloaded versions. These are covered in the section, "Variations," later in this chapter and in Chapter 4, "Parallel Aggregation."

The example includes a lambda expression in the form args => body as the third argument to the Parallel.For invocation. Lambda expressions are unnamed methods that can capture variables from their enclosing scope. Of course, the body parameter could also be an instance of a delegate type, an anonymous method (using the delegate keyword) or an ordinary named method. In other words, you don't have to use lambda expressions if you don't want to. Examples in this book use lambda expressions because they keep the code within the body of the loop, and they are easier to read when the number of lines of code is small.

Note

If you're unfamiliar with the syntax for lambda expressions, see "Further Reading" at the end of this chapter. After you use lambda expressions, you'll wonder how you ever lived without them.

Parallel ForEach

Here's an example of a sequential foreach loop in C#.

IEnumerable<MyObject> myEnumerable = ...

foreach (var obj in myEnumerable)
{
   // ... 
}

To take advantage of multiple cores, replace the foreach keyword with a call to the Parallel.ForEach method.

IEnumerable<MyObject> myEnumerable = ...

Parallel.ForEach(myEnumerable, obj =>
{
   // ... 
});

Parallel.ForEach is a static method with overloaded versions. Here's the signature of the version of Parallel.ForEach that was used in the example.

ForEach<TSource>(IEnumerable<TSource> source, 
                 Action<TSource> body);

In the example, the first argument is an object that implements the IEnumerable<MyObject> interface. The second argument is a method that's invoked for each element of the input collection.

The Parallel.ForEach method does not guarantee the order of execution. Unlike a sequential ForEach loop, the incoming values aren't always processed in order.

Note

Don't forget that iterations need to be independent. The loop body must only make updates to fields of the particular instance that's passed to it.

The Parallel.ForEach method has additional overloaded versions. These are covered in the section, "Variations," later in this chapter and in Chapter 4, "Parallel Aggregation."

Parallel LINQ (PLINQ)

The Language Integrated Query (LINQ) feature of the .NET Framework includes a parallel version named PLINQ (Parallel LINQ). There are many options and variations for expressing PLINQ queries but almost all LINQ-to-Objects expressions can easily be converted to their parallel counterpart by adding a call to the AsParallel extension method. Here's an example that shows both the LINQ and PLINQ versions.

IEnumerable<MyObject> source = ...

// LINQ
var query1 = from i in source select Normalize(i);

// PLINQ
var query2 = from i in source.AsParallel() 
                                       select Normalize(i);

This code example creates two queries that transform values of the enumerable object source. The PLINQ version uses multiple cores if they're available.

You can also use PLINQ's ForAll extension method in cases where you want to iterate over the input values but you don't want to select output values to return. This is shown in the following code.

IEnumerable<MyObject> myEnumerable = ...

myEnumerable.AsParallel().ForAll(obj => DoWork(obj));

The ForAll extension method is the PLINQ equivalent of Parallel.ForEach.

Note

It's important to use PLINQ's ForAll extension method instead of giving a PLINQ query as an argument to the Parallel.ForEach method. For more information, see the section, "Mixing the Parallel Class and PLINQ," later in this chapter.

What to Expect

By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the number of available cores. In typical scenarios, the more cores you have, the faster your loop executes, until you reach the point of diminishing returns that Amdahl's Law predicts. How much faster depends on the kind of work your loop does.

Note

You must choose the correct granularity. Too many small parallel loops can reach a point of over-decomposition where the multicore speedup is more than offset by the parallel loop's overhead.

The .NET implementation of the Parallel Loop pattern ensures that exceptions that are thrown during the execution of a loop body are not lost. For both the Parallel.For and Parallel.ForEach methods as well as for PLINQ, exceptions are collected into an AggregateException object and rethrown in the context of the calling thread. All exceptions are propagated back to you. To learn more about exception handling for parallel loops, see the section, "Variations," later in this chapter.

Parallel loops have many variations. There are 12 overloaded methods for Parallel.For and 20 overloaded methods for Parallel.ForEach. PLINQ has close to 200 extension methods. Although there are many overloaded versions of For and ForEach, you can think of the overloads as providing optional configuration options. Two examples are a maximum degree of parallelism and hooks for external cancellation. These options allow the loop body to monitor the progress of other steps (for example, to see if exceptions are pending) and to manage task-local state. They are sometimes needed in advanced scenarios. To learn about the most important cases, see the section, "Variations," later in this chapter.

If you convert a sequential loop to a parallel loop and then find that your program does not behave as expected, the mostly likely problem is that the loop's steps are not independent. Here are some common examples of dependent loop bodies:

Note

Check carefully for dependencies between loop iterations! Not noticing dependencies between steps is by far the most common mistake you'll make with parallel loops.

Writing to shared variables. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. Here is an example, where total is shared across iterations.
```
for(int i = 1; i < n; i++)
  total += data[i];
```
If you encounter this situation, see Chapter 4, "Parallel Aggregation."

Shared variables come in many flavors. Any variable that is declared outside of the scope of the loop body is a shared variable. Shared references to types such as classes or arrays will implicitly allow all fields or array elements to be shared. Parameters that are declared using the keyword ref result in shared variables. Even reading and writing files can have the same effect as shared variables.
Using properties of an object model. If the object being processed by a loop body exposes properties, you need to know whether those properties refer to shared state or state that's local to the object itself. For example, a property named Parent is likely to refer to global state. Here's an example.
```
for(int i = 0; i < n; i++)
  SomeObject[i].Parent.Update();
```
In this example, it's likely that the loop iterations are not independent. For all values of i, SomeObject[i].Parent is a reference to a single shared object.

Note

You must be extremely cautious when getting data from properties and methods. Large object models are known for sharing mutable state in unbelievably devious ways.

Referencing data types that are not thread safe. If the body of the parallel loop uses a data type that is not thread safe, the loop body is not independent (there is an implicit dependency on the thread context). An example of this case, along with a solution, is shown in "Using Task-Local State in a Loop Body" in the section, "Variations," later in this chapter.
Loop-carried dependence. If the body of a parallel for loop performs arithmetic on the loop index, there is likely to be a dependency that is known as loop-carried dependence. This is shown in the following code example. The loop body references data[i] and data[i – 1]. If Parallel.For is used here, there's no guarantee that the loop body that updates data[i – 1] has executed before the loop for data[i].
```
for(int i = 1; i < N; i++)
  data[i] = data[i] + data[i - 1]; 
```
Sometimes, it's possible to use a parallel algorithm in cases of loop-carried dependence, but this is outside the scope of this book. Your best bet is to look elsewhere in your program for opportunities for parallelism or to analyze your algorithm and see if it matches some of the advanced parallel patterns that occur in scientific computing. Parallel scan and parallel dynamic programming are examples of these patterns.

Note

Arithmetic on loop index variables, especially addition or subtraction, usually indicates loop-carried dependence.

When you look for opportunities for parallelism, profiling your application is a way to deepen your understanding of where your application spends its time; however, profiling is not a substitute for understanding your application's structure and algorithms. For example, profiling doesn't tell you whether loop bodies are independent.

Note

Don't expect miracles from profiling—it can't analyze your algorithms for you. Only you can do that.

An Example

Here's an example of when to use a parallel loop. Fabrikam Shipping extends credit to its commercial accounts. It uses customer credit trends to identify accounts that might pose a credit risk. Each customer account includes a history of past balance-due amounts. Fabrikam has noticed that customers who don't pay their bills often have histories of steadily increasing balances over a period of several months before they default.

To identify at-risk accounts, Fabrikam uses statistical trend analysis to calculate a projected credit balance for each account. If the analysis predicts that a customer account will exceed its credit limit within three months, the account is flagged for manual review by one of Fabrikam's credit analysts.

In the application, a top-level loop iterates over customers in the account repository. The body of the loop fits a trend line to the balance history, extrapolates the projected balance, compares it to the credit limit, and assigns the warning flag if necessary.

An important aspect of this application is that each customer's credit status can be independently calculated. The credit status of one customer doesn't depend on the credit status of any other customer. Because the operations are independent, making the credit analysis application run faster is simply a matter of replacing a sequential foreach loop with a parallel loop.

The complete source code for this example is online at http://parallelpatterns.codeplex.com in the Chapter2\CreditReview project.

Sequential Credit Review Example

Here's the sequential version of the credit analysis operation.

static void UpdatePredictionsSequential(
                                   AccountRepository accounts)
{
  foreach (Account account in accounts.AllAccounts)
  {
    Trend trend = SampleUtilities.Fit(account.Balance);
    double prediction = trend.Predict(
                     account.Balance.Length + NumberOfMonths); 
    account.SeqPrediction = prediction;
    account.SeqWarning = prediction < account.Overdraft;
  }
}

The UpdatePredictionsSequential method processes each account from the application's account repository. The Fit method is a utility function that uses the statistical least squares method to create a trend line from an array of numbers. The Fit method is a pure function. This means that it doesn't modify any state.

The prediction is a three-month projection based on the trend. If a prediction is more negative than the overdraft limit (credit balances are negative numbers in the accounting system), the account is flagged for review.

Credit Review Example Using Parallel.ForEach

The parallel version of the credit scoring analysis is very similar to the sequential version.

static void UpdatePredictionsParallel(AccountRepository accounts)
{
  Parallel.ForEach(accounts.AllAccounts, account =>
  {
    Trend trend = SampleUtilities.Fit(account.Balance);
    double prediction = trend.Predict(
                       account.Balance.Length + NumberOfMonths);
    account.ParPrediction = prediction;
    account.ParWarning = prediction < account.Overdraft;
  });
}

The UpdatePredictionsParallel method is identical to the UpdatePredictionsSequential method, except that the Parallel.ForEach method replaces the foreach operator.

Credit Review Example with PLINQ

You can also use PLINQ to express a parallel loop. Here's an example.

static void UpdatePredictionsPlinq(AccountRepository accounts)
{            
  accounts.AllAccounts
  .AsParallel()
  .ForAll(account =>
  {
    Trend trend = SampleUtilities.Fit(account.Balance);
    double prediction = trend.Predict(
                        account.Balance.Length + NumberOfMonths);
    account.PlinqPrediction = prediction;
    account.PlinqWarning = prediction < account.Overdraft;         
   });            
}

Using PLINQ is almost exactly like using LINQ-to-Objects. PLINQ provides a ParallelEnumerable class that defines extension methods for various types in a manner very similar to LINQ's Enumerable class. One of the methods of ParallelEnumerable is the AsParallel extension method.

The AsParallel extension method allows you to convert a sequential collection of type IEnumerable<T> into a ParallelQuery<T> object. Applying AsParallel to the accounts.AllAccounts collection returns an object of type ParallelQuery<AccountRecord>.

PLINQ's ParallelEnumerable class has close to 200 extension methods that provide parallel queries for ParallelQuery<T> objects. In addition to parallel implementations of LINQ methods, such as Select and Where, PLINQ provides a ForAll extension method that invokes a delegate method in parallel for every element.

In the PLINQ prediction example, the argument to ForAll is a lambda expression that performs the credit analysis for a specified account. The body is the same as in the sequential version.

Performance Comparison

Running the credit review example on a quad-core computer shows that the Parallel.ForEach and PLINQ versions run slightly less than four times as fast as the sequential version. Timing numbers vary; you may want to run the online samples on your own computer.

Variations

The credit analysis example shows a typical way to use parallel loops, but there can be variations. This section introduces some of the most important ones. You won't always need to use these variations, but you should be aware that they are available.

Breaking Out of Loops Early

Breaking out of loops is a familiar part of sequential iteration. It's less common in parallel loops, but you'll sometimes need to do it. Here's an example of the sequential case.

int n = ...
for (int i = 0; i < n; i++)
{
  // ... 
  if (/* stopping condition is true */)
    break;   
}

The situation is more complicated with parallel loops because more than one step may be active at the same time, and steps of a parallel loop are not necessarily executed in any predetermined order. Consequently, parallel loops have two ways to break or stop a loop instead of just one. Parallel break allows all steps with indices lower than the break index to run before terminating the loop. Parallel stop terminates the loop without allowing any new steps to begin.

Parallel Break

The Parallel.For method has an overload that provides a ParallelLoopState object as a second argument to the loop body. You can ask the loop to break by calling the Break method of the ParallelLoopState object. Here's an example.

int n = ...
Parallel.For(0, n, (i, loopState) =>
{
  // ... 
  if (/* stopping condition is true */)
  {
    loopState.Break();
    return;   
  }
});

This example uses an overloaded version of Parallel.For that passes a "loop state" object to each step. Here's the signature of the version of the Parallel.For method that was used in the example.

Parallel.For(int fromInclusive, 
             int toExclusive, 
             Action<int, ParallelLoopState> body);

The object that's passed to the loopState argument is an instance of the ParallelLoopState class that was created by the parallel loop for use within the loop body.

Calling the Break method of the ParallelLoopState object begins an orderly shutdown of the loop processing. Any steps that are running as of the call to Break will run to completion.

Note

Calling Break doesn't stop other steps that might have already started running.

You may want to check for a break condition in long-running loop bodies and exit that step immediately if a break was requested. If you don't do this, the step will continue to run until it finishes. To see if another step running in parallel has requested a break, retrieve the value of the parallel loop state's LowestBreakIteration property. If this returns a nullable long integer whose HasValue property is true, you know that a break has been requested. You can also read the ShouldExitCurrentIteration property of the loop state object, which checks for breaks as well as other stopping conditions.

During the processing of a call to the Break method, iterations with an index value less than the current index will be allowed to start (if they have not already started), but iterations with an index value greater than the current index will not be started. This ensures that all iterations below the break point will complete.

Note

Don't forget that all steps with an index value less than the step that invoked the Break method will be allowed to run normally, even after you call Break.

Because of parallel execution, it's possible that more than one step may call Break. In that case, the lowest index will be used to determine which steps will be allowed to start after the break occurred.

The Parallel.For and Parallel.ForEach methods return an object of type ParallelLoopResult. You can find out if a loop terminated with a break by examining the values of two of the loop result properties. If the IsCompleted property is false and the LowestBreakIteration property returns an object whose HasValue property is true, you know that the loop terminated by a call to the Break method. You can query for the specific index with the loop result's LowestBreakIteration property. Here's an example.

int n = ...
var result = new double[n];

var loopResult = Parallel.For(0, n, (i, loopState) =>
{
   if (/* break condition is true */)
   {
      loopState.Break();
      return;
   }
   result[i] = DoWork(i);
});

if (!loopResult.IsCompleted && 
        loopResult.LowestBreakIteration.HasValue)
{
   Console.WriteLine("Loop encountered a break at {0}", 
                      loopResult.LowestBreakIteration.Value);
}

The Break method ensures that data up to a particular iteration index value will be processed. Depending on how the iterations are scheduled, it may be possible that some steps with a higher index value than the one that called the Break method may have been started before the call to Break occurs.

Note

Be aware that some steps with index values higher than the step that called the Break method might be run. There's no way of predicting when or if this might happen.

The Parallel.ForEach method also supports the loop state Break method. The parallel loop assigns items a sequence number, starting from zero, as it pulls them from the enumerable input. This sequence number is used as the iteration index for the LowestBreakIteration property.

Parallel Stop

There are also situations, such as unordered searches, where you want the loop to stop as quickly as possible after the stopping condition is met. The difference between "break" and "stop" is that, with stop, no attempt is made to execute loop iterations less than the stopping index if they have not already run. To stop a loop in this way, call the ParallelLoopState class's Stop method instead of the Break method. Here is an example of parallel stop.

var n = ...
var loopResult = Parallel.For(0, n, (i, loopState) =>
{
   if (/* stopping condition is true */)
   {
      loopState.Stop();
      return;
   }
   result[i] = DoWork(i);
});

if (!loopResult.IsCompleted && 
         !loopResult.LowestBreakIteration.HasValue)
{
   Console.WriteLine(“Loop was stopped”);
}

When the Stop method is called, the index value of the iteration that caused the stop isn't available.

You cannot call both Break and Stop during the same parallel loop. You have to choose which of the two loop exit behaviors you want to use. If you call both Break and Stop in the same parallel loop, an exception will be thrown.

Parallel programs use Stop more often than Break. Processing all iterations with indices less than the stopping iteration is usually not necessary when the loop bodies are independent of each other. It's also true that Stop shuts down a loop more quickly than Break.

There's no Stop method for a PLINQ query, but you can use the WithCancellation extension method and then use cancellation as a way to stop PLINQ execution. For more information, see the next section, "External Loop Cancellation."

External Loop Cancellation

In some scenarios, you may want to cancel a parallel loop because of an external request. For example, you may need to respond to a request from a user interface to stop what you're doing.

In .NET, you use the CancellationTokenSource class to signal cancellation and the CancellationToken structure to detect and respond to a cancellation request. The structure allows you to find out if there is a pending cancellation request. The class lets you signal that cancellation should occur.

The Parallel.For and Parallel.ForEach methods include overloaded versions that accept parallel loop options as one of the arguments. You can specify a cancellation token as one of these options. If you provide a cancellation token as an option to a parallel loop, the loop will use that token to look for a cancellation request. Here's an example.

void DoLoop(CancellationTokenSource cts)
{
  int n = ...
  CancellationToken token = cts.Token;

  var options = new ParallelOptions 
                        { CancellationToken = token };

  try
  {
    Parallel.For(0, n, options, (i) =>
    {
      // ... 

      // ... optionally check to see if cancellation happened
      if (token.IsCancellationRequested)
      {
         // ... optionally exit this iteration early
         return;
      }
    });
  }
  catch (OperationCanceledException ex)
  {
     // ... handle the loop cancellation
  }
}

Here is the signature of the Parallel.For method that was used in the example.

Parallel.For(int fromInclusive, 
             int toExclusive, 
             ParallelOptions parallelOptions, 
             Action<int> body);

When the caller of the DoLoop method is ready to cancel, it invokes the Cancel method of the CancellationTokenSource class that was provided as an argument to the DoLoop method. The parallel loop will allow currently running iterations to complete and then throw an OperationCanceledException. No new iterations will start after cancellation begins.

If external cancellation has been signaled and your loop has called either the Break or the Stop method of the ParallelLoopState object, a race occurs to see which will be recognized first. The parallel loop will either throw an OperationCanceledException or it will terminate using the mechanism for Break and Stop that is described in the section, "Breaking Out of Loops Early," earlier in this chapter.

You can use the WithCancellation extension method to add external cancellation capabilities to a PLINQ query.

Exception Handling

If the body of a parallel loop throws an unhandled exception, the parallel loop no longer begins any new steps. By default, iterations that are executing at the time of the exception, other than the iteration that threw the exception, will complete. After they finish, the parallel loop will throw an exception in the context of the thread that invoked it. Long-running iterations may want to test to see whether an exception is pending in another iteration. They can do this with the ParallelLoopState class's IsExceptional property. This property returns true if an exception is pending.

Because more than one exception may occur during parallel execution, exceptions are grouped using an exception type known as an aggregate exception. The AggregateException class has an InnerExceptions property that contains a collection of all the exceptions that occurred during the execution of the parallel loop. Because the loop runs in parallel, there may be more than one exception.

Exceptions take priority over external cancellations and terminations of a loop initiated by calling the Break or Stop methods of the ParallelLoopState object.

For a code example and more information about handling aggregate exceptions, see the section, "Exception Handling," in Chapter 3, "Parallel Tasks."

Special Handling of Small Loop Bodies

If the body of the loop performs only a small amount of work, you may find that you achieve better performance by partitioning the iterations into larger units of work. The reason for this is that there are two types of overhead that are introduced when processing a loop: the cost of managing worker threads and the cost of invoking a delegate method. In most situations, these costs are negligible, but with very small loop bodies they can be significant.

The parallel extensions of .NET Framework 4 include support for custom partitioning. A Partitioner object divides the indices into non-overlapping intervals named ranges. With partitioners, each parallel loop step handles a range of indices instead of individual indices. By grouping iterations into ranges, you can avoid some of the overhead of a normal parallel loop. Here's an example.

int n = ...
double[] result = new double[n];
Parallel.ForEach(Partitioner.Create(0, n), 
    (range) =>
    {
       for (int i = range.Item1; i < range.Item2; i++)
       {
         // very small, equally sized blocks of work
         result[i] = (double)(i * i); 
       }
    });

Here's the signature of the Parallel.For method that was used in the example.

Parallel.ForEach<TSource>(
     Partitioner<TSource> source, 
     Action<TSource> body);

In this example, you can think of the result of the Partitioner.Create method as an object that acts like an instance of IEnumerable<Tuple<int, int>> (the actual type is Partitioner<Tuple<int, int>>). In other words, Create gives you access to a collection of tuples (unnamed records) with two integer field values. Each tuple represents a range of index values that should be processed in a single iteration of the parallel loop. Each iteration of the parallel loop contains a nested sequential for loop that processes each index in the range.

The partition-based syntax for parallel loops is more complicated than the syntax for other parallel loops in .NET, and when the amount of work in each iteration is large (or of uneven size across iterations), it may not result in better performance. Generally, you would only use the more complicated syntax after profiling or in the case where loop bodies are extremely small and the number of iterations large.

The number of ranges that will be created by a Partitioner object depends on the number of cores in your computer. The default number of ranges is approximately three times the number of those cores.

If you know how big you want your ranges to be, you can use an overloaded version of the Partitioner.Create method that allows you to specify the size of each range. Here's an example.

double[] result = new double[1000000];
Parallel.ForEach(Partitioner.Create(0, 1000000, 50000), 
    (range) =>
    {
       for (int i = range.Item1; i < range.Item2; i++)
       {
         // small, equally sized blocks of work
         result[i] = (double)(i * i); 
       }
    });

In this example, each range will span 50,000 index values. In other words, for a million iterations, the system will use twenty parallel iterations (1,000,000 / 50,000). These iterations will be spread out among all the available cores.

Custom partitioning is an extension point in the API for parallel loops. You can implement your own partitioning strategies. For more information about this topic, see the section, "Further Reading," at the end of this chapter.

Controlling the Degree of Parallelism

Although you usually let the system manage how iterations of a parallel loop are mapped to your computer's cores, in some cases, you may want additional control.

You'll see this variation of the Parallel Loop pattern in a variety of circumstances. Reducing the degree of parallelism is often used in performance testing to simulate less capable hardware. Increasing the degree of parallelism to a number larger than the number of cores can be appropriate when iterations of your loop spend a lot of time waiting for I/O operations to complete.

The term degree of parallelism can be used in two senses. In the simplest case, it refers to the number of cores that are used to process iterations simultaneously. However, .NET also uses this term to refer to the number of tasks that can be used simultaneously by the parallel loop. For example, the MaxDegreeOfParallelism property of the ParallelOptions object refers to the maximum number of worker tasks that will be scheduled at any one time by a parallel loop.

For efficient use of hardware resources, the number of tasks is often greater than the number of available cores. For example, parallel loops may use additional tasks in cases where there are blocking I/O operations that do not require processor resources to run. The degree of parallelism is automatically managed by the underlying components of the system; the implementation of the Parallel class, the default task scheduler, and the .NET thread pool all play a role in optimizing throughput under a wide range of conditions.

You can limit the maximum number of tasks used concurrently by specifying the MaxDegreeOfParallelism property of a ParallelOptions object. Here is an example.

var n = ...
var options = new ParallelOptions() 
                      { MaxDegreeOfParallelism = 2};
Parallel.For(0, n, options, i =>
{
  // ... 
});

In the preceding code example, the parallel loop will run using at most two tasks at any one time. Here's the signature of the Parallel.For method that was used in the example.

Parallel.For(int fromInclusive, 
             int toExclusive, 
             ParallelOptions parallelOptions, 
             Action<int> body);

You can also configure the maximum number of worker threads for PLINQ queries by setting the WithDegreeOfParallelism property of a ParallelQuery<T> object. Here's an example.

IEnumerable<T> myCollection = // ...

myCollection.AsParallel()
            .WithDegreeOfParallelism(8)
            .ForAll(obj => /* ... */);

The query in the code example will run with a maximum of eight tasks at any one time.

If you specify a larger degree of parallelism, you may also want to use the ThreadPool class's SetMinThreads method so that these threads are created without delay. If you don't do this, the thread pool's thread injection algorithm may limit how quickly threads can be added to the pool of worker threads that is used by the parallel loop. It may take more time than you want to create the required number of threads.

Using Task-Local State in a Loop Body

Occasionally, you'll need to maintain thread-local state during the execution of a parallel loop. For example, you might want to use a parallel loop to initialize each element of a large array with random values. The .NET Framework Random class does not support multi-threaded access. Therefore, you need a separate instance of the random number generator for each thread.

Note

You must use task-local state for loop bodies that make calls to methods that are not thread safe.

Here's an example that uses one of the overloads of the Parallel.ForEach method. The example uses a Partitioner object to decompose the work into relatively large pieces, because the amount of work performed by each step is small, and there are a large number of steps.

int numberOfSteps = 10000000;
double[] result = new double[numberOfSteps];

Parallel.ForEach(

     Partitioner.Create(0, numberOfSteps), 

     new ParallelOptions(),

     () => { return new Random(MakeRandomSeed()); },

     (range, loopState, random) =>
     {
       for (int i = range.Item1; i < range.Item2; i++)
         result[i] = random.NextDouble();
       return random;
     },

     _ => {});

Here's the signature of the overload of version of the Parallel.ForEach method that was used in the example.

ForEach<TSource, TLocal>(
   OrderablePartitioner<TSource> source, 
   ParallelOptions parallelOptions, 
   Func<TLocal> localInit, 
   Func<TSource, ParallelLoopState, TLocal, TLocal> body, 
   Action<TLocal> localFinally)

The Parallel.ForEach loop will create a new instance of the Random class for each of its worker tasks. This instance will be passed as an argument to each partitioned iteration. Each partitioned iteration is responsible for returning the next value of the thread-local state. In this example, the returned value is always the same object that was passed in.

The default constructor of the Random class uses the system's clock to generate a random seed. It's possible that the default constructor will use the same seed if the constructor is called in twice in short succession. Therefore, the code makes sure that a distinct seed value is used for each new Random object by passing a unique seed value as an argument to the constructor.

Note

Calling the default Random constructor twice in short succession may use the same random seed. Provide your own random seed to prevent duplicate random sequences.

This example is designed to show the use of task-local state. In the Parallel class, thread-local state is provided by task-local state. A task is guaranteed to execute on just one thread throughout its entire run.

Generating pseudorandom numbers using the Random class in parallel can result in overlapping sequences of numbers on different threads. If your application really requires statistically robust pseudorandom values, you should consider using the RNGCryptoServiceProvider class or a third-party library. For more information about generating random numbers, see the section, "Further Reading," at the end of this chapter.

Note

The Random class isn't the right random generator for all simulation scenarios. It's used here only as an example of a class that isn't thread safe.

Using a Custom Task Scheduler for a Parallel Loop

You can substitute custom task scheduling logic for the default task scheduler that uses ThreadPool worker threads. For example, you can do this to enable a maximum degree of parallelism that operates across multiple loops instead of over only a single loop, which is the default. Additionally, you can use a custom task scheduler if you want to use a dedicated set of threads to process a parallel loop instead of pulling worker threads from the shared pool. These are just a few of the advanced scenarios where a custom task scheduler might be needed.

To specify a custom task scheduler, set the TaskScheduler property of a ParallelOptions object. Here's an example.

int n = ...
TaskScheduler myScheduler = ...
var options = new ParallelOptions() 
                      { TaskScheduler = myScheduler };
Parallel.For(0, n, options, i =>
{
  // ... 
});

Here is the signature of the Parallel.For that was used in the code example.

Parallel.For(int fromInclusive, 
             int toExclusive, 
             ParallelOptions parallelOptions, 
             Action<int> body);

For more information about tasks and task schedulers, see the section, "Design Notes," in Chapter3, "Parallel Tasks."

It isn't possible to specify a custom task scheduler for PLINQ queries.

Anti-Patterns

Anti-patterns are cautionary tales. They highlight issues that need to be carefully considered as well as problem areas.

Step Size Other Than One

If you need to use a step size other than one, transform the loop index in the body of the loop or use parallel foreach with the values you want. Also, be aware that a step size other than one often indicates a data dependency. Carefully analyze such a computation before you convert it into a parallel loop.

In sequential programs, you rarely iterate from high to low values unless the order matters. If your loop has a negative step size, the ordering of the loop is probably significant, and the iterations may not be independent.

Note

Beware! A step size other than one can indicate a loop dependency.

Here's a helpful technique. Before you convert a sequential loop to a parallel loop, you can temporarily reverse the order of the indices (for example, so that they go from high to low). If your sequential code still runs correctly, that's a bit of evidence that the steps of the loop are independent. (The test isn't foolproof; you still need to understand the algorithm.)

Hidden Loop Body Dependencies

Incorrect analysis of loop dependencies is a frequent source of software defects. Be careful that all parallel loop bodies do not contain hidden dependencies. This is a mistake that's easy to make.

The case of trying to share an instance of a class such as Random or DbConnection, which are not thread safe, across parallel iterations is an example of a subtle dependency.

When loop bodies are not fully independent of each other, it may still be possible to use parallel loops. However, in these cases, you must make sure that all shared variables are protected and synchronized, and you must understand the performance characteristics of any synchronization you add. Adding synchronization can greatly reduce the performance of a parallel program, but forgetting to add necessary synchronization can result in a program with bugs that are catastrophic and difficult to reproduce.

If the loop body is not independent, for example, when you use an iteration to calculate a sum, you may need to apply the variation on a parallel loop that's described in Chapter 4, "Parallel Aggregation."

Small Loop Bodies with Few Iterations

You'll probably not get performance improvements if you use a parallel loop for very small loop bodies with only a limited number of data elements to process. In this case, the overhead required by the parallel loop itself will dominate the calculation. Simply changing every sequential for loop to Parallel.For will not necessarily produce good results.

Processor Oversubscription and Undersubscription

Arbitrarily increasing the degree of parallelism puts you at risk of processor oversubscription, a situation that occurs when there are many more compute-intensive worker threads than there are cores. Tests have shown that, in general, the optimum number of worker threads for a parallel task equals the number of available cores divided by the average fraction of core utilization per task. For example, with four cores and 50 percent average core utilization per task, you need eight worker threads for maximum throughput. Increasing the number of worker threads above this number begins to incur extra cost from additional context switching without any improvement in processor utilization.

On the other hand, arbitrarily restricting the degree of parallelism can result in processor undersubscription. Having too few tasks misses an opportunity to effectively use the available processor cores. One situation where you may want to restrict the degree of parallelism is if you know that other tasks in your application are also running in parallel.

In most cases, the built-in load balancing algorithms in the .NET Framework are the most effective way to manage these tradeoffs. They coordinate resources among parallel loops and other tasks that are running concurrently.

Mixing the Parallel Class and PLINQ

PLINQ queries are represented by instances of the ParallelQuery<T> class. This class implements the IEnumerable<T> interface, so it's possible to use a PLINQ query as the source collection for a Parallel.ForEach loop, but it's not recommended. Instead, use PLINQ's ForAll extension method for the ParallelQuery<T> instance. PLINQ's ForAll extension method performs the same kind of iteration as Parallel.ForEach, but it avoids unnecessary merge and repartition steps that would otherwise occur.

Note

Use ForAll if you need to iterate over a PLINQ result. Don't use ForEach in this case.

Here's an example of how to use the ForAll extension method.

var q = from d in data.AsParallel() ... select F(d);

q.ForAll(item =>
{
  // ... process item
});

Duplicates in the Input Enumeration

If you're using the Parallel.ForEach method, duplicate objects in the enumeration often indicate an unsafe race condition in your code. If an object (that is, an instance of a class) appears more than once in the input to the loop, it's possible that two parallel threads could try to update that object at the same time.

Note

Don't allow duplicate instances in parallel loops. If an object appears more than once in the input to a loop, it's possible that two parallel threads could update the object at the same time.

Design Notes

Here are some fine points about using parallel loops.

Adaptive Partitioning

Parallel loops in .NET use an adaptive partitioning technique where the size of units of work increases over time. Adaptive partitioning is able to meet the needs of both small and large input ranges.

Adaptive Concurrency

The Parallel class and PLINQ work on slightly different threading models in the .NET Framework 4. PLINQ uses a fixed number of tasks to execute a query; by default, it creates the same number of tasks as there are logical cores in the computer, or it uses the value passed to the WithDegreeOfParallelism method if one was specified.

Conversely, by default, the Parallel.ForEach and Parallel.For methods can use a variable number of tasks. That's why, for example, the ParallelOptions class has a MaxDegreeOfParallelism property instead of a "MinDegreeOfParallelism" property. The idea is that the system can use fewer threads than requested to process a loop.

The .NET thread pool adapts dynamically to changing workloads by allowing the number of worker threads for parallel tasks to change over time. At run time, the system observes whether increasing the number of threads improves or degrades overall throughput and adjusts the number of worker threads accordingly.

Be careful if you use parallel loops with individual steps that take several seconds or more. This can occur with I/O-bound workloads as well as lengthy calculations. If the loops take a long time, you may experience an unbounded growth of worker threads due to a heuristic for preventing thread starvation that's used by the .NET ThreadPool class's thread injection logic. This heuristic steadily increases the number of worker threads when work items of the current pool run for long periods of time. The motivation is to add more threads in cases where everything in the thread pool is blocked. Unfortunately, if work is actually proceeding, more threads may not necessarily be what you want. The .NET Framework can't distinguish between these two situations.

Note

If the individual steps of your loop take a long time, you may see more worker threads than you intend.

You can limit the number of threads with the ParallelOptions class if you observe, through profiling, that the number of threads chosen by the system is too large. You can also limit the number of threads with the SetMaxThreads method of the ThreadPool class. Generally, using ParallelOptions is preferred because SetMaxThreads has a global, process-wide effect that you may not want.

Support for Nested Loops and Server Applications

The .NET Framework Parallel Extensions allow you to nest parallel loops. In this situation, the run-time environment coordinates the use of processor resources. You'll find that with nested parallel loops, each loop uses fewer threads than in a non-nested loop.

A related problem is the handling of parallel loops in server applications. The Parallel class attempts to deal with multiple AppDomains in server applications in exactly the same manner that nested loops are handled. If the server application is already using all the available thread pool threads to process other ASP.NET requests, a parallel loop will only run on the thread that invoked it. If the workload decreases and other threads become available and there's no other ASP.NET work to be processed, the loop will start using additional threads.

The Parallel Loop pattern is the basis of the parallel aggregation pattern, which is the subject of Chapter 4, "Parallel Aggregation."

Exercises

Which of the following problems could be solved using the parallel loop techniques taught in this chapter?
1. Sorting an in-memory array of numbers with a million elements
2. Putting the words in each line read from a text file in alphabetical order
3. Adding together all the numbers in one collection, to obtain a single sum
4. Adding numbers from two collections pair-wise, to obtain a collection of sums
5. Counting the total number of occurrences of each word in a collection of text files
6. Finding the word that occurs most frequently in each file in a collection of text files
Choose a suitable problem from Exercise 1. Code three solutions, using a sequential loop, a parallel loop, and PLINQ.
In the credit review example, what is the type of account? What is the type of account.AllAccounts? What is the type of account.AllAccounts.AsParallel()?
Is it possible to use a PLINQ query as the source for a Parallel.ForEach loop? Is this recommended? Explain your answers.
Do a performance analysis of the credit review example code on the CodePlex site http://parallelpatterns.codeplex.com. Use command line options to independently vary the number of iterations (the number of accounts) and the amount of work done in the body of each iteration (the number of months in the credit history). Record the execution times reported by the program for all three versions, using several different combinations of numbers of accounts and months. Repeat the tests on different computers with different numbers of cores and with different execution loads (from other applications).
Modify the credit review example from CodePlex so that you can set the MaxDegreeOfParallelism property. Observe the effect of different values of this property on the execution time when running on computers with different numbers of cores.

Parallel Loops

The Basics

Parallel For Loops

Parallel ForEach

Parallel LINQ (PLINQ)

What to Expect

An Example

Sequential Credit Review Example

Credit Review Example Using Parallel.ForEach

Credit Review Example with PLINQ

Performance Comparison

Variations

Breaking Out of Loops Early

Parallel Break

Parallel Stop

External Loop Cancellation

Exception Handling

Special Handling of Small Loop Bodies

Controlling the Degree of Parallelism

Using Task-Local State in a Loop Body

Using a Custom Task Scheduler for a Parallel Loop

Anti-Patterns

Step Size Other Than One

Hidden Loop Body Dependencies

Small Loop Bodies with Few Iterations

Processor Oversubscription and Undersubscription

Mixing the Parallel Class and PLINQ

Duplicates in the Input Enumeration

Design Notes

Adaptive Partitioning

Adaptive Concurrency

Support for Nested Loops and Server Applications

Related Patterns

Exercises

Further Reading

Additional resources