2: Parallel Loops

Article
03/09/2012

On this page:
The Basics \| Parallel for Loops \| parallel_for_each \| What to Expect \| An Example \| Sequential Credit Review Example \| Credit Review Example Using parallel_for_each \| Performance Comparison \| Variations \| Breaking out of Loops Early \| Exception Handling \| Special Handling of Small Loop Bodies \| Controlling the Degree of Parallelism \| Anti-Patterns \| Hidden Loop Body Dependencies \| Small Loop Bodies with Few Iterations \| Duplicates in the Input Enumeration \| Scheduling Interactions with Cooperative Blocking \| Related Patterns \| Exercises \| Further Reading

Use the Parallel Loop pattern when you need to perform the same independent operation for each element of a collection or for a fixed number of iterations. The steps of a loop are independent if they don't write to memory locations or files that are read by other steps.

The syntax of a parallel loop is very similar to the for and for_each loops you already know, but the parallel loop completes faster on a computer that has available cores. Another difference is that, unlike a sequential loop, the order of execution isn’t defined for a parallel loop. Steps often take place at the same time, in parallel. Sometimes, two steps take place in the opposite order than they would if the loop were sequential. The only guarantee is that all of the loop’s iterations will have run by the time the loop finishes.

It's easy to change a sequential loop into a parallel loop. However, it's also easy to use a parallel loop when you shouldn't. This is because it can be hard to tell if the steps are actually independent of each other. It takes practice to learn how to recognize when one step is dependent on another step. Sometimes, using this pattern on a loop with dependent steps causes the program to behave in a completely unexpected way, and perhaps to stop responding. Other times, it introduces a subtle bug that only appears once in a million runs. In other words, the word "independent" is a key part of the definition of the Parallel Loop pattern, and one that this chapter explains in detail.

The Parallel Loop pattern independently applies an operation to multiple data elements. It's an example of data parallelism.

For parallel loops, the degree of parallelism doesn't need to be specified by your code. Instead, the run-time environment executes the steps of the loop at the same time on as many cores as it can. The loop works correctly no matter how many cores are available. If there is only one core and assuming the work performed by each iteration is not too small, then the performance is close to (perhaps within a few percentage points of) the sequential equivalent. If there are multiple cores, performance improves; in many cases, performance improves proportionately with the number of cores.

The Basics

The Parallel Patterns Library (PPL) includes both parallel for and parallel for_each loops. Use the parallel_for function to iterate over a range of integer indices and the parallel_for_each function to iterate over user-provided values.

To make for and for_each loops with independent iterations run faster on multicore computers, use their parallel counterparts.

Parallel for Loops

Here's an example of a sequential for loop in C++.

vector<double> results = ...
int workload = ...
size_t n = results.size(); 

for (size_t i = 0; i < n; ++i)

{

  results[i] = DoWork(i, workLoad);

}

To take advantage of multiple cores, replace the for keyword with a call to the parallel_for function and convert the body of the loop into a lambda expression.

Note

Don’t forget that the steps of the loop bodymust be independent of one another if you want to use a parallel loop. The steps must not communicate by writing to shared variables.

vector<double> results = ...
int workload = ...
size_t n = results.size(); 

parallel_for(0u, n,   
  [&results, workLoad](size_t i) 
  {
    results[i] = DoWork(i, workLoad); 
  });

Note

The example includes a lambda expression in the form [captured variables](args){body}. You may need to brush up on the syntax of lambda expressions in C++ before reading further.

The parallel_for function uses multiple cores if they're available to operate over the index range.

The parallel_for function has overloaded versions. Here's the signature of the version of parallel_for that's used in the example.

template <typename _Index_type, typename _Function>
void parallel_for(_Index_type _First, 
                  _Index_type _Last, 
                  const _Function& _Func);

In the example, the first two arguments specify the iteration limits. The first argument is the lowest index of the loop. The second argument is the exclusive upper bound, or the largest index plus one. The third argument is a function that's invoked once per iteration. The function takes the iteration's index as its argument and executes the loop body once for each index.

Note

The parallel_for method does not guarantee any particular order of execution. Unlike a sequential loop, some higher-valued indices may be processed before some lower-valued indices.

The parallel_for method has an additional overloaded version. It is covered in the "Variations" section later in this chapter.

The example includes a lambda expression in the form [captured variables] (args) {body} as the third argument to the parallel_for invocation. Lambda expressions denote function objects that can capture variables from their enclosing scope. Of course, the _Func parameter could also be a pointer to a function declared elsewhere. You don’t have to use lambda expressions.

Note

If you’re unfamiliar with the syntax for lambda expressions, see "Further Reading" at the end of this chapter. Once you use lambda expressions, you’ll wonder how you ever lived without them.

parallel_for_each

Here's an example of a sequential for_each loop in C++ that uses the conventions of the Standard Template Library (STL).

vector<size_t> inputs = ...   
int workload = ...

for_each(inputs.cbegin(), inputs.cend(), 
  [workLoad](size_t i)
  {
    DoWork(i, workLoad);
  });

To take advantage of multiple cores, replace the for_each keyword with a call to the parallel_for_each method.

vector<size_t> inputs = ...   
int workload = ...

parallel_for_each(inputs.cbegin(), inputs.cend(), 
  [workLoad](size_t i)
  {
    DoWork(i, workLoad);
  });

parallel_for_each runs the loop body for each element in a collection.

The parallel_for_each function is very similar in syntax to the std::for_each function. The first argument is an iterator that references the position of the first element in the range to be operated on. The second argument is an iterator that references the position one past the final element in the range. The third argument is a function object that's invoked for each element of the input range.

The parallel_for_each method does not guarantee the order of execution. Unlike a sequential for_each loop, the incoming values aren't always processed in order.

Note

Don’t forget that iterations need to be independent. The loop body must only make updates to fields of the particular instance that's passed to it.

What to Expect

By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the number of available cores. In typical scenarios, the more cores you have the faster your loop executes, until you reach the point of diminishing returns that Amdahl’s Law predicts. How much faster depends on the kind of work your loop does. (See Chapter 1 for a discussion of Amdahl's Law.)

Adding cores makes your loop complete faster; however, there's always an upper limit.

Note

You must choose the right granularity. Too many small parallel loops can reach a point of over-decomposition where the multicore speedup is more than offset by the parallel loop’s overhead.

If an exception is thrown during the execution of one of the iterations of a parallel_for or parallel_for_each function, that exception will be rethrown in the context of the calling thread. To learn more about exception handling for parallel loops, see the "Variations" section later in this chapter.

Robust exception handling is an important aspect of parallel loop processing.

If you convert a sequential loop to a parallel loop and then find that your program does not behave as expected, the most likely problem is that the loop's steps are not independent. Here are some common examples of dependent loop bodies:

Note

Check carefully for dependencies between loop iterations! Not noticing dependencies between steps is by far the most common mistake you’ll make with parallel loops.

Writing to shared variables. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. Here is an example, where total is shared across iterations.
```
for(int i = 1; i < n; i++)
  total += data[i];
```
If you encounter this situation, see Chapter 4, "Parallel Aggregation."

Shared variables come in many flavors. Any variable that is declared outside of the scope of the loop body is a shared variable. Shared references to types such as classes or arrays will implicitly allow all fields or array elements to be shared. Parameters that are passed by reference or by pointer result in shared variables, as do variables captured by reference in a lambda expression.
Using data accessors of an object model. If the object being processed by a loop body exposes data accessors, you need to know whether they refer to shared state or state that's local to the object itself. For example, an accessor method named GetParent is likely to refer to global state. Here's an example.
```
for(int i = 0; i < n; i++)
  SomeObject[i].GetParent().Update();
```
In this example, it's likely that the loop iterations are not independent. It’s possible that, for all values of i, SomeObject[i].GetParent() is a reference to a single shared object.

Note

You must be extremely cautious when getting data from accessors. Large object models are known for sharing mutable state in unbelievably convoluted ways.

Referencing data types or functions that are not thread safe. If the body of the parallel loop uses a data type or function that is not thread safe, the loop body is not independent because there's an implicit dependency on the thread context.
Loop-carried dependence. If the body of a parallel_for loop performs arithmetic on a loop-indexed variable, there is likely to be a dependency that is known as loop-carried dependence. This is shown in the following code example. The loop body references data[i] and data[i – 1]. If parallel_for is used here, then there's no guarantee that the loop body that updates data[i – 1] has executed before the loop body for data[i].
```
for(int i = 1; i < N; i++)
  data[i] = data[i] + data[i - 1]; 
```
It's sometimes possible to use a parallel algorithm in cases of loop-carried dependence, but this is outside the scope of this book. Your best options are to look elsewhere in your program for opportunities for parallelism or to analyze your algorithm and see if it matches some of the advanced parallel patterns that occur in scientific computing. Parallel scan and parallel dynamic programming are examples of these patterns.

Note

Arithmetic on loop index variables, especially addition or subtraction, usually indicates loop-carried dependence.

When you look for opportunities for parallelism, profiling your application is a way to deepen your understanding of where your application spends its time; however, profiling is not a substitute for understanding your application's structure and algorithms. For example, profiling doesn't tell you whether loop bodies are independent.

Note

Don’t expect miracles from profiling—it can’t analyze your algorithms for you. Only you can do that.

An Example

Here's an example of when to use a parallel loop. Fabrikam Shipping extends credit to its commercial accounts. It uses customer credit trends to identify accounts that might pose a credit risk. Each customer account includes a history of past balance-due amounts. Fabrikam has noticed that customers who don't pay their bills often have histories of steadily increasing balances over a period of several months before they default.

To identify at-risk accounts, Fabrikam uses statistical trend analysis to calculate a projected credit balance for each account. If the analysis predicts that a customer account will exceed its credit limit within three months, the account is flagged for manual review by one of Fabrikam's credit analysts.

In the application, a top-level loop iterates over customers in the account repository. The body of the loop fits a trend line to the balance history, extrapolates the projected balance, compares it to the credit limit, and assigns the warning flag if necessary.

An important aspect of this application is that each customer's credit status can be calculated independently. The credit status of one customer doesn't depend on the credit status of any other customer. Because the operations are independent, making the credit analysis application run faster is simply a matter of replacing a sequential for_each loop with a parallel loop.

The complete source code for this example is online at http://parallelpatternscpp.codeplex.com in the Chapter2\CreditReview project.

Sequential Credit Review Example

Here's the sequential version of the credit analysis operation.

void UpdatePredictionsSequential(AccountRepository& accounts)
{
  for_each(accounts.begin(), accounts.end(), 
  [](AccountRepository::value_type& record)
  {
    Account& account = record.second;
    Trend trend = Fit(account.Balances());
    double prediction = PredictIntercept(trend, 
         (account.Balances().size() + g_predictionWindow)); 
    account.SeqPrediction() = prediction;
    account.SeqWarning() = prediction < account.GetOverdraft();
   });
}

The UpdatePredictionsSequential method processes each account from the application's account repository. The Fit method is a utility function that uses the statistical least squares method to create a trend line from an array of numbers. The Fit method is a pure function. This means that it doesn't modify any state.

The prediction is a three-month projection based on the trend. If a prediction is more negative than the overdraft limit (credit balances are negative numbers in the accounting system), the account is flagged for review.

Credit Review Example Using parallel_for_each

The parallel version of the credit scoring analysis is very similar to the sequential version.

void UpdatePredictionsParallel(AccountRepository& accounts)

{

  parallel_for_each(accounts.begin(), accounts.end(),
    []

    (AccountRepository::value_type& record)

    {

      Account& account = record.second;

      Trend trend = Fit(account.Balances());

      double prediction = PredictIntercept(trend, 

        (account.Balances().size() + g_predictionWindow));

      account.ParPrediction() = prediction;
      account.ParWarning() = prediction < account.GetOverdraft();
  });
}

The UpdatePredictionsParallel method is identical to the UpdatePredictionsSequential method, except that the parallel_for_each function replaces the for_each operator.

Performance Comparison

Running the credit review example on a quad-core computer shows that the parallel_for_each version runs slightly less than four times as fast as the sequential version. Timing numbers vary; you may want to run the online samples on your own computer.

Variations

The credit analysis example shows a typical way to use parallel loops, but there can be variations. This section introduces some of the most important ones. You won’t always need to use these variations, but you should be aware that they are available.

Breaking out of Loops Early

Breaking out of loops is a familiar part of sequential iteration. It’s less common in parallel loops, but you'll sometimes need to do it. Here's an example of the sequential case.

int n = ...
for (int i = 0; i < n; i++)
{
  // ... 
  if (/* stopping condition is true */)
    break;   
}

The situation is more complicated with parallel loops because more than one step may be active at the same time, and steps of a parallel loop are not necessarily executed in any predetermined order. However, you can break out of a parallel loop by canceling the task group that contains it. Task groups are described in Chapter 3, "Parallel Tasks."

Use the task_group::cancel method to break out of a parallel loop early.

Here's an example of how to break out of a parallel loop early.

vector<double> results = ...
int workLoad = ...
task_group tg;
size_t fillTo = results.size() - 5 ;
fill(results.begin(), results.end(), -1.0);

task_group_status status = tg.run_and_wait([&]
  {
     parallel_for(0u, results.size(), [&](size_t i)
     {
       if (i > fillTo)
         tg.cancel();
       else
         results[i] = DoWork(i, workLoad);
     });
   });

The example code shows that if you want to break out of a parallel loop early, you need to create a task group object and execute the parallel loop within that task group. When you want to break out of the loop, you invoke the task group’s cancel method.

You should keep in mind that parallel loops may execute steps out of order. Unlike breaking from a sequential loop, canceling the task group of a parallel loop cannot guarantee that higher-indexed iterations won’t have had a chance to run before the cancel operation takes effect.

Note

Don’t forget that parallel loops can execute steps out of order. Canceling a parallel loop doesn’t ensure that iterations with higher-valued indices won’t run.

Exception Handling

If the body of a parallel loop throws an unhandled exception, the parallel loop no longer begins any new steps. By default, iterations that are executing at the time of the exception, other than the iteration that threw the exception, will complete. After they finish, the parallel loop will throw an exception in the context of the thread that invoked it.

Because the loop runs in parallel, there may be more than one exception. If more than one exception has occurred, the parallel loop will nondeterministically choose one of the exceptions to throw. The remaining exceptions will not be externally observable.

Throwing an unhandled exception prevents new iterations from starting.

Here's an example of how to handle exceptions from a parallel loop.

vector<double> results = ...
    
try
{
  size_t n = results.size(); 
  parallel_for(0u, n, [&results](size_t i) 
  {
    results[i] = DoWork(i, 10); // throws exception
  });
}
catch (ParallelForExampleException e)
{
  printf("Exception caught as expected.\n");
}

Special Handling of Small Loop Bodies

If the body of the loop performs only a small amount of work, you may find that you achieve better performance by partitioning the iterations into larger units of work. The reason for this is that there are two types of overhead that are introduced when processing a loop: the cost of managing worker threads and the cost of invoking the function object. In most situations, these costs are negligible, but with very small loop bodies they can be significant.

Consider using partitioning strategies when you have many iterations that each perform a small amount of work.

An overloaded version of parallel_for allows you to specify a step size for the indices. Iterating with step sizes greater than one lets you embed a sequential loop within your parallel loop. Each iteration of the outer (parallel) loop handles a range of indices instead of individual indices. By grouping iterations into ranges, you can avoid some of the overhead of a normal parallel loop. Here's an example.

size_t size = results.size();
size_t rangeSize = size / (GetProcessorCount() * 10);
rangeSize = max(1, rangeSize);


parallel_for(0u, size, rangeSize, 
  [&results, size, rangeSize, workLoad](size_t i) 

  {

    for (size_t j = 0; (j < rangeSize) && (i + j < size); ++j)

        results[i + j] = DoWork(i + j, workLoad);

  });

Partitioning your data into ranges results in more complicated application logic than using an ordinary parallel_for function without partitioning. When the amount of work in each iteration is large (or of uneven size across iterations), partitioning may not result in better performance. Generally, you would only use the more complicated syntax after profiling or in the case where loop bodies are extremely small and the number of iterations large.

The number of ranges that you use will normally depend on the number of cores in your computer. A good default number of ranges is approximately three to ten times the number of cores.

Another approach for handling smaller loop bodies is to use the parallel_for_fixed or parallel_for_each_fixed functions that are provided in the Concurrency Runtime sample pack. By default, the parallel_for and parallel_for_each functions perform dynamic load balancing. When the amount of work for each item is small, the cost of load balancing can become significant. The sample pack’s parallel_for_fixed and parallel_for_each_fixed functions do not perform load balancing so they may outperform parallel_for and parallel_for_each when the loop bodies are small.

Another difference is that the parallel_for and parallel_for_each functions check for cancellation with each iteration. In contrast, the parallel_for_fixed and parallel_for_each_fixed functions do not check for cancellation within their subranges.

Controlling the Degree of Parallelism

Although you usually let the system manage how iterations of a parallel loop are mapped to your computer's cores, in some cases you may want additional control.

You’ll see this variation of the Parallel Loop pattern in a variety of circumstances. Reducing the degree of parallelism is often done in performance testing to simulate less capable hardware. Increasing the degree of parallelism to a number larger than the number of cores can be appropriate when iterations of your loop spend a lot of time waiting for I/O operations to complete.

You can control the maximum number of active threads used concurrently by a parallel loop.

The term degree of parallelism refers to the number of cores that are used to process iterations simultaneously. The degree of parallelism is automatically managed by the underlying components of the system. The implementation of the parallel_for and parallel_for_each functions, the Concurrency Runtime’s task scheduler, and the operating system’s thread scheduler all play a role in optimizing throughput under a wide range of conditions. You can’t control the degree of parallelism directly, but you can influence it by controlling the number of threads that are simultaneously executed by a parallel loop. To do this, you need to set the MinConcurrency and MaxConcurrency policies of the SchedulerPolicy class. For more information about setting these policies, see Appendix A, "The Task Scheduler and Resource Manager."

Anti-Patterns

Anti-patterns are cautionary tales. They highlight issues that need to be carefully considered as well as problem areas. Here are some issues to think about when you implement a parallel loop.

Hidden Loop Body Dependencies

Incorrect analysis of loop dependencies is a frequent source of software defects. Be careful that all parallel loop bodies do not contain hidden dependencies. This is a mistake that's easy to make.

The case of trying to share an instance of a class which is not thread safe across parallel iterations is an example of a subtle dependency. You should also be careful when you share state by using reference variables from the enclosing lexical scope in a lambda expression.

When loop bodies are not fully independent of each other, it may still be possible to use parallel loops. However, in these cases, you must make sure that all shared variables are protected and synchronized, and you must understand the performance characteristics of any synchronization you add. Adding synchronization can greatly reduce the performance of a parallel program, but forgetting to add necessary synchronization can result in a program with bugs that are catastrophic and difficult to reproduce.

If the loop body is not independent—for example, when you use an iteration to calculate a sum—you may need to apply the variation on a parallel loop that's described in Chapter 4, "Parallel Aggregation."

Small Loop Bodies with Few Iterations

You probably won't get performance improvements if you use a parallel loop for very small loop bodies with only a limited number of data elements to process. In this case, the overhead required by the parallel loop itself will dominate the calculation. Simply changing every sequential for loop to parallel_for will not necessarily produce good results.

Duplicates in the Input Enumeration

If you're using the parallel_for_each function, duplicate references or pointers to objects in the enumeration often indicate an unsafe race condition in your code. If an object reference (that is, a pointer or reference to an instance of a class) appears more than once in the input to the loop, then it’s possible that two parallel threads could try to update that object at the same time.

Note

Don’t allow duplicate instances in parallel loops. If an object appears more than once in the input to a loop, then it’s possible that two parallel threads could update the object at the same time.

Scheduling Interactions with Cooperative Blocking

If you perform a cooperative blocking operation in every iteration of a parallel loop, the task scheduler may create more threads than you intend. Cooperative blocking operations should be performed infrequently within the parallel loop.

The Parallel Loop pattern is the basis of the parallel aggregation pattern, which is the subject of Chapter 4, "Parallel Aggregation."

Exercises

Which of the following problems could be solved using the parallel loop techniques taught in this chapter?
1. Sorting an in-memory array of numbers with a million elements.
2. Putting the words in each line that's read from a text file in alphabetical order.
3. Adding together all the numbers in one collection to obtain a single sum.
4. Adding numbers from two collections pair-wise to obtain a collection of sums.
5. Counting the total number of occurrences of each word in a collection of text files.
6. Finding the word that occurs most frequently in each file in a collection of text files.
Choose a suitable problem from Exercise 1. Code two solutions, using a sequential loop and a parallel loop.
Do a performance analysis of the credit review example code on the CodePlex site http://parallelpatternscpp.codeplex.com. Use command line options to independently vary the number of iterations (the number of accounts) and the amount of work done in the body of each iteration (the number of months in the credit history). Record the execution times reported by the program for all three versions, using several different combinations of numbers of accounts and months. Repeat the tests on different computers with different numbers of cores and with different execution loads (from other applications).

Share via

2: Parallel Loops

The Basics

Parallel for Loops

parallel_for_each

What to Expect

An Example

Sequential Credit Review Example

Credit Review Example Using parallel_for_each

Performance Comparison

Variations

Breaking out of Loops Early

Exception Handling

Special Handling of Small Loop Bodies

Controlling the Degree of Parallelism

Anti-Patterns

Hidden Loop Body Dependencies

Small Loop Bodies with Few Iterations

Duplicates in the Input Enumeration

Scheduling Interactions with Cooperative Blocking

Exercises

Further Reading

Additional resources

Share via

2: Parallel Loops

The Basics

Parallel for Loops

parallel_for_each

What to Expect

An Example

Sequential Credit Review Example

Credit Review Example Using parallel_for_each

Performance Comparison

Variations

Breaking out of Loops Early

Exception Handling

Special Handling of Small Loop Bodies

Controlling the Degree of Parallelism

Anti-Patterns

Hidden Loop Body Dependencies

Small Loop Bodies with Few Iterations

Duplicates in the Input Enumeration

Scheduling Interactions with Cooperative Blocking

Related Patterns

Exercises

Further Reading

Additional resources