Dynamic Task Parallelism

Article
09/01/2010

This chapter covers situations where tasks are dynamically added to the work queue as the computation proceeds. This is known as dynamic task parallelism. A simple example of dynamic task parallelism occurs in cases where the sequential version of an algorithm includes recursion.

Dynamic task parallelism is also known as recursive decomposition or "divide and conquer."

Dynamic task parallelism applies to problems that are solved by first solving smaller, related problems. For example, when you count the number of nodes in a data structure that represents a binary tree, you can count the nodes in the left and right subtrees and then add the results.

Applications that use data structures such as trees and graphs are typical examples of dynamic task parallelism. It's also used for applications that have geographic or geometric aspects, where the problem can be partitioned spatially.

The Basics

The following code shows a binary tree.

public class Tree<T>
{
  public T       Data  { get; set; }
  public Tree<T> Left  { get; set; }
  public Tree<T> Right { get; set; }
}

If you want to perform an action on each data value in the tree, you need to visit each node. This is known as walking the tree, which is a naturally recursive operation. Here's an example that uses sequential code.

static void SequentialWalk<T>(Tree<T> tree, Action<T> action)
{
  if (tree == null) return;
  action(tree.Data);
  SequentialWalk(tree.Left, action);
  SequentialWalk(tree.Right, action);
}

You can also use parallel tasks to walk the tree.

static void ParallelWalk<T>(Tree<T> tree, Action<T> action)
{
  if (tree == null) return;
  var t1 = Task.Factory.StartNew(
             () => action(tree.Data));
  var t2 = Task.Factory.StartNew(
             () => ParallelWalk(tree.Left, action));
  var t3 = Task.Factory.StartNew(
             () => ParallelWalk(tree.Right, action)); 

  Task.WaitAll(t1, t2, t3);
}

You don't need to create a new task to walk the right side of the tree. If you want, you can use the current task; however, using tasks for both the left and right walks makes sure that exceptions that are thrown by either will be observed. For more information about how to observe unhandled task exceptions, see the section, "Handling Exceptions," in Chapter 3, "Parallel Tasks."

When you use dynamic task parallelism to perform a tree walk, you no longer visit nodes in a fully predictable order. If you need to visit nodes in a sequential order, such as with a preorder, inorder, or postorder traversal, you may want to consider the Pipeline pattern that's described in Chapter 7, "Pipelines."

In this example, the number of tasks is three times the number of nodes in the tree, which could be a large number. The Task Parallel Library (TPL) is designed to handle this type of situation, but you may want to read the section, "Design Notes," later in this chapter for some performance tips.

An Example

An example of dynamic task parallelism is when you sort a list with an algorithm such as QuickSort. This algorithm first divides an unsorted array of values, and then it orders and recombines the pieces. Here's a sequential implementation.

static void SequentialQuickSort(int[] array, int from, int to)
{
  if (to - from <= Threshold)
  {
    InsertionSort(array, from, to);
  }
  else
  {
    int pivot = from + (to - from) / 2;
    pivot = Partition(array, from, to, pivot);
    SequentialQuickSort(array, from, pivot);
    SequentialQuickSort(array, pivot + 1, to);
  }
}

This method sorts array in place, instead of returning a sorted array. The from and to arguments identify the segment that will be sorted. The code includes an optimization. It's not efficient to use the recursive algorithm on short segments, so the method calls the non-recursive InsertionSort method on segments that are not longer than Threshold, which is set in a global variable. This optimization applies equally to the sequential and parallel versions.

If the segment is longer than Threshold, the recursive algorithm is used. The element in the middle of the segment (at pivot) is chosen. The Partition method moves all the array elements not greater than the element at pivot to the segment that precedes pivot, and leaves the greater elements in the segment that follows pivot (pivot itself may be moved). Then the method recursively calls QuickSort on both segments.

The following code shows a parallel implementation.

static void ParallelQuickSort(int[] array, int from, 
                              int to, int depthRemaining)
{
  if (to - from <= Threshold)
  {
    InsertionSort(array, from, to);
  }
  else
  {
    int pivot = from + (to - from) / 2; 
    pivot = Partition(array, from, to, pivot);
    if (depthRemaining > 0)
    {
      Parallel.Invoke(
        () => ParallelQuickSort(array, from, pivot, 
                       depthRemaining - 1),
        () => ParallelQuickSort(array, pivot + 1, to, 
                       depthRemaining - 1));
    }
    else
    {
      ParallelQuickSort(array, from, pivot, 0);
      ParallelQuickSort(array, pivot + 1, to, 0);
    }
  }
}

This version uses Parallel.Invoke to execute the recursive calls in tasks that can run in parallel. Tasks are created dynamically with each recursive call; if the array is large, many tasks might be created.

The parallel version includes an additional optimization. It's generally not useful to create many more tasks than there are processors to run them. So, the ParallelQuickSort method includes an additional argument to limit task creation. The depthRemaining argument is decremented on each recursive call, and tasks are created only when this argument exceeds zero. The following code shows how to calculate an appropriate depth (the depthRemaining argument) from the number of processors.

public static void ParallelQuickSort(int[] array)
{
  ParallelQuickSort(array, 0, array.Length, 
      (int) Math.Log(Environment.ProcessorCount, 2) + 4);
}

One relevant factor in selecting the number of tasks is how similar the predicted run times of the tasks will be. In the case of QuickSort, the duration of the tasks may vary a great deal because the pivot points depend on the unsorted data. They don't necessarily result in segments of equal size (in fact, the sizes could vary widely). To compensate for the uneven sizes of the tasks, the formula in this example that calculates the depthRemaining argument produces more tasks than cores. The formula limits the number of tasks to approximately sixteen times the number of cores. This is because the number of tasks can be no larger than 2 ^ depthRemaining. If you substitute depthRemaining = log2(NCores) + 4 and simplify the expression, you see that the number of tasks is 16 x NCores. (Recall that for any value a, 2 ^ (a + 4) is the same as 16 times 2^a and that if a = log2(b), 2^a = b.)

Variations

Dynamic task parallelism has several variations.

Parallel While-Not-Empty

The examples shown so far in this chapter use techniques that are the parallel analogs of sequential depth-first traversal. There are also parallel algorithms for other types of traversals. These techniques rely on concurrent collections to keep track of the remaining work to be done. Here's an example.

public static void ParallelWhileNotEmpty<T>(
  IEnumerable<T> initialValues, 
  Action<T, Action<T>> body)
{
  var from = new ConcurrentQueue<T>(initialValues);  
  while (!from.IsEmpty)
  { 
    var to = new ConcurrentQueue<T>();
    Action<T> addMethod = to.Enqueue; 
    Parallel.ForEach(from, v => body(v, addMethod));        
    from = to;
  }
}

This method shows how you can use Parallel.ForEach to process an initial collection of values. While processing the values, additional values to process may be discovered. The additional values are placed in the to queue. After the first batch of values processes, the method starts processing the additional values, which may again result in more values to process. This process repeats until no additional values can be produced.

A method that walks a binary tree can use the ParallelWhileNotEmpty method.

static void ParallelWalk4<T>(Tree<T> tree, Action<T> action)
{
  if (tree == null) return;
  ParallelWhileNotEmpty(new[] { tree }, (item, adder) =>
  {
    if (item.Left != null) adder(item.Left);
    if (item.Right != null) adder(item.Right);
    action(item.Data);
  });
}

An example of a problem that would be an appropriate use of the ParallelWhileNotEmpty method is a website link checking tool. The walk task loads the initial page and searches it for links. Each link is checked and removed from the list, and additional links to unchecked pages from the same site are added to the list. Eventually, there are no more unchecked links and the application stops.

Task Chaining with Parent/Child Tasks

The TPL includes a task creation option named AttachedToParent. This option is used most frequently in code that uses the dynamic task parallelism pattern. The purpose of the AttachedToParent option is to link a subtask to the task that created it. In this case, the subtask is known as a child task and the task that created the child task is known as a parent task.

You use the AttachedToParent option in two situations. The first is when you want to link the status of a parent task to the status of its child tasks. The second is when you want to use the Microsoft® Visual Studio® development system debugger to see the parent/child relationships. The order of execution is not changed by using the AttachedToParent option.

The following code shows an example.

static void ParallelWalk2<T>(Tree<T> tree, Action<T> action)
{
  if (tree == null) return;
  var t1 = Task.Factory.StartNew(
              () => action(tree.Data),                                        
              TaskCreationOptions.AttachedToParent);
  var t2 = Task.Factory.StartNew(
              () => ParallelWalk2(tree.Left, action),                             
              TaskCreationOptions.AttachedToParent);
  var t3 = Task.Factory.StartNew(
              () => ParallelWalk2(tree.Right, action),
              TaskCreationOptions.AttachedToParent);
  Task.WaitAll(t1, t2, t3);
}

The AttachedToParent option affects the behavior of the parent task. If a parent task with at least one running child task finishes running for any reason, its Status property becomes WaitingForChildrenToComplete. Only when all of its attached children are no longer running will the parent task's Status property transition to one of the three final states. Figure 1 illustrates this.

Figure 1

Life cycle of a task that has attached children

Exceptions from attached children are observed in the parent task.

To access the parent/child view in the Visual Studio debugger, right-click the row headings in the Parallel Tasks window, and then click Parent Child View.

Design Notes

The behavior of the Microsoft .NET Framework default task scheduler was described in Chapter 3, "Parallel Tasks." The default task scheduler uses fast local task queues for each of its worker threads as well as a work stealing algorithm for load balancing. These features are of particular importance to the performance of applications that use dynamic task parallelism.

The default task scheduler's thread injection heuristics aren't appropriate for every case of dynamic task parallelism. In particular, if you have long-running, processor-intensive tasks, profiling your application may show processor oversubscription (too many worker threads). If this happens, consider decomposing your problem into more tasks with shorter durations. For example, in the parallel quick sort example, you can adjust the initial value of the depthRemaining parameter to create more or fewer tasks.

You can also write a custom thread scheduler. For more information about task scheduling, see Chapter 3, "Parallel Tasks."

Exercises

The sample code on CodePlex assigns a particular default value for the Threshold segment length. At this point, the QuickSort methods switch to the non-recursive InsertionSort algorithm. Use the command line argument to assign different values for the Threshold value, and then observe the execution times for the sequential version to sort different array sizes. What do you expect to see? What's the best value for Threshold on your system?
Use the command line argument to vary the array size, and then observe the execution time as a function of array size for the sequential and parallel versions. What do you expect? Can you explain your observations?
Suggest other measures, besides the number of cores, to limit the number of tasks.