Article
08/10/2015

August 2011

Volume 26 Number 08

Parallel Programming - The Past, Present and Future of Parallelizing .NET Applications

By Stephen Toub | August 2011

The Ghost of Parallelism Past

Direct thread manipulation has historically been the way developers attempted to achieve responsive client applications, parallelized algorithms and scalable servers. Yet such techniques have also been the way developers historically achieved deadlocks, livelocks, lock convoys, two-step dances, race conditions, oversubscription and a host of other undesirable warts on applications. Since its inception, the Microsoft .NET Framework has provided a myriad of lower-level tools for building concurrent applications, including an entire namespace dedicated to the endeavor: System.Threading.  With approximately 50 types in this namespace in the .NET Framework 3.5 core assemblies (including such types as Thread, ThreadPool, Timer, Monitor, ManualResetEvent, ReaderWriterLock and Interlocked), no one should be able to accuse the .NET Framework of being light on threading support. And yet, I will accuse previous versions of the .NET Framework as being light on the real support developers everywhere need to successfully build scalable and highly parallelized applications. This is a problem I’m thankful and excited to say has been rectified in the .NET Framework 4, and it’s continuing to see a significant amount of investment for future .NET Framework versions.

Some may question the value of a rich subsystem in a managed language for writing parallel code. After all, parallelism and concurrency are about performance, and developers interested in performance should seek out native languages that provide pedal-to-the-metal access to the hardware and full control over every bit twiddle, cache line manipulation and interlocked operation … right? I fear for the state of our industry if that is indeed the case. Managed languages like C#, Visual Basic and F# exist to provide all developers—mere mortals and superheroes alike—with a safe, productive environment in which to rapidly develop powerful and efficient code. Developers are provided with thousands upon thousands of prebuilt library classes, along with languages ripe with all of the modern services we’ve come to expect, and still manage to achieve impressive performance numbers on all but the most number-crunching and floating-point-intensive workloads. All of this is to say that managed languages and their associated frameworks have deep-seated support for building high-performing concurrent applications, so that developers on modern hardware can have their cake and eat it too.

I’ve always felt that patterns are a great way to learn, so for the topic at hand it’s only right that we start our exploration by looking at a pattern. For the “embarrassing parallel” or “delightfully parallel” pattern, one of the most commonly needed fork-join constructs is a parallel loop, which is intended to process every independent iteration of a loop in parallel. It’s instructive to see how such processing could be done using the lower-level primitives mentioned earlier, and for that we’ll walk through the basic implementation of a naïve parallel loop implemented in C#. Consider a typical for loop:

for (int i=0; i<N; i++) {
  ... // Process i here
}

We can use threads directly to effect the parallelization of this loop, as shown in Figure 1.

Figure 1 Parallelizing a For Loop

int lowerBound = 0, upperBound = N;
int numThreads = Environment.ProcessorCount;
int chunkSize = (upperBound - lowerBound) / numThreads;

var threads = new Thread[numThreads];
for (int t = 0; t < threads.Length; t++) {
  int start = (chunkSize * t) + lowerBound;
  int end = t < threads.Length - 1 ? start + chunkSize : upperBound;
  threads[t] = new Thread(delegate() {
    for (int i = start; i < end; i++) {
      ... // Process i here
    }
  });
}

foreach (Thread t in threads) t.Start(); // fork
foreach (Thread t in threads) t.Join();  // join

Of course, there are a myriad of problems with this parallelization approach. We’re spinning up new threads dedicated to the loop, which not only add overhead (in particular if the body of the loop is trivial in work to be done) but can also lead to significant oversubscription in a process that’s doing other work concurrently. We’re using static partitioning to divide the work among the threads, which could lead to significant load imbalance if the workload is not evenly distributed across the iteration space (not to mention that if the number of iterations isn’t evenly divided by the number of utilized threads, the last thread is burdened with the overflow). Arguably worst of all, however, is that the developer is forced to write this code in the first place. Every algorithm we attempt to parallelize will require similar code—code that’s brittle at best.

The problem exemplified by the previous code is further amplified when we recognize that parallel loops are just one pattern in the multitudes that exist in parallel programs. Forcing developers to express all such parallel patterns at this low level of coding does not make for a good programming model, and does not set up for success the masses of developers in the world who need to be able to utilize massively parallel hardware.

The Ghost of Parallelism Present

Enter the .NET Framework 4. This release of the .NET Framework was augmented with a multitude of features to make it significantly easier for developers to express parallelism in their applications, and to have that parallelism executed efficiently. This goes well beyond parallel loops, but we’ll begin there nonetheless.

The System.Threading namespace was enhanced in the .NET Framework 4 with a new sub namespace: System.Threading.Tasks. This namespace includes a new type, Parallel, that exposes an abundance of static methods for implementing parallel loops and structured fork-join patterns. As an example of its usage, consider the previous for loop:

for (int i=0; i<N; i++) {
  ... // Process i here
}

With the Parallel class, you can parallelize it simply as follows:

Parallel.For(0, N, i => {
  ... // Process i here
});

Here, the developer is still responsible for ensuring each iteration of the loop is in fact independent, but beyond that, the Parallel.For construct handles all aspects of this loop’s parallelization. It handles dynamically partitioning the input range across all underlying threads involved in the computation, while still minimizing overhead for partitioning close to those provided by static partitioning implementations. It handles scaling up and scaling down the number of threads involved in the computation dynamically in order to find the optimal number of threads for a given workload (which is not always equal to the number of hardware threads, contrary to popular belief). It provides exception-handling capabilities not present in my naïve implementation previously shown, and so on. Most importantly, it keeps the developer from having to think about parallelism at the lower-level OS abstraction of threads, and from needing to continually code delicate solutions for partitioning workloads, offloading to multiple cores and joining the results efficiently. Instead, it enables the developer to spend his time focusing on what’s important: the business logic that makes the developer’s work profitable.

Parallel.For also provides facilities for developers who require more fine-grained control over a loop’s operation. Through an options bag provided to the For method, developers can control the underlying scheduler the loop runs on, the maximum degree of parallelism to be employed, and the cancellation token used by an entity external to the loop to request the loop’s polite termination at the loop’s earliest convenience:

var options = new ParallelOptions { MaxDegreeOfParallelism = 4 };
Parallel.For(0, N, options, i=> {
  ... // Process i here
});

This customization capability highlights one of the goals of this parallelization effort within the .NET Framework: To make it significantly easier for developers to take advantage of parallelism without complicating the programming, but at the same time give more advanced developers the knobs they need to fine-tune the processing and execution. In this vein, additional tweaks are supported. Other overloads of Parallel.For enable developers to break out of the loop early:

Parallel.For(0, N, (i,loop) => {
  ... // Process i here
  if (SomeCondition()) loop.Break();
});

Still other overloads allow developers to flow state through iterations that end up running on the same underlying thread, enabling far more efficient implementations of algorithms such as reductions, for example:

static int SumComputations(int [] inputs, Func<int,int> computeFunc) {
  int total = 0;
  Parallel.For(0, inputs.Length, () => 0, (i,loop,partial)=> {
    return partial + computeFunc(inputs[i]);
  }, 
  partial => Interlocked.Add(ref total, partial));
}

The Parallel class provides support not just for integral ranges, but also for arbitrary IEnumerable<T> sources, the .NET Framework representation of an enumerable sequence: code may continually call MoveNext on an enumerator in order to retrieve the next Current value. This ability to consume arbitrary enumerables enables parallel processing of arbitrary data sets, regardless of their in-memory representation; data sources may even be materialized on demand, and paged in as MoveNext calls reach not-yet-materialized sections of the source data:

IEnumerable<string> lines = File.ReadLines("data.txt");
Parallel.ForEach(lines, line => {
  ... // Process line here
});

As is the case with Parallel.For, Parallel.ForEach sports a multitude of customization capabilities, offering greater control than Parallel.For. For example, ForEach lets a developer customize how the input data set is partitioned. This is done through a set of partitioning-focused abstract classes that enable parallelization constructs to request a fixed or variable number of partitions, allowing the partitioner to hand out those partition abstractions over the input data set and to assign data to those partitions statically or dynamically as appropriate:

Graph<T> graph = ...;
Partitioner<T> data = new GraphPartitioner<T>(graph);
Parallel.ForEach(data, vertex => {
  ... // Process vertex here
});

Parallel.For and Parallel.ForEach are complemented on the Parallel class by an Invoke method that accepts an arbitrary number of actions to be invoked with as much parallelism as the underlying system can muster. This classic fork-join construct makes it easy to parallelize recursive divide-and-conquer algorithms, such as the commonly used example of QuickSort:

static void QuickSort<T>(T [] data, int lower, int upper) {
  if (upper – lower < THRESHOLD) {
    Array.Sort(data, index:lower, length:upper-lower);
  }
  else {
    int pivotPos = Partition(data, lower, upper);
    Parallel.Invoke(
      () => QuickSort(data, lower, pivotPos),
      () => QuickSort(data, pivotPos, upper));
  }
}

While a great step forward, the Parallel class only scratches the surface of the functionality available. One of the more monumental parallelization strides taken in the .NET Framework 4 was the introduction of Parallel LINQ, lovingly referred to as PLINQ (pronounced “Pee-link”). LINQ, or Language Integrated Query, was introduced to the .NET Framework in version 3.5. LINQ is really two things: a description of a set of operators exposed as methods for manipulating data sets, and contextual keywords in both C# and Visual Basic for expressing these queries directly in language. Many of the operators included in LINQ are based on the equivalent operations known for years to the database community, including Select, SelectMany, Where, Join, GroupBy and approximately 50 others. The .NET Framework Standard Query Operators API defines the pattern for these methods, but it doesn’t define which exact data sets these operations should target, nor exactly how these operations should be implemented. Various “LINQ providers” then implement this pattern for a multitude of different data sources and target environments (in-memory collections, SQL databases, object/relational mapping systems, HPC Server compute clusters, temporal and streaming data sources and more). One of the most commonly used providers is called LINQ to Objects, and it provides the full suite of LINQ operators implemented on top of IEnumerable<T>. This enables the implementation of queries in C# and Visual Basic, such as the following snippet that reads all data from a file line by line, filtering down to just those lines that contain the word “secret” and encrypting them. The end result is an enumerable of byte arrays:

IEnumerable<byte[]> encryptedLines = 
  from line in File.ReadLines("data.txt")
  where line.Contains("secret")
  select DataEncryptor.Encrypt(line);

For computationally intensive queries, or even just for queries that involve a lot of long-latency I/O, PLINQ provides automatic parallelization capabilities, implementing the full LINQ operator set utilizing end-to-end parallel algorithms. Thus, the previous query can be parallelized simply by the developer appending “.AsParallel()” to the data source:

IEnumerable<byte[]> encryptedLines = 
  from line in File.ReadLines("data.txt").AsParallel()
  where line.Contains("secret")
  select DataEncryptor.Encrypt(line);

As with the Parallel class, this model is opt-in to force the developer to evaluate the ramifications of running the computation in parallel. Once that choice has been made, however, the system handles the lower-level details of the actual parallelization, partitioning, thread throttling and the like. Also, as with Parallel, these PLINQ queries are customizable in a variety of ways. A developer can control how partitioning is achieved, how much parallelism is actually employed, tradeoffs between synchronization and latency, and more:

IEnumerable<byte[]> encryptedLines = 
  from line in new OneAtATimePartitioner<string>(
    File.ReadLines("data.txt"))
    .AsParallel()
    .AsOrdered()
    .WithCancellation(someExternalToken)
    .WithDegreeOfParallelism(4)
    .WithMergeOptions(ParallelMergeOptions.NotBuffered)
  where line.Contains("secret")
  select DataEncryptor.Encrypt(line);

These powerful and higher-level programming models for loops and queries are built on top of an equally powerful but lower-level set of task-based APIs, centering around the Task and Task<TResult> types in the System.Threading.Tasks namespace. In effect, the parallel loops and query engines are task generators, relying on the underlying task infrastructure to map the parallelism expressed to the resources available in the underlying system. At its core, Task is a representation for a unit of work—or, more generally, for a unit of asynchrony, a work item that may be spawned and later joined with through various means. Task provides Wait, WaitAll and WaitAny methods that allow for synchronously blocking forward progress until the target task (or tasks) has completed, or until additional constraints supplied to overloads of these methods have been met (for instance, a timeout or cancellation token). Task supports polling for completion through its IsCompleted property, and, more generally, polling for changes in its lifecycle processing through its Status property. Arguably most importantly, it provides the ContinueWith, ContinueWhenAll and ContinueWhenAny methods, which enable the creation of tasks that will be scheduled only when a specific set of antecedent tasks have completed. This continuation support allows a myriad of scenarios to be implemented easily, enabling dependencies to be expressed between computations such that the system can schedule work based on those dependencies becoming satisfied:

Task t1 = Task.Factory.StartNew(() => BuildProject(1));
Task t2 = Task.Factory.StartNew(() => BuildProject(2));
Task t3 = Task.Factory.StartNew(() => BuildProject(3));
Task t4 = Task.Factory.ContinueWhenAll(
  new [] { t1, t2 }, _ => BuildProject(4));
Task t5 = Task.Factory.ContinueWhenAll(
  new [] { t2, t3 }, _ => BuildProject(5));
Task t6 = Task.Factory.ContinueWhenAll(
  new [] { t4, t5 }, _ => BuildProject(6));
t6.ContinueWith(_ => Console.WriteLine("Solution build completed."));

The Task<TResult> class derived from Task enables results to be passed out from the completed operation, providing the .NET Framework with a core “future” implementation:

int SumTree<T>(Node<T> root, Func<T,int> computeFunc) {
  if (root == null) return 0;
  Task<int> left  = Task.Factory.StartNew(() => SumTree(root.Left));
  Task<int> right = Task.Factory.StartNew(() => SumTree(root.Right));
  return computeFunc(root.Data) + left.Result + right.Result;
}

Under all of these models (loops, queries and tasks alike) the .NET Framework employs work-stealing techniques to provide for more efficient processing of specialized workloads, and by default it employs hill-climbing heuristics to vary the number of employed threads over time in order to find the optimal processing level. Heuristics are also built into pieces of these components to automatically fall back to sequential processing if the system believes that any parallelization attempt would result in slower-than-sequential result times—though, as with the other defaults discussed previously, these heuristics may also be overridden.

Task<TResult> need not represent only compute-bound operations. It may also be used to represent arbitrary asynchronous operations. Consider the .NET Framework System.IO.Stream class, which provides a Read method to extract data from the stream:

NetworkStream source = ...;
byte [] buffer = new byte[0x1000];
int numBytesRead = source.Read(buffer, 0, buffer.Length);

This Read operation is synchronous and blocking, such that the thread making the Read call may not be used for other work until the I/O-based Read operation completes. To allow better scalability, the Stream class provides an asynchronous counterpart to the Read method in the form of two methods: BeginRead and EndRead. These methods follow a pattern available in the .NET Framework since its inception, a pattern known as the APM, or Asynchronous Programming Model. The following is an asynchronous version of the previous code example:

NetworkStream source = …;
byte [] buffer = new byte[0x1000];
source.BeginRead(buffer, 0, buffer.Length, delegate(IAsyncResult iar) {
  int numBytesRead = source.EndRead(iar);
}, null);

This approach, however, leads to poor composability. The TaskCompletionSource<TResult> type fixes this by enabling such an asynchronous read operation to be exposed as a task:

public static Task<int> ReadAsync(
  this Stream source, byte [] buffer, int offset, int count) 
{
  var tcs = new TaskCompletionSource<int>();
  source.BeginRead(buffer, 0, buffer.Length, iar => {
    try { tcs.SetResult(source.EndRead(iar)); }
    catch(Exception exc) { tcs.SetException(exc); }
  }, null);
  return tcs.Task;
}

This allows for multiple asynchronous operations to be composed just as in the compute-bound examples. The following example concurrently reads from all of the source Streams, writing out to the console only when all of the operations have completed:

NetworkStream [] sources = ...;
byte [] buffers = ...;
Task.Factory.ContinueWhenAll(
  (from i in Enumerable.Range(0, sources.Length)
   select sources[i].ReadAsync(buffers[i], 0, buffers[i].Length))
  .ToArray(), 
  _ => Console.WriteLine("All reads completed"));

Beyond mechanisms for launching parallelized and concurrent processing, the .NET Framework 4 also provides primitives for further coordinating work between tasks and threads. This includes a set of thread-safe and scalable collection types that largely eliminate the need for developers to manually synchronize access to shared collections. ConcurrentQueue<T> provides a thread-safe, lock-free, first-in-first-out collection that may be used concurrently by any number of producers and any number of consumers. In addition, it supports snapshot semantics for concurrent enumerators, so that code may examine the state of the queue at a moment in time even as other threads bash away at the instance. ConcurrentStack<T> is similar, instead providing last-in-first-out semantics. ConcurrentDictionary<T> uses lock-free and fine-grained locking techniques to provide a thread-safe dictionary, which also supports any number of concurrent readers, writers and enumerators. It also provides several atomic implementations of multistep operations, such as GetOrAdd and AddOrUpdate. Another type, ConcurrentBag<T>, provides an unordered collection using work-stealing queues.

The .NET Framework doesn’t stop at collection types. Lazy<T> provides lazy initialization of a variable, using configurable approaches to achieve thread safety. ThreadLocal<T> provides per-thread, per-instance data that can also be lazily initialized on first access. The Barrier type enables phased operation so that multiple tasks or threads can proceed through an algorithm in lockstep. The list continues, and all stem from a single guiding principle: Developers shouldn’t need to focus on the lower-level and rudimentary aspects of their algorithm’s parallelization, and instead allow the .NET Framework to handle the mechanics and efficiency details for them.

The Ghost of Parallelism Yet to Come

Unlike the Dickens counterpart, the future for parallelism and concurrency in the .NET Framework is exciting and something to look forward to, building upon the foundations laid in the .NET Framework 4. A focus of future versions of the .NET Framework, beyond improving the performance of the existing programming models, is expanding the set of higher-level models that exist in order to address more patterns of parallel workloads. One such enhancement is a new library for implementing parallel systems based on dataflow and for architecting applications with agent-based models. The new System.Threading.Tasks.Dataflow library provides a multitude of “dataflow blocks” that act as buffers, processors and propagators of data. Data may be posted to these blocks, and that data will be processed and automatically forwarded to any linked targets, based on the semantics of the source block. The dataflow library is also built on top of tasks, with the blocks spinning up tasks under the covers to process and propagate data.

From a patterns perspective, the library is particularly good at handling dataflow networks that form chains of producers and consumers. Consider the need for data to be compressed and then encrypted and written out to a file, with the stream of data arriving and flowing through the application. This might be achieved by configuring a small network of dataflow blocks as follows:

static byte [] Compress(byte [] data) { ... }
static byte [] Encrypt(byte [] data) { ... }
...
var compressor = new TransformBlock<byte[],byte[]>(Compress);
var encryptor = new TransformBlock<byte[],byte[]>(Encrypt);
var saver = new ActionBlock<byte[]>(AppendToFile);
compressor.LinkTo(encryptor);
encryptor.LinkTo(saver);
...
// As data arrives
compressor.Post(byteArray);

Beyond the dataflow library, however, arguably the most important feature coming for parallelism and concurrency in the .NET Framework is first-class language support in C# and Visual Basic for producing and asynchronously awaiting tasks. These languages are being augmented with state-machine-based rewrite capabilities that allow for all of the languages’ sequential control flow constructs to be utilized while at the same time being able to asynchronously wait for tasks to complete (F# in Visual Studio 2010 supports a related form of asynchrony as part of its asynchronous workflows feature, a feature that also integrates with tasks). Take a look at the following method, which synchronously copies data from one Stream to another, returning the number of bytes copied:

static long CopyStreamToStream(Stream src, Stream dst) {
  long numCopied = 0;
  byte [] buffer = new byte[0x1000];
  int numRead;
  while((numRead = src.Read(buffer,0,buffer.Length)) > 0) {
    dst.Write(buffer, 0, numRead);
    numCopied += numRead;
  }
  return numCopied;
}

Implementing this function, including its conditionals and loops, with support like the BeginRead/EndRead methods on Stream shown earlier, results in a nightmare of callbacks and logic that’s error-prone and extremely difficult to debug. Instead, consider the approach of using a ReadAsync method like the one shown earlier, which returns a Task<int>, and a corresponding WriteAsync method, which returns a Task. Using the new C# functionality, we can rewrite the previous method as follows:

static async Task<long> CopyStreamToStreamAsync(Stream src, Stream dst) {
  long numCopied = 0;
  byte [] buffer = new byte[0x1000];
  int numRead;
  while((numRead = await src.ReadAsync(buffer,0,buffer.Length)) > 0) {
    await dst.WriteAsync(buffer, 0, numRead);
    numCopied += numRead;
  }
  return numCopied;
}

Notice the few minor modifications to turn the synchronous method into an asynchronous method. The function is now annotated as “async” to inform the compiler that it should perform a rewrite of the function. With that, any time an “await” operation is requested on a Task or Task<TResult>, the rest of the function’s execution is, in effect, hooked up to that task as a continuation: until such time that the task completes, this method call won’t be occupying a thread. The Read method call has been converted into a ReadAsync call, such that the await contextual keyword may be used to signify the yielding point where the remainder should be turned into a continuation; it’s the same for WriteAsync. When this asynchronous method eventually completes, the returned long value will be lifted into a Task<long> that was returned to the initial caller of CopyStreamToStreamAsync, using a mechanism like the one shown earlier with TaskCompletionSource<TResult>. I can now use the return value from CopyStreamToStreamAsync as I would any Task, waiting on it, hooking up a continuation to it, composing with it over other tasks or even awaiting it. With functionality like ContinueWhenAll and WaitAll, I can initiate and later join with multiple asynchronous operations in order to achieve higher levels of concurrency and improve the overall throughput of my application.

This language support for asynchrony improves not only I/O-bound but also CPU-bound operations, and in particular the ability of a developer to build responsive client applications (ones that don’t tie up the UI thread and leave the application in a non-responsive state) while still getting the benefits of massively parallel processing. It has long been cumbersome for developers to move off of a UI thread, perform any processing, and then move back to the UI thread to update UI elements and interact with a user. The language support for asynchrony interacts with key components of the .NET Framework to, by default, automatically bring operations back to their original context when an await operation completes (for instance, if an await is issued from the UI thread, the continuation that’s hooked up will continue execution back on the UI thread). This means that a task can be launched to run compute-intensive work in the background, and the developer can simply await it to retrieve the results and store them into UI elements, like so:

async void button1_Click(object sender, EventArgs e) {
  string filePath = txtFilePath.Text;
  txtOutput.Text = await Task.Factory.StartNew(() => {
    return ProcessFile(filePath);
  });
}

That background task may itself spin off multiple tasks in order to parallelize the background computation, such as by using a PLINQ query:

async void button1_Click(object sender, EventArgs e) {
  string filePath = txtFilePath.Text;
  txtOutput.Text = await Task.Factory.StartNew(() => {
    return File.ReadLines(filePath).AsParallel()
      .SelectMany(line => ParseWords(line))
      .Distinct()
      .Count()
      .ToString();
  });
}

The language support may also be used in combination with the dataflow library to ease the natural expression of asynchronous producer/consumer scenarios. Consider the desire to implement a set of throttled producers, each of which is generating some data to be sent off to a number of consumers. Synchronously, this might be done using a type like BlockingCollection<T> (see Figure 2), which was introduced as part of the .NET Framework 4.

Figure 2 Using a BlockingCollection

static BlockingCollection<Datum> s_data = 
  new BlockingCollection<Datum>(boundedCapacity:100);
...
static void Producer() {
  for(int i=0; i<N; i++) {
    Datum d = GenerateData();
    s_data.Add(d);
  }
  s_data.CompleteAdding();
}

static void Consumer() {
  foreach(Datum d in s_data.GetConsumingEnumerable()) {
    Process(d);
  }
}
...
var workers = new Task[3];
workers[0] = Task.Factory.StartNew(Producer);
workers[1] = Task.Factory.StartNew(Consumer);
workers[2] = Task.Factory.StartNew(Consumer);
Task.WaitAll(workers);

This is a fine pattern, as long as it meets the application’s goals that both the producers and consumers block threads. If that’s unacceptable, you can write an asynchronous counterpart, utilizing another of the dataflow blocks, BufferBlock<T>, and the ability to asynchronously send to and receive from a block, as shown in Figure 3.

Figure 3 Using a BufferBlock

static BufferBlock<Datum> s_data = new BufferBlock<Datum>(
  new DataflowBlockOptions { BoundedCapacity=100 });
...
static async Task ProducerAsync() {
  for(int i=0; i<N; i++) {
    Datum d = GenerateData();
    await s_data.SendAsync(d);
  }
  s_data.Complete();
}

static async Task ConsumerAsync() {
  Datum d;
  while(await s_data.OutputAvailableAsync()) {
    while(s_data.TryReceive(out d)) {
      Process(d);
    }
  }
}
...
var workers = new Task[3];
workers[0] = ProducerAsync();
workers[1] = ConsumerAsync();
workers[2] = ConsumerAsync();
await Task.WhenAll(workers);

Here, the SendAsync and OutputAvailableAsync methods both return tasks, enabling the compiler to hook up continuations and allowing the whole process to run asynchronously.

A Scrooge No More

Parallel programming has long been the domain of expert developers, uniquely qualified individuals well-versed in the art of scaling code to multiple cores. These experts are the result of years of training and on-the-job experiences. They’re highly valued—and they’re scarce. In our brave new world of multicore and manycore everywhere, this model of leaving parallelism purely to the experts is no longer sufficient. Regardless of whether an application or component is destined to be a publicly available software package, is intended purely for in-house use, or is just a tool to enable a more important job to be completed, parallelism is now something every developer must at least consider, and something that the millions upon millions of developers that utilize managed languages must be able to take advantage of, even if it’s through components that themselves encapsulate parallelism. Parallel programming models like those exposed in the .NET Framework 4, and like those coming in future versions of the .NET Framework, are necessary to enable this beautiful future.

Stephen Toub is a principal architect on the Parallel Computing Platform team at Microsoft.

Thanks to the following technical experts for reviewing this article: Joe Hoag and Danny Shih

Parallel Programming - The Past, Present and Future of Parallelizing .NET Applications

The Ghost of Parallelism Past

The Ghost of Parallelism Present

The Ghost of Parallelism Yet to Come

A Scrooge No More

Additional resources