Your CPU meter shows a problem. One core is running at 100 percent, but all the other cores are idle. Your application is CPU-bound, but you are using only a fraction of the computing power of your multicore system. Is there a way to get better performance?
The answer, in a nutshell, is parallel programming. Where you once would have written the kind of sequential code that is familiar to all programmers, you now find that this no longer meets your performance goals. To use your system's CPU resources efficiently, you need to split your application into pieces that can run at the same time.
Parallel programming uses multiple cores at the same time to improve your application's speed.
Of course, this is easier said than done. Parallel programming has a reputation for being the domain of experts and a minefield of subtle, hard-to-reproduce software defects. Everyone seems to have a favorite story about a parallel program that did not behave as expected because of a mysterious bug.
|Attention impatient readers:|
|you can skip ahead to the table of patterns and when they can be used. See "Selecting the Right Pattern" later in this chapter.|
These stories should inspire a healthy respect for the difficulty of the problems you will face in writing your own parallel programs. Fortunately, help has arrived. The Parallel Patterns Library (PPL) and the Asynchronous Agents Library introduce a new programming model for parallelism that significantly simplifies the job. Behind the scenes are sophisticated algorithms that dynamically distribute computations on multicore architectures. In addition, Microsoft® Visual Studio® 2010 development system includes debugging and analysis tools to support the new parallel programming model.
Writing parallel programs has the reputation of being hard, but help has arrived.
Proven design patterns are another source of help. This guide introduces you to the most important and frequently used patterns of parallel programming and provides executable code samples for them, using PPL. When thinking about where to begin, a good place to start is to review the patterns in this book. See if your problem has any attributes that match the six patterns presented in the following chapters. If it does, delve more deeply into the relevant pattern or patterns and study the sample code.
Most parallel programs conform to these patterns, and it's very likely you'll be successful in finding a match to your particular problem. If you can't use these patterns, you've probably encountered one of the more difficult cases, and you'll need to hire an expert or consult the academic literature.
The code examples for this guide are online at http://parallelpatternscpp.codeplex.com/.
The Importance of Potential Parallelism
The patterns in this book are ways to express potential parallelism. This means that your program is written so that it runs faster when parallel hardware is available and roughly the same as an equivalent sequential program when it's not. If you correctly structure your code, the run-time environment can automatically adapt to the workload on a particular computer. This is why the patterns in this book only express potential parallelism. They do not guarantee parallel execution in every situation. Expressing potential parallelism is a central organizing principle behind PPL’s programming model. It deserves some explanation.
Declaring the potential parallelism of your program allows the execution environment to run the program on all available cores, whether one or many.
Some parallel applications can be written for specific hardware. For example, creators of programs for a console gaming platform have detailed knowledge about the hardware resources that will be available at run time. They know the number of cores and the details of the memory architecture in advance. The game can be written to exploit the exact level of parallelism provided by the platform. Complete knowledge of the hardware environment is also a characteristic of some embedded applications, such as industrial process control. The life cycle of such programs matches the life cycle of the specific hardware they were designed to use.
In contrast, when you write programs that run on general-purpose computing platforms, such as desktop workstations and servers, there is less predictability about the hardware features. You may not always know how many cores will be available. You also may be unable to predict what other software could be running at the same time as your application.
|Don't hard code the degree of parallelism in an application. You can't always predict how many cores will be available at run time.|
Even if you initially know your application's environment, it can change over time. In the past, programmers assumed that their applications would automatically run faster on later generations of hardware. You could rely on this assumption because processor clock speeds kept increasing. With multicore processors, clock speeds on newer hardware are not increasing as much as they did in the past. Instead, the trend in processor design is toward more cores. If you want your application to benefit from hardware advances in the multicore world, you need to adapt your programming model. You should expect that the programs you write today will run on computers with many more cores within a few years. Focusing on potential parallelism helps to "future proof" your program.
Hardware trends predict more cores instead of faster clock speeds.
Finally, you must plan for these contingencies in a way that does not penalize users who might not have access to the latest hardware. You want your parallel application to run as fast on a single-core computer as an application that was written using only sequential code. In other words, you want scalable performance from one to many cores. Allowing your application to adapt to varying hardware capabilities, both now and in the future, is the motivation for potential parallelism.
A well-written parallel program runs at approximately the same speed as a sequential program when there is only one core available.
An example of potential parallelism is the parallel loop pattern described in Chapter 2, "Parallel Loops." If you have a for loop that performs a million independent iterations, it makes sense to divide those iterations among the available cores and do the work in parallel. It's easy to see that how you divide the work should depend on the number of cores. For many common scenarios, the speed of the loop will be approximately proportional to the number of cores.
Decomposition, Coordination, and Scalable Sharing
The patterns in this book contain some common themes. You'll see that the process of designing and implementing a parallel application involves three aspects: methods for decomposing the work into discrete units known as tasks, ways of coordinating these tasks as they run in parallel, and scalable techniques for sharing the data needed to perform the tasks.
The patterns described in this guide are design patterns. You can apply them when you design and implement your algorithms and when you think about the overall structure of your application. Although the example applications are small, the principles they demonstrate apply equally well to the architectures of large applications.
Tasks are sequential operations that work together to perform a larger operation. When you think about how to structure a parallel program, it's important to identify tasks at a level of granularity that results in efficient use of hardware resources. If the chosen granularity is too fine, the overhead of managing tasks will dominate. If it's too coarse, opportunities for parallelism may be lost because cores that could otherwise be used remain idle. In general, tasks should be as large as possible, but they should remain independent of each other, and there should be enough tasks to keep the cores busy. You may also need to consider the heuristics that will be used for task scheduling. Meeting all these goals sometimes involves design tradeoffs. Decomposing a problem into tasks requires a good understanding of the algorithmic and structural aspects of your application.
Tasks are sequential units of work. Tasks should be large, independent, and numerous enough to keep all cores busy.
An example of these guidelines at work can be seen in a parallel ray tracing application. A ray tracer constructs a synthetic image by simulating the path of each ray of light in a scene. The individual ray simulations are a good level of granularity for parallelism. Breaking the tasks into smaller units, for example, by trying to decompose the ray simulation itself into independent tasks, only adds overhead, because the number of ray simulations is already large enough to keep all cores occupied. If your tasks vary greatly in duration, you generally want more of them in order to fill in the gaps.
Another advantage to grouping work into larger and fewer tasks is that larger tasks are often more independent of each other than are smaller tasks. Larger tasks are less likely than smaller tasks to share local variables or fields. Unfortunately, in applications that rely on large mutable object graphs, such as applications that expose a large object model with many public classes, methods, and properties, the opposite may be true. In these cases, the larger the task, the more chance there is for unexpected sharing of data or other side effects.
The overall goal is to decompose the problem into independent tasks that do not share data, while providing a sufficient number of tasks to occupy the number of cores available. When considering the number of cores, you should take into account that future generations of hardware will have more cores.
|Keep in mind that tasks are not threads. Tasks and threads take very different approaches to scheduling. Tasks are much more compatible with the concept of potential parallelism than threads are. While a new thread immediately introduces additional concurrency to your application, a new task introduces only the potential for additional concurrency. A task's potential for additional concurrency will be realized only when there are enough available cores.|
It's often possible that more than one task can run at the same time. Tasks that are independent of one another can run in parallel, while some tasks can begin only after other tasks complete. The order of execution and the degree of parallelism are constrained by the application's underlying algorithms. Constraints can arise from control flow (the steps of the algorithm) or data flow (the availability of inputs and outputs).
Various mechanisms for coordinating tasks are possible. The way tasks are coordinated depends on which parallel pattern you use. For example, the Pipeline pattern described in Chapter 7, "Pipelines," is distinguished by its use of messages to coordinate tasks. Regardless of the mechanism you choose for coordinating tasks, in order to have a successful design, you must understand the dependencies between tasks.
Scalable Sharing of Data
Tasks often need to share data. The problem is that when a program is running in parallel, different parts of the program may be racing against each other to perform updates on the same memory location. The result of such unintended data races can be catastrophic. The solution to the problem of data races includes techniques for synchronizing threads.
You may already be familiar with techniques that synchronize concurrent threads by blocking their execution in certain circumstances. Examples include locks, atomic compare-and-swap operations, and semaphores. All of these techniques have the effect of serializing access to shared resources. Although your first impulse for data sharing might be to add locks or other kinds of synchronization, adding synchronization reduces the parallelism of your application. Every form of synchronization is a form of serialization. Your tasks can end up contending over the locks instead of doing the work you want them to do. Programming with locks is also error-prone.
Fortunately, there are a number of techniques that allow data to be shared that don't degrade performance or make your program prone to error. These techniques include the use of immutable, read-only data, sending messages instead of updating shared variables, and introducing new steps in your algorithm that merge local versions of mutable state at appropriate checkpoints. Techniques for scalable sharing may involve changes to an existing algorithm.
Scalable sharing may involve changes to your algorithm.
Conventional object-oriented designs can have complex and highly interconnected in-memory graphs of object references. As a result, traditional object-oriented programming styles can be very difficult to adapt to scalable parallel execution. Your first impulse might be to consider all fields of a large, interconnected object graph as mutable shared state, and to wrap access to these fields in serializing locks whenever there is the possibility that they may be shared by multiple tasks. Unfortunately, this is not a scalable approach to sharing. Locks can often negatively affect the performance of all cores. Locks force cores to pause and communicate, which takes time, and they introduce serial regions in the code, which reduces the potential for parallelism. As the number of cores gets larger, the cost of lock contention can increase. As more and more tasks are added that share the same data, the overhead associated with locks can dominate the computation.
Adding synchronization (locks) can reduce the scalability of your application.
In addition to performance problems, programs that rely on complex synchronization are prone to a variety of problems, including deadlock. Deadlock occurs when two or more tasks are waiting for each other to release a lock. Most of the horror stories about parallel programming are actually about the incorrect use of shared mutable state or locking protocols.
Nonetheless, synchronizing elements in an object graph plays a legitimate, if limited, role in scalable parallel programs. This book uses synchronization sparingly. You should, too. Locks can be thought of as the goto statements of parallel programming: they are error prone but necessary in certain situations, and they are best left, when possible, to compilers and libraries.
No one is advocating the removal, in the name of performance, of synchronization that's necessary for correctness. First and foremost, the code still needs to be correct. However, it's important to incorporate design principles into the design process that limit the need for synchronization. Don't add synchronization to your application as an afterthought.
It's common for developers to identify one problem area, parallelize the code to improve performance, and then repeat the process for the next bottleneck. This is a particularly tempting approach when you parallelize an existing sequential application. Although this may give you some initial improvements in performance, it has many pitfalls, such as those described in the previous section. As a result, traditional profile-and-optimize techniques may not produce the best results. A far better approach is to understand your problem or application and look for potential parallelism across the entire application as a whole. What you discover may lead you to adopt a different architecture or algorithm that better exposes the areas of potential parallelism in your application. Don't simply identify bottlenecks and parallelize them. Instead, prepare your program for parallel execution by making structural changes.
|Think in terms of data structures and algorithms; don't just identify bottlenecks.|
Techniques for decomposition, coordination, and scalable sharing are interrelated. There's a circular dependency. You need to consider all of these aspects together when choosing your approach for a particular application.
After reading the preceding description, you might complain that it all seems vague. How specifically do you divide your problem into tasks? Exactly what kinds of coordination techniques should you use?
Questions like these are best answered by the patterns described in this book. Patterns are a true shortcut to understanding. As you begin to see the design motivations behind the patterns, you will also develop your intuition about how the patterns and their variations can be applied to your own applications. The following section gives more details about how to select the right pattern.
Selecting the Right Pattern
To select the relevant pattern, use the following table.
Do you have sequential loops where there's no communication among the steps of each iteration?
The Parallel Loop pattern (Chapter 2)
Parallel loops apply an independent operation to multiple inputs simultaneously.
Do you have distinct operations with well-defined control dependencies? Are these operations largely free of serializing dependencies?
The Parallel Task pattern (Chapter 3)
Parallel tasks allow you to establish parallel control flow in the style of fork and join.
Do you need to summarize data by applying some kind of combination operator? Do you have loops with steps that are not fully independent?
The Parallel Aggregation pattern (Chapter 4)
Parallel aggregation introduces special steps in the algorithm for merging partial results. This pattern expresses a reduction operation and includes map/reduce as one of its variations.
Does the ordering of steps in your algorithm depend on data flow constraints?
The Futures pattern (Chapter 5)
Futures make the data flow dependencies between tasks explicit. This pattern is also referred to as the Task Graph pattern.
Does your algorithm divide the problem domain dynamically during the run? Do you operate on recursive data structures such as graphs?
The Dynamic Task Parallelism pattern (Chapter 6)
This pattern takes a divide-and-conquer approach and spawns new tasks on demand.
Does your application perform a sequence of operations repetitively? Does the input data have streaming characteristics? Does the order of processing matter?
The Pipeline pattern (Chapter 7)
Pipelines consist of components that are connected by queues, in the style of producers and consumers. All the components run in parallel even though the order of inputs is respected.
One way to become familiar with the possibilities of the six patterns is to read the first page or two of each chapter. This will give you an overview of approaches that have been proven to work in a wide variety of applications. Then go back and more deeply explore patterns that may apply in your situation.
A Word about Terminology
You'll often hear the words parallelism and concurrency used as synonyms. This book makes a distinction between the two terms.
Concurrency is a concept related to multitasking and asynchronous input-output (I/O). It usually refers to the existence of multiple threads of execution that may each get a slice of time to execute before being preempted by another thread, which also gets a slice of time. Concurrency is necessary in order for a program to react to external stimuli such as user input, devices, and sensors. Operating systems and games, by their very nature, are concurrent, even on one core.
With parallelism, concurrent threads execute at the same time on multiple cores. Parallel programming focuses on improving the performance of applications that use a lot of processor power and are not constantly interrupted when multiple cores are available.
The goals of concurrency and parallelism are distinct. The main goal of concurrency is to reduce latency by never allowing long periods of time to go by without at least some computation being performed by each unblocked thread. In other words, the goal of concurrency is to prevent thread starvation.
Concurrency is required operationally. For example, an operating system with a graphical user interface must support concurrency if more than one window at a time can update its display area on a single-core computer. Parallelism, on the other hand, is only about throughput. It's an optimization, not a functional requirement. Its goal is to maximize processor usage across all available cores; to do this, it uses scheduling algorithms that are not preemptive, such as algorithms that process queues or stacks of work to be done.
The Limits of Parallelism
A theoretical result known as Amdahl's law says that the amount of performance improvement that parallelism provides is limited by the amount of sequential processing in your application. This may, at first, seem counterintuitive.
Amdahl's law says that no matter how many cores you have, the maximum speed-up you can ever achieve is (1 / fraction of time spent in sequential processing). Figure 1 illustrates this.
For example, with 11 cores, the application runs slightly more than three times faster than it would if it were entirely sequential.
Even with fewer cores, you can see that the expected speed-up is not linear. Figure 2 illustrates this.
Figure 2 shows that as the number of cores (and overall application speed) increases, the percentage of time spent in the sequential part of the application increases. (The elapsed time spent in sequential processing is constant.) The illustration also shows why you might be satisfied with a 2x speed-up on a four-core computer for actual applications, as opposed to sample programs. The important question is always how scalable the application is. Scalability depends on the amount of time spent doing work that is inherently sequential in nature.
Another implication of Amdahl's law is that for some problems, you may want to create additional features in the parts of an application that are amenable to parallel execution. For example, a developer of a computer game might find that it's possible to make increasingly sophisticated graphics for newer multicore computers by using the parallel hardware, even if it's not as feasible to make the game logic (the artificial intelligence engine) run in parallel. Performance can influence the mix of application features.
The speed-up you can achieve in practice is usually somewhat worse than Amdahl's law would predict. As the number of cores increases, the overhead incurred by accessing shared memory also increases. Also, parallel algorithms may include overhead for coordination that would not be necessary for the sequential case. Profiling tools, such as the Visual Studio Concurrency Visualizer, can help you understand how effective your use of parallelism is.
In summary, because an application consists of parts that must run sequentially as well as parts that can run in parallel, the application overall will rarely see a linear increase in performance with a linear increase in the number of cores, even if certain parts of the application see a near linear speed-up. Understanding the structure of your application and its algorithms—that is, which parts of your application are suitable for parallel execution—is a step that can't be skipped when analyzing performance.
A Few Tips
Always try for the simplest approach. Here are some basic precepts:
- Whenever possible, stay at the highest possible level of abstraction and use constructs or a library that does the parallel work for you.
- Use your application server's inherent parallelism; for example, use the parallelism that is incorporated into a web server or database.
- Use an API to encapsulate parallelism, such as the Parallel Patterns Library. These libraries were written by experts and have been thoroughly tested; they help you avoid many of the common problems that arise in parallel programming.
- Consider the overall architecture of your application when thinking about how to parallelize it. It's tempting to simply look for the performance hotspots and focus on improving them. While this may produce some improvement, it does not necessarily give you the best results.
- Use patterns, such as the ones described in this book.
- Often, restructuring your algorithm (for example, to eliminate the need for shared data) is better than making low-level improvements to code that was originally designed to run serially.
- Don't share data among concurrent tasks unless absolutely necessary. If you do share data, use one of the containers provided by the API you are using, such as a shared queue.
- Use low-level primitives, such as threads and locks, only as a last resort. Raise the level of abstraction from threads to tasks in your applications.
- What are some of the tradeoffs between decomposing a problem into many small tasks and decomposing it into larger tasks?
- What is the maximum potential speed-up of a program that spends 10 percent of its time in sequential processing when you move it from one to four cores?
- What is the difference between parallelism and concurrency?
For More Information
If you are interested in better understanding the terminology used in the text, refer to the glossary at the end of this book.
The design patterns presented in this book are consistent with classifications of parallel patterns developed by groups in both industry and academia. In the terminology of these groups, the patterns in this book would be considered to be algorithm or implementation patterns. Classification approaches for parallel patterns can be found in the book by Mattson, et al. and at the Our Pattern Language (OPL) web site. This book attempts to be consistent with the terminology of these sources. In cases where this is not possible, an explanation appears in the text.
For a detailed discussion of parallelism on the Microsoft Windows® platform, see the book by Duffy.
Duffy, Joe. Concurrent Programming on Windows, Addison-Wesley, 2008.
Mattson, Timothy G., Beverly A. Sanders, and Berna L. Massingill. Patterns for Parallel Programming. Addison-Wesley, 2004.
OPL, Our Pattern Language for Parallel Programming ver2.0, 2010. http://parlab.eecs.berkeley.edu/wiki/patterns.
Last built: March 9, 2012