Exercise 1: Learning the Basics of the PPL (Background)

Figure 1

Developers seeking to execute their native serial applications on multiple cores must find a way to run them in parallel. One choice is to redesign the application using threads, which can be time consuming, prone to errors, and difficult to debug.

Another option is to use the Pattern Parallel Library to enable parallel execution of loops without having to worry about the underlying mechanism. The latter mechanism is more appealing and should reduce develop and testing time. Nonetheless, this does not mean that your code is immune to the concurrency problems. You must still pay careful attention to your code.

The Parallel Patterns Library

Figure 2

PPL stands for the Parallel Pattern Library and is meant to provide a convenient interface to execute work in parallel. By using PPL, you can introduce parallelism without even having to manage a scheduler. Below are some of the most common patterns.

Task execution patterns:

parallel_invoke: To execute from 2 to 10 tasks in parallel

parallel_for: To execute tasks in parallel over a range of integers

parallel_for_each: To execute tasks in parallel over a collection of items in an STL container

Synchronization patterns:

reader_writer_lock: A cooperative reader-writer lock that yields to other tasks instead of preempting.

critical_section: A cooperative mutual exclusion primitive that yields to other tasks instead of preempting.

Data sharing pattern:

combinable: A scalable object that has a local copy for each thread where processing can be done lock free on the local copy and combined afterwards when parallel processing is done.

A PPL Example

Figure 3

Take a look at the following code that carries matrix multiplication:

C++

void MatrixMult(int size, double** m1, double** m2,double** result){ for (int i = 0; i < size; i++){ for (int j = 0; j < size; j++){ for (int k = 0; k < size; k++) { result[i][j] += m1[i][k] * m2[k][j]; } } } }

The code has nothing out of this world; in fact, probably no one would ever do this for multiplying matrices in real life. But the interesting thing about it is that the outer loop is very embarrassingly parallel.

If were to use threading APIs to do this, it would require a lot of effort from our part. We would have to deal with partitioning the data, and data synchronization.

Here's that same scenario using the library “parallel_for” from the Parallel Pattern Library.

C++

void MatrixMult(int size, double** m1, double** m2,double** result){
parallel_for (0,size,1,[&](int i){FakePre-c19e0bc2a447497b8c31bd808b429a4b-feb64bad96a44e7d95abd57694d71536FakePre-ba07dbc9ec144843b986077095b7100c-068307873cf845298338c65c3b913f0aFakePre-793cbc2b544f4834864f33c3f71b72c6-da1bf72cb8824df9809574d664dfe4ebFakePre-8a9618d105784ef9bacac5f6ffdb26b6-77606263eacd4d7885a18143b7b9c22dFakePre-e1b1e00d954546889d7a98cf756cd39a-caecf5a0bcff4d318bf8ed8cd85f1812FakePre-480993d0146345bbadde9de0f49589f3-25260cc8b58c4e56a018b14bcf843255FakePre-d99b5e52e4264d98878c2e36fc6c488f-455fdb3ddf79438cb3e6229afd78f4d5

One thing you'll notice is that we’re using a new feature from C++ Ox called Lambda support. The syntax following the parallel_for key-word specifies iteration logic from zero to size in increment steps of 1.

By changing the code to use the parallel_for, the Concurrency Runtime Scheduler implicitly executes each iteration as an independent task.

Figure 4

That’s All I Have to Do to Parallelize my Code?

We all know that writing code can be a very daunting task. Writing code that executes in parallel adds even more complexity. This means that writing code that will execute in parallel can introduce new problems that did not exist in the serial version of your application.

Take a look at the following code snippet:

C++

for(int i = 1;i < count;++i){ if (IsPrime(i)) ++totalPrimes; };

The code works perfectly as is. It will find the number of primes by using a function that returns whether the sent parameter is prime or not. The totalPrimes variable is incremented every time such number is found.

In order to parallelize this code using PPL, the code would look like this:

C++

parallel_for(1,count,1,[&](int i){ if (IsPrime(i)) ++totalPrimes; )};

If we were to run this code on a multi-core machine, we would discover 2 things:

  1. The programs runs a lot faster than before
  2. The program is a lot faster but gives incorrect results with each run

We’ll talk about date races later on and you will understand why the parallel code is yielding incorrect results.

Let’s get started and parallelize some code with PPL!