Appendix B: Debugging and Profiling Parallel Applications

Article
03/09/2012

On this page:
The Parallel Tasks and Parallel Stacks Windows \| Breakpoints and Memory Allocation \| The Concurrency Visualizer \| Scenario Markers \| Visual Patterns \| Oversubscription \| Lock Contention and Serialization \| Load Imbalance \| Further Reading

The Microsoft® Visual Studio® 2010 development system debugger includes two windows that assist with parallel programming: the Parallel Stacks window and the Parallel Tasks window. In addition, the Premium and Ultimate editions of Visual Studio 2010 include a profiling tool. This appendix gives examples of how to use these windows and the profiler to visualize the execution of a parallel program and to confirm that it's working as you expect. After you gain some experience at this, you'll be able to use these tools to help identify and fix problems.

The Parallel Tasks and Parallel Stacks Windows

In Visual Studio, open the parallel guide samples solution. Set the A-Dash project that is discussed in Chapter 5, "Futures," to be the startup project. Open AnalysisEngine.h and find the AnalysisEngine::DoAnalysisParallel method, which declares and configures the A-Dash workflow. Each future executes a different stage of the workflow and returns a result that is, in turn, passed to the next future in the workflow. Insert a breakpoint in the declaration of the future5 lambda.

Start the debugging process. You can either press F5 or click Start Debugging on the Debug menu. The A-Dash sample begins to run and displays its GUI. On the GUI, select the Parallel checkbox, and then click Calculate. When execution reaches the breakpoint, all tasks stop and the familiar Call Stack window appears. On the Debug menu, point to Windows, and then click Parallel Tasks. When execution first reaches the breakpoint, the Parallel Tasks window shows a task associated with each future that has been added to the workflow.

Figure 1 illustrates a case where multiple tasks are running. Recall that each task runs in a thread. The Parallel Tasks window shows the assignment of tasks to threads. The ID column identifies the task, while the Thread Assignment column shows the thread. If there is task inlining, more than one task can run in a thread, so it's possible that there will be more tasks than executing threads. The Status column indicates whether the task is running or is in a scheduled or waiting state. In some cases the debugger cannot detect that the task is waiting. In these instances, the task is shown as running. The Location column gives the name of the method that is currently being invoked. Place the cursor over the location field of each task to see a call stack pop-up window that displays only the stack frames that are part of the user’s code. To switch to a particular stack frame, double-click on it.

Double-click on a task listed in the Task column to switch the debugger to that task. On the Debug menu, point to Windows, and then click Call Stack to display the complete stack for the thread that is executing the task.

Figure 1

The Parallel Tasks window

On the Debug menu, point to Windows, and then click Parallel Stacks. In the Parallel Stacks window, from the drop-down menu in the upper-left corner, click Tasks. The window shows the call stack for each of the running or waiting tasks. This is illustrated in Figure 2.

Figure 2

The Parallel Stacks window

See the "Further Reading" section for references that discuss the Parallel Stacks window in more detail.

If you add additional breakpoints to the other futures defined in DoAnalysisParallel and press F5, the contents of the Parallel Tasks and Parallel Stacks windows change as the A-Dash workflow processes data. This is how the application should behave. However, these windows can also reveal unexpected behavior that can help you identify and fix performance problems and synchronization errors. For example, the Parallel Tasks and Parallel Stacks windows can help to identify common concurrency problems such as deadlocks. This behavior is demonstrated in the following code, taken from the ProfilerExamples sample.

void Deadlock()
{
    reader_writer_lock lock1;
    reader_writer_lock lock2;

    parallel_invoke(
        [&lock1, &lock2]() 
        { 
            for (int i = 0; ; i++)
            {
                lock1.lock();
                printf("Got lock 1 at %d\n", i);
                lock2.lock();
                printf("Got lock 2 at %d\n", i);
            }
        },
        [&lock1, &lock2]() 
        { 
            for (int i = 0; ; i++)
            {
                lock2.lock();
                printf("Got lock 2 at %d\n", i);
                lock1.lock();
                printf("Got lock 1 at %d\n", i);
            }
        }
    );
}

This code is a classic example of a deadlock. Each task attempts to acquire a lock. The order in which this occurs leads to a cycle that eventually results in deadlock. At this point, the application stops making progress and there is no more new console output. Once the deadlock occurs, click the Break All option on the Debug menu, and open the Parallel Tasks window. You'll see something similar to Figure 3. Notice that the status of each task is Waiting instead of Running. Visual Studio also displays a warning dialog "The process appears to be deadlocked (or is not running any user-mode code). All threads have been stopped." Use the Parallel Tasks window to examine each deadlocked task and its call stack to understand why your application is deadlocked. Place the cursor over the Status column to see what the task is waiting for. You can also examine the call stacks and identify the locks, or other wait conditions, that are causing the problem.

Gg663532.D3E98A194F9E03A7F0A26A881E926628(en-us,PandP.10).png

Figure 3

Parallel Tasks window showing deadlock

Breakpoints and Memory Allocation

Excessive memory copies can lead to significant performance degradation. Use the debugger to set breakpoints on your class's copy constructor and assignment operators to find unintentional copy and assignment operations.

For example, the ImagePipeline sample passes pointers of type shared_ptr<ImageInfo> along the pipeline rather than copies of the ImageInfo objects. These are too expensive to copy because they contain large bitmaps. However, ImageInfo contains an ImagePerformanceData object that is copied once per image. You can use the debugger to verify that no extra copies are being made. Here is how to do this.

Set a breakpoint inside the ImagePerformanceData assignment operator in ImagePerformanceData.h and then run the ImagePipeline example. You'll see that an ImagePerformanceData object is only assigned once, after it has passed through the pipeline. This occurs during the display phase, when the ImagePipelineDlg::OnPaint method creates a copy of the final ImagePerformanceData object so that the GUI thread has the latest data available to display.

Use the Hit Count feature to count the number of assignments. In the Breakpoints window, right-click on the breakpoint and then click Hit Count on the shortcut menu. In the Hit Count dialog box, select break when the hit count is equal to option from the When the breakpoint is hit list. Set the hit count number to 100. Run the sample. The debugger stops on the breakpoint after one hundred images have been processed. Because the number of hits equals the number of images processed so far, you know that only one copy was made for each image. If there were unintentional copies, you would reach the breakpoint after fewer images were processed.

You can also declare the copy constructors and assignment operators as private to prevent objects from being unintentionally copied. The ImageInfo object has a private copy constructor and assignment operator for just this reason.

The Concurrency Visualizer

The profiler included in the Visual Studio 2010 Premium and Ultimate editions includes the Concurrency Visualizer. This tool shows how parallel code uses resources as it runs: how many cores it uses, how threads are distributed among cores, and the activity of each thread. This information helps you to confirm that your parallel code is behaving as you intended, and it can help you to diagnose performance problems. This appendix uses the Concurrency Visualizer to profile the ImagePipeline sample from Chapter 7 on a computer with eight logical cores.

The Concurrency Visualizer has two stages: data collection and visualization. In the collection stage, you first enable data collection and then run your application. In the visualization stage, you examine the data you collected.

You first perform the data collection stage. To do this, you must run Visual Studio as an administrator because data collection uses kernel-level logging. Open the sample solution in Visual Studio. Click Start Performance Analysis on the Visual Studio Debug menu. The Performance Wizard begins. Click Concurrency, and then select Visualize the behavior of a multithreaded application. The next page of the wizard shows the solution that is currently open in Visual Studio. Select the project you want to profile, which is ImagePipeline. Click Next. The last page of the wizard asks if you want to begin profiling after the wizard finishes. This check box is selected by default. Click Finish. The Visual Studio profiler window appears and indicates that it's currently profiling. The ImagePipeline sample begins to run and opens its GUI window. To maximize processor utilization, select the Load Balanced option, and then click Start. In order to collect enough data to visualize, let the Images counter on the GUI reach at least a hundred. Then click Stop Profiling in the Visual Studio profiler window.

During data collection, the performance analyzer takes frequent data samples (known as snapshots) that record the state of your running parallel code. The analyzer also uses Event Tracing for Windows (ETW) to collect all context switches and some other relevant events. Each data collection run writes several data files, including a .vsp file. A single data collection run can write files that are hundreds of megabytes in size. Data collected during separate runs of the same program can differ because of uncontrolled factors such as other processes running on the same computer.

You can run the visualization stage whenever the data files are available. You don't need to run Visual Studio as an administrator to do this. There are several ways to begin visualization. Unless you've changed the default, visualization automatically starts as soon as data collection finishes. Alternatively, you can simply open any .vsp file in Visual Studio. If you select the first option, you'll see a summary report after the data is collected and analyzed. The summary report shows the different views that are available. These include a Threads view, a CPU Utilization view, and a Cores view.

Figure 4 shows the Threads view. Each task is executed in a thread. The Concurrency Visualizer shows the thread for each task (remember that there may be more than one task per thread because of inline tasks).

Figure 4

Threads view of the Concurrency Visualizer

The Concurrency Visualizer screens contain many details that may not be readable in this book's figures, which are reduced in size and are not in full color. The full color screen shots from this appendix are available on the CodePlex site at http://parallelpatternscpp.codeplex.com/.

The Threads View also contains a couple of other useful features. Clicking on different segments of an individual thread activity timeline in the upper part of the screen allows you to see the current stack for that activity segment. This allows you to associate code with individual activity segments. Clicking on the different activity types in the Visible Timeline Profile on the bottom left displays a Profile Report. This allows you to discover which functions in your application are involved in each activity type for the currently selected portion of the timeline. You can also click the Demystify feature to get further help on what different colors mean and on other features of the report.

Figure 5 illustrates the CPU Utilization view. The CPU Utilization view shows how many logical cores the entire application (all tasks) uses, as a function of time. On the computer used for this example, there are eight logical cores. Other processes not related to the application are also shown as an aggregated total named Other Processes. For the application process, there's a graph that shows how many logical cores it's using at each point in time. To make the processes easier to distinguish, the area under each process's graph appears in a different color (some colors may not be reproduced accurately in this figure). Some data points show a fraction rather than an integer such as 0, 1, or 2, because each point represents an average calculated over the sampling interval.

Figure 5

Detail of CPU Utilization view

Figure 6 illustrates the Cores view. The Cores view shows how the application uses the available cores. There is a timeline for each core, with a color-coded band that indicates when each thread is running (a different color indicates each thread.) In this example, between 0 and 2.5 on the time scale, the application is idle with little or no work running on any core. Between 2.5 and 12 the pipeline is filled and more tasks are eligible to run than there are cores. Several threads alternate on each core and the table beneath the graph shows that there is some context switching across cores.

Figure 6

Detail of Cores view

Figure 7 illustrates the Threads view. The Threads view shows how each thread spends its time. The upper part of the view is a timeline with color-coded bands that indicate different types of activity. For example, red indicates when the thread is synchronizing (waiting for something). In this example, the Threads view initially shows the Main Thread and Worker Thread 4708. Later, more threads are added to the thread pool as the pipeline starts processing images. Not all threads are visible in the view pictured here. (You can hide individual threads by right-clicking the view and then clicking Hide). This view also shows that the main thread is active throughout; the green-brown color indicates user interface activity. Other threads show segments of green, denoting execution, red indicating synchronization, and yellow for preempted treads.

Figure 7

Detail of Threads view

After 2.5 on the timeline, some pipeline threads execute frequently but others execute almost continuously. There are more pipeline threads than cores, so some pipeline threads must alternate between running and being preempted.

Scenario Markers

You can use the Scenario library to mark different phases of complex applications. The following code shows an example. (The Scenario library is a free download on the MSDN® Code Gallery website. For more information, see "Scenario Marker Support" on MSDN at https://msdn.microsoft.com/en-us/library/dd984115.aspx.)

#include "Scenario.h"

// ...

shared_ptr<Scenario> myScenario = shared_ptr<Scenario>(new Scenario());
myScenario->Begin(0, L"Main Calculation");

// Main Calculation Phase...

myScenario->End();

These markers will be displayed in all three views. They appear as a band across the top and bottom of the timeline. Move the mouse over the band to see the marker name. You can see an example of this later on in Figure 11. Don't use too many markers as they can easily overwhelm the visualization and make it hard to read. The tool may hide some markers to improve visibility. You can use the zoom feature to increase the magnification and see the hidden markers for a specific section of the view.

Visual Patterns

The patterns discussed in this book focus primarily on ways to express potential parallelism. However, there are other types of patterns that are useful in parallel development. The human mind is very good at recognizing visual patterns, and the Concurrency Visualizer takes advantage of this. You can learn to identify some common visual patterns that occur when an application has specific performance problems. This section describes visual patterns that will help you to recognize and fix oversubscription, lock contention, and load imbalances.

More examples of common patterns indicating poorly behaved parallel applications are discussed on MSDN. See the "Further Reading" section for more information. You can also access this content from a link in the Hints tab in the Threads view. As seen earlier, Figure 4 shows the link in this tab.

Oversubscription

Oversubscription occurs when there are more threads than logical processors to run them. Oversubscription can cause poor performance because of the high number of context switches, each of which takes some processing time and which can decrease the benefits provided by memory caches.

Oversubscription is easy to recognize in the Concurrency Visualizer because it causes large numbers of yellow regions in the profiler trace. Yellow means that a thread was preempted (the thread was switched out). When profiled, the following code yields a quintessential depiction of oversubscription.

void Oversubscription()
{
  task_group tasks;

  for (unsigned int p = 0; p < (GetProcessorCount() * 4); p++)
  {
    tasks.run([]()
    {
      // Oversubscribe in an exception safe manner
      scoped_oversubcription_token oversubscribe;
      // Do work 
      delay(1000000000);
    });
  }
  tasks.wait();
}

Figure 8 illustrates the Threads view from one run of this function on a system with eight logical cores. It produces a very distinct pattern.

Gg663532.3E6556E968489A44F53CD8202AE3D76E(en-us,PandP.10).png

Figure 8

Threads view that shows oversubscription

Lock Contention and Serialization

Contention occurs when a thread attempts to acquire a lock that is held by another thread. In many cases, this results in the second thread blocking until the lock is released. The Threads view of the Concurrency Visualizer depicts blocking in red. It is often a sign of decreased performance. In extreme cases, an application can be fully serialized by one or more locks, even though multiple threads are being used. You can see this in Figure 9, where the narrow bright green areas representing execution only appear in one worker thread at a time. In such cases, the performance may be much worse than in a conventional, serial version of the application.

The following LockContention method produces a lock convoy, which leads to significant lock contention and serialization of the program even though multiple threads are in use. A lock convoy is a performance problem that occurs when multiple threads contend for a frequently shared resource.

void LockContention()
{
  task_group tasks;
  reader_writer_lock lock;

  for (unsigned int p = 0; p < GetProcessorCount(); p++)
  {
    tasks.run([&lock]() 
    {
      for (int i = 0; i < 10; i++)
      {
        // Do work
        delay(100000);

        // Do protected work
        lock.lock();
        delay(100000000);
        lock.unlock();
      }
    });
  }
  tasks.wait();
}

Figure 9 illustrates the pattern this code produced in the Threads view of the Concurrency Visualizer.

Gg663532.662EC4C5C98F77E783868CC72B68E60D(en-us,PandP.10).png

Figure 9

Threads view showing lock convoy

Clicking on one of the Synchronization (red) blocks highlights it and also indicates which thread blocked it. You can examine the stack of the unblocking thread by clicking the Unblocking stack tab. This allows you to see the call stack of the thread that unblocked the blocked thread. In this example, the maroon block on thread 1428 indicates that it was waiting for thread 7400, which was executing a call to****the ThreadProxy::SwitchTo method. After thread 1428 is unblocked it continues to execute and is displayed as a green block in the Threads view. You can click on any red synchronization block to examine its unblocking stack, if one is available. When examining call stacks, remember that you can quickly view the source by double-clicking on the stack frame. This allows you to associate segments on the Thread view with the code that was executing during that segment.

Load Imbalance

A load imbalance occurs when work is unevenly distributed across all the threads that are involved in a parallel operation. Load imbalances mean that the system is underutilized because some threads or cores are idle while others finish processing the operation. The visual pattern produced by a load imbalance is recognizable in several of the Concurrency Visualizer views. The following code creates a load imbalance.

void LoadImbalance()
{
  const int loadFactor = 20;

  parallel_for_fixed(0, 100000, [loadFactor](int i)
  {
    // Do work
    delay(i, loadFactor);
  });
}

Although most of the parallelism support in the PPL uses dynamic partitioning to apportion work to a pool of tasks, the parallel_for_fixed method included in concrt_extras.h, which is available at https://code.msdn.microsoft.com/concrtextras, uses fixed partitioning of iterations, without range stealing. The code example shown here, when run on a system with eight logical cores, causes elements [0, 12499] to be processed by one task, elements [12500, 24999] to be processed by another task, and so on, through the entire range. The body of the workload iterates from 0 to the current index value, which means that the amount of work to be done is proportional to the index. Workers that process lower ranges will have significantly less work to do than the workers that process the upper ranges. As a consequence, the tasks for each sub-range finish at different times. This load imbalance is an inefficient use of the processors. Figure 10, which is the CPU Utilization view in the Concurrency Visualizer, illustrates this.

Figure 10

CPU view that shows a load imbalance

When the LoadImbalance method begins to execute, all eight logical cores on the system are being used. However, after a period of time, usage drops as each core completes its work. This yields a stair-step pattern, as threads are dropped after they complete their portion of the work. The Threads view confirms this analysis. Figure 11 illustrates the Threads view.

Gg663532.17D0C15440F1F8455ADC92D35A107B24(en-us,PandP.10).png

Figure 11

Threads view that shows a load imbalance

The Threads view shows that after completing a portion of the work, the worker threads were idle while they waited for the Main Thread, 7136, to complete the remaining work. The example is a console application and, in this case, the runtime used the main thread to run one of the tasks.