November 2010

Volume 25 Number 11

Task-Based Programming - Scalable Multithreaded Programming with Tasks

By Ron Fosner | November 2010

PCs are evolving away from faster and faster processors and toward more and more cores. That means increases in latent processing power are available at relatively low cost. But it also means programming these systems to take advantage of that latent processing power is more challenging. To use all of those multiple processors, you need to delve into the world of parallel processing.

There are many different ways to distribute your work across multiple cores. In the October issue of MSDN Magazine (msdn.microsoft.com/magazine/gg232758), I introduced you to some basic concepts of multithreaded programming and showed how to add threaded execution into your code with OpenMP and thread pools. I also demonstrated how to use tools in Visual Studio 2010 to measure core and thread utilization as a measure of how well a threading implementation improves the performance of your app.

In this article, I’ll look at a more sophisticated multithreading technique called task-based programming. Tasks let you spread your application’s work across some or all of the CPU cores that are available. With a bit of thoughtful programming you can minimize and even eliminate any data dependency or time-synchronization constraints.

Building on what you learned from my previous article, I’ll take you through a more sophisticated multithreaded application that uses tasks. Tasks enable the application to scale itself up to the number of cores available and the amount of work that needs to be accomplished.

A Mouse in a Maze

When I sat down to write this article, I tried to come up with a problem that was complicated to parallelize but still easy to visualize what was occurring. I hit upon the idea of creating a solution that would solve a 2D maze. While at first glance it may seem a bit trivial, it’s actually quite challenging to implement correctly—I know because it took me three tries to get it right.

Figure 1 shows a simple, serial maze solver in action. The solution to the maze is just a long, sinuous pathway with many branches that lead to dead ends. Not surprisingly, I called the solution algorithm a “mouse.” The mouse will observe its current cell and try to go forward to the next cell. If it hits a wall it will try to go left. If it can’t go left it will try to go right. If it can’t go into any of those directions, it will mark its current path as a dead end and back up.

image: Single-Threaded Maze

Figure 1 Single-Threaded Maze

When the mouse moves into a new cell it makes note of any pathways that it didn’t take. For example, if a mouse is able to move forward, but it was also able to go left, then it will remember that cell and that direction. So as the mouse moves down a hallway it will take note of doorways on either side and push these on a stack. When the mouse reaches a dead end it pops one of these locations, backs up, and heads off in the saved direction. Eventually it will reach the endpoint though sheer persistence.

After a mouse has backed up, it’s necessary to prevent it from trying a direction it has already searched. I do this by marking a cell as visited when a mouse has successfully moved into a cell. So when a mouse attempts to move in a new direction, it first checks to see that there’s no wall. If there’s no wall, it checks to see if the cell that it’s considering moving into has been visited before. It will discard any movement into cells that have already been visited. This is illustrated by the grayed out pathways in Figure 1.

The Mouse Solution

This solution is pretty easy to visualize and, as a result, it’s easy to understand the logic behind the search. It’s also somewhat hypnotic to watch—at least for a little while. A simplified flow chart for a serial mouse is shown in Figure 2.

image: Flow Chart of Mouse Actions

Figure 2 Flow Chart of Mouse Actions

While it’s very simple to understand the problem conceptually, there are some unconstrained elements that make it challenging. First, you have no idea how long a mouse will run before it hits a dead end. Second, you have no idea how many branches it will discover along the way.

This problem is doubly interesting when you try running it on multiple threads. The most straightforward way to make this problem multi-core friendly is to make multiple mice and give each mouse its own thread—which is the approach I’ve taken. As an added bonus, this enhances the visualization because I can switch the active mouse color as a new thread takes over.

In fact, it was a bit more challenging than I originally considered. Once I had a working single-threaded version, I made the mistake of trying to adapt that version and make it multithreaded. This was my single biggest architectural mistake. I went through three different revisions before I stood back and reworked the architecture to be task-friendly.

I won’t cover my failed efforts except to state that they all focused on me trying to optimize performance by not copying data and by trying to minimize memory and optimize access to the shared data by the various threads. Essentially, in my original design, I had a global stack that I could lock, then I’d push the locations of the untaken branches onto the stack as I ran across them. When a thread finished work and went in search of more work to process, I would lock the stack (to prevent simultaneous access by another thread), pop off the location and direction, then unlock the stack. While this worked to some degree, it was clunky and forced me to consider adding in new data in each mouse to keep track of the path taken so that it would have a path from the starting point to its current location.

When you find yourself adding in mid-state information in a multithreaded program to make up for some partial state you’re starting with, or special casing behavior, or in general doing something that’s not generic to your tasking code, then it’s time to rethink your design.

What I ended up doing was placing the current path information in each mouse and making this part of the mouse’s initialization information. When I reached a branch point, I created a mouse data structure and initialized it from the current mouse’s information—thus creating a clone mouse that, for example, goes left when the original goes right. The clone contains the original’s memory. The only thing different in each mouse is a counter that keeps track of the number of mice created—this is how the mice get assigned a color.

I also turned about and made one global copy of the maze that contains the individual cells’ state information and did notplace any locks around the writing of the state information. This was a simplification I accepted as a tradeoff—a mouse will be working on a path by itself. It always checks to see whether the cell is marked as visited before it moves into the cell.

Because there are no locks around the global cell data, it’s possible, though unlikely, that two mice might both start down a path at the same time. This could happen either though a duplicate stack entry or through a looped pathway where two mice run into each other. In either case, I accept the fact that a mouse might be happily running down a path, its thread might get suspended, and when it resumes it discovers that some other mouse has crossed its path. In this case the mouse just backs up as if it has hit a wall, because the mouse that crossed it is following a path successfully. The suspended mouse missed its chance and did some extra work.

If there were a lot more processing involved in marking the cells, then I might be more reluctant to accept the fact that some useless work might get done. Eliminating any locks of shared data simply means I have to make the algorithm a bit more robust. Designing it to handle such situations means that there’s less room to make mistakes. The most likely source of errors in multithreaded programs usually involves some form of locking mistake, such as a race condition or making assumptions about when data or counters get updated.

If you can make your algorithms robust enough to handle data that might be slightly out of date and able to recover from those situations, then you’re on your way to making a resilient multithreaded architecture.

Task Queues and Dependencies

Windows Vista introduced new scheduling algorithms and new primitives that are the underlying support for a number of Microsoft .NET Framework 4 features. One of these features is the Task Parallel Library (TPL), which provides quite a few of the more common parallel programming algorithms, including fork/join, parallel for, work stealing and task inlining. If you’re coding in unmanaged C++, you can take advantage of Intel Threading Building Blocks (TBB) or the Microsoft Parallel Patterns Library (PPL).

These libraries come with classes that provide multithreaded programming support for jobs and tasks. They also have many thread-safe container classes. These classes have been tested and optimized for performance, so unless you’ve got a deep-seated need to write a customized variation for some reason, you’ll be better off using some tested, robust code.

Because this is an introductory article on threading and tasks, and you might benefit from some insight into how these threading libraries work, I wrote my own set of wrappers around two of the new features in Windows Vista: ThreadPool and SlimReaderWriterLock (SRWLock). These are both low-cost ways of making data threads safe for those situations where you’ve got one writer and many readers, and the data usually isn’t locked for a long time. Note that the goal of this article is to step you through how I’ve chosen to implement a thread pool that consumes tasks—tasks which can have dependencies. To illustrate the basic mechanisms I’ve taken some liberties with the code to make it easier to understand. The code works, but you’re better off choosing one of the threading libraries for any real implementation.

For my maze algorithm I chose to use the most generic of the multithreading algorithms: tasks that can have dependencies (implemented using SRWLock) and a task scheduler (using the OS’s ThreadPool). This is the most generic because basically a task is just some bit of work that needs to get done and the task scheduler takes care of communication with the OS to get the task running on a thread. They’re generic because it’s possible to create tasks out of any code that needs to get executed and throw it into a ThreadPool.

The challenge is creating tasks that take up enough time to make overcoming the overhead of getting them scheduled worthwhile. If you’ve got some big monolithic tasks that need to get executed, then feel free to create some threads and execute the code on them. On the other hand, there are many applications where there are groups of tasks that need to get done, some serially, some not. Sometimes you’ll know beforehand how much work needs to get done; other times—particularly fetching or reacting to user input or some communication—you’re just polling for something that will only occasionally require extended processing. This is easily handled by the generic nature of ThreadPool and its associated task queue.

Customizing the ThreadPool

To give you a clear understanding of how to build a tasking system on top of the ThreadPool, I really only need to use three of its interfaces:

CreateThreadpool();
CloseThreadpool();
TrySubmitThreadpoolCallback();

The first two are just the bookkeeping functions. TrySubmitThreadpoolCallback basically takes a pointer to a function to execute plus some context variables. You call this function repeatedly to load up the thread pool with tasks to execute and it will serve them in a first-in-first-out (FIFO) manner (no guarantees here).

To make it work with my tasks, I wrote a short wrapper about ThreadPool that lets me customize the number of threads in the thread pool (see Figure 3). I also wrote a submit function that will take care of tracking the context variables associated with the task.

Figure 3 ThreadPool Wrapper

class ThreadPool {
  PTP_POOL m_Pool;
public:
  static VOID CALLBACK WorkCallback(
    PTP_CALLBACK_INSTANCE instance,
    void* Context);
  ThreadPool(void);
  ~ThreadPool(void);
  
  static unsigned GetHardwareThreadsCount();
  // create thread pool that has optimal 
  // number of threads for current hardware
  bool create() { 
    DWORD tc = GetHardwareThreadsCount(); 
    return create(tc,tc); 
  }
  bool create(
    unsigned int minThreads, 
    unsigned int maxThreads = 0);
  void release();
  bool submit(ITask* pWork);
};

The interesting thing to note is that you usually only want to create as many software threads as there are hardware threads. Sometimes this might be one less if you’ve got a main thread off doing something else. Note that the number of threads you create has nothing to do with the size of the task queue—it’s perfectly legitimate to create a ThreadPool with four threads and then submit hundreds of tasks. It’s probably not a good idea, however, to take a single serial job and break it up into thousands of tasks. This is an indication that you’ve got too many finely grained tasks. If you find yourself in this situation, then you’ll just have create a task whose job it is schedule the next 100 tasks—or if you’re using one of the tasking libraries, then create a work-stealing, inlining or follow-on task.

My Task class (note the capital T, which I’ll use to denote my task wrapper class as shown in Figure 4) has the capability of being dependent upon other Tasks. Since the OS’s ThreadPool doesn’t have this capability, I’ll need to add it. Thus, when a Task starts executing on a thread, the first thing it does is check to make sure it has no outstanding dependencies. If it does, the thread executing the task blocks by waiting on the SRWLock. The Task will only get rescheduled when the SRWLock is freed.

Figure 4 Task Wrapper

class ITask {
protected:
  vector<ITask*>    m_dependentsList;      // Those waiting for me
  ThreadSafe_Int32  m_dependencysRemaining;// Those I'm waiting for
  // The blocking event if we have dependencies waiting on
  SRWLOCK m_SRWLock; 
  CONDITION_VARIABLE m_Dependencies;
  void SignalDependentsImDone();
  ITask(); 
  virtual ~ITask();
public:  
  void blockIfDependenciesArePending();
  void isDependentUpon(ITask* pTask);
  unsigned int queryNumberDependentsRemaining()
  // A parent Task will call this
  void clearOneDependency();
  virtual void doWork(void*) = 0;
  virtual void* context() = 0;
};

Again, let me point out that this is not code I’d want to see in a non-academic application, but putting the block here lets you see exactly what’s happening. The OS will notice the block and schedule another Task. Eventually—unless there’s a programming error—the blocked task will unblock and get rescheduled to run.

Generally, it’s not a good idea to schedule a Task that will immediately get suspended. Because ThreadPool’s task queue is by and large FIFO, you’ll want to schedule tasks that have no dependencies first. If I were writing this for optimal performance rather than illustrative purposes, I’d add a layer that only submitted tasks that had no dependencies to the thread pool. I can get away with this because blocked threads will eventually get swapped out. In any case, you’ll need to have some thread-safe way of signaling that a task is done and SRWLocks can be used for this situation. Incorporating them into my Task class is natural, rather than having to write specialty code to handle each case.

By design a Task can have any number of Task dependencies. Ordinarily, you want to reduce or eliminate any waiting if you can, and using a tool such as Visual Studio Task List or Intel Graphics Performance Analyzers will help you track these down. The implementation I present here is a very basic tasking system and should not be used for code that requires high performance. It’s good sandbox code for getting your multithreaded feet wet, but you should look toward TBB, TPL or PPL for more efficient code.

The ThreadPool will call the WorkCallback function, which executes some prefix code that will query the Task data structure, like so:

VOID CALLBACK ThreadPool::WorkCallback(
  PTP_CALLBACK_INSTANCE instance, void* pTask) {
  ITask * pCurrentTask = (ITask*) pTask;
  pCurrentTask->blockIfDependenciesArePending( );
  pCurrentTask->doWork(pCurrentTask->context() );
  pCurrentTask->SignalDependentsImDone();
}

The basic operation is:

  1. The ThreadPool loads up the WorkCallback from its internal task queue.
  2. The code queries the Task to see if there are any dependencies (parent dependencies). If dependencies are found, block execution.
  3. Once there are no dependencies, call doWork, which is the actual part of the Task code that’s unique per Task.
  4. Upon returning from doWork, clear any child dependencies on this Task.

The important thing to note is there is some preamble and postscript code residing in my ThreadPool class that checks and clears dependencies in the Task. This code gets run on each thread but has a unique Task associated with it. Once the preamble code gets run, the actual Task work function gets called.

Creating a Custom Task Class Wrapper

The basic job of a task is to provide some context and a function pointer to get the task executed by the thread pool. Because I wanted to create Tasks that were capable of having dependencies, I needed some code to handle the bookkeeping of tracking blocking, and unblocking dependencies (see Figure 4).

When a Task is created you can then tell it that it’s dependent upon other tasks by providing a pointer to that dependant Task. A Task contains a counter for the number of Tasks that it has to wait for, plus an array of pointers to Tasks that have to wait for it.

When a Task has no dependants, its work function will get called. After the work function returns, it loops through all the Task pointers in the array, calling clearOneDependency on each class, which decrements the number of that Task’s remaining dependencies. When the number of dependencies drops to zero, the SRWLock is released, unblocking the thread that executed the task waiting on those dependencies. The thread running the task gets unblocked and execution will continue.

That’s the basic overview of how I designed my Task and ThreadPool classes. I ended up designing it this way because the OS’s native thread pool doesn’t have quite this behavior and I wanted to give you some code to play with where you’re in control of the dependency mechanism. I originally had a much more complicated wrapper around ThreadPool that included a priority queue, but realized that I was unnecessarily complicating things and that a simple child-parent dependency relationship was all I needed. If you really want to take a look at customizing a thread scheduler, take a look at Joe Duffy’s article, “Building a Custom Thread Pool (Part 2): A Work Stealing Queue,” at tinyurl.com/36k4jcy.

I tend to write code pretty defensively. I tend to write simple implementations that work, then refactor and increase functionality with frequent checks along the way to make sure I haven’t messed anything up. Unfortunately, I also tend to write code that passes around references to other variables—which is a bad thing in a multithreaded program if you’re not careful.

I was undone more than once when converting my single-threaded maze solution code to a multithreaded one. I finally had to go through and make sure I was passing copies of data when there was a chance that the value would be modified in a thread.

I also tried to be conservative by starting with a single-threaded version that only kept the current mouse’s path. That introduced the problem of keeping track of yet more state data. As mentioned earlier, I solved this by making clone mice that had all their parents’ data. I also chose to eliminate the prioritized ThreadPool wrapper and any locking on the global maze cell data. In all likelihood I introduced some additional work, but I also eliminated many of the sources of errors that could have occurred by greatly simplifying my code.

The ThreadPool wrapper and the Task class worked exactly as designed. I utilized these classes in some unit tests to make sure they exhibited the behavior I was expecting. I also instrumented them using the Intel Graphics Performance Analyzers tasking tool, which has a feature that lets you dynamically tag threads and examine which pieces of code are executing on a particular thread. This visualization of thread execution let me verify the threads were executing, blocking and being rescheduled, just as I expected.

When I reworked the mice to be clones of their parents, this ended up greatly simplifying the bookkeeping required by the simulation because each mouse was self-contained. The only shared data I ended up requiring was the global cell array, which indicates if a cell was visited. I can’t emphasize enough that having a good visualization on how your tasks are being scheduled is of paramount importance.

The Mouse Pool

I chose the maze problem because it illustrated a number of issues that can crop up in converting a single-threaded algorithm to a multithreaded one. The biggest surprise to me was that, once I bit the bullet and rewrote the algorithm to eliminate some of the bookkeeping I had been attempting to maintain, the maze-solving algorithm suddenly became much simpler. In fact, it became simpler than the single-threaded algorithm because there was no need to keep a stack of branch points—they were simply spawned off into a new mouse.

By design, each mouse was a clone of its parent, so each mouse had an inherited path tracing back to the spawn point. The mice didn’t know it, but I purposely wrote the maze-generation algorithms to try to select the furthest path possible. No sense in making it easy on them. The test program allows selection between a number of maze-generation algorithms, ranging from the original algorithm—which generates long corridors with occasional branches eventually leading to dead ends—to algorithms that are very branchy with short corridors. Presented with these different mazes, the difference in behavior from the single-threaded solution to the multithreaded solution can be quite dramatic.

When I applied a multi-threaded method to solving the original maze algorithm, I reduced the search duration by 48 percent on a 4 CPU system. This is because the algorithm has a lot of long corridors and there aren’t a lot of opportunities to spawn off additional mice (see Figure 5).

Figure 5 Solving the Original Long-Maze Algorithm

Figure 5 Solving the Original Long-Maze Algorithm

Figure 6 shows a maze with more branches. Now there are a lot more opportunities to spawn off mice and have them search simultaneously. Figure 6 shows the multi-threaded solution to this short-path maze, where I get a reduction of the time to find a solution by 95 percent by using more tasks.

Figure 6 Multiple Mice Make It Easy to Solve the Maze in Much Less Time

Figure 6 Multiple Mice Make It Easy to Solve the Maze in Much Less Time

This just serves to illustrate that some problems are more amenable to breaking apart than others. I feel compelled to point out that the maze program is designed to be visually interesting—that is, it lets you see the progression of the mice and steps through them. If I were interested in simply finding the solution to the maze in the shortest amount of time, the rendering and the mice would be decoupled—but then that would not be as fun to watch.

Loose Threads

One of the biggest problems that I see when I help folks try to make their applications run faster is a hesitation to try multithreading. I understand that hesitation. When you add in multithreading, you suddenly add in a layer of complexity that most programmers aren’t used to in an area they don’t have a lot of experience with.

Unfortunately, when you shy away from multithreading, you end up leaving a good portion of computing power unutilized.

I’ve covered the basics of how a tasking system works and given you the basics of how to go about breaking up large jobs into tasks. However, the current approach, while good, is not the best practice for getting the maximum performance across current and future multi-core hardware. If you’re interested in getting further performance gains on any hardware that your application may be run on, then you’ll need to architect your application with this in mind.

The best way of getting maximum, scalable performance in your application is to take advantage of one of the existing parallel libraries and see the best way to fit your application’s needs into the various architectures that these libraries provide. Unmanaged, real-time or performance-critical applications are usually best served by using one of the interfaces provided in TBB, while managed apps have a wider variety of multithreading options in the .NET Framework 4. In either case, choosing one of these threading APIs will determine the overall structure of your application and how you design the tasks to work and coexist.

In a future article I’ll take a look at actual implementations that take advantage of these techniques, and demonstrate how to construct applications around these various threading libraries so you can design your own implementations to use them.

In any event, you’ve now got the basic knowledge to try out some basic threading approaches, and you should take a look at the threading libraries to start to figure out how best to design your future applications. While multithreading can be challenging, using one of these scalable libraries is the gateway to taking advantage of maximum performance of hardware, both current and future.


Ron Fosner has been optimizing high-performance applications and games on Windows for years and is starting to get the hang of it. He’s a graphics and optimization expert at Intel and is happiest when he sees all CPU cores running flat out. You can reach him at Ron@directx.com.

Thanks to the following technical experts for reviewing this article: Aaron Coday, Orion Granatir and Brad Werth