Export (0) Print
Expand All

Appendix A: The Task Scheduler and Resource Manager

The Parallel Patterns Library (PPL) and Asynchronous Agents Library rely on features of a lower-level component known as the Concurrency Runtime. The Concurrency Runtime contains a task scheduler and a resource manager. Both are documented on MSDN®, but in this appendix you’ll find an overview of their functionality and learn of some of the motivations that shaped their designs.

The Concurrency Runtime enables you to declare sources of potential parallelism in your applications. It is designed both to use existing parallel hardware architectures and to take advantage of future advances in those architectures. In other words, your applications will continue to run efficiently as platforms evolve. The task scheduler determines where and when to run your application’s tasks, and it uses cooperative scheduling to provide load balancing across cores. The resource manager prevents the different parts of your application, as well as any libraries they use, from contending for parallel computing resources. It also helps to ensure the best use of resources such as hardware caches. The problems addressed by the Concurrency Runtime are far from trivial. It uses advanced algorithms and benefits from many person-years of experience to optimize parallel performance on multicore architectures.

Figure 1 illustrates how the components in the Concurrency namespace relate to each other. (These components are shown in shaded blue.)

Gg663535.4DB08458F68D8C6212B0125A6EFE82F7(en-us,PandP.10).png

Figure 1

Relationships among libraries and run-time components

The following table lists the header files for each component of the Concurrency Runtime. The information in these files gives you a convenient way to learn the capabilities of a component.

Component

C++ Header File

Parallel Patterns Library

ppl.h

Asynchronous Agents Library

agent.h

Data structures

concurrent_vector.h

concurrent_queue.h

Task scheduler

concrt.h

Resource manager

concrtrm.h


The Concurrency Runtime is a user-mode layer that sits on top of the operating system. It can manage large numbers of cores, something that is not feasible for an application to do by itself. The Concurrency Runtime is part of the C++ runtime that is included in Microsoft® Visual Studio® 2010 development system. No additional libraries are required.

The Concurrency Runtime abstracts some of the operating system’s processor management APIs and provides implementations that automatically use the features of each version of the operating system. For example, on a 64-bit version of Microsoft Windows® 7 operating system, you can scale your application to 256 cores while automatically assigning tasks to cores in a way that respects Non-Uniform Memory Architecture (NUMA) boundaries. (This would not be easy to program.) This same application will run on Windows Vista®, which supports 64 cores.

Gg663535.note(en-us,PandP.10).gifNote:
All class names and global function names mentioned in this appendix are from the Concurrency namespace, unless otherwise specified.

Resource Manager

The resource manager allocates processor cores among the application’s task schedulers and ensures that related tasks execute as "locally" as possible. Local execution includes taking advantage of the memory access characteristics of NUMA nodes and hardware caches. The resource manager is a singleton instance of the ResourceManager class.

The resource manager is a singleton instance of the ResourceManager class. It allocates processor resources to the application’s task schedulers and helps execute tasks as "locally" as possible.

Most programmers won’t invoke the resource manager directly, but it’s helpful to understand how the resource manager works.

Why It’s Needed

The resource manager is especially helpful when there are multiple scheduler instances within a single application. In these situations, the resource manager allows task scheduling components, including the Scheduler class as well as components written by third parties, to coexist without contending for cores. For example, if your application has two schedulers, both with default scheduler policies, the resource manager initially divides the available cores on your computer equally between them. The resource manager makes the division of cores along NUMA node boundaries or processor packages, if possible.

The resource manager is also helpful when an application uses parallel libraries from more than one vendor. The resource manager allows these libraries to cooperatively share resources. Microsoft encourages vendors who write libraries for concurrency to build their components on top of the Concurrency Runtime so that all the libraries can take advantage of this feature.

The resource manager also provides dynamic resource management, which adjusts the level of concurrency across components based on core utilization.

How Resource Management Works

The main abstraction of the resource manager is a virtual processor object that is provided by the IVirtualProcessorRoot class. You can think of a virtual processor object as a token that grants a scheduler the right to start (or resume) one thread. The core that runs the thread is chosen based on a processor affinity mask that is specified by the virtual processor object. A processor affinity mask is a bit mask that specifies which core(s) the operating system can use to run a particular thread. Cores mean all hardware-supported execution resources, including hardware threads of simultaneous multithreading ("hyperthreading") architectures.

A virtual processor object grants a scheduler permission to start (or resume) one thread. The core that executes the thread is chosen based on a specific processor affinity mask.

You can think of a virtual processor object as similar to a concert ticket that has a section assignment but no specific seat number. When the ticket is eventually presented at the door it gives the bearer the right to sit in any seat in the designated section of the concert hall. Similarly, the virtual processor object is a "ticket" that will allow a worker thread to run on any core that meets the requirements of the processor affinity mask.

Gg663535.note(en-us,PandP.10).gifNote:
To manage threads, the resource manager provides instances of the IThreadProxy class, which a scheduling component should associate with objects that provide the IExecutionContext interface.

The following diagram shows how this works.

Gg663535.6D0CDA8E79723800AA8FEC6B44E7EB76(en-us,PandP.10).png

Figure 2

Virtual processor objects

Figure 2 shows that the cores of the computer are grouped into processor packages and NUMA nodes. The resource manager knows how the NUMA nodes and processor packages are laid out in the computer. For a particular process, several scheduler objects ask the resource manager for specific numbers of virtual processor objects that will be used by their worker threads. Some schedulers may want a higher degree of concurrency than other schedulers. Some schedulers want as much concurrency as possible.

A processor node is an abstraction created by the resource manager to represent NUMA nodes, processor packages or (potentially) other kinds of groupings.

The resource manager attempts to satisfy the requests of all scheduler objects, given the fact that there are a fixed number of cores on the computer. In the end, the resource manager gives some virtual processor objects to each scheduler. The virtual processor objects do not issue particular core IDs; instead they specify a processor node, which is an abstraction used by the resource manager to represent NUMA nodes, processor packages or other kinds of groupings of execution resources. In any case, each virtual processor object represents the right for one worker thread to run on that processor node, even though the particular core will be chosen later.

The resource manager maps processor nodes to a set of cores. After a scheduler receives its virtual processor objects, which worker thread will use a particular virtual processor object is still unknown. As it runs, the scheduler associates (and disassociates) worker threads with virtual processor objects. At any given moment, there is one virtual processor object per running thread. The virtual processor object uses the processor affinity mask to determine which cores can be selected by the operating system. The operating system decides the specific core that will be assigned to a worker thread. The worker thread runs on the chosen core.

After the thread begins to run, the virtual processor object can’t be reused with another thread unless a cooperative context switch occurs. A cooperative context switch happens when a worker thread becomes blocked as a result of a cooperative blocking operation. The blocked thread is disassociated from its virtual processor object, and another worker thread becomes associated with the virtual processor object and is allowed to run. If a scheduler wants more threads to run at the same time, the scheduler must ask the resource manager for additional virtual processor objects. The number of virtual processor objects assigned to a scheduler equals that scheduler’s level of concurrency, or the number of worker threads that can run at the same time. The scheduler is free to create as many threads as it wants, but it can only allow as many threads as it has virtual processor objects to run at any given time.

The net effect of the interaction between a scheduler and its virtual processor objects is to fix the scheduler’s level of concurrency and to cause its threads to run on specific cores.

Dynamic Resource Management

The resource manager uses dynamic resource management to help schedulers cooperate in their use of cores. At run time, the resource manager dynamically monitors the use of execution resources. It knows when virtual processor objects are idle and when they are busy. If it detects use patterns that indicate consistent underutilization of a core, the resource manager might reassign that core to another scheduler. Dynamic resource management only occurs when there are multiple schedulers.

Techniques such as reassigning cores allow the resource manager to place execution resources where they are most needed. As a consequence, you may notice that the allocation of virtual processor objects changes over time.

Oversubscribing Cores

If a set of cores has more virtual processor objects than the number of cores (where all hardware threads are considered to be cores), the cores are said to be oversubscribed. Normally, the default scheduling policy partitions the available cores without creating more virtual processor objects than there are cores on the computer. In other words, the resource manager avoids oversubscription. When cores are busy with compute-intensive operations, oversubscription results in contention for system caches and reduced throughput. However, there are exceptions to this rule.

For example, the resource manager deliberately uses oversubscription if the aggregate minimum level of concurrency required by all of the application’s schedulers exceeds the number of cores. Another situation is when the scheduling policy option TargetOversubscriptionFactor is set to a value greater than one. This policy allows a scheduler to ask for oversubscribed cores at startup.

A third example is when a scheduler requests additional concurrency (in the form of additional virtual processor objects) from the resource manager as the program runs. Finally, a special case arises when the resource manager’s dynamic resource management feature observes that a scheduler is underutilizing one of its cores. In this case the resource manager might temporarily oversubscribe the underutilized core with work from other schedulers without removing the corresponding virtual processor object from the first scheduler.

Querying the Environment

While it is unlikely that you will need to program directly against the resource manager, there are a few functions in the concrtrm.h header file that you may find useful. These functions retrieve information about the operating environment. The GetProcessorCount and GetProcessorNodeCount functions are particularly useful. They return the number of cores (counting all hardware threads as cores) and the number of processor nodes on your computer. If you see that the number of nodes is greater than one, you can deduce that you are on a machine that has some concept of locality for cores, such as a machine with NUMA or multiple processor packages. The exact meaning of a "processor node" is determined by the resource manager and depends on what kind of computer you have.

Kinds of Tasks

Before discussing the task scheduler it's important that you understand that there are two types of tasks in the Concurrency Runtime. The task scheduler provides its own task abstraction, known as lightweight task, which should not be confused with the tasks provided by PPL.

Lightweight Tasks

The primary interface to lightweight tasks is the ScheduleTask method of the ScheduleGroup class. The ScheduleTask method allows you to add a new, pending lightweight task to the queue of a particular schedule group. You can also invoke the CurrentScheduler::ScheduleTask static method if you want the runtime to choose a schedule group for you. Schedulers and schedule groups are described in the "Task Schedulers" section of this appendix.

Gg663535.note(en-us,PandP.10).gifNote:
Do not confuse lightweight tasks provided by the Concurrency Runtime with tasks in PPL. Most programmers will not create lightweight tasks directly.

You can use lightweight tasks when you don’t need any of the cancellation, exception handling, or task-wait features of PPL tasks. If you need to wait for a lightweight task to finish, you must implement the wait with lower-level synchronization primitives. If you want to handle exceptions, you need to use STL’s exception_ptr class and then rethrow captured exceptions at a time of your choosing.

The Concurrency Runtime’s interface to lightweight tasks is similar to the Windows thread pool in that it only uses function pointers for its work functions. However, the Concurrency Runtime’s sample pack includes, as a convenience, a functor-based way to schedule lightweight tasks that uses wrapper classes named task_scheduler and schedule_group. If you use lightweight tasks, you will probably find the sample pack’s interface to be a more convenient approach.

Most programmers won’t use lightweight tasks. Instead, they will use PPL to create tasks. Nonetheless, lightweight tasks can be useful for programming new control structures. Also, you can use lightweight tasks to migrate from threads to tasks. For example, if you have an existing application that uses calls to Windows APIs to create threads, and you want the application to use a task scheduler object as a better thread pool, then you may want to investigate lightweight tasks. However, it’s recommended that most programmers should use PPL for task-based applications.

Lightweight tasks are used internally by the messaging block and agent classes of the Asynchronous Agents Library.

The Asynchronous Agents Library uses lightweight tasks to implement the messaging block and agent classes that are described in Chapter 7, "Pipelines."

Tasks Created Using PPL

When you use any of the features of PPL to create tasks, such as the parallel_for, parallel_for_each or parallel_invoke functions, or the task_group::run or structured_task_group::run methods, you create a kind of task that is distinct from a lightweight task. This appendix refers to such tasks as PPL tasks.

Task Schedulers

The Concurrency Runtime’s task scheduler is similar to a thread pool. In fact, you can think of the task scheduler as a special type of thread pool that is very good at optimizing large numbers of fine-grained work requests. Unlike many thread pool implementations, the task scheduler does not automatically create a new worker thread when a request arrives and all existing threads are busy. Instead, the scheduler queues the new task and executes it when processor resources become available. The goal is to keep all of the processor cores as busy as possible while minimizing the number of context switches in the operating system and maximizing the effectiveness of hardware caches.

Task scheduling is implemented by the Scheduler class. Scheduler instances use worker threads that are associated with the virtual processor objects provided by the resource manager whenever the worker threads are running. There is at most one running thread per virtual processor object granted by the resource manager.

Instances of the Context class represent the threads known to a scheduler instance, along with additional per-thread data structures that are maintained by the scheduler for its own use.

Task schedulers represent a special kind of thread pool that is optimized for fine-grained tasks.

Managing Task Schedulers

The Concurrency Runtime provides one default scheduler per process. The default scheduler is created when you first make calls into the Concurrency Runtime. You can set scheduler policy for the current context by creating instances of the Scheduler class and attaching them to the currently executing thread.

Alternatively, you can invoke the SetDefaultSchedulerPolicy static method of the Scheduler class at the beginning of your application, before the default scheduler has been created, to specify how the default scheduler will work.

There is one default scheduler per process, but you can create additional schedulers per context. You can also set the scheduling policy of the default scheduler.

If your application uses multiple scheduler objects, you might want to construct and attach all of the schedulers at startup. Allocating schedulers at the outset avoids the overhead that occurs when schedulers are created while work is in progress. This overhead includes reallocating system resources and reassociating threads to processor nodes. For example, if you have one scheduler that does a parallel_for loop, it will, by default, use all cores on your machine. If halfway through the run of the parallel_for operation you add a scheduler for use by agent-based code, the first scheduler may be asked to reduce its concurrency as it runs. It’s more efficient to allocate the division of cores between the two schedulers at the start of the application.

Creating and Attaching a Task Scheduler

Use the factory method Scheduler::Create to instantiate a scheduler object. The method accepts a reference to a SchedulerPolicy object as its argument. This object contains configuration settings for your scheduler. After creating the scheduler, you attach it to the current context to activate it. The following code is an example of how to do this.

#include <concrt.h>
#include <concrtrm.h>
#include <stdio.h>
#include <windows.h>
#include <iostream>

using namespace ::Concurrency;
using namespace ::std;

int main()
{
  SchedulerPolicy myPolicy(2, MinConcurrency, 2, 
                              MaxConcurrency, 2);

  Scheduler* myScheduler = Scheduler::Create(myPolicy);
    
  cout << "My scheduler ID: " << myScheduler->Id() << endl;

  cout << "Default scheduler ID: " 
       << CurrentScheduler::Get()->Id() << endl;

  myScheduler->Attach();
    
  cout << "Current scheduler ID: " 
       <<  CurrentScheduler::Get()->Id() << endl;

The Scheduler::Attach method sets the scheduler that will be used by the current context. The new scheduler in this example alerts the resource manager that it requires a concurrency level of two. You must call the Attach method in the thread whose scheduler you want to replace. The CurrentScheduler class's Get method returns the current context’s currently attached scheduler object, or the default scheduler if no user-provided scheduler was previously attached to the current context.

As a convenience, you can use the Create method of the CurrentScheduler class to instantiate a new scheduler object and attach it to the current context with a single call.

Detaching a Task Scheduler

The Detach method of the CurrentScheduler class reverses the effect of a previous call to the Scheduler::Attach method. Attaching and detaching schedulers are stack-based operations. If you call Detach, the current scheduler is popped from the stack and the previous scheduler is restored as the current scheduler. The following code shows how to use the Detach method.

  CurrentScheduler::Detach();
    
  cout << "Current scheduler ID: " 
       << Concurrency::CurrentScheduler::Get()->Id() << endl;

Destroying a Task Scheduler

A scheduler object has reference/release semantics that are independent of the attach/detach process. Attach/detach logically contains a reference/release pair.

You can detach the most recently attached scheduler at any time. The runtime will not destroy the detached scheduler until all references to it have been released and all of its tasks have completed. If you need to be notified when a detached scheduler is eventually destroyed, create a Windows event and register it with the scheduler. The correct shutdown sequence for a scheduler object is shown in the following code.

HANDLE schedulerShutdownEvent = 
          CreateEvent(NULL, TRUE, FALSE, L"Shutdown Scheduler");
myScheduler->RegisterShutdownEvent(schedulerShutdownEvent);
myScheduler->Release();
WaitForSingleObject(schedulerShutdownEvent, INFINITE);

This code blocks the current context until all other users of the scheduler are also finished. Waiting for the scheduler to shut down might be necessary before unloading a DLL, for example.

Scenarios for Using Multiple Task Schedulers

Multiple task schedulers can be helpful when you need quality-of-service guarantees. They can also provide a better user experience by ensuring that the application is responsive even if long-running parallel workloads are executing. For example, you can use specific scheduler instances to reserve cores for high-priority activities such as audio processing.

Message passing is another example of a high-priority activity. If your cores are very busy and message delivery does not have high priority, tasks that depend on receiving data from messaging buffers might not be run, which could cause bugs. A solution is to send messages on a dedicated "message propagation" scheduler. For example, if you wanted to guarantee that a cancellation message gets processed immediately, you could create a new scheduler for it.

Another case is when you’re running on a multi-node host in a server farm. For performance reasons, you may not want parallel loops to span multiple processor nodes. Using multiple schedulers allows you to limit your parallel loops to a single processor node.

Multiple task schedulers can separate UI work from background processing. You can have a foreground scheduler in the UI that does some work when you click a button and a background scheduler for work that’s ongoing.

Implementing a Custom Scheduling Component

The Scheduler class is not extensible. You cannot use it as a base class. This means that you must use the runtime's Scheduler class for scheduling work that is created with PPL and the Asynchronous Agents Library.

However, if you are writing your own parallel programming library, you can create your own scheduling component by implementing the IScheduler and IExecutionContext interfaces.

The Scheduling Algorithm

The Scheduler class uses a queue-based approach to run tasks. New tasks wait in queues until the scheduler assigns processing resources to them in cooperation with the resource manager and the operating system. Queues emphasize overall throughput at the cost of scheduling fairness for individual tasks. The motivating idea is that a set of related tasks represents the decomposition of a larger problem. Therefore, solving the overall problem in the fastest possible way is the primary goal, even if some tasks have to wait longer than others to run.

Queue-based scheduling emphasizes overall throughput at the cost of "fairness" for individual tasks.

You can contrast the queue-based approach to scheduling with preemptive multitasking that gives time slices to all running threads. Multitasking emphasizes the responsiveness of each thread.

The material in this section can help you understand the performance characteristics that you'll observe when you use the runtime’s Scheduler class. Be aware that the scheduling algorithm described here represents an implementation choice. Future versions of the Concurrency Runtime might optimize task execution differently.

Gg663535.note(en-us,PandP.10).gifNote:
The behind-the-scenes behavior described in this section applies to Visual Studio 2010 SP1. There's no guarantee that future releases of the runtime won't behave differently.

Schedule Groups

A scheduler object has more than one queue of pending tasks. Pending tasks are tasks that haven't started to run. Internally, the Scheduler class uses instances of the ScheduleGroup helper class for its queues of pending tasks. For example, a scheduler instance may want separate queues based on the division of cores into processor nodes.

When you add a task to the scheduler, you normally allow the scheduler to choose a schedule group for that task.

The scheduler object maintains several kinds of queues for each of its schedule groups. This is illustrated in Figure 3.

Gg663535.FD5BCB9BF22C3A28B96C3E5A215DE38F(en-us,PandP.10).png

Figure 3

Data structures of schedule groups

Each schedule group contains one queue for lightweight tasks (LWTs). Each schedule group also contains multiple work-stealing queues of pending PPL tasks. There is one work-stealing queue of pending PPL tasks in the schedule group for each execution context. A work-stealing queue is a mutable list with a private end accessible to a single context and a public end that can be accessed by many contexts. Entries in a work-stealing queue can be added or removed from the private (or local) end with minimal synchronization. Entries can also be added or removed from the public (or shared) end but with higher synchronization costs. Only the context that the schedule group has associated with the work-stealing queue can add to and remove entries from the private end.

Each schedule group contains one main queue of runnable contexts and can also maintain a cache of runnable contexts for each virtual processor object. A runnable context is a thread that was previously interrupted by one of the cooperative blocking operations, but is now unblocked and ready to resume its work.

Adding Tasks

A lightweight task is implicitly created by messaging blocks' operations and by starting an agent instance. You can also use the ScheduleTask method of the Scheduler class, the CurrentScheduler class or the ScheduleGroup class to create a lightweight task. When you create a new lightweight task, the task is added to the schedule group you specify or to a schedule group chosen by the scheduler. The new task is added to the end of the schedule group’s queue of lightweight tasks.

When you create a PPL task, the new task is added to the private end of the work-stealing queue of the current context. The current context may correspond to a thread that is not one of the scheduler’s worker threads. In this case, the scheduler associates a work-stealing queue with the current context in one of its schedule groups.

Use the Attach and Detach methods that were described in the "Managing Task Schedulers" section of this appendix to control which scheduler is the current scheduler for a given context.

Running Tasks

A virtual processor object can become available in one of two ways. One way is that the thread that is currently running on the virtual processor object becomes blocked by one of the cooperative blocking operations. The other way is that the thread completes its current task. When a virtual processor object becomes ready to accept new work, the scheduler uses heuristics to prioritize the possible next steps.

As a general rule, the scheduler tries to resume runnable contexts in preference to starting new pending tasks. A runnable context is a worker thread that previously ran and became blocked by one of the cooperative blocking operations, but is now unblocked and ready to be resumed. The scheduler takes runnable contexts first from the virtual processor object’s local cache of runnable contexts in last in first out (LIFO) order and then from the queue of runnable contexts in the current schedule group in first in first out (FIFO) order. It may also look for runnable contexts in the caches of other virtual processor objects and schedule groups. The LIFO cache of runnable contexts improves the likelihood that the data in hardware caches will be relevant. The most recently unblocked context is resumed first, and it is run on the same virtual processor object as the operation that caused the context to become unblocked. You can configure the size of the cache of unblocked tasks with the LocalContextCacheSize schedule policy key.

If there are no runnable contexts, the scheduler looks for lightweight tasks in the schedule group’s queue of lightweight tasks. It takes the next available lightweight task in FIFO order. Lightweight tasks are usually used to implement message passing and are therefore considered to be of higher priority than PPL tasks.

If there are no lightweight tasks pending in the current schedule group, the scheduler looks for tasks from the public ends of the work-stealing queues of the current schedule group. It looks for tasks in work-stealing queues first within the queues associated with the current processor node, and then across processor nodes.

A special situation arises when a thread enters the task_group::wait or the structured_task_group::wait methods. Normally, wait is considered to be one of the cooperative blocking operations. However, if the call to wait requires the current thread to wait for pending tasks that are in the current context’s work-stealing queue, then the current context will not block immediately. Instead, PPL reuses the current context to run the locally queued tasks using an optimization known as inline execution. The pending tasks are taken from the private end of the current context's work-stealing queue in LIFO order.

The scheduler is unaware of inline execution because the thread that performs inline execution will not become cooperatively blocked by the wait method until there are no more tasks in the local work-stealing queue that would satisfy the wait condition. Of course, if inline execution has satisfied the wait condition, the wait method will return and the current thread will continue running its top-level task. For more information about inlining, see "Tasks That Are Run Inline" in this appendix.

The Scheduler class implements two variations of its scheduling algorithm, which can be selected by setting the SchedulingProtocol scheduling policy key. The two scheduling approaches are enhanced locality mode (the default) and forward progress mode.

Enhanced Locality Mode

In enhanced locality mode, the scheduler tries to execute related tasks in batches. The scheduler attempts to process all pending tasks of the first schedule group until the group contains no more tasks to run. Then, the scheduler moves on to the pending tasks of the next schedule group, and so on, eventually starting over with the first schedule group. This order is not strict; there are also heuristics to avoid thread starvation that can occur if a task that would unblock a currently blocked task is not allowed to run. These heuristics may periodically give preference to tasks of other schedule groups. This can occur if the current schedule group has many pending tasks.

Enhanced locality scheduling processes schedule groups in batches so that related tasks execute together.

In enhanced locality mode the runtime assumes that tasks in a schedule group share memory. In order to derive the most benefit from hardware caches, the scheduler attempts to run the tasks of a schedule group as close together chronologically as possible, even if it means that tasks that were added earlier to some other schedule group experience delays. For example, if tasks 1, 2, and 3 are created and assigned to schedule groups A, B, and A, respectively, and the scheduler starts processing schedule group A, then it is likely that task 3 will be run before task 2, even though task 3 was created after task 2. In other words, the scheduler attempts to run all tasks of schedule group A (which contains tasks 1 and 3) before proceeding to schedule group B (which contains task 2).

Due to the processor node affinity provided by the division into schedule groups, the tasks will be spatially close as well. When the next schedule group’s turn eventually comes, all of its tasks will be executed the same way.

Forward Progress Mode

In forward progress mode, the scheduler executes one pending task from each of its schedule groups in round-robin fashion. Unlike enhanced locality scheduling, there are no caches of runnable contexts per virtual processor object; however, runnable contexts are prioritized over pending tasks.

Forward progress scheduling rotates through schedule groups, executing one task from each.

With forward progress scheduling, the runtime assumes that keeping each schedule group from stalling matters more than running related tasks as a batch. This might occur, for example, in discrete event simulation scenarios where you want all schedule groups to progress by one step before updating a GUI. Forward progress scheduling is a less commonly used approach for task-based applications.

Task Execution Order

The scheduler does not make guarantees about the order in which queued tasks will be executed, and your application should make no assumptions about when a particular task will be allowed to run. In the current implementation, pending tasks are generally processed in FIFO order unless they are inlined. However, given the interaction of the various queues of pending tasks and the optional round-robin or batch-oriented processing of schedule groups, it’s not possible to predict the order of execution. If you need tasks to be run in a particular order, then you should use one of the task coordination techniques described in this book. Examples include the task_group::wait method and the receive function of messaging blocks.

Tasks That Are Run Inline

As was mentioned earlier in the section on "Running Tasks," it is possible that the scheduler will run a task in the thread context of another task that is waiting for that task to complete. For example, if you invoke a task group’s wait method from within a task context, the runtime knows that the current context will be idle until the tasks of that task group have completed. The runtime can therefore optimize its use of worker threads by reusing the blocked thread to run one or more of the pending tasks. Inlining is a good example of potential parallelism: when dependent tasks run on a machine with many cores, they run in parallel. On a machine with just one available core, they act like sequential function calls.

Inlining allows you to avoid deadlock due to thread starvation. Unlike other kinds of cooperative blocking operations, the task_group::wait function provides a hint to the scheduler about which pending tasks will unblock the current context. Inlined execution is a good way to elevate the scheduling priority of tasks that will unblock other tasks.

Only tasks that were created by the current context are eligible to be inlined when you invoke their task group’s wait method.

You can provide an explicit hint that inlining should occur by calling the task_group::run_and_wait method. This method creates a new task and immediately processes it with inline execution.

Using Contexts to Communicate with the Scheduler

The Context class allows you to communicate with the task scheduler. The context object that corresponds to the currently executing thread can be accessed by invoking the static method Context::CurrentContext.

The CurrentContext method can be called from application threads as well as from a task scheduler's worker threads. The behavior of the context object’s methods may differ, depending on whether the current context object corresponds to a worker thread or whether it is an application thread. For example, if you call a cooperative blocking operation from within an application thread (that is, a thread that is not one of the current scheduler's worker threads), the thread will block without allowing one of the scheduler's runnable contexts to resume as is the case when a worker thread blocks.

Gg663535.note(en-us,PandP.10).gifNote:
If you get the context object of a thread that is not one of the scheduler’s worker threads, be aware that the behavior of some context-dependent operations will be different than would be the case for the scheduler’s worker threads.

Debugging Information

You can get access to useful debugging information from contexts, such as the integer ID for the current context, the schedule group ID and the virtual processor object ID. The information is provided by the static methods Id, ScheduleGroupId, and VirtualProcessorId of the Context class.

Querying for Cancellation

You can detect that a task is being cancelled even if you don’t have a reference to its task group object by invoking the Context class’s static IsCurrentTaskCollectionCanceling method. Checking for cancellation is useful if you are about to start a long-running operation.

Interface to Cooperative Blocking

A task can communicate with the scheduler by invoking one of the cooperative blocking operations that were described in Chapter 3, "Parallel Tasks." See "Coordinating Tasks with Cooperative Blocking" in Chapter 3 for a list of the built-in operations.

The Context class provides an additional, lower-level interface to cooperative blocking through its Block and Unblock methods. PPL and the Asynchronous Agents Library use the Block and Unblock methods to implement all of their cooperative blocking operations.

Gg663535.note(en-us,PandP.10).gifNote:
Most programmers should use the higher-level built-in task coordination mechanisms of PPL and messaging blocks rather than implementing their own synchronization primitives.

You can use Block and Unblock to coordinate your tasks from within custom synchronization primitives. If you program your own parallel synchronization primitives with the Block and Unblock methods, you must be very careful not to disrupt the runtime’s internal use of these operations. Block and Unblock are not nesting operations, but the order of operations is flexible. If Unblock precedes Block, the Block operation is ignored. You can unintentionally interact with one of PPL’s internal operations by allowing an out-of-order Unblock call. For example, an out-of-order call to Unblock followed by a call to PPL’s critical_section::lock method can, in certain interleavings, cause a critical section not to be observed.

Block and Unblock are low-level methods. Most programmers will want to use the higher-level cooperative blocking operations.

Gg663535.note(en-us,PandP.10).gifNote:
If you use the Block and Unblock methods you must be very careful to avoid situations that could unexpectedly interact with PPL's internal use of the methods. Most programmers will not need to use Block and Unblock to implement their own synchronization primitives.

Waiting

The Concurrency namespace includes a global wait function that allows you to suspend the current context with the guarantee that the suspended task will not be resumed until a time interval that is provided as an argument to the wait function has passed. It is possible, due to the queue-oriented scheduling approach used by the task scheduler, that a task will be suspended for a time period that is much longer than the specified wait period.

The Yield method of the Context class allows a pending lightweight task to run, or if there are no lightweight tasks, it allows one of the runnable contexts to resume. If either of these two conditions is satisfied, then the current worker thread is added to the queue of runnable contexts; otherwise, the Yield method is a no-op. The Yield method is also a no-op when called from a thread that is not a worker thread of a scheduler.

Gg663535.note(en-us,PandP.10).gifNote:
Calling the Yield method in a tight loop while polling for some condition can cause poor performance. Instead, wait for an event.

The Caching Suballocator

You can allocate memory from a memory cache that is local to the current context. This is useful when a task needs to create many small temporary objects on the heap, and you expect that more than one task might create objects of the same size.

The synchronization overhead of coordinating memory allocation and deallocation for a global pool of memory can be a significant source of overhead for a parallel application. The Concurrency Runtime provides a thread-private caching suballocator that allows you to allocate and deallocate temporary working storage within a task and potentially to use fewer locks or memory barrier operations than you would if you used the global memory pool. Allocation requests only incur synchronization overhead at the beginning of the run when new memory must be added to the cache.

The caching suballocator is meant for situations where there is frequent allocation of temporary memory within tasks. Invoke its functions only from within a running task. The caching suballocator does not improve performance in all scenarios, and it does not free its memory. Refer to MSDN for more information about its use.

Long-Running I/O Tasks

By default, the scheduler assumes that its tasks are computationally intensive. Dynamic resource management compensates to some extent for I/O-intensive tasks that are not computationally intensive, but if you know that your task is a long-running I/O-intensive task that will use only a fraction of a processor's resources, you should give a hint to the scheduler that it can oversubscribe the current processor node by one additional running thread.

To do this, call the static method Context::Oversubscribe with the argument true. This tells the scheduler that it should request an additional virtual processor object from the resource manager while using the same processor affinity mask as the current thread is using. This operation increases the level of concurrency in the scheduler without increasing the number of cores that will be used by the scheduler.

When you are done with your long-running I/O-intensive task, call the Oversubscribe method with the argument false to reset the level of concurrency to the previous value. Be sure to call the Oversubscribe method in pairs, and to take exceptions into account. The sample pack includes the scoped_oversubcription_token helper class to automatically ensure that the Oversubscribe method is called in pairs. You should place calls to the Oversubscribe method within the body of the work function of your long-running I/O task.

Gg663535.note(en-us,PandP.10).gifNote:
Use Context::Oversubscribe to add concurrency to the current scheduler without consuming more cores.

Setting Scheduler Policy

There are a number of settings that you can control that will affect how the task scheduler does its job. These are documented on MSDN as values of the PolicyElementKey enumeration (and they are also found in the concrt.h header file), but here is a summary of the most important settings. See the section "Creating and Attaching a Task Scheduler" in this appendix for a code example of how to set policies.

Policy Key

Description

MinConcurrency

This is the minimum number of virtual processor objects that the scheduler requires at startup.

MaxConcurrency

This is the maximum number of virtual processor objects that the scheduler requires at startup. There is a special value, MaxExecutionResources, that indicates "as many as exist on the computer."

SchedulingProtocol

This indicates which of the two available scheduling algorithms should be used by the scheduler. Choose EnhanceScheduleGroupLocality (the default) or EnhanceForwardProgress.

TargetOversubscriptionFactor

You can use this setting if you know that your operations will ordinarily use only a fraction of the processor’s time, as is typical in I/O-bound programs. The runtime multiplies the number of virtual processor objects by the oversubscription factor.


Anti-Patterns

Here are a few things to watch out for.

Multiple Resource Managers

The resource manager is a singleton that works across one process. It does not coordinate processor resources across multiple operating-system processes. If your application uses multiple, concurrent processes, you may need to reduce the level of concurrency in each process for optimum efficiency.

It is possible, in some situations, for more than one resource manager instance to be created within a single process. If this happens, the resource manager instances will contend for resources. This situation arises if a library uses static linking. There will be one resource manager for each statically linked instance of the C++ runtime.

It is also possible to have multiple resource manager instances if you use more than one version of the C++ runtime within a single application. In this case, there will be one resource manager for each version of the runtime.

In general, you should avoid situations where more than one instance of the resource manager can exist.

Gg663535.note(en-us,PandP.10).gifNote:
Avoid static linking for your parallel applications.


Resource Management Overhead

The resource manager optimizes the use of processor resources by dynamically adjusting the concurrency levels of schedulers that are active in the application. In certain scenarios, you may find that dynamic resource management isn’t providing optimal results. In these cases you can effectively disable dynamic resource management by setting the minimum and maximum concurrency levels of your scheduler to the same value. You should understand the performance characteristics of your application before doing this. In most scenarios dynamic resource management will result in better overall throughput of your application.

Unintentional Oversubscription from Inlined Tasks

The "Running Tasks" section of this appendix describes how a scheduler may reuse a thread that is waiting for the tasks of a task group to complete to run one or more pending tasks. This is known as inlining.

Inline execution that occurs in a worker thread of a scheduler object has no effect on the level of concurrency of the scheduler. With or without inline execution, the number of worker threads that are allowed to run at the same time is limited by the number of virtual processor objects that have been allocated to the scheduler by the resource manager.

Unlike worker threads, application threads are not managed by a task scheduler instance. Application threads (that is, any threads that are not a scheduler's worker threads) do not count toward the limit on the number of concurrent running threads that is coordinated by the resource manager. Unless blocked by the operating system, application threads are always allowed to run.

If you call the task_group::run method from an application thread and subsequently call the wait method on the task group from that same thread, inline execution of the task you created may occur on the application thread. The application thread will be allowed to run regardless of the number of running worker threads in the scheduler. Therefore, inline execution in an application may increase the parallel operation’s overall level of concurrency by one running thread. If your scheduler had a maximum concurrency level of four virtual processor objects, your application might run five tasks concurrently: four on the worker threads of the scheduler, plus one on the application thread that entered the wait method.

Pending PPL tasks are only inlined if they reside in the local work-stealing queue of the thread that enters the wait method. Therefore, if you wanted to prevent inline execution of tasks in an application thread, you could a use a lightweight task to create the PPL task that would otherwise have been created in the application thread. The pending PPL task would then be placed in the work-stealing queue of whatever worker thread executed the lightweight task and would not be eligible for inline execution on the application thread.

The remarks about inlining in this section also apply to parallel_for, parallel_for_each, and parallel_invoke, as well as tasks created using structured task groups. Inline execution also occurs whenever you use the task_group::run_and_wait method.

Deadlock from Thread Starvation

Recall that there are two kinds of blocking operations: cooperative blocking operations proved by the Concurrency Runtime and "noncooperative" blocking operations provided by the operating system.

When a worker thread becomes blocked by a cooperative blocking operation, the scheduler will resume one of its runnable contexts or allow one of the pending tasks to start running. Note that if no idle worker thread is available to run the pending task, a new worker thread will be created. Creating the additional thread does not introduce additional concurrency. At any given time, the number of worker threads that are released to the OS for scheduling never exceeds the number of virtual processor objects allocated to it by the resource manager.

Allowing additional pending tasks to run when a task is cooperatively blocked increases the chance of running the task that would unblock the cooperatively blocked task. This can help avoid deadlock from thread starvation that can occur in systems with many dependent tasks and fixed levels of concurrency.

In contrast to cooperative blocking, noncooperative or OS-level blocking is opaque to PPL and the Concurrency Runtime, unless the blocking operation occurs on a User-Mode Scheduled (UMS) thread. (UMS threads are outside the scope of this book.) When a worker thread becomes blocked by an OS-level blocking operation, the scheduler still considers it to be a running thread. The scheduler does not resume one of its runnable contexts or start one of its pending tasks in response to an OS-level blocking operation. As a consequence, you may end up with deadlock due to thread starvation. For example, in an extreme case, if all worker threads of a scheduler are blocked by noncooperative blocking operations that need pending tasks to run in order to become unblocked, the scheduler is deadlocked.

It is therefore recommended that cooperative blocking operations be used as the primary mechanism of expressing dependencies among tasks.

Gg663535.note(en-us,PandP.10).gifNote:
The number of worker threads in a scheduler is not the scheduler’s level of concurrency. At any given time, the number of threads that are released to the OS for execution will not exceed the number of virtual processor objects that have been allocated to the scheduler by the resource manager.

Ignored Process Affinity Mask

As of Windows 7 you can set a process affinity mask that limits the cores on which the threads of your application may run. The version of the Concurrency Runtime that is shipped as part of Visual Studio 2010 SP1 does not recognize this user-selected process affinity mask. Instead, it uses all available cores up to the maximum specified level of concurrency for each scheduler.

This issue is expected to be resolved in a future version of the Concurrency Runtime.

References

For more information about the caching suballocator, see Alloc Function on MSDN at http://msdn.microsoft.com/en-us/library/dd492420.aspx.

For a general description of NUMA, see Introduction to NUMA on the MSDN Magazine blog at http://blogs.msdn.com/b/msdnmagazine/archive/2010/05/06/10009393.aspx.

For more information about the PolicyElementKey enumeration, see the entry on MSDN at http://msdn.microsoft.com/en-us/library/dd492562.



Last built: March 9, 2012

Show:
© 2014 Microsoft