Rahul V. PatilBoby George
In order to satisfy the perpetual need for increased computing power, the hardware industry is steadily shifting toward multi- and many-core processor systems. Unlike the increased application performance attained by faster processors with higher clock speeds, performance improvements in many-core systems can only be achieved by writing efficient parallel programs.
Forms of parallelism have existed in the software industry for a long time. However, creating mainstream software applications that harness the full power of parallel hardware requires significant changes from the practices designed for sequential applications.
Testing parallel applications is not straightforward. For instance, concurrent bugs are difficult to detect due to the nondeterministic behavior exhibited by parallel applications. Even if these bugs are detected, it is difficult to reproduce them consistently. Further, after fixing a defect, it is difficult to ensure that the defect was truly rectified and not simply masked. In addition, parallelization can also introduce new performance bottlenecks that must be identified.
In this article, we'll look at testing techniques for parallel programs and present six helpful tools you can use to locate potentially serious defects. We'll begin with the following categories of concurrency bugs: race conditions, incorrect mutual exclusions, and memory reordering.
A race occurs when two or more threads of execution in a multithreaded program try to access the same shared data and at least one of the accesses is a write. Harmful race conditions introduce unpredictability and are often hard to detect. The consequences of a race condition might only become visible at a much later time or in a totally different part of the program. They are also incredibly hard to reproduce. Races are avoided by using synchronization techniques to correctly sequence operations between threads.
Sometimes races may be safe and intentional. For example, there could be a global flag called done, for which there is only one writer but many readers. The writer thread sets the flag to tell all the threads to terminate safely. All reader threads may be running in a loop using while (!done), repeatedly reading the flag. Once a thread notices that the done flag is set, it will exit its while loop. In most cases, this is a benign race. As we will discussed later in this article, there are tools available for detecting race conditions. However, such tools might warn about races that are in fact benign, reporting errors known as false positives.
A deadlock occurs when two or more threads wait on each other, forming a cycle and preventing all of them from making any forward progress. Deadlocks can be introduced by the programmer while trying to avoid race conditions. For example, incorrect use of synchronization primitives such as acquiring locks in an incorrect order can result in two or more threads waiting for each other. Deadlocks can also occur in cases that don't use locking constructs; any kind of circular wait can result in a deadlock.
Take a look at the following example of a potential deadlock situation. Thread 1 does the following:
Acquire Lock A;
Acquire Lock B;
Release Lock B;
Release Lock A;
Thread 2 proceeds like so:
Acquire Lock B;
Acquire Lock A; // if thread 1 has already acquired lock
// A and waiting to acquire lock B, then
// this is a deadlock
Release Lock A;
Release Lock B;
Starvation is an indefinite delay or permanent blocking of one or more runnable threads in a multithreaded application. Threads that are not being scheduled to run even though they are not blocking or waiting on anything else are said to be starving. Starvation is typically the result of scheduling rules and policies. For example, if one schedules a high-priority, non-blocking, continuously executing thread along with a low-priority thread, then, on a single core CPU, the lower thread will never get a chance to run. To help avoid such conditions, the Windows® scheduler intervenes frequently to reduce starvation by occasionally boosting priorities of starving threads.
Livelocks occur when threads are scheduled but are not making forward progress because they are continuously reacting to each other's state changes. The best way to describe this is if two people were to approach each other in a narrow hallway and they step aside for one another, each time blocking the other's path. Such side stepping prevents forward progress and results in a live lock. High CPU utilization with no sign of real work being done is a classic warning sign of a livelock. Livelocks are incredibly difficult to detect and diagnose.
For parallel applications, test data is defined by identifying the scalability aspects of the scenario. The scalability exhibited by the application really depends on the parallelizability of the algorithm used. Certain algorithms may produce super-linear speedup for certain problem domains (for example, a search through a non-uniform distribution). Other algorithms may not produce embarrassingly parallel speedup due to limited parallelizability, resulting in total runtime speed increases that aren't proportional to the number of CPUs. Still other algorithms (such as sorts) won't scale linearly across data set size. As you can see, it is important to understand how the performance of the application scales across the following categories:
The Number of Threads or CPU Cores Typical user expectation is that the parallel application will scale linearly as the number of CPUs or threads increases.
Dataset Size Typical user expectation is that the performance of the parallel application should not decrease when the size of the input dataset increases.
Execution Workload As the execution progresses, the amount of work done by the thread may not remain constant. It can increase or decrease, randomly vary, or be normally distributed. The variations in workload as the execution progresses can impact the performance characteristics of the application. The types of workload variations as execution progresses include: constant workload, increasing workload, decreasing workload, random workload and normally distributed workload.
The execution time is the most common metric used for reporting. Other reporting metrics include speedup achieved in comparison with sequential application and speedup across the scalability categories. It's important to define the reporting metric and how it is going to be measured. Some good questions to start with include: What values have to be reported? How will they be reported? Will the instrumentation used to take the measurement affect the performance? Further, when possible, take measurement in parts so it is possible to understand the performance of the sub- parts of the system.
When the compiler is optimizing code, it may decide to reorder pieces of code to improve performance. However, this can cause unexpected behavior in threads monitoring changes to global state. For example, let's say you have two threads of execution: thread 1 and thread 2. Let's also assume two global variables, x and y, both initialized to 0. Thread 1 runs these lines:
Thread 2 executes the following:
if (y == 5)
Assert(x != 10);
Here, the optimizing compiler might observe that y will eventually be reassigned to 5 anyway, and it might reorder the code such that y gets assigned 5 before x gets assigned 10; from the perspective of the single thread executing these two assignments, the ordering doesn't matter. Hence, there is a chance that Thread 2's assert will get fired, since x might not equal 10.
Where necessary, one can avoid compiler memory reordering by using volatile variables and compiler intrinsics. It is also worth mentioning that with common optimization flags turned on, Thread 2 may never see an update to variables y or x. Here the user will have to use volatile variables.
Compiler reordering is not the only complication. Memory access optimizations such as caching and advanced fetching by speculative executions may cause memory operations executed by one thread on one CPU to be perceived to have happened in a different order by another thread on another CPU. This kind of memory reordering perception can only happen in a system with multiple CPUs or multiple CPU cores.
To clarify, suppose there are two threads of execution: thread 1 and thread 2. Further assume m and n are global, both initialized to 0, and a, b, c, d are local variables. Thread 1 (on processor 1) has these instructions:
m = 5; // A1
int a = m; // A2
int b = n; // A3
Thread 2 (on processor 2) has these:
n = 5; //A4
int c = n; //A5
int d = m; //A6
Logically (in a sequentially consistent memory world), b == d == 0 can never happen after all of these instructions have executed. However, hardware memory reordering can cause the writes to shared memory (A1 and A4) to be store-buffer forwarded, which can lead the different processors to observe the writes differently.
Testing a concurrent application entails testing for correctness, reliability, performance, and scalability. The correctness of the application is determined using techniques such as static analysis, dynamic analysis, and model checking. Performance and reliability are tested for by studying the overheads introduced by parallelism and stress testing. Scalability is tested by analyzing how well the application performs on systems of varying sizes.
Static analysis techniques analyze code without actually executing the program. Usually, static analysis is performed by looking at metadata from a compiled application or annotated source code. The analysis typically includes some formal inspection to ensure that the intent and assumptions of the programmer prevents incorrect behavior. Prefix and Prefast are two of the most popular static analysis tools for native applications, whereas FxCop is very popular for managed code.
There are advantages and disadvantages to static analysis. Among the advantages are the thorough and detailed coverage achieved, the confidence in the product design that such formal analysis helps build, and the precise error reports that make bugs easier to find and fix. However, to deal with concurrency bugs, static analysis requires a great deal of annotation specifying intent. Moreover, these annotations need to be correct themselves. Plus, tools that use static analysis tend to generate a lot of false positives and require a significant effort to minimize the false positives.
In dynamic analysis, bugs are detected by looking at the footprints of execution. There are two types of dynamic analysis: online and offline. Tools that use online dynamic analysis analyze a program while it is executing; tools that use offline dynamic analysis record traces and analyze them later to detect bugs. Dynamic analysis is convenient because it requires little-to-no extra work from the developer at the time of programming, yields fewer false positives, and makes it easier to address the more obvious problems. Since bugs are discovered only in the path of execution, the more frequently traversed paths yield the first set of bugs. This makes it cheaper to improve reliability.
But there are disadvantages, too. You can detect bugs only in the path of execution, and you need to rely on test cases for a good code coverage. In fact, some tools can only catch a race if a race occurred in that run. Thus, if the tool does not report any bugs, you still can't be certain there aren't any. Most dynamic analysis tools rely on some sort of instrumentation that can modify the behavior of runtime, which is another drawback. Due to the complexity, the performance of these tools tends to be very poor.
The final method is model checking—the method for verifying the correctness of a finite state concurrent system. It allows for formal deductive reasoning. A model checker tries to simulate race and deadlock conditions. Using model checking, you can formally prove the absence of races and deadlocks. This method provides superior confidence in design and architecture, provides very high coverage, and requires minimal external drivers.
Like other testing methods, model checking has some disadvantages. In most cases, it is very difficult to automatically extract the model from code (rare cases exist with limited use). Performing the model extraction manually is also quite tedious. There is also too much to verify due to state space explosion, a condition wherein the potential number of possible states is too huge to analyze. State space explosion can be controlled to some extent by applying reduction techniques which utilize complicated heuristics. Such heuristics can also have downsides, such as failing to detect bugs in long-running applications.
Finally, model checking requires disciplined design and implementation planning. Model checking may prove that the design is error free, but the implementation may still be incorrect. In practice, model checking is useful only for small, critical sections of a product. Other hybrid techniques include combining dynamic analysis and model checking. An example is CHESS, which you'll see shortly.
The lockset algorithm, used in both static and dynamic analysis tools, reports a potential race when shared memory is accessed by two or more threads without the threads holding a common lock. Fundamentally, the algorithm says that for each shared memory variable v, a non-empty set of locks C(v)that will be held by every thread while accessing the variable. Initially, C(v) is the list of all locks. Each thread also maintains two sets of locks: locks(t) indicating all the locks held and writeLocks(t) indicating all the write locks held. Here's how the algorithm proceeds:
For each shared memory variable v, maintain a set C(v) of locks. Initially C(v) is the list of all locks.
Each thread also maintains two sets of locks: locks(t) indicating all the locks held and writeLocks(t) indicating all the write locks held.
On each read of v by thread t:
Set C(v) = C(v) intersection locks(t).
If C(v) == NULL set, then raise an error.
On each write of v by thread t:
Set C(v) = C(v) intersection writeLocks(t).
If C(v) ==NULL set, then raise an error.
Essentially, as the application progresses, the C(v) of each variable starts shrinking. An error is raised if the C(v) ever becomes null (for example, if the intersection of the threads' locksets are NULL at the time of accessing the shared memory).
Unfortunately, not all races reported by a lockset algorithm are real races. One can write race-free code without using locks either by applying clever programming tricks or using other synchronization primitives such as signaling. This makes finding genuine bugs really hard. Annotations and certain suppressions can help alleviate this problem.
Another algorithm for detecting races is the "happens-before" algorithm, which is based on partial ordering of events in distributed systems. Here is an overview of the algorithm that computes the partial order to determine what events happened before another event (events, in this context, are all instructions, including read/write and locks):
Within any single thread, events are ordered in the order of their occurrence.
Among threads, events are ordered according to the synchronization primitives' properties. For example, if lock(a) is being acquired by two threads, then the unlock by one thread is said to have happened-before the lock of another thread.
If two or more threads access a shared variable, and the accesses are not deterministically ordered by the "happens-before" relationship, then a race is said to have occurred.
The algorithm is illustrated in Figure 1.
Figure 1: Happens-Before Algorithm
Thread 1's unlock happened-before Thread 2's lock. Thus, the access to the shared variable can never happen at the same time and there is no race. One major drawback of the algorithm is that monitoring such relationships can be very expensive.
The larger issue is that the ability of this algorithm to detect races totally depends on the scheduling order. The partial order is built only for this specific instance of the scheduling and may miss bugs for the same test on a different day. Figure 2 will not report any races, but an execution order in Figure 3 will report races. Many races are known to have occurred only after several years following a product's release. So, this detection does not leave one with a comforting sense of a complete coverage.
Figure 2 - Races Will Not Be Reported
Figure 3: Races Will Be Detected
On the flip side, happens-before generates very low false positives. Most bugs are real. However, happens-before misses many errors (false negatives) and is very hard to implement efficiently.
Similarly, the lockset algorithm is very efficient, and can detect more errors. However, it tends to generate too many false positives. There have been efforts to combine these algorithms to overcome the disadvantages of both the approaches. Note: these race detection algorithms have evolved over several years, and sources to these can be found by searching the ACM /IEEE portals.
There are a number of concurrency testing tools on the market to help you deal with potential deadlocks, livelocks, hangs, and all the other issues you experience when running parallel transactions. Each tool we look at here will help in a particular area.
CHESS Created by Microsoft Research, CHESS is a novel combination of model checking and dynamic analysis. It detects concurrency errors by systematically exploring thread schedules and interleaving. It is capable of finding race conditions, deadlocks, hangs, livelocks, and data corruption issues. To help with debugging, it also provides a fully repeatable execution. Like most model checking, the systematic exploration provides thorough coverage.
As a dynamic analysis tool, CHESS runs a regular unit test repeatedly on a specialized scheduler. On every repetition, it chooses a different scheduling order. As a model checker, it controls the specialized scheduler that is capable of creating specific thread interleavings. To control the state space explosion, CHESS applies partial-order reduction and a novel iteration context bounding.
In iteration context bounding, instead of limiting the state space explosion by depth, CHESS limits number of thread switches in a given execution. The thread itself can run any number of steps between thread switches, leaving the execution depth unbounded (a big win over traditional model checking). This is based on the empirical evidence that a small number of systematic thread switches is sufficient to expose most concurrency bugs.
CHESS can detect deadlocks and races (using the Goldilocks lockset algorithm explained at go.microsoft.com/fwlink/?LinkId=116877) but relies on programmer assertions for other state verification. It also expects all programs to terminate and that there is a fairness guarantee (forward progress) for all threads. Thus, if the program enters a state of continuous loop, it reports a livelock.
From the testing view point, you would begin by running CHESS with an iteration context bound of 2. Once there are no more bugs to be found, you would increase the bound to 3 and so on. Empirically, most of the bugs should be detected with a bound of 2 and 3. Hence, this can be a very efficient way to weed out bugs sooner.
For instrumentation, CHESS detours Win32® API synchronization calls to control and intentionally introduces non-determinism. Additionally, it requires developer code to include a lot of asserts to ensure consistency of states (something good code should do anyway). However, like most dynamic analysis tools, it requires a good test suite for broad coverage. CHESS is a boon for all those developers and testers who had to solely rely on stress for better interleaved testing. It takes regular unit tests and methodically simulates interesting interleavings.
The Intel Thread Checker This is a dynamic analysis tool for finding deadlocks (including potential deadlocks), stalls, data races, and incorrect uses of the native Windows synchronization APIs (see go.microsoft.com/fwlink/?LinkId=115727). The Thread Checker needs to instrument either the source code or the compiled binary to make every memory reference and every standard Win32 synchronization primitive observable. At run time, the instrumented binary provides sufficient information for the analyzer to construct a partial-order of execution. The tool then performs a "happens-before" analysis on the partial order. Please refer to the "Race Detection Algorithms" sidebar for more information on happens-before analysis.
For performance and scalability reasons, instead of remembering all accesses to a shared variable, the tool only remembers recent accesses. This helps to improve the tool's efficiency while analyzing long-running applications. However, the side effect is that it will miss some bugs. It's a trade-off, and perhaps it's more important to find many bugs in long-running applications than to find all bugs in a very short-lived application.
The only other big drawback of the tool is that it can't account for synchronization via interlocked operations, such as those used in custom spin locks. However, for applications that use only standard synchronization primitives, this is probably one of the best-supported test tools available for concurrency testing native applications.
RacerX This flow-sensitive static analysis tool is used for detecting races and deadlocks. It overcomes the requirement to tediously annotate the entire source code. In fact, the only annotation requirement is that users provide a table specifying APIs used to acquire and release locks. The locking primitives' attributes such as spinning, blocking, and re-entrancy can also be specified. This table will typically be very small, 30 entries at most. This tremendously reduces the burden of source annotation for large systems.
In the first phase, RacerX iterates over each source code file and builds a Control Flow Graph (CFG). The CFG has information related to function calls, shared memory, use of pointers, and other data. As it builds the CFG, it also refers back to the table of synchronization primitives and marks calls to these APIs.
Once the complete CFG is built, the analysis phase kicks in, which includes running the race checker and deadlock checker. Traversing the CFG can be lengthy, but appropriate reduction and caching techniques are utilized to minimize the impact. As the context flow is traversed, the lockset algorithm is used to catch potential races (refer to the "Race Detection Algorithms" sidebar for details about the algorithm). For deadlock analysis, it computes locking cycles every time locks are taken.
The final phase involves post-processing all the reported errors in order to prioritize by importance and harmfulness of the error. As can be expected with static analysis, the authors spent a lot of time trying to reduce the false positives and present bugs with high confidence. The results of this tool have been impressive, and it seems very practical from a test engineering standpoint.
Chord This is a flow-insensitive, context-sensitive static analysis tool for Java. Being flow-insensitive allows it be far more scalable than other static tools, but at the cost of losing precision. It also takes into account the specific synchronization primitives available in Java. The algorithm used is very involved and would require the introduction of many concepts.
KISS Developed by Microsoft Research, this model checker tool is for concurrent C programs. Since state space explodes quickly in a concurrent system, KISS transforms a concurrent C program into a sequential program that simulates the execution of interleaving. A sequential model checker is then used to perform the analysis.
The application is instrumented with statements that convert the concurrent program to a sequential program, with KISS assuming the responsibility of controlling the non-determinism. The non-determinism context switching is bounded by similar principles described in CHESS above. The programmer is supposed to introduce asserts which validate concurrency assumptions. The tool does not report false positives. The tool is a research prototype and has been used by the Windows driver team, which primarily uses C code.
Zing This tool is a pure model checker meant for design verification of concurrent programs. Zing has its own custom language that is used to describe complex states and transition, and it is fully capable of modeling concurrent state machines. Like other model checkers, Zing provides a comprehensive way to verify designs; it also helps build confidence in the quality of the design since you can verify assumptions and formally prove the presence or absence of certain conditions. It also contends with concurrent state space explosion by innovative reduction techniques.
The model that Zing uses (to check for program correctness) has to be created either by hand or by translators. While some specific domain translators can be written, we have yet to come across any complete and successful translators for native or CLR applications. Without having translators, we believe Zing cannot be used in large software projects, except for verifying the correctness of critical subsections of the project (see go.microsoft.com/fwlink/?LinkId=115725.)
Performance testing is an integral and essential part of the concurrency testing process. After all, one of the key motivations for developing parallel applications is to deliver superior performance in comparison to sequential applications. As postulated in Amdahl's law and Gustafson's law, the performance gain achieved by a parallel application is greatly influenced by its algorithm's parallelizability aspects, amount of sequential parts in the program, the parallelization overhead and the data/workload characteristics. Conducting performance testing allows stakeholders to understand and analyze the performance characteristics of parallel applications.
The general methodology for conducting performance testing of parallel applications remains the same as that of performance testing sequential applications. The following section describes how some of the key performance testing steps are to be adapted for concurrent applications. For an understanding of the most helpful factors to measure, see the "Identifying Test Data and Reporting Metrics" sidebar.
The most important step in performance testing is defining the objectives or purpose of the test. For parallel applications, valid test objectives include understanding the parallelizability/scalability of the algorithm used, understanding the performance characteristics of various design alternatives, discovering the synchronizations and communication overheads, and validating that performance requirements are met.
It should be noted that the objective and scope of the performance testing could also vary based on who is conducting the test. For example, customers deploying a parallel application would perform performance testing to ensure that business needs are met, while the development team would be interested in conducting exhaustive performance testing to identify bottlenecks and to improve the parallelizability of the program. Further, the test objectives can vary throughout the software development cycle. During the design and implementation phase, the test objective might be to understand and improve the performance characteristics of the application, while during testing and release phases (stabilization phases), the objective might be to ensure that performance does not regress across builds.
Defining test scenarios and setting goals for them is another important step in the performance testing process. Development teams interested in understanding the performance characteristics of a parallel application should define three types of test scenarios—customer scenario tests, key performance indicator (KPI) tests, and micro benchmark (MBM) tests.
The relationships between these different types of performance tests can be considered as equivalent to that which exists between system tests, integration tests, and unit tests. In other words, a customer scenario tests spawns a set of KPI tests and a KPI test spawns a set of MBM tests. Understanding the relationship between these different test types enables developers to understand performance relationships among various sub-parts of a parallel application. Gaining this understanding also permits better prioritization of performance bugs and, more importantly, allows the development team to respond to customer performance complaints with result-oriented suggestions.
Setting performance goals for any non-trivial application is hard, let alone for parallel applications. One approach in setting goals is to derive the goal by triangulating metrics obtained from an expected algorithmic or theoretical performance metric for each component and sub-component of the application, from comparative analysis (comparison candidates include similar competitor applications) and from profiling the existing implementation/application (if one exists).
As with test goals, every test scenario should define a set of acceptance criteria (set of values that validate the test results). For parallel test scenarios, the following variables can be used as acceptance criteria.
Variance in Test Results This is indicative of an unstable application. The variance in test results can be understood by calculating the standard deviation of results.
CPU Utilization Low CPU usage could be a sign of concurrency bugs such as deadlocks and synchronization overheads. Conversely, a high CPU utilization does not necessarily indicate success, as livelocks result in high CPU utilization.
Garbage Collections For managed applications, too many memory management operations can hinder actual work execution and could point to faulty design or implementation.
Total Thread Execution Time This enables you to approximate the total execution time if the program was to be executed sequentially. Comparing the total thread execution time to the elapsed time of sequential implementation of the same algorithm program is a way to understand the parallelization overheads (and to note whether the test is valid).
Total Memory Usage Measuring total memory usage can be valuable in understanding the memory profile of the application. For managed applications, the garbage collector plays a significant role in altering the total memory usage of the application and should be taken into consideration while evaluating the memory usage.
Before you begin performance testing, there are three important steps you can take to ensure more relevant results. First, minimize the variance in the test environment (variance can be introduced by both the application and test environment). Variances in the test environment can be lowered by stopping unnecessary services and applications and reducing network interferences. Variance caused by the application can be reduced by performing warm-up iterations and collecting measurements from multiple iterations of the test.
Second, for managed applications, the untimely execution of the garbage collector can invalidate performance results. As a result, you should minimize the chances of the garbage collector getting invoked during performance measurements by forcibly invoking garbage collection (GC) before or after collecting the measurement and/or changing the GCLatencyMode to LowLatency (this is true only in the Microsoft® .NET Framework 3.5).
Third, always perform warm-up before collecting the metrics. Warming up by executing the test once before collecting the measurement allows for discounting onetime costs such as initialization of variables, JITing cost (in managed applications), and initial cache miss latency. Note that depending on performance scenarios, it might be important to measure the cold startup performance too. (For more information on improving application startup performance, see msdn2.microsoft.com/magazine/cc337892.aspx.)
Software stress testing often refers to testing the application for robustness, availability, and graceful error handling. Robustness is usually tested by repeatedly invoking the same functionalities or combination of functionalities and ensuring correctness of execution. Availability is tested by executing the application under strenuous resource conditions such as low memory or high CPU load and ensuring that the application does not crash or that it fails gracefully. Graceful error handling is often tested for by introducing errors (fault injection) or by triggering multiple errors.
When conducting stress testing for concurrent applications, more focus should be given to testing the robustness and availability of the application. Due to the nondeterministic behavior of parallel applications, repeatedly invoking the same functionality or combination of functionalities could result in different code paths—which in turn might result in more discovered bugs. Further, testing for availability under strenuous resource conditions (such as creation of numerous threads or multiple threads entering deadlocks/livelocks) is especially important since the probability that a customer will encounter such scenarios is relatively high for parallel applications.
It should be noted that analyzing the root cause for stress-related concurrency bugs and trying to reproduce stress-related bugs in a consistent manner is extremely difficult. However, as the concurrency testing tools are still in a nascent state, stress testing is a very important supplement to functional testing.
In order to truly take advantage of all the power that parallel computing can offer, it is imperative that we evolve the current software development practices that were created for developing and testing sequential applications. It's difficult to develop and test concurrent applications due to a new class of bugs introduced by parallelism. This article outlined the various types of concurrent bugs, strategies for finding concurrent bugs, and discussed how to adapt performance testing for concurrent scenarios.
Rahul V. Patil is an SDET Lead with the Parallel Computing Platform team at Microsoft and is responsible for leading the testing effort with the native concurrent framework.
Boby George is also an SDET with the Parallel Computing Platform team at Microsoft and is responsible for the performance testing effort with managed parallel computing framework.
More MSDN Magazine Blog entries >
Browse All MSDN Magazines
Subscribe to MSDN Flash newsletter
Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.