Boosting the Performance of the Microsoft .NET Framework
Summary: Three technologies that directly improve Microsoft .NET Framework application performance on the Intel architecture are profiled: Just-In-Time compilation, garbage collection, and Hyper-Threading technology. (4 printed pages)
With the release of its new .NET Framework, Microsoft® introduced the common language runtime, a platform abstraction layer that provides an optimized environment for application execution.
The runtime is the execution engine for the Microsoft .NET Framework and is responsible for compiling managed code, verifying type safety, managing the thread pool, executing code, and handling memory management issues, among other things. The functionality embodied in this layer hides substantial complexity and allows the developer to focus on solving the task at hand. It is this layer that is aware of the target Intel® processor version and can best take advantage of the processor’s unique characteristics.
We will look at three technologies that directly improve .NET Framework application performance on the Intel architecture: Just-In-Time compilation, garbage collection, and Hyper-Threading technology.
Just-In-Time Means Run-It-Fast
A key component of the runtime that results in fast execution of .NET Framework applications is the Microsoft Just-In-Time compiler (JIT). The JIT takes common MSIL (Microsoft Intermediate Language) code that is produced by Microsoft Visual Studio® .NET, the Microsoft .NET Framework SDK, or other tools that target the .NET Framework, and uses it to generate native x86 instructions. Intel processors have evolved rapidly from the early Pentium® processor design to the latest multiprocessor-based systems using the Intel Xeon™ processor. When the JIT encounters a genuine Intel processor, the code produced takes advantage of Intel NetBurst™ architecture innovations and hyper-threading technology. The JIT compiler version 1.1, due to ship later this year, will also take advantage of Streaming SIMD Extensions 2 (SSE2).
As the JIT is activated at run time, it is able to utilize information that isn't available to a regular compiler. This information can be used to provide further performance enhancements by allowing for increasingly sophisticated optimizations. The runtime effect of these additional JIT-only optimizations can be quite dramatic. The JIT can dynamically remove levels of indirection and inline frequently called methods, as it knows what code paths are actually executing. These optimizations result in a more responsive .NET application due to fewer cache misses and reduced branching. It is possible for all this to be accomplished without a perceptible delay in application execution due to the raw speed of the latest Xeon and Pentium 4 processors and the Mobile Intel Pentium 4 processor - M. (These latest processors currently run with clock rates up to 2.4 GHz, and it's reasonable to expect that they'll go even faster in the future.)
Intelligent Garbage Collection
The .NET Framework is unique in its ability to support code running in both managed and native (unmanaged) models (the latter being true in the case of C++). While both managed and native code can execute in the runtime environment, only managed code contains the information that allows the runtime to guarantee safe execution. (In essence, unmanaged code was written before the .NET Framework; in this case, it is code written using the Microsoft Foundation Classes. Managed code targets the .NET Framework specifically.) By writing code for the managed model, a developer is freed from the error-prone task of managing application memory. A common misconception is that code written to rely on garbage collection will be slower than native C++ code with explicit object allocation and deallocation. This may be the case with other implementations of garbage collection, but not Microsoft's; it is designed to incorporate information about processor cache size, dynamic workload, object locality, and multiprocessor configuration.
The runtime garbage collection implementation uses a generational, mark-and-compact collector algorithm that allows for excellent performance. It assumes that the newly created objects are smaller and tend to be accessed often, whereas older objects are larger and accessed less frequently. The algorithm divides the heap allocation graph into several sub-graphs, called generations, which allow it to spend as little time as possible collecting. Generation 0 contains young, frequently accessed objects. Generations 1 and 2 are for larger, older objects that can be collected less often.
This design improves utilization of the processor cache over what is possible using native memory management. A processor such as the Intel Xeon processor has up to 512KB of L2 Advanced Transfer Cache that resides on-chip and is connected to the core via a 256-bit (32-byte) interface that transfers data to the core at rates far greater than any system can replenish the cache. The garbage collector resolves the L2 cache size and optimizes the Generation 0 heap size accordingly. This optimization greatly increases the probability that the referenced object will reside in cache and can be accessed quickly by the processor core.
In traditional malloc libraries memory is placed at the address of the last freed block of sufficient size. This behavior results in poor locality as the objects needed for a method can be dispersed anywhere in the address space. The improved runtime behavior ensures objects are ordered by time of allocation in a contiguous block of memory. This design results in excellent locality and fewer cache misses as objects that are referenced within a method have a high probability of residing on the same page. Even with the Pentium 4 and Xeon processors' advanced prefetch capability, the cost to bring in an object from off-chip memory is high. These extremely fast processors benefit greatly from keeping the pipeline full of user instructions and data.
Running in Parallel
In early 2002, Intel introduced an architectural innovation that results in even better performance of .NET Framework applications. With Hyper-Threading technology, one physical Xeon processor can be viewed as two logical processors each with its own state. The performance improvements due to this design arise from two factors: an application can schedule threads to execute simultaneously on the logical processors in a physical processor, and on-chip execution resources are utilized at a higher level than when only a single thread is consuming the execution resources.
The runtime is designed to fully exploit parallelism with support for Hyper-Threading and multiprocessor systems. A developer writing a .NET Framework application does not need additional code to take advantage of these features since the runtime manages all the runtime details with its thread pool. The pool can spawn new threads and block others to optimize runtime CPU utilization. It also recycles threads when they are done, starting them up again without the overhead of killing and creating new ones. The thread pool is also aware of issues specific to managed code, such as garbage collection cycles, and adjusts its scheduling logic accordingly.
.NET Framework applications designated for deployment on multiprocessor machines, such as those using the Intel Xeon MP processors, should make use of the server-optimized runtime implementation (MSCorSvr.dll). This version of the runtime has a modified garbage collector that splits the managed heap into several sections, with one per CPU. Each CPU has its own garbage-collection thread, so they can all participate simultaneously in the collection cycle.
Getting a Jump on the Competition
.NET Framework developers interested in getting early availability and discounts on the latest Intel solutions and software tools should consider the Intel Early Access Program. This program jumpstarts a development team intending to build world-class .NET Framework solutions on an Intel platform. It also provides a great outlet for your work by offering co-marketing and sales development opportunities with Intel.