CLR Inside Out
CLR Optimizations In .NET Framework 3.5 SP1
A surprising amount of work went into the core of the Microsoft .NET Framework for the .NET Framework 3.5 SP1 released in August 2008. Here, I'll provide in-depth information about the changes that we on the CLR team made to the common language runtime (CLR) and the improvements you can expect by simply running your existing CLR 2.0-based applications against this latest service pack. Most of our effort was centered on improving performance, security, and deployment of applications targeting the .NET platform.
Before delving into the details, note that you can download the .NET Framework 3.5 SP1
from the Microsoft Download Center. It is also available through Windows Update. Also, take a look at Figure 1
. It answers the perennial question: Which CLR version is inside which version of the .NET Framework?
|Figure 1 .NET Versions and CLR Versions
|.NET Framework Version
||Contains CLR Version
||2.0 SP1, 3.0 SP1, 3.5
||2.0 SP2, 3.0 SP2, 3.5 SP1
Startup Performance Improvements
Improving the startup performance—in particular, the cold startup time of managed applications—was a primary focus of .NET Framework 3.5 SP1.
Managed assemblies in the .NET Framework are largely precompiled via NGen ("The Performance Benefits of NGen
"), and the layout of the code and data in the NGen images has a strong impact on the startup performance of applications that use the framework. In particular, since cold startup time is typically bound by the number of pages of the image that need to be read from disk, any effort to reduce it translates into an effort to better pack the image so that only a small subset of its pages are read during startup.
To that effect, the CLR uses a profile-driven engine to optimize the layout of NGen images for assemblies in the .NET Framework. The profile data is collected by running scenarios that we believe are representative of our framework's usage, and then included as a resource in the corresponding assembly when it is being built. During NGen, if such a resource is found in an assembly, it is used to pack the "hot" code and data (those that were accessed when the training scenarios were run) together, thereby moving the "cold" code and data into their own sections in the image.
Furthermore, .NET Framework 3.5 SP1 contains several improvements to our hot/cold splitting infrastructure, which in turn improve the locality of NGen images. In particular, basic blocks in methods that contain various kinds of control flow—switch statements or unconditional branches, for example—are now rearranged such that more cold blocks can be split off and moved to the cold part of the NGen images. We also implemented a better algorithm for merging profile data from multiple training scenarios—we now prioritize packing together data/code that is accessed by the greatest number of scenarios in a given training set. Finally, to further improve spatial locality for code that is executed together temporally, the executable code in the NGen image is partitioned into RunOnce, RunMany, and RunNever sections. These improvements to our profile data collection and usage were made for both 32-bit and 64-bit platforms. As a result of these changes, there was a greater than 50% (sometimes significantly higher) reduction in the number of cold pages touched when a training scenario was re-run on top of a trained .NET Framework.
Unfortunately, our profile data-gathering framework is still internal-only. So while this work benefits all managed applications that use .NET Framework libraries at startup, those that NGen the applications' assemblies could further optimize startup performance by using our training framework. We'd like to make this framework available at some point, so stay tuned!
Note that a somewhat unrelated change ("Strong Name Bypass"
) also addresses another NGen-related problem that was mentioned earlier in my article "The Performance Benefits of NGen
," namely, time spent verifying strong name signatures.
Switching gears a little, .NET Framework 3.5 SP1 also contains improvements to the quality of the code generated by the 32-bit and 64-bit JIT compilers. In particular, the 32-bit JIT can now inline method calls that involve passing, returning, or operating on structs. Prior to this release, the 32-bit JIT simply gave up trying to inline such functions because structs aren't first-class citizens in the x86 JIT and are lowered into byref pointers very early in the compilation process. Once that happens, the JIT is no longer able to identify the structs and cannot do the sort of normal optimizations, such as copy propagation, that are performed on primitive types.
Thus in the following example, the JIT compiler was unable to determine that it simply needs to propagate "a" to the last statement and instead generates code for three sets of redundant copies (b=a; c=b; d=c;).
MyStruct a, b, c, d;
a = new MyStruct(1, 2);
b = a;
c = b;
d = c;
The goal of this work was therefore to allow inlining of functions that use structs and enable subsequent optimizations such as copy prop. The approach taken was to select structs that meet certain heuristics (such as no more than four fields, fields are either primitive types or refs, and so on), "promote" the fields to local variables internally in the compiler, and take advantage of the fact that standard optimizations are already enabled for primitive types and refs. Thus, all operations on the promoted structs are changed to reflect operations on the individual "field locals." More details on this can be found on the blog JIT, NGen, and other Managed Code Generation Stuff
. After tuning the inliner heuristic, we ended up with better code quality and lower NGen time, so this work was a win-win. Note that the inliner in the 64-bit JIT is different and wasn't changed dramatically in 3.5 SP1; however, we've made several improvements to it since, which should be available in the upcoming .NET Framework 4.0 release.
In addition to the inliner work, both JIT compilers in 3.5 SP1 are now better at propagating assertions (what is known to be true) in the generated code, which in turn results in better code quality because of eliminated null checks, TypeOf checks, comparisons against constants, comparisons between variables, and so on.
The JIT compiler can now also straighten branches by predicting whether a branch is more likely to be taken or not. Thus, if a taken branch is expected to be more common than the fall-through code path, the fall-through blocks are moved further down, after the blocks in the branch target. (The condition of the branch is also reversed, of course.) Straightening branches in this manner helps branch prediction (forward branches are typically assumed to not be taken when there is no branch history) and improves cache locality. While the outcome of some branches can be predicted statically (exceptional code-paths are assumed to not be executed, for instance), we also use the previously mentioned profile data for the purposes of branch straightening. This work on its own resulted in a greater than 10% improvement in the throughput of some of our ASP.NET benchmarks.
While this isn't a JIT optimization by any means, we did make another major change to the 32-bit JIT compiler in 3.5 SP1. For managed call-stacks, the EBP chain can now be walked (by a debugger or profiler) to determine what functions were called. This simply means that the JIT compiler no longer uses EBP as a scratch register and instead uses it exclusively to store the frame pointer. Note, however, that the JIT compiler doesn't guarantee that it will push or pop EBP to or from the stack for every function call (to avoid the performance cost of doing so for tiny methods). Thus, walking the EBP chain now gives you an ordered list of methods that were invoked, but there can be methods missing from the list. We did choose a conservative heuristic when determining which methods to generate EBP frames for, so it should be reasonably easy to determine the complete list given an understanding of the program's call graph. Of course, in debug mode we push or pop the EBP register for every function call. The primary motivation behind this work was to enable kernel-mode profilers that cannot safely use our managed stack-walking APIs to walk the stack when managed applications are being run.
NGen and ASLR
We made a significant overhaul of the infrastructure that generates and writes out NGen images in 3.5 SP1. It is now more efficient, consumes less memory, and produces images that are more compact. This resulted in improvements in NGen time of more than 100% (when compared to a runtime that uses the old NGen infrastructure) and approximately a 10% improvement in the cold startup time of some large applications such as Expression Blend. The startup time improvement was a side effect that resulted from the fact that data structures persisted in the NGen image were more compact. This work improved .NET Framework set-up time as well since a good portion of the installation time is spent generating NGen images for assemblies in the framework.
Starting with this service pack, we've also opted NGen images into using Address Space Layout Randomization (ASLR), a security feature that was added to Windows in Windows Vista (see "Address Space Layout Randomization in Windows Vista
"). ASLR relies on the ability to efficiently load PE images at random locations in the virtual address space, and hence on the ability to efficiently fix up addresses referred to in the images themselves. Unlike unmanaged images, which contain direct references only to locations within the same image, NGen images used to also contain direct references to code/data in other assemblies' NGen images. (See the section on hardbinding in my previous article, "The Performance Benefits of NGen
.") Such external references cannot easily be handled by the kernel's fix up logic. Thus, one of the more significant efforts in .NET Framework 3.5 SP1 involved changing the format of NGen images to eliminate the use of such direct pointers without giving up on the performance wins we'd obtained when we introduced hardbinding. Indirect references—and managed code typically contains a lot of them—are expensive not only because of the additional level of indirection involved, but also because they need to be fixed up by default. (And writing to the indirection cells results in private pages that cannot be shared across processes.) These indirect references were strewn all over the NGen image (and resulted in lots of private pages); hence a lot of the effort was centered on restructuring the image such that all the indirection cells were colocated, patched together, and resulted in as few pages written to at run time as was possible. At the end, the performance degradation was measurable in only a few warm startup and throughput scenarios, and even in those cases the impact was minimal.
One of the side effects of the ASLR work is the fact that you no longer need to worry about choosing noncolliding base addresses for your NGen images (see "The Performance Benefits of NGen
") when targeting Windows Vista or newer platforms. (It is still an issue on Windows XP and older platforms, however.) In another unrelated security change, all managed applications built with our 3.5 SP1 tool set are now opted into Data Execution Prevention
The .NET Framework 3.5 Client Profile
In the .NET Framework 3.5 SP1, we also tried to address one of the other major problems we've had with the framework—namely, getting it deployed on end user machines. We decided to focus on providing a leaner framework package for certain kinds of applications that made use of only a subset of the functionality provided by the framework. Because the focus of this package was to enable developers to write client applications—primarily WPF-based applications—we call it the Client Profile. Note that the Client Profile is a true subset of the full .NET Framework.
Introducing a new way to split the framework in a service-pack release turned out to be a far more challenging task than we'd originally envisioned it to be. In particular, we already had a "horizontal slicing" of our product; we were now introducing a "vertical slicing."
This lead to interesting scenarios that involved installing old frameworks on top of new partial frameworks, and vice versa. To trim down the set of possible scenarios, we made the following simplifications:
- The Client Profile can be installed only on machines that don't have any CLR 2.0-based framework installed (since installing the framework on "clean" Windows XP machines was the biggest pain point).
- Old versions of the framework cannot be installed on top of the Client Profile (thus the ASP.NET stack included in .NET Framework 2.0 cannot be installed on top of the new CLR included in the 3.5 SP1 Client Profile)—there just wasn't a good way for us to test all the resulting combinations of old/new binaries.
We encountered another interesting situation when testing some applications on top of the Client Profile. Some applications that targeted existing .NET Framework versions (2.0/3.0/3.5) seemed to install and run on top of the Client Profile (since they all contain the same compatible 2.0 CLR) and then failed when attempting to call APIs that weren't included in the Client Profile! While the particular application that made us think of this problem had an installer that wasn't looking for the documented install keys for the framework, we quickly realized that this would affect all xcopy-deployed applications as well. As a consequence, we decided to block applications from running on top of the Client Profile unless they were explicitly marked as capable of running on the subset. We then changed the runtime to ensure that non-Client applications failed gracefully, immediately, and with a diagnostic error message (requesting the end user to install the full .NET Framework 3.5 SP1) when they were executed on the Client Profile.
A lot more information about targeting the Client Profile, the subset of functionality that it supports, and so on can be found in the ".NET Framework Client Profile
" article on MSDN.
Before wrapping up, I'd like to mention another change we made to the runtime in 3.5 SP1, which a lot of users have asked for in the past—the ability to launch managed applications from a local intranet share in full trust without having to manually tweak security settings. This is described in further detail in the .NET Security Blog article, "FullTrust on the LocalIntranet
," but seems worthwhile to mention in case you didn't know about this yet, and is something you've wanted for a long time.
This article was simply intended to give you an idea of the changes we made to the CLR in the latest service-pack release of the .NET Framework. Note that there were a lot of changes made to other parts of the .NET Framework as well (an overview can be found in Scott Guthrie's blog post
. In conclusion, I hope you are now excited about trying out this release that's been a year in the making.
If you have any questions or comments, don't hesitate to send them our way.
Surupa Biswas is a program manager on the CLR team at Microsoft. She works on the runtime's back-end compiler and focuses primarily on pre-compilation technologies.