Article
08/10/2015

April 2012

Volume 27 Number 04

CLR - An Overview of Performance Improvements in .NET 4.5

By Ashwin Kamath | April 2012

This article discusses the prerelease version of the Microsoft .NET Framework 4.5. All related information is subject to change.

On the Microsoft .NET Framework team, we’ve always understood that improving performance is at least as valuable to developers as adding new runtime features and library APIs. The .NET Framework 4.5 includes significant investments in performance, benefiting all application scenarios. In addition, because .NET 4.5 is an update to .NET 4, even your .NET 4 applications can enjoy many of the performance improvements to existing .NET 4 features.

When it comes to enabling developers to deliver satisfying application experiences, startup time (see msdn.microsoft.com/magazine/cc337892), memory usage (see msdn.microsoft.com/magazine/dd882521), throughput and responsiveness really matter. We set goals on improving these metrics for the different application scenarios, and then we design changes to meet or exceed them. In this article, I’ll provide a high-level overview of some of the key performance improvements we made in the .NET Framework 4.5.

CLR

In this release, we focused on: exploiting multiple processor cores to improve performance, reducing latency in the garbage collector and improving the code quality of native images. Following are some of the key performance improvement features.

Multicore Just-in-Time (JIT) We continually monitor low-level hardware advancements and work with chip vendors to achieve the best hardware-assisted performance. In particular, we’ve had multicore chips in our performance labs since they were available and have made appropriate changes to exploit that particular hardware change; however, those changes benefited very few customers at first.

At this point, nearly every PC has at least two cores, such that new features that require more than one core are immediately broadly useful. Early in the development of .NET 4.5, we set out to determine if it was reasonable to use multiple processor cores to share the task of JIT compilation—specifically as part of application startup—to speed up the overall experience. As part of that investigation, we discovered that enough managed apps have a minimum threshold number of JIT-compiled methods to make the investment worthwhile.

The feature works by JIT compiling methods likely to be executed on a background thread, which on a multicore machine will run on another core, in parallel. In the ideal case, the second core quickly gets ahead of the mainline execution of the application, so most methods are JIT compiled by the time they’re needed. In order to know which methods to compile, the feature generates profile data that keeps track of the methods that are executed and then is guided by this profile data on a later run. This requirement of generating profile data is the primary way in which you interact with this feature.

With a minimal addition of code, you can use this feature of the runtime to significantly improve the startup times of both client applications and Web sites. In particular, you need to make straightforward calls to two static methods on the ProfileOptimization class, in the System.Runtime namespace. See the MSDN documentation for more information. Note that this feature is enabled by default for ASP.NET 4.5 applications and Silverlight 5 applications.

Optimized Native Images For several releases, we have enabled you to precompile code to native images via a tool called Native Image Generation (NGen). The consequent native images typically result in significantly faster application startup than is seen with JIT compilation. In this release we introduced a supplemental tool called Managed Profile Guided Optimization (MPGO), which optimizes the layout of native images for even greater performance. MPGO uses a profile-guided optimization technology, very similar in concept to multicore JIT described earlier. The profile data for the app includes a representative scenario or set of scenarios, which can be used to reorder the layout of a native image such that methods and other data structures needed at startup are colocated densely within one part of the native image, resulting in shorter startup time and less working set (an application’s memory usage). In our own tests and experience, we typically see a benefit from MPGO with larger managed applications (for example, large interactive GUI applications), and we recommend its use largely along those lines.

The MPGO tool generates profile data for an intermediate language (IL) DLL and adds the profile as a resource to the IL DLL. The NGen tool is used to precompile IL DLLs after profiling, and it performs additional optimization due to the presence of the profile data. Figure 1 visualizes the process flow.

Figure 1 Process Flow with the MPGO Tool

Large Object Heap (LOH) Allocator Many .NET developers requested a solution for the LOH fragmentation issue or a way to force compaction of the LOH. You can read more about how the LOH works in the June 2008 CLR Inside Out column by Maoni Stephens at msdn.microsoft.com/magazine/cc534993. To summarize, any object of size 85,000 bytes or more gets allocated on the LOH. Currently, the LOH isn’t compacted. Compacting the LOH would consume a lot of time because the garbage collector would have to move large objects and, hence, is an expensive proposition. When objects on the LOH get collected, they leave free spaces in between objects that survive the collection, which leads to the fragmentation issue.

To explain a bit more, the CLR makes a free list out of the dead objects, allowing them to be reused later to satisfy large object allocation requests; adjacent dead objects are made into one free object. Eventually a program can end up in a situation where these free memory fragments between live large objects aren’t large enough to make further object allocations in the LOH, and because compaction isn’t an option, we quickly run into problems. This leads to applications becoming unresponsive and eventually leads to out-of-memory exceptions.

In .NET 4.5 we’ve made some changes to make efficient use of memory fragments in the LOH, particularly concerning the way we manage the free list. The changes apply to both workstation and server garbage collection (GC). Please note that this doesn’t change the 85,000-byte limit for LOH objects.

Background GC for Server In .NET 4 we enabled background GC for workstation GC. Since that time, we’ve seen a greater frequency of machines with upper-end heap sizes that range from a few gigabytes to tens of gigabytes. Even an optimized parallel collector such as ours can take seconds to collect such large heaps, thus blocking application threads for seconds. Background GC for server introduces support for concurrent collections to our server collector. It minimizes long blocking collections while continuing to maintain high application throughput.

If you’re using server GC, you don’t need to do anything to take advantage of this new feature; the server background GC will automatically happen. The high-level background GC characteristics are the same for client and server GC:

Only full GC (generation 2) can happen in the background.
Background GC doesn’t compact.
Foreground GC (generation 0/generation 1 GC) can happen during background GC. Server GC is done on dedicated server GC threads.
Full-blocking GC also happens on dedicated server GC threads.

Asynchronous Programming

A new asynchronous programming model was introduced as part of the Visual Studio Async CTP and is now an important part of .NET 4.5. These new language features in .NET 4.5 enable you to productively write asynchronous code. Two new language keywords in C# and Visual Basic called “async” and “await” enable this new model. .NET 4.5 has also been updated to support asynchronous applications that use these new keywords.

The Visual Studio Asynchronous Programming portal on MSDN (msdn.microsoft.com/vstudio/async) is a great resource for samples, white papers and talks on the new language features and support.

Parallel Computing Libraries

A number of improvements were made to the parallel computing libraries (PCLs) in .NET 4.5 to enhance the existing APIs.

Faster Lighter-Weight Tasks The System.Threading.Tasks.Task and Task<TResult> classes were optimized to use less memory and execute faster in key scenarios. Specifically, cases related to creating Tasks and scheduling continuations saw performance improvements of up to 60 percent.

More PLINQ Queries Execute in Parallel PLINQ falls back to sequential execution when it thinks that it would do more harm (make things slower) by parallelizing a query. These decisions are educated guesses and not always perfect, and in .NET 4.5, PLINQ will recognize more classes of queries that it can successfully parallelize.

Faster Concurrent Collections A number of tweaks were made to System.Collections.Concurrent.ConcurrentDictionary<TKey, TValue> to make it faster for certain scenarios.

For more details on these changes, check out the Parallel Computing Platform team blog at blogs.msdn.com/b/pfxteam.

ADO.NET

Null Bit Compression Row Support Null data is especially common for customers leveraging the SQL Server 2008 sparse columns feature. Customers leveraging the sparse columns feature can potentially produce result sets that contain a large number of null columns. For this scenario, row null-bit-compression (SQLNBCROW token or, simply, NBCROW) was introduced. This reduces the space used by the result set rows sent from the server with large numbers of columns by compressing multiple columns with NULL values into a bit mask. This significantly helps the tabular data stream (TDS) protocol’s compression of data where there are many nulled columns in the data.

Entity Framework

Auto-Compiled LINQ Queries When you write a LINQ to Entities query today, the Entity Framework walks over the expression tree generated by the C#/Visual Basic compiler and translates (or compiles) that into SQL, as shown in Figure 2.

Figure 2 A LINQ to Entities Query Translated into SQL

Compiling the expression tree into SQL involves some overhead, though, particularly for more complex queries. In previous versions of the Entity Framework, if you wanted to avoid having to pay this performance penalty every time a LINQ query was executed, you had to use the CompiledQuery class.

This new version of the Entity Framework supports a new feature called Auto-Compiled LINQ Queries. Now every LINQ to Entities query that you execute automatically gets compiled and placed in the Entity Framework query plan cache. Each additional time you run the query, the Entity Framework will find it in its query cache and won’t have to go through the whole compilation process again. You can read more about it at bit.ly/iCaM2b.

Windows Communication Foundation and Windows Workflow Foundation

The Windows Communication Foundation (WCF) and Windows Workflow Foundation (WF) team has also done a bunch of performance improvements in this release, such as:

TCP activation scalability improvements: Customers reported an issue with TCP activation such that when many concurrent users sent requests with constant reconnections, the TCP port sharing service didn’t scale well. This has been fixed in .NET 4.5.
Built-in GZip compression support for WCF HTTP/TCP: With this new compression, we expect up to a 5x compression ratio.
Recycling host when memory usage is high for WCF: When the memory usage is high (configurable knob), we use least recently used (LRU) logic to recycle WCF services.
HTTP async streaming support for WCF: We implemented this feature in .NET 4.5 and achieved the same throughput as that of synchronous streaming but with much better scalability.
Generation 0 fragmentation improvements for WCF TCP.
Optimized BufferManager for WCF for large objects: For large objects, better buffer pooling has been implemented to avoid high generation 2 GC costs.
WF validation improvement with expression caching: We expect up to 3x improvements for a core scenario of loading WF and executing it.
Implemented WCF/WF end-to-end Event Tracing for Windows (ETW): Although this isn’t a performance improvement feature, it helps customers on performance investigations.

You can find more details on the Workflow Team blog at blogs.msdn.com/b/workflowteam and in the MSDN Library article at bit.ly/n5VCtU.

ASP.NET

Improving site density (also defined as “per-site memory consumption”) and cold startup time of sites in the case of shared hosting have been two key performance goals for the ASP.NET team for .NET 4.5.

In shared hosting scenarios, many sites share the same machine. In such environments traffic is usually low. Data provided by some hosting companies shows that most of the time the request per second is below 1 rps, with occasional peaks of 2 rps or more. This means that many worker processes will probably die when they’re idle for a long time (20 minutes by default in IIS 7 and later). Thus startup time becomes very important. In ASP.NET, that’s the time it takes a Web site to receive a request and respond to it, when the worker process was down versus when the Web site was already compiled.

We implemented several features in this release to improve startup time for the shared hosting scenarios. The features used are:

Bin assemblies interning (sharing common assemblies): The ASP.NET shadow copy feature enables assemblies that are used in an application domain to be updated without unloading that AppDomain (necessary because the CLR locks assemblies that are being used). This is done by copying application assemblies to a separate location (either a default CLR-determined location or a user-specified one) and loading the assemblies from that location. This allows the original assembly to be updated while the shadow copy is locked. ASP.NET turns on this feature by default for Bin folder assemblies so that DLLs can continue to be updated while a site is up and running.
ASP.NET recognizes the Bin folder of a Web site as a special folder for compiled assemblies (DLLs) for custom ASP.NET controls, components or other code that needs to be referenced in an ASP.NET application and shared across various pages in the site. A compiled assembly in the Bin folder gets automatically referenced everywhere in the Web application. ASP.NET also detects the latest version of a specific DLL in the Bin folder for use by the Web site. Prepackaged applications intended to be used by ASP.NET sites typically install to the Bin folder rather than the Global Assembly Cache.
The ASP.NET and CLR teams have found that when many sites reside on the same server and use the same application, many of these shadow copy DLLs tend to be exactly the same. As these files are read from disk and loaded into memory, this leads to many redundant loads that increase startup time and memory consumption. We worked on using symbolic links for the CLR to follow and then implemented identification of the common files and interned them in a special location (to which symbolic links will point). ASP.NET automatically configures shadow copying for Bin DLLs to be on. Shared hosters can now set up their machines according to the ASP.NET guidelines for maximum performance benefit.
Multicore JIT: See related information in the previous “CLR” section. The ASP.NET team uses the multicore JIT feature to improve startup time by spreading the JIT compilation across processor cores. This is enabled by default in ASP.NET, so you can take advantage of this feature without any additional work. You can disable it using the following setting in your web.config file:

<configuration>
<!-- ... -->
<system.web>
<compilation profileGuidedOptimizations="None" />
<!-- ... -->

Prefetcher: Prefetcher technology in Windows is very effective in reducing the disk read cost of paging during application startup. Prefetcher is now enabled (but not by default) on Windows Server as well. To enable Prefetcher for high-density Web hosting, run the following set of commands at the command line:

sc config sysmain start=auto
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters" /v EnablePrefetcher /t REG_DWORD /d 2 /f
reg add "HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Prefetcher" /v MaxPrefetchFiles /t REG_DWORD /d 8192 /f
net start sysmain

You can then update the web.config file to use it in ASP.NET:

<configuration>
<!-- ... -->
<system.web>
<compilation enablePrefetchOptimization
  ="true" />
<!-- ... -->

Tuning GC for high-density Web hosting: GC can impact a site’s memory consumption, but it can be tuned to enable better performance. You can tune or configure GC for better CPU performance (slow down frequency of collections) or lower memory consumption (that is, more frequent collections to free up memory sooner). To enable the GC tuning, you can select the HighDensityWebHosting setting in the aspnet.config file in the Windows\Microsoft\v4.0.30319 folder in order to achieve smaller memory consumption (working set) per site:

<configuration>
<!-- ... -->
<runtime>
<performanceScenario
  value="HighDensityWebHosting" />
  <!-- ... -->

More details on ASP.NET performance improvements can be found in the “Getting Started with the Next Version of ASP.NET” white paper at bit.ly/A66I7R.

Feedback Wanted

The list presented here isn’t exhaustive. There are more minor changes to improve performance that were omitted to keep the scope of this article limited to the major features. Apart from this, the .NET Framework performance teams have also been busy working on performance improvements specific to Windows 8 Managed Metro-style applications. Once you download and try the .NET Framework 4.5 and Visual Studio 11 beta for Windows 8, please let us know if you have any feedback or suggestions for the releases ahead.

Glossary of Terms

Shared Hosting: Also known as “shared Web hosting,” high-density Web hosting enables hundreds—if not thousands—of Web sites to run on the same server. By sharing hardware costs, each site can be maintained for a lower cost. This technique has considerably lowered the barrier to entry for Web site owners.

Cold Startup: Cold startup is the time taken to launch when your application wasn’t already present in memory. You can experience cold startup by starting an application after a system reboot. For large applications, cold startup can take several seconds because the required pages (code, static data, registry and so on) aren’t present in memory and expensive disk accesses are required to bring the pages into memory.

Warm Startup: Warm startup is the time taken to start an application that’s already present in memory. For example, if an application was launched a few seconds earlier, it’s likely that most of the pages are already loaded in memory and the OS will reuse them, saving expensive disk-access time. This is why an application is much faster to start up the second time you run it (or why a second .NET app starts faster than the first, because parts of .NET will already be loaded in memory).

Native Image Generation, or NGen: Refers to the process of precompiling Intermediate Language (IL) executables into machine code prior to execution time. This results in two primary performance benefits. First, it reduces application startup time by avoiding the need to compile code at run time. Second, it improves memory usage by allowing for code pages to be shared across multiple processes. There’s also a tool, NGen.exe, that creates native images and installs them into the Native Image Cache (NIC) on the local computer. The runtime loads native images when they’re available.

Profile Guided Optimization: Profile guided optimization has been proven to enhance the startup and execution times of native and managed applications. Windows provides the toolset and infrastructure to perform profile guided optimization for native assemblies, whereas the CLR provides the toolset and infrastructure to perform profile guided optimizations for managed assemblies (called Managed Profile Guided Optimization, or MPGO). These technologies are used by many teams within Microsoft to improve the performance of their applications. For example, the CLR performs profile guided optimization of native assemblies (C++ profile guided optimization) and managed assemblies (using MPGO).

Garbage Collector: The .NET runtime supports automatic memory management. It tracks every memory allocation made by the managed program and periodically calls a garbage collector that finds memory no longer in use and reuses it for new allocations. An important optimization the garbage collector performs is that it doesn’t search the whole heap every time, but rather partitions the heap into three generations (generation 0, generation 1 and generation 2). For more information on the garbage collector, please read the June 2009 CLR Inside Out column at msdn.microsoft.com/magazine/dd882521.

Compacting: In the context of garbage collection, when the heap reaches a state where it’s sufficiently fragmented, the garbage collector compacts the heap by moving live objects close to each other. The primary goal of compacting the heap is to make larger blocks of memory available in which to allocate more objects.

Ashwin Kamath is a program manager on the CLR team of .NET and drove the performance and reliability features for the .NET Framework 4.5. He’s currently working on diagnostics features for the Windows Phone development platform.

Thanks to the following technical experts for reviewing this article: Surupa Biswas, Eric Dettinger, Wenlong Dong, Layla Driscoll, Dave Hiniker, Piyush Joshi, Ashok Kamath, Richard Lander, Vance Morrison, Subramanian Ramaswamy, Jose Reyes, Danny Shih and Bill Wert