Article
02/01/2019

March 2016

Volume 31 Number 3

[Compilers]

Managed Profile-Guided Optimization Using Background JIT

By Hadi Brais | March 2016

Some performance optimizations performed by a compiler are always good. That is, no matter which code actually gets executed at run time, the optimization will improve performance. Consider, for example, loop unrolling to enable vectorization. This optimization transforms a loop so that instead of performing a single operation in the body of the loop on a single set of operands (such as adding two integers stored in different arrays), the same operation would be performed on multiple sets of operands simultaneously (adding four pairs of integers).

On the other hand, there are extremely important optimizations that the compiler performs heuristically. That is, the compiler doesn’t know for sure that these optimizations will actually work great for the code that gets executed at run time. The two most important optimizations that fall in this category (or probably among all categories) are register allocation and function inlining. You can help the compiler make better decisions when performing such optimizations by running the app one or more times and providing it with typical user input while at the same time recording which code got executed.

The information that has been collected about the execution of the app is called a profile. The compiler can then use this profile to make some of its optimizations more effective, sometimes resulting in significant speedups. This technique is called profile-guided optimization (PGO). You should use this technique when you’ve written readable, maintainable code, employed good algorithms, maximized data access locality, minimized contention on locks and turned on all possible compiler optimizations, but still aren’t satisfied with the resulting performance. Generally speaking, PGO can be used to improve other characteristics of your code, not just performance. However, the technique discussed in this article can be used to only improve performance.

I’ve discussed in detail native PGO in the Microsoft Visual C++ compiler in a previous article at msdn.com/magazine/mt422584. For those who read that article, I’ve got some great news. Using managed PGO is simpler. In particular, the feature I’m going to discuss in this article, namely background JIT (also called multicore JIT) is a lot simpler. However, this is an advanced article. The CLR team wrote an introductory blog post three years ago (bit.ly/1ZnIj9y). Background JIT is supported in the Microsoft .NET Framework 4.5 and all later versions.

There are three managed PGO techniques:

Compile managed code to binary code using Ngen.exe (a process known as preJIT) and then use Mpgo.exe to generate profiles representing common usage scenarios that can be used to optimize the performance of the binary code. This is similar to native PGO. I’ll refer to this technique as static MPGO.
The first time an Intermediate Language (IL) method is about to be JIT compiled, generate instrumented binary code that records information at run time regarding which parts of the method are getting executed. Then later use that in-memory profile to re-JIT compile the IL method to generate highly optimized binary code. This is also similar to native PGO except that everything happens at run time. I’ll refer to this technique as dynamic MPGO.
Use background JIT to hide as much as possible the overhead to JIT by intelligently JIT compiling IL methods before they actually get executed for the first time. Ideally, by the time a method is called for the first time, it would’ve already been JIT compiled and there would be no need to wait for the JIT compiler to compile the method.

Interestingly, all of these techniques have been introduced in the .NET Framework 4.5, and later versions also support them. Static MPGO works only with native images generated by Ngen.exe. In contrast, dynamic MPGO works only with IL methods. Whenever possible, use Ngen.exe to generate native images and optimize them using static MPGO because this technique is much simpler, while at the same time it gives respectable speedups. The third technique, background JIT, is very different from the first two because it reduces JIT compiling overhead rather than improving the performance of the generated binary code and, therefore, can be used together with either of the other two techniques. However, using background JIT alone can sometimes be very beneficial and improve the performance of the app startup or a particular common usage scenario by up to 50 percent, which is great. This article focuses exclusively on background JIT. In the next section, I’ll discuss the traditional way of JIT compiling IL methods and how it impacts performance. Then I’ll discuss how background JIT works, why it works that way and how to properly use it.

Traditional JIT

You probably already have a basic idea of how the .NET JIT compiler works because there are many articles that discuss this process. However, I would like to revisit this subject in a little more detail and accuracy (but not much) before I get to background JIT so that you can easily follow the next section and understand the feature firmly.

Consider the example shown in Figure 1. T0 is the main thread. The green parts of the thread indicate that the thread is executing app code and it’s running at full speed. Let’s assume that T0 is executing in a method that has already been JIT compiled (the topmost green part) and the next instruction is to call the IL method M0. Because this is the first time M0 will get executed and because it’s represented in IL, it has to be compiled to binary code that the processor can execute. For this reason, when the call instruction is executed, a function known as the JIT IL stub is called. This function eventually calls the JIT compiler to JIT the IL code of M0 and returns the address of the generated binary code. This work has nothing to do with the app itself and is represented by the red part of T0 to indicate that it’s an overhead. Fortunately, the memory location that stores the address of the JIT IL stub will be patched with the address of the corresponding binary code so that future calls to the same function run at full speed.

Figure 1 The Overhead of Traditional JIT When Executing Managed Code

Now, after returning from M0, some other code that has already been JIT compiled is executed and then the IL method M1 is called. Just like with M0, the JIT IL stub is called, which in turn calls the JIT compiler to compile the method and returns the address of the binary code. After returning from M1, some more binary code is executed and then two more threads, T1 and T2, start running. This is where things get interesting.

After executing methods that have already been JIT compiled, T1 and T2 are going to call the IL method M3, which has never been called before and, therefore, has to be JIT compiled. Internally, the JIT compiler maintains a list of all methods that are being JIT compiled. There’s a list for each AppDomain and one for shared code. This list is protected by a lock and every element is also protected by its own lock so that multiple threads can safely engage in JIT compilation simultaneously. What will happen in this case is that one thread, say T1, will be JIT compiling the method and wasting time doing work that has nothing to do with the app while T2 is doing nothing—waiting on a lock simply because it actually has nothing to do—until the binary code of M3 becomes available. At the same time, T0 will be compiling M2. When a thread finishes JIT compiling a method, it replaces the address of the JIT IL stub with the address of the binary code, releases the locks and executes the method. Note that T2 will eventually wake up and just execute M3.

The rest of the code that’s executed by these threads is shown in the green bars in Figure 1. This means that the app is running at full speed. Even when a new thread, T3, starts running, all the methods that it needs to execute have already been JIT compiled and, therefore, it also runs at full speed. The resulting performance becomes very close to native code performance.

Roughly speaking, the duration of each of these red segments mainly depends on the amount of time it takes to JIT the method, which in turn depends on how large and complex the method is. It can range from a few microseconds to tens of milliseconds (excluding the time to load any required assemblies or modules). If the startup of an app requires executing for the first time less than a hundred methods, it’s not a big deal. But if it requires executing for the first time hundreds or thousands of methods, the impact of all of the resulting red segments might be significant, especially when the time it takes to JIT a method is comparable to the time it takes to execute the method, which causes a double-digit percentage slowdown. For example, if an app requires executing a thousand different methods at startup with an average JIT time of 3 milliseconds, it would take 3 seconds to complete startup. That’s a big deal. It’s not good for business because your customers will not be satisfied.

Note that there is a possibility more than one thread JIT compiles the same method at the same time. It’s also possible that the first attempt to JIT fails but the second succeeds. Finally, it’s also possible that a method that has already been JIT compiled is re-JIT compiled. However, all of these cases are beyond the scope of this article and you don’t have to be aware of them when using background JIT.

Background JIT

The JIT compiling overhead discussed in the previous section cannot be avoided or significantly reduced. You have to JIT IL methods to execute them. What you can do, however, is change the time at which this overhead is incurred. The key insight is that instead of waiting for an IL method to be called for the first time to JIT it, you can JIT that method earlier so that by the time it’s called, the binary code would have already been generated. If you got it right, all threads shown in Figure 1would be green and they would all run at full speed, as if you’re executing an NGEN native image or maybe better. But before you get there, two problems must be addressed.

The first problem is that if you’re going to JIT a method before it’s needed, which thread is going to JIT it? It’s not hard to see that the best way to solve this problem is to have a dedicated thread that runs in the background and JIT methods that are likely to be executed as quickly as possible. As a consequence, this only works if at least two cores are available (which is almost always the case) so that the JIT compiling overhead is hidden by the overlapping execution of the app’s code.

The second problem is this: How do you know which method to JIT next before it’s called for the first time? Keep in mind that typically there are conditional method calls in every method and so you can’t just JIT all methods that might be called or be too speculative in choosing which methods to JIT next. It’s very likely that the JIT background thread falls behind the app threads very quickly. This is where profiles come into play. You first exercise the startup of the app and any common usage scenarios and record which methods were JIT compiled and the order in which they were JIT compiled for each scenario separately. Then you can publish the app together with the recorded profiles so that when it runs on the user’s machine, the JIT compiling overhead will be minimized with respect to the wall clock time (this is how the user perceives time and performance). This feature is called background JIT and you can use it with very little effort from your side.

In the previous section, you saw how the JIT compiler can JIT compile different methods in parallel in different threads. So technically, that traditional JIT is already multicore. It’s unfortunate and confusing that the MSDN documentation refers to the feature as multicore JIT, based on the requirement of at least two cores rather than based on its defining characteristic. I’m using the name “background JIT” because this is the one I’d like to spread. PerfView has built-in support for this feature and it uses the name background JIT. Note that the name “multicore JIT” is the name used by Microsoft early in development. In the rest of this section, I’ll discuss all you have to do to apply this technique on your code and how it changes the traditional JIT model. I’ll also show you how to use PerfView to measure the benefit of background JIT when you use it on your own apps.

To use background JIT, you need to tell the runtime where to put the profiles (one for each scenario that triggers a lot of JIT compilation). You also need to tell the runtime which profile to use so that it reads the profile to determine which methods to compile on the background thread. This, of course, has to be done sufficiently before the associated usage scenario starts.

To specify where to put the profiles, call the System.Runtime.ProfileOptimization.SetProfileRoot method defined in mscorlib.dll. This method looks like this:

public static void SetProfileRoot(string directoryPath);

The purpose of its only parameter, directoryPath, is to specify the directory of the folder in which all profiles will be read from or written to. Only the first call to this method in the same AppDomain takes effect and any other calls are ignored (however, the same path can be used by different AppDomains). Also, if the computer doesn’t have at least two cores, any call to SetProfileRoot is ignored. The only thing that this method does is store the specified directory in an internal variable so that it can be used whenever required later. This method is usually called by the executable (.EXE) of the process during initialization. Shared libraries should not call it. You can call this method any time while the app is running, but before any call to the ProfileOptimization.StartProfile method. This other method looks like this:

public static void StartProfile(string profile);

When the app is about to go through an execution path whose performance you would like to optimize (such as startup), call this method and pass to it the file name and extension of the profile. If the file doesn’t exist, a profile is recorded and stored in a file with the specified name in the folder you’ve specified using SetProfileRoot. This process is called “profile recording.” If the specified file exists and contains a valid background JIT profile, background JIT takes effect in a dedicated background thread JIT compiling methods, chosen according to what the profile says. This process is called “profile playing.” While playing the profile, the behavior exhibited by the app will still be recorded and the same input profile will be replaced.

You can’t play a profile without recording; it’s just not currently supported. You can call StartProfile multiple times specifying different profiles suitable for different execution paths. This method has no effect if it has been called before initializing the profile root using SetProfileRoot. Also, both methods have no effect if the specified argument is invalid in any way. In fact, these methods don’t throw any exceptions or return error codes to not impact the behavior of apps in any undesirable way. Both of them are thread-safe, just like every other static method in the framework.

For example, if you want to improve startup performance, call these two methods as a first step in the main function. If you want to improve the performance of a particular usage scenario, call StartProfile when the user is expected to initiate that scenario and call SetProfileRoot anytime earlier. Remember that everything happens locally in AppDomains.

That’s all you have to do to use background JIT in your code. It’s so simple that you can just try it without thinking too much about whether it will be useful or not. You can then measure the gained speedup to determine whether it’s worth keeping. If the speedup is at least 15 percent, you should keep it. Otherwise, it’s your call. Now I’ll explain in detail how it works.

Every time StartProfile is called, the following actions are performed in the context of the AppDomain in which the code is currently executing:

All the contents of the file that contains the profile (if it exists) are copied into memory. The file is then closed.
If this isn’t the first time StartProfile has been successfully called, there would already be a background JIT thread running. In this case, it’s terminated and a new background thread is created. Then the thread that called StartProfile returns to the caller.
This step happens in the background JIT thread. The profile is parsed. The recorded methods are JIT compiled in the order they were recorded sequentially and as fast as possible. This step constitutes the profile playing process.

That’s it as far as the background thread is concerned. If it has finished JIT compiling all the recorded methods, it will terminate silently. If anything goes wrong while parsing or JIT compiling the methods, the thread terminates silently. If an assembly or module that hasn’t been loaded and is required to JIT a method, it will not be loaded and so the method will not be JIT-compiled. Background JIT has been designed so that it doesn’t change the behavior of the program as much as possible. When a module is loaded, its constructor is executed. Also, when a module can’t be found, callbacks registered with the System.Reflection.Assembly.ModuleResolve event are called. Therefore, if the background thread loads a module earlier than it otherwise would be, the behavior of these functions might change. This similarly applies to callbacks registered with the System.AppDomain.AssemblyLoad event. Because the background JIT doesn’t load the modules it needs, it might not be able to compile many of the recorded methods, leading to modest benefit.

You might be wondering, why not create more than one background thread to JIT more methods? Well, first, these threads are compute-intensive and so they might compete with app threads. Second, more of these threads mean more thread synchronization contention. Third, it’s not unlikely that methods get JIT compiled but never get called by any app thread. Conversely, a method might get called for the first time that’s not even recorded in the profile or before it gets JIT compiled by the multicore thread. Due to these issues, having more than one background thread might not be very beneficial. However, the CLR team might do this in the future (especially when the restriction of loading modules can be relaxed). Now it’s time to discuss what happens in the app threads including the profile recording process.

Figure 2 shows the same example as in Figure 1 except that background is JIT-enabled. That is, there is a background thread JIT compiling the methods M0, M1, M3 and M2, in that order. Notice how this background thread is racing against the app threads T0, T1, T2 and T3. The background thread has to JIT every method before it’s called for the first time by any thread to fulfill its purpose in life. The following discussion assumes that this is the case with M0, M1 and M3, but not quite with M2.

Figure 2 An Example Showing the Background JIT Optimization as Compared to Figure 1

When T0 is about to call M0, the background JIT thread has already JIT compiled it. However, the method address hasn’t been patched yet and still points to the JIT IL stub. The background JIT thread could’ve patched it, but it doesn’t in order to determine later whether the method has been called or not. This information is used by the CLR team to evaluate background JIT. So the JIT IL stub gets called and it sees that the method has already been compiled on the background thread. The only thing it has to do is patch the address and execute the method. Notice how the JIT compiling overhead has been completely eliminated on this thread. M1 receives the same treatment when called on T0. M3 receives the same treatment, as well, when called on T1. But, when T2 calls M3 (refer to Figure 1), the method address has been patched by T1 very quickly and so it directly calls the actual binary code of the method. Then T0 calls M2. However, the background JIT thread hasn’t finished yet JIT compiling the method and, therefore, T0 waits on the JIT lock of the method. When the method is JIT compiled, T0 wakes up and calls it.

I haven’t yet discussed how methods get recorded in the profile. It’s also quite possible that an app thread calls a method that the background JIT thread hasn’t even begun JIT compiling (or will never JIT it because it’s not in the profile). I’ve compiled the steps performed on an app thread when it calls a static or a dynamic IL method that hasn’t been JIT compiled yet in the following algorithm:

Acquire the JIT list lock of the AppDomain in which the method exists.
If the binary code has already been generated by some other app thread, release the JIT list lock and go to step 13.
Add a new element to the list representing the JIT worker of the method if it doesn’t exist. If it already exists, its reference count is incremented.
Release the JIT list lock.
Acquire the JIT lock of the method.
If the binary code has already been generated by some other app thread, go to step 11.
If the method isn’t supported by the background JIT, skip this step. Currently, background JIT supports only statically emitted IL methods that are defined in assemblies that haven’t been loaded with System.Reflection.Assembly.Load. Now if the method is supported, check whether it has already been JIT compiled by the background JIT thread. If this is the case, record the method and go to step 9. Otherwise, go to the next step.
JIT the method. The JIT compiler examines the IL of the method, determines all required types, makes sure that all required assemblies are loaded and all required type objects are created. If anything goes wrong, an exception is thrown. This step incurs most of the overhead.
Replace the address of the JIT IL stub with the address of the actual binary code of the method.
If the method has been JIT compiled by an app thread rather than by the background JIT thread, there is an active background JIT recorder and the method is supported by background JIT; the method is recorded in an in-memory profile. The order in which methods were JIT compiled is maintained in the profile. Note that the generated binary code is not recorded.
Release the method JIT lock.
Safely decrease the reference count of the method using the list lock. If it becomes zero, the element is removed.
Execute the method.

The background JIT recording process terminates when any of the following situations occur:

The AppDomain associated with the background JIT manager is unloaded for any reason.
StartProfile is called again in the same AppDomain.
The rate at which methods are JIT compiled in app threads becomes very small. This indicates that the app has reached a stable state where it rarely requires JIT compiling. Any methods that get JIT compiled after this point are not of interest to background JIT.
One of the recording limits has been reached. The maximum number of modules is 512, the maximum number of methods is 16,384 and the longest continuous duration of recording is one minute.

When the recording process terminates, the recorded in-memory profile is dumped to the specified file. In this way, the next time the app runs, it picks up the profile that reflects the behavior exhibited by the app during its last run. As I’ve mentioned before, profiles are always overwritten. If you want to retain the current profile, you must manually make a copy of it before calling StartProfile. The size of a profile typically doesn’t exceed a few dozen kilobytes.

Before closing this section, I’d like to talk about selecting profile roots. For client apps, you can either specify a user-specific directory or an app-relative directory, depending on whether you want to have different sets of profiles for different users or just one set of profiles for all users. For ASP.NET and Silverlight apps, you’ll probably be using an app-relative directory. In fact, starting with ASP.NET 4.5 and Silverlight 4.5, background JIT is enabled by default and the profiles are stored next to the app. The runtime will behave as if you’ve called SetProfileRoot and StartProfile in the main method and so you don’t have to do anything to use the feature. You can still call StartProfile, though, as described earlier. You can turn off automatic background JIT by setting the profileGuidedOptimizations flag to None in the Web configuration file as described in the .NET Blog post, “An Easy Solution for Improving App Launch Performance” (bit.ly/1ZnIj9y). This flag can take only one other value, namely All, which enables background JIT (the default).

Background JIT in Action

Background JIT is an Event Tracing for Windows (ETW) provider. That is, it reports a number of events that are related to this feature to ETW consumers such as the Windows Performance Recorder and PerfView. These events enable you to diagnose any inefficiencies or failures that occurred in background JIT. In particular, you can determine how many methods were compiled on the background thread and the total JIT time of those methods. You can download PerfView from bit.ly/1PpJUpv (no installation required, just unzip and run). I’ll use the following simple code for demonstration:

class Program {
  const int OneSecond = 1000;
  static void PrintHelloWorld() {
    Console.WriteLine("Hello, World!");
  }
  static void Main() {
    ProfileOptimization.SetProfileRoot(@"C:\Users\Hadi\Desktop");
    ProfileOptimization.StartProfile("HelloWorld Profile");
    Thread.Sleep(OneSecond);
    PrintHelloWorld();
  }
}

In the main function, SetProfileRoot and StartProfile are called to set up background JIT. The thread is put to sleep for about one second and then a method, PrintHelloWorld, is called. This method simply calls Console.WriteLine and returns. Compile this code to an IL executable. Note that Console.WriteLined doesn’t require JIT compiling because it’s already been compiled using NGEN while installing the .NET Framework on your computer.

Use PerfView to launch and profile the executable (for more information on how to do that, refer to the .NET Blog post, “Improving Your App’s Performance with PerfView,” at bit.ly/1nabIYC, or the Channel 9 PerfView Tutorial at bit.ly/23fwp6r). Remember to check the Background JIT checkbox (required only in .NET Framework 4.5 and 4.5.1) to enable capturing events from this feature. Wait until PerfView finishes and then open the JITStats page (see Figure 3); PerfView will tell you that the process doesn’t use background JIT compilation. That’s because in the first run, a profile has to be generated.

Figure 3 The Location of JITStats in PerfView

So now that you’ve generated a background JIT profile, use PerfView to launch and profile the executable. This time, however, when you open the JITStats page, you’ll see that one method, namely PrintHelloWorld, was JIT compiled on the background JIT thread and one method, namely Main, wasn’t. It’ll also tell you that about 92 percent of JIT time that was spent compiling all IL methods occurred in app threads. The PerfView report will also show a list of all methods that were JIT compiled, the IL and binary size of each method, who JIT compiled the method and other information. You can also easily access the full set of information about the background JIT events. However, due to the lack of space here, I won’t go into the details.

You might be wondering about the purpose of sleeping for about one second. This is necessary to have PrintHelloWorld JIT compiled on the background thread. Otherwise, it’s likely that the app thread will start compiling the method before the background thread. In other words, you have to call StartProfile early enough so that the background thread can stay ahead most of the time.

Wrapping Up

Background JIT is a profile-guided optimization supported in the .NET Framework 4.5 and later. This article discussed almost everything you need to know about this feature. I’ve demonstrated why this optimization is needed, how it works and how to properly use it in your code in great detail. Use this feature when NGEN isn’t convenient or possible. Because it’s easy to use, you can just try it without thinking too much about whether it would benefit your app or not. If you’re happy with the gained speedup, keep it. Otherwise, you can easily remove it. Microsoft used background JIT to improve the startup performance of some of its apps. I hope that you can effectively use it in your apps, as well, to achieve significant startup speedups of JIT-extensive usage scenarios and app startup.

Hadi Brais is a doctorate scholar at the Indian Institute of Technology Delhi, researching compiler optimizations for the next-generation memory technology. He spends most of his time writing code in C/C++/C# and digging deep into runtimes, compiler frameworks and computer architectures. He blogs at hadibrais.wordpress.com. Reach him at hadi.b@live.com.

Thanks to the following Microsoft technical expert for reviewing this article: Vance Morrison