Article
09/03/2015

September 2015

Volume 30 Number 9

Compiler Optimizations - Streamline Code with Native Profile-Guided Optimization

By Hadi Brais | September 2015

Often a compiler can make bad optimization decisions that don’t actually improve the code’s performance or, even worse, degrade it when it runs. The optimizations discussed in the first two articles are essential for your application’s performance.

This article covers an important technique called profile-guided optimization (PGO) that can help the compiler back end optimize code more efficiently. Experimental results show performance improvements from 5 percent to 35 percent. Also, when used carefully, this technique will never degrade your code’s performance.

This article builds on the first two parts (msdn.microsoft.com/magazine/dn904673 and msdn.microsoft.com/magazine/dn973015). If you’re new to the concept of PGO, I recommend you first read the Visual C++ Team Blog post at bit.ly/1fJn1DI.

Introducing PGO

One of the most important optimizations a compiler performs is function inlining. By default, the Visual C++ compiler inlines a function as long as the caller size doesn’t grow too large. Many function calls are expanded, however, this is only useful when the call happens frequently. Otherwise, it just increases the size of the code, wasting space from instruction and unified caches and increases the app working set. How would the compiler know whether the call happens frequently? It eventually depends on the arguments passed to the function.

Most optimizations lack reliable heuristics needed to make good decisions. I’ve seen many cases of bad register allocation that results in significant performance degradation. When compiling the code, all you can do is hope all performance improvements and degradations from all optimizations add up to a positive speed result. This is almost always the case, but it can generate an excessively large executable.

It would be nice to eliminate such degradations. If you could tell the compiler how the code is going to behave at run time, it could better optimize the code. The process of recording information on program behavior at run time is called profiling and the generated information is called a profile. You can provide one or more profiles to the compiler, which will use them to guide its optimizations. This is what the PGO technique entails.

You can use this technique on both native and managed code. However, the tools are different and so here I’ll discuss only native PGO and leave managed PGO for another article. The rest of this section discusses how to apply PGO to an app.

PGO is a great technique. Like everything else, though, it has a downside. It takes some time (depending on the app size) and effort. Fortunately, as you’ll see later, Microsoft provides tools that can substantially reduce the time to perform PGO to an app. There are three phases to perform PGO to an app—the instrumentation build, training and the PGO build.

The Instrumentation Build

There are several ways to profile a running program. The Visual C++ compiler uses static binary instrumentation, which generates the most accurate profiles, but takes more time. Using instrumentation, the compiler inserts a small number of machine instructions at locations of interest in all functions of your code, as shown in Figure 1. These instructions record when the associated part of your code has been executed and include this information in the generated profile.

Figure 1 The Instrumentation Build of a Profile-Guided Optimization App

There are several steps to building an instrumented version of an app. First, you have to compile all source code files with the /GL switch to enable Whole Program Optimization (WPO). WPO is required to instrument the program (it’s not technically necessary, but it helps make the generated profile much more useful). Only those files that have been compiled with /GL will be instrumented.

For the next phase to go as smooth as possible, avoid using any compiler switches that result in extra code. For example, disable function inlining (/Ob0). Also, disable security checks (/GS-) and remove runtime checks (no /RTC). This means you shouldn’t use the default Release and Debug modes of Visual Studio. For files not compiled with /GL, optimize them for speed (/O2). For instrumented code, specify at least /Og.

Next, link the generated object files and the required static libraries with the /LTCG:PGI switch. This makes the linker perform three tasks. It tells the compiler back end to instrument the code and generate a PGO database (PGD) file. This will be used in the third phase to store all profiles. At this point, the PGD file doesn’t contain any profiles. It only has information to identify the object files used to detect at the time of using the PGD file whether they have changed. By default, the PGD file name takes the name of the executable. You can also specify a PGD file name using the optional /PGD linker switch. The third task is to link the pgort.lib import library. The output executable is dependent on the PGO runtime DLL pgortXXX.dll where XXX is the version of Visual Studio.

The end result of this phase is an executable (EXE or DLL) file bloated with instrumentation code and an empty PGD file to be filled and used in the third phase. You can only have a static library instrumented if that library is linked to a project to be instrumented. Also, the same version of the compiler must produce all CIL OBJ files; otherwise, the linker will emit an error.

Profiling Probes

Before moving on to the next phase, I’d like to discuss the code the compiler inserts to profile the code. This lets you estimate the overhead added to your program and understand the information being collected at run time.

To record a profile, the compiler inserts a number of probes in every function compiled with /GL. A probe is a small sequence of instructions (two to four instructions) consisting of a number of push instructions and one call instruction to a probe handler at the end. When necessary, a probe is wrapped by two function calls to save and restore all XMM registers. There are three types of probes:

Count probes: This is the most common type of probe. It counts the number of times a block of code executes by incrementing a counter every time it’s executed. It’s also the cheapest in terms of size and speed. Each counter is 8 bytes in size on x64 and 4 bytes on x86.
Entry probes: The compiler inserts an entry probe at the beginning of every function. The purpose of this probe is to tell the other probes in the same function to use the counters associated with that function. This is needed because probe handlers are shared between probes across functions. The entry probe of the main function initializes the PGO runtime. An entry probe is also a count probe. This is the slowest probe.
Value probes: These are inserted before every virtual function call and switch statement and used to record a histogram of values. A value probe is also a count probe because it counts the number of times each value appears. The size of this probe is the largest.

A function won’t be instrumented by any probe if it only has one basic block (a sequence of instructions with one entry and exit). In fact, it might be inlined in spite of the /Ob0 switch. Besides the value probe, every switch statement causes the compiler to create a constant COMDAT section that describes it. The size of this section is approximately the number of cases multiplied by the size of the variable controlling the switch.

Every probe ends with a call to the probe handler. The entry probe of the main function creates a vector of (8 bytes on x64 and 4 bytes on x86) probe handler pointers where each entry points to a distinct probe handler. In most cases, there will be only a few probe handlers. Probes are inserted in each function in the following locations:

An entry probe at the entry of the function
A count probe in every basic block ending with a call or ret instruction
A value probe just before every switch statement
A value probe just before every virtual function call

So the amount of memory overhead of the instrumented program is determined by the number of probes, the number of cases in all switch statements, the number of switch statements and the number of virtual function calls.

All probe handlers at some point advance a counter by one to record execution of the corresponding block of code. The compiler uses the ADD instruction to increment a 4-byte counter by one and on x64, the ADC instruction to add the carry to the high 4 byte of the counter. These instructions aren’t thread-safe. This means all probes by default aren’t thread-safe. If at least one of the functions might be executed by more than one thread at the same time, the results won’t be reliable. In this case, you can use the /pogosafemode linker switch. This causes the compiler to prefix these instructions with LOCK, which makes all probes thread-safe. Of course, this makes them slower, as well. Unfortunately, you can’t selectively apply this feature.

If your application consists of more than one project whose output is either an EXE or a DLL file for PGO, you must repeat the process for each.

The Training Phase

After that first phase, you have an instrumented version of the executable and a PGD file. The second phase is training, in which the executable will generate one or more profiles to store in a separate PGO Count (PGC) file. You’ll use these files in the third phase to optimize the code.

This is the most important phase because profile accuracy is crucial to the success of the whole process. For a profile to be useful, it has to reflect a common usage scenario of the program. The compiler will optimize the program assuming the exercised scenarios are common. If this wasn’t the case, your program might perform worse in the field. A profile generated from a common-usage scenario helps the compiler know the hot paths to optimize for speed and the cold paths to optimize for size, as shown in Figure 2.

Figure 2 The Training Phase of Creating a PGO App

The complexity of this phase depends on the number of usage scenarios and the nature of the program. Training is easy if the program doesn’t require any user input. If there are many usage scenarios, sequentially generating a profile for each scenario might not be the quickest way.

In the complex training scenario shown in Figure 2, pgosweep.exe is a command-line tool that lets you control the contents of the profile maintained by the PGO runtime when it runs. You can spawn more than one instance of the program and concurrently apply usage scenarios.

Imagine you have two instances running in processes X and Y. When one scenario is about to start on process X, call pgosweep and pass to it the process ID and the /onlyzero switch. This causes the PGO runtime to clear the part of the in-memory profile for that process only. Without specifying the process ID, the whole PGC profile will be cleared. Then the scenario can start. You can initiate usage scenario two on process Y in a similar fashion.

The PGC file will be generated only when all running instances of the program terminate. However, if the program has a long startup time and you don’t want to run it for every scenario, you can force the runtime to generate a profile and clear the in-memory profile to prepare it for another scenario in the same run. Do this by running pgosweep.exe and passing the process ID, the executable file name and the PGC file name.

By default, the PGC file will be generated in the same directory as the executable. You can change this with the VCPROFILE_PATH environment variable, which has to be set before you run the first instance of the program.

I’ve discussed the data and instruction overhead of instrumenting code. In most cases, this overhead can be accommodated. The memory consumption of the PGO runtime by default will not exceed a certain threshold. If it turns out that it requires more memory, an error will occur. In this case, you can use the VCPROFILE_ALLOC_SCALE environment variable to raise that threshold.

The PGO Build

Once you’ve exercised all common usage scenarios, you’ll have a set of PGC files you can use to build the optimized version of the program. You can discard any PGC file you don’t want to use.

The first step in building the PGO version is to merge all PGC files with a command-line tool called pgomgr.exe. You can also use this to edit a PGD file. To merge the two PGC files into the PGD file generated in the first phase, run pgomgr and pass the /merge switch and the PGD file name. This will merge all PGC files in the current directory whose names match the name of the specified PGD file followed by !# and a number. The compiler and linker can use the resulting PGD file to optimize the code.

You can capture a more common or important usage scenario with the pgomgr tool. To do this, pass the corresponding PGC file name and the /merge:n switch, where n is some positive integer indicating the number of copies of the PGC file to include in the PGD file. By default, n is one. This multiplicity causes a specific profile to bias the optimizations in its favor.

The second step is to run the linker passing the same set of object files as in phase one. This time, use the /LTCG:PGO switch. The linker will look for a PGD file with the same name as the executable in the current directory. It will ensure the CIL OBJ files didn’t change since generating the PGD file in phase one, and then pass it to the compiler to use and optimize the code. This process is shown in Figure 3. You can use the /PGD linker switch to explicitly specify a PGD file. Don’t forget to enable function inlining for this phase.

Figure 3 The PGO Build in Phase Three

Most of the compiler and linker optimizations will become profile-guided. The end result of this phase is a highly optimized executable in terms of size and speed. It’s a good idea now to measure your performance gains.

Maintain the Code Base

If you make any changes to any input files passed to the linker with the /LTCG:PGI switch, the linker will refuse to use the PGD file when /LTCG:PGO is specified. That’s because such changes can significantly affect the usefulness of the PGD file.

One option is to repeat the three phases discussed previously to generate another compatible PGD. However, if the changes were tiny (such as adding a small number of functions, calling a function a little less or a little more frequently, or maybe adding a feature that isn’t commonly used) then it isn’t practical to repeat the whole process. In this case, you can use the /LTCG:PGU switch instead of the /LTCG:PGO switch. This tells the linker to skip compatibility checks over the PGD file.

These tiny changes will accumulate over time. You’ll eventually reach a point where it’s beneficial to instrument the app again. You can determine when you’ve reached this point by looking at the compiler output when PGO building the code. It will tell you how much of the code base the PGD covers. If profile coverage drops to less than 80 percent (as shown in Figure 4), it’s a good idea to instrument the code again. However, this percentage is highly dependent on the nature of the application.

PGO in Action

PGO guides optimizations employed by the compiler and the linker. I’ll use the NBody simulator to demonstrate some of its benefits. You can download this application from bit.ly/1gpEaCY. You’ll have to also download and install the DirectX SDK at bit.ly/1LQnKge to compile the application.

First, I’ll compile the application in Release mode to compare it to the PGO version. To build the PGO version of the app, you can use the Profile-Guided Optimization menu item of the Build menu of Visual Studio.

You should also enable assembler output using the /FA[c] compiler switch (don’t use /FA[c]s for this demo). For this simple app, it’s sufficient to train the instrumented app once to generate one PGC file and use it to optimize the app. By doing this, you’ll be having two executables: one blindly optimized and another PGO. Make sure you can access the final PGD file because you’ll need it later.

Now if you run both executables one after the other and compare the attained GFLOP counts, you’ll notice they achieved similar performance. Apparently, to apply PGO to the app was a waste of time. Upon further investigation, it turns out that the size of the app has been reduced from 531KB (for the blindly optimized app) to 472KB (for the PGO-based app), or 11 percent. So when you apply PGO to this app, it was reduced in size while maintaining the same performance. How did this happen?

Consider the 200-line function DXUTParseCommandLine from the DXUT/Core/DXUT.CPP file. By looking at the generated assembly code of the Release build, you can see the size of the binary code is approximately 2700 bytes. On the other hand, the size of the binary code in the PGO build is no more than 1650 bytes. You can see the reason for this difference from the assembly instruction that checks the condition of the following loop:

for( int iArg = iArgStart; iArg < nNumArgs; iArg++ ) { ... }

The blindly optimized build produced the following code:

0x044 jge block1
; Fall-through code executed when iArg < nNumArgs
; Lots of code in between
0x362 block1:
; iArg >= nNumArgs
; Lots of other code

On the other hand, the PGO build produced the following code:

0x043 jl   block1
; taken 0(0%), not-taken 1(100%)
block2:
; Fall-through code executed when iArg >= nNumArgs
0x05f ret  0
; Scenario dead code below
0x000 block1:
; Lots of other code executed when iArg < nNumArgs

Many users prefer to specify parameters in the GUI instead of passing them in the command line. So the common scenario here, as indicated by the profile information, is the loop never iterates. Without a profile, it’s impossible for the compiler to know that. So it goes ahead and aggressively optimizes the code within the loop. It expands many functions resulting in pointless code bloat. In the PGO build, you provided the compiler a profile saying the loop has never executed. From this, the compiler understood there’s no point in inlining any functions called from the body of the loop.

You can see another interesting difference from the assembly code snippets. In the blindly optimized executable, the branch that rarely executes is in the fall-through path of the conditional instruction. The branch that’s almost always executed is located 800 bytes away from the conditional instruction. This not only causes the processor branch predictor to fail, but also guarantees an instruction cache miss.

The PGO build avoided both of these problems by swapping branch locations of the branches. In fact, it moved the branch that rarely executes to a separate section in the executable, thereby improving working set locality. This optimization is called dead code separation. It would have been impossible to perform without a profile. Infrequently called functions, such as tiny differences in binary code, can make significant performance differences.

When building the PGO code, the compiler will show you how many functions have been compiled for speed of all the instrumented functions. The compiler also shows you this in the Visual Studio Output windows. Typically, no more than 10 percent of the functions would be compiled for speed (think of aggressive inlining), while the rest are compiled for size (think of partial or no inlining).

Consider a slightly more interesting function, DXUTStaticWndProc, which is defined in the same file. The functions control structure looks like this:

if (condition1) { /* Lots of code */ }
if (condition2) { /* Lots of code */ }
if (condition3) { /* Lots of code */ }
switch (variable) { /* Many cases with lots of code in each */ }
if-else statement
return

The blindly optimized code emits each code block in the same order as in the source code. However, the code in the PGO build has been cleverly rearranged using the execution frequency of each block and the time at which each block was executed. The first two conditions were rarely executed, so to improve cache and memory utilization, the corresponding code blocks are now in a separate section. Also, the functions recognized as falling in the hot path (such as DXUTIsWindowed) are now in-lined:

if (condition1) { goto dead-code-section }
if (condition2) { goto dead-code-section }
if (condition3) { /* Lots of code */ }
{/* Frequently executed cases pulled outside the switch statement */}
if-else statement
return
switch(variable) { /* The rest of cases */ }

Most optimizations benefit from a reliable profile and others become possible to perform. If PGO doesn’t result in a noticeable performance improvement, it will certainly reduce the size of the generated executables and their overhead on the memory system.

PGO Databases

The benefits of the PGD profile far exceed guiding the optimizations of the compiler. While you can use the pgomgr.exe to merge multiple PGC files, it also serves a different purpose. It offers three switches that let you view the contents of the PGD file to gain full understanding of the behavior of your code with respect to the exercised scenarios. The first switch /summary tells the tool to emit a textual summary of the PGD file contents. The second switch /detail, used in conjunction with the first, tells the tool to emit a detailed textual profile description. The last switch /unique tells the tool to undecorate the function names (particularly useful for C++ code bases).

Programmatic Control

There’s one last feature worth mentioning. The pgobootrun.h header file declares one function called PgoAutoSweep. You can call this function to programmatically generate a PGC file and clear the in-memory profile to prepare for the next PGC file. The function takes one argument of type char* referring to the name of the PGC file. To use this function, you have to link the pgobootrun.lib static library. Currently, this is the only programmatic support related to PGO.

Wrapping Up

PGO is an optimization technique that helps the compiler and linker make better optimization decisions by referring to a reliable profile whenever a trade-off of size versus speed requires resolution. Visual Studio provides visual access to this technique through the Build menu or the context menu of the project.

However, you can get a richer set of features using the PGO plug-in, which you can download from bit.ly/1Ntg4Be. This is also well documented at bit.ly/1RLjPDi. Recall the coverage threshold from Figure 4, the easiest way to get it is with the plug-in as discussed in the documentation. However, if you prefer using the command-line tools, you can refer to the article at bit.ly/1QYT5nO for plenty of examples. If you have a native code base, it might be a good idea now to try this technique. When you do that, feel free to let me know how it affected the size and speed of your application.

Figure 4 PGO Code Base Maintenance Cycle

Additional Resources

For more information on profile-guided optimization databases, check out the blog post from Hadi Brais.

Hadi Brais is a Ph.D. scholar at the Indian Institute of Technology Delhi (IITD), researching compiler optimizations for the next-generation memory technology. He spends most of his time writing code in C/C++/C# and digging deep into runtimes and compiler frameworks. He blogs at hadibrais.wordpress.com. Reach him at hadi.b@live.com.

Thanks to the following Microsoft technical expert for reviewing this article: Ankit Asthana