Export (0) Print
Expand All
CLR Generics Versus C++ Templates
Copy Constructors, Assignment Operators, and More
Counting MDI Children, Browsing for Folders
Dialog Templates, RegexTest
Disabling Print Screen, Calling Derived Destructors, and More
Enum Declarations, Template Function Specialization
Form Validation with Regular Expressions in MFC
Generic Programming Under .NET
Generic Programming: Template Specialization
Hello, C++/CLI
Installing a Hook, Strings in Managed C++, and More
Layered Windows, Blending Images
Making Static Links Keyboard-Capable, Launching URLs from Your App
Persisting View State Update, Using Managed Extensions in a DLL
Power Your App with the Programming Model and Compiler Optimizations of Visual C++
Reflecting on Generic Types
Unreferenced Parameters, Adding Task Bar Commands, and More
Wrappers: Use Our ManWrap Library to Get the Best of .NET in Native C++ Code -- MSDN Magazine, April 2005
Expand Minimize
20 out of 21 rated this helpful - Rate this topic

Profile-Guided Optimization with Microsoft Visual C++ 2005

Visual Studio .NET 2003
 

Kang Su Gatlin
Microsoft Corporation

March 2004

Applies to:
   Microsoft® Visual C++® 2005

Summary: Discusses profile-guided optimization in Microsoft Visual C++ 2005 (formerly known as Visual C++ "Whidbey"), a powerful new feature that will allow applications to be tuned for actual customer scenarios. Real-world performance gains of over 20% are not uncommon. (13 printed pages)

Contents

Introduction
How Traditional C++ Compilers Work
Whole Program Optimization with Link-Time Code Generation
Profile Guided Optimization
More PGO Tools
PGO and the Visual Studio IDE
Some Tips for PGO Use
Conclusion

Introduction

There are several reasons to program in C++, and one of the most important ones is the incredible performance that one can obtain. With the release of Microsoft® Visual C++® 2005, you will not only get great performance from all of the traditional methods of optimization, but we've added a new technique, which allows users to get even more out of their application. In this article we show the user how to use profile-guided optimization (PGO) to achieve this incredible performance.

How Traditional C++ Compilers Work

To get a full appreciation of profile-guided optimization, let's start with a discussion of how traditional compilers decide which optimizations to do.

Traditional compilers perform optimizations based on static source files. That is, they analyze the text of the source file, but use no knowledge about potential user input that is not directly obtainable from the source code. For example, consider the following function in a .cpp file:

int setArray(int a, int *array)
{
   for(int x = 0; x < a; ++x)
        array[x] = 0;
    return x;
}

From this file the compiler knows nothing about the potential values for "a" (besides they must be of type int), nor does it know anything about the typical alignment of array.

This compiler/linker model is not particularly bad, but it misses two major opportunities for optimization: First, it doesn't exploit information that it could gain from analyzing all source files together; secondly it does not make any optimizations based upon expected/profiled behavior of the application. With WPO and PGO, we'll do both of these things.

Whole Program Optimization with Link-Time Code Generation

With Visual C++ 7.0 and beyond, including all recent versions of the Itanium® compiler, Visual C++ has supported a mechanism known as link-time code generation (LTCG). I won't spend too much time in this section, as Matt Pietrek wrote a good article on LTCG in his Under the Hood column (May 2002), which is freely available from MSDN. But here are the basics...

LTCG is a technology that allows the compiler to effectively compile all the source files as a single translation unit. This is done in a two-step process:

  1. The compiler compiles the source file and emits the result as an intermediate language (IL) into the generated .obj file rather than a standard object file. It's worth noting that this IL is not the same thing as MSIL (which is used by the Microsoft® .NET Framework).
  2. When the linker is invoked with /LTCG, the linker actually invokes the backend to compile all the code compiled with WPO. All of the IL from the WPO .obj files are aggregated and a call graph of the complete program can be generated. From this the compiler backend and linker compiles the whole-program and links it into an executable image.

With WPO the compiler now has more information about the structure of the entire program. Thus it can be more effective in performing certain types of optimizations. For example, when doing traditional compilation/linking, the compiler could not inline a function from source file foo.cpp to source file bar.cpp. When compiling bar.cpp, the compiler does not have any info about foo.cpp. With WPO the compiler now has both bar.cpp and foo.cpp (in IL form) available to it, and can make optimizations that ordinarily would not be possible (like cross translation unit inlining).

How do you compile an application to make use of LTCG? There are two steps:

  1. First compile the source code with the Whole Program Optimization compiler switch (/GL):
    cl.exe /GL /O2 /c test.cpp foo.cpp
    
    
  2. Then link all of the object files in the program with the /LTCG switch.
    link /LTCG test.obj foo.obj /out:test.exe
    
    

That's it. You can now run the generated executable and usually it will be faster. We've seen a lot of benefit with LTCG, but it does not come for free. There are increased memory requirements at compile/link time. This is because all of the IL must be addressable, which potentially can be tens or hundreds of compilation units. This can increase the memory requirements needed to build a project, and further can increase the total time to build.

Profile Guided Optimization

LTCG can obviously give you some performance benefit, but we have only just begun improving the performance of your application. Another new technology used in conjunction with LTCG can give an additional performance boost, and in many cases this boost can be very significant. This technology is called profile-guided optimization (PGO).

The idea behind PGO is simple: Generate profiles from running the executable/dll on real-world inputs, which are then used to assist the compiler in generating optimized code for the particular executable. (Note that PGO can be applied to optimizing unmanaged executables or DLLs, but not to .NET/managed images. For the rest of the article, I'll simply refer to the optimized image as an executable or application, although the information applies equally to DLLs.) Really that's about all there is to it, but there are details worth investigating.

There are three general phases to creating a PGO application:

  1. Compile into instrumented code.
  2. Train instrumented code.
  3. Re-compile into optimized code.

We explain each of these three phases in more depth below. See Figure 1 for a graphical representation of the process.

Figure 1. The PGO build process

Compile Instrumented Code

The first phase is instrumenting the executable. To do this, you first compile the source files with WPO (/GL). After this take all of the source files from the application and link them with the /LTCG:PGINSTRUMENT switch (this can abbreviated as /LTCG:PGI). Note that not all files need be compiled /GL for PGO to work on the application as a whole. PGO will instrument those files compiled with /GL and won't instrument those that aren't.

The instrumentation is done by strategically placing different types of probes in the code. You can break the types of probes up into two very rough types: those that are collecting flow information and those that are collecting value information. I won't go into detail as to how we decide what to use and where to use it, but we do go through painstaking effort to make efficient use of the probes. It's also worth noting that the instrumented code may not be as optimized with /O2 as the same un-instrumented /O2 code. (Although we do as many optimizations as we can without interfering with the probes we are placing in the instrumented code.) So with the combination of the instrumentation and un-optimized code, expect your application to run slower. (Of course the optimized code will be optimized without the probes in the code.)

The result of linking /LTCG:PGI will be an executable or DLL and a PGO database file (.PGD). By default the PGD file takes the name of the generated executable, but the user can specify the name of the PGD file when linking with the /PGD:filename linker option.

Table 1 below lists the files that will be generated after each step given in the left column. Note that no files are removed.

Table 1. Generated files after each step

Step File Generated
At the start of compilation MyApp.cpp foo.cpp
After compiling with /c /GL MyApp.obj foo.obj
After linking with /LTCG:PGI /PGD:MyApp.pgd /out:MyApp.inst.exe MyApp.inst.exe MyApp.pgd
After training the instrumented application with three scenarios. MyApp1.pgc MyApp2.pgc MyApp3.pgc
After relinking with /LTCG:PGO./PGD:MyApp.pgd MyApp.opt.exe

Train Instrumented Code

After creating the instrumented executable, the next step is to train the executable. You do this by running the executable with scenarios that reflect how they'll be used in real life. The output of each scenario run is a PGO count file (.PGC). These files take on the same name as the .PGD file with the a number appended on the end (starting with "1" and increasing with subsequent runs). A given .PGC file can be removed if the user decides that the particular scenario wasn't useful.

Compile Optimized Code

The last step is to relink the executable with the profile information collected from running the scenarios. This time when you link the application, you use the linker switch /LTCG:PGOPTIMIZE (or /LTCG:PGO). This will use the generated profile data to create an optimized executable. Prior to this optimization the linker will automatically invoke pgomgr. pgomgr will, by default, merge all of the .PGC files in the current directory whose name matches the .PGD file into the .PGD file.

  • Updating source code. It's important to note that if the source code of the compiled application changed after the .PGD files were generated, then /LTCG:PGO will revert to simply doing an /LTCG build, and not use any of the profile information. So what do you do if you've spent considerable generating profile from your instrumented code, and then realize that you need to make a small change to the code, but would like to reuse the profiles that you've generated? In this case you can specify /LTCG:PGUPDATE (or /LTCG:PGU). PGUPDATE allows the linker to compile modified source code, while using the original .PGD file

What PGO Can Do

  • We now have an understanding of how to generate PGO applications, so now the question is, what does PGO do for us? What optimizations does it enable? Here we give a partial list and we expect the set of optimizations we currently do to expand as we find new optimizations and learn better heuristics.
  • Inlining. As described earlier, WPO gives the application the ability to find more inlining opportunities. With PGO this is supplemented with additional information to help make this determination. For example, examine the call graph in Figures 2, 3, and 4 below.

    In Figure 2. we see that a, foo, and bat all call bar, which in turn calls baz.

    Figure 2. The original call graph of a program

    Figure 3. The measured call frequencies, obtained with PGO

    Figure 4. The optimized call-graph based on the profile obtained in Figure 3

  • Partial Inlining. Next is an optimization that is at least partially familiar to most programmers. In many hot functions, there exist paths of code within the function that are not so hot; some are downright cold. In Figure 5 below, we will inline the purple sections of code, but not the blue.

    Figure 5. A control flow graph, where the purple nodes get inlined, while the blue node does not

  • Cold Code Separation. Code blocks that are not called during profiling, cold code, are moved to the end of the set of sections. Thus pages in the working set usually consist of instructions that will be executed, according to the profile information.

    Figure 6. Control flow graph showing how the optimized layout moves basic blocks together that are used more often, and cold basic blocks further away.

  • Size/Speed Optimization. Functions that are called more often can be optimized for speed while those that are called less frequently get optimized for size. This tends to be the right tradeoff.
  • Block Layout. In this optimization, we form the hottest paths through a function, and lay them out such that hot paths are spatially located closer together. This can increase the utilization of the instruction cache and decrease the working set size and number of pages used.
  • Virtual Call Speculation. Virtual calls can be expensive due to the jumping through the vtable to invoke method. With PGO, the compiler can speculate at the call site of a virtual call and inline the method of the speculated object into the virtual call site; the data to make this decision is gathered with the instrumented application. In the optimized code, the guard around the inlined function is a check to ensure that the type of the speculated object matches the derived object.

    The following pseudocode shows a base class, two derived classes, and a function invoking a virtual function.

    class Base{
    ...
    virtual void call();
    }
    
    class Foo:Base{
    ...
    void call();
    }
    
    class Bar:Base{
    ...
    void call();
    }
    
    // This is the Func function before PGO has optimized it. 
    void Func(Base *A){
     ...
     while(true) {
     ...
     A->call();
     ...
     }
    }
    
    
    

    The code below shows the result of optimizing the above code, given that the dynamic type of "A" is almost always Foo.

    // This is the Func function call after PGO has optimized it.
    void Func(Base *A){
     ...
     while(true) {
     ...
     if(type(A) == Foo:Base) { 
     // inline of A->call();
     }
     else
     A->call();
     ...
     }
    }
    
    

DLL Use

A short note on PGO and DLLs: You train/profile the DLLs by running the executable, which links the DLL on a set of representative scenarios. You can further use different executables for different scenarios and merge all of the scenarios into a single .PGD file. It is important to know that PGO technology currently does not support static libraries.

General Effectiveness

The current implementation of PGO has proven to be extremely effective in getting real-world performance. For example, we've seen 30%+ improvement on large real-world applications such as Microsoft® SQL Server, and 4-15% gains on the SPEC benchmarks (depending on the architecture). See Figure 7 for performance speedup of PGO over the best static compilation setting (Link Time Code Generation) using the SPEC benchmarks.

Figure 7. SPEC performance improvement with PGO over that of Link Time Code Generation, for all three platforms

More PGO Tools

PGO support in Microsoft® Visual C++ comes with a couple of tools to help the user do precisely what they need to do. This section describes each of these included tools.

  • pgomgr. pgomgr is a tool to do post-processing on .pgd files generated by PGO. (Note that for PSDK Itanium compiler users, pgomgr in Whidbey replaces pgmerge and part of pgopt from the old PSDK compiler. These two tools are no longer available, as their functionality has been subsumed.) The .PGC files need to be merged into the .PGD to be used for the optimize phase of PGO. The pgomgr tool does this merging. The syntax for this statement is:
    pgomgr [options] [Profile-Count paths] <Profile-Database>
    
    

    By default, if /LTCG:PGI is run in a directory with .PGC files, those .PGC files will be merged if they match the .PGD file for the program being linked (so you needn't always use pgomgr). The list of options is given here:

    • /?Gets help
    • /helpSame as /?
    • /clear Remove all merge data from the specified pgd
    • /detail Display verbose program statistics
    • /merge[:n] Merge the given PGC file(s), with optional integer weight
    • /summary Display program statistics
    • /unique Display decorated function names
  • pgosweep. The pgosweep tools interrupts a running program that was built with PGO instrumentation, writes the current counts to a new .PGC file, and then clears the counts from the runtime data structures. This program has two main intended uses: First, if PGO is being used on code that never ends. (For instance, things like the OS kernel.) Second, to obtain precise profile information about a certain part of the program. For example, you may not want to profile the night-time scenarios of an application, so you use pogosweep in those situation and then delete the .PGC files from that part of the scenario.

    The usage for pogosweep is:

       pogosweep <instrumented image> <.PGC file to be created>
    
    

PGO and the Visual Studio IDE

Command-line tools are great to use, but if you're working within the Microsoft® Visual Studio® Integrated Development Environment (IDE), then you may want to fully leverage functionality, such as PGO, from within the Visual Studio IDE. Visual Studio 2005 (formerly known as Visual Studio "Whidbey") offers support for PGO through a set of menu items, which allow the programmer to do an instrumented build, run scenarios, do an optimized build, or do an update build. The instrumented build, optimized build, and update build all produce an .exe or .dll file as the output. The optimized builds and the update build require a .pgd file to be available. This .pgd file can be generated by running the Run Profiling Scenario menu item.

Click here for larger image.

Figure 8. A screen shot of PGO support in the Visual Studio 2005 IDE

Some Tips for PGO Use

Here are some basic tips that can improve your PGO experience.

  1. The scenarios used to generate the profile data should resemble the real-world scenarios the application will see when deployed. The scenarios are NOT and attempt at doing code coverage.
  2. Using scenarios to train with that are not representative of real-world use can result in code that performs worse than if PGO was not used.
  3. Name the optimized code something different from the instrumented code, for example, app.opt.exe and app.inst.exe. This way you can rerun the instrumented application to supplement your set of scenario profiles without rerunning everything again.
  4. To tweak results, use the /clear option of pgomgr to clear out a .PGD files.
  5. If you have two scenarios that run for different amounts of time, but would like them to be weighted equally, you can use the weight switch (/merge:weight in pgomgr) on .PGC files to adjust them.
  6. You can use the speed switch to change the speed/size thresholds.
  7. Use the inline threshold switch with great caution. The values from 0-100 aren't linear.

Conclusion

In closing, the basic steps for generating a PGO application are:

  1. Compile with Whole Program Optimization (/GL) on the files that you would like PGO to work on. The user can selectively choose files that don't get optimized with whole program optimization and PGO, by not compiling them with /GL.
  2. Link the application with Link Time Code Generation using /LTCG:PGINSTRUMENT. This generates an instrumented executable.
  3. Train the application with scenarios generating .pgc files.
  4. Re-compile the application (although you're invoking the linker) with /LTCG:PGOPTIMIZE. This will optimize the executable based on the profile data.

The end result is a program or library that is optimized for the real-world situations that your program or library will run under.

About the Author

Kang Su Gatlin is a Program Manager at Microsoft in the Visual C++ group. He received his PhD from UC San Diego. His focus is on high-performance computation and optimization—essentially he enjoys making code run fast.

Show:
© 2014 Microsoft. All rights reserved.