Under The Hood: Link-time Code Generation

MSDN Magazine

Link-time Code Generation
I

n all the hoopla surrounding the release of the Microsoft® .NET Framework and Visual Studio® .NET, a really slick new feature in Visual Studio .NET is often overlooked. This enhancement is a special goodie for those of you who still write unmanaged code in good old C++. The new feature goes by a couple of names: whole program optimization and link-time code generation. John Robbins covered these features briefly in the August 2001 installment of Bugslayer. This month, I'll take the time to really drill into the subject and show you just how cool it can be. To keep things easy to read, I'll refer to link-time code generation as LTCG throughout this column.
      To understand how LTCG improves matters, it's important to know all the players. When you build a plain old non-.NET program, three primary components that come into play: the front end, the back end, and the linker. The first of these is the compiler front end. In Visual C++®, the front end is comprised of C1.DLL and C1XX.DLL. The front end takes your source code, tokenizes it, parses it, verifies its syntactic correctness, and so on. The output of the front end is intermediate language (IL).
      The IL represents all the flow constructs, variable declarations, and so on in your code. However, it is a binary format and not easily read by humans. Comments, spacing, variable names, and so on are unimportant at this point. Theoretically, IL is not specific to any particular CPU. Instead, it is portable across any CPU that the compiler might target.
      The second player is the back end, which is also known as the code generator because it takes the IL from the front end and generates code targeted to the CPU. In Visual C++, C2.DLL is the back end. The back end is also where optimizations take place. This is a key point that I'll discuss later. The usual output from the back end is an .OBJ file that has processor-specific instructions, as well as any data that's declared. You can see the processor-specific code for yourself by running DUMPBIN /disasm on an .OBJ file.
      The final player is the linker. In Visual C++, the linker is LINK.EXE. The linker's job is to take all the OBJ files, along with any supplied .LIB files, and create an executable image. This usually consists of merging all the code from the contributing OBJs into a single section in the executable. Ditto for data. Of course, there are things like performing fixups that make the linker more than a simple merge utility. The rules that the linker must follow are pretty simple on a grand scale, but get pretty esoteric when you drill down a bit. Parts 1 and 2 of "Inside Windows: An In-Depth Look into the Win32 Portable Executable File Format" in the February 2002 and March 2002 issues describe some of the rules that the Microsoft linker follows.

But What About .NET?

      In the .NET universe, these players have slightly different roles. The end result of the compiler front end and back end isn't processor-specific code. It's IL, but not the same IL that the Visual C++ compiler uses when creating a classic executable. Instead, the IL for .NET goes into the executable file. At run time, the .NET JIT compiler takes the IL and generates real code in dynamically allocated memory. The fact that the actual code for a method can exist pretty much anywhere in memory is one reason why .NET has completely different debugging and profiling APIs than Win32®-based programs.
      The linker is still used in .NET for certain languages (in particular, Visual C++ with managed extensions). It performs much the same job as it does when creating a regular executable. The primary difference is that instead of working with actual code when merging together all the OBJ's code sections, it's working with IL.

The Normal Build Process

      In every version of Visual C++ up to Visual C++ .NET, and when using Visual C++ .NET without LTCG, the compiler front end invokes the back end to generate code and corresponding OBJ files. Next, the IDE or the make file invokes the linker, supplying it with the list of OBJ files and LIB files to process. (If you run CL from the command line without using the /c switch, it will invoke the linker for you.)
      It's important to note that in a normal build, the linker is working with actual CPU-specific code. While the linker is a pretty clever piece of code, it's not so smart that it will start making modifications to the code in the OBJ files supplied to it. Basically, the linker's hands are tied. It can only work with the code that's handed to it by the back end.
      You might be thinking, "why is this a bad thing?" After all, the Visual C++ back end can do some pretty nifty optimizations. But there are limits to what it can do, primarily because of boundary conditions and lack of knowledge.
      An example of a boundary condition is when one function calls another function in a different source file. Because the compiler has limited knowledge of what the other function looks like, it has to rely on calling conventions to ensure it's called correctly. A calling convention specifies that parameters will be passed at certain locations on the stack or in certain registers, the order in which parameters are passed, and who will clean up the stack if necessary. The three prevalent calling conventions are cdecl (the default), stdcall, and fastcall.
      While the use of calling conventions ensures a certain level of safety, they also potentially add to the amount of code generated. To ensure that the calling conventions are adhered to, the code generator may need to add additional instructions that might not be absolutely necessary.
      Another example of when lack of knowledge hinders the optimizer occurs with function inlining. Let's say in one source file you have a function like this:
  int DoubleTheValue( int x )
{
    return x * 2;
}

In a second source file, you call that function. Unfortunately, when generating the code for the second source file, the back end has no knowledge of the DoubleTheValue function, other than that it exists elsewhere. If the code generator knew that the DoubleTheValue function was so simple, it might just incorporate its logic into the function it was currently working on. This process is known as inlining a function.
      In the past, you could frequently get a function to be inlined by declaring it in a header file, and including that header file appropriately. This happened often with small C++ class methods such as accessor methods like get and set. However, if the back end hasn't seen the source code for a function during the current compilation run, there's no way it can inline it. Thus, functions in another source file, no matter how small, would never be inlined.
      To get around these limitations, Visual C++ .NET introduces link-time code generation. When building with LTCG, the compiler front end doesn't invoke the back end. Instead, it emits an OBJ file with IL in it. It bears repeating: this IL is not the same IL that the .NET runtime uses. While .NET IL is standardized and documented, the IL used with LTCG is undocumented and subject to change from version to version of the compiler.
      When the linker is invoked with IL-based OBJ files, it calls COM methods in the back end (C2.DLL) to generate the final, processor-specific code. When the linker invokes the back end, it gives the back end all of the IL from all the OBJs (as well as any pregenerated code like you might find in a .LIB file.) Because the code generator now has almost perfect knowledge of the code, it can be much more aggressive when it optimizes. The primary changes you'll notice from LTCG-generated code is that many more functions are inlined, and that many functions are called without using one of the standard calling conventions. I'll drill into some of these later.
      It's worth noting that the neither the linker nor the compiler back end got significantly smarter here. The major work in implementing LTCG was the juggling required to allow the back end to be called from either the front end or from the linker. The linker also was smartened up so that it could work with a combination of IL-based OBJ files and normal OBJ and LIB files already containing actual code.
      To be fair to other companies, Microsoft did not invent the concept of LTCG. One notable earlier implementation was the Intel C++ compiler for Windows®. The equivalent functionality is termed Inter Procedural Optimization or IPO (cute acronym, eh?).

Building With LTCG

      Hopefully by this point, you're excited enough to want to try LTCG with your own projects. To use LTCG from the Visual Studio .NET IDE is as simple as ensuring that the Whole Program Optimization option is set to Yes. This can be found in the project's Configuration Properties | General pane.
      To use LTCG from the command line, you need to inform both the compiler and linker what's going on. The compiler switch to enable link-time code generation is /GL. If you invoke the linker, make sure to add the /LTCG command-line option. If you forget to add /LTCG and give the linker an OBJ with IL in it, the linker restarts with /LTCG enabled.

Drilling into LTCG Optimizations

      Enough theory. Let's see firsthand what LTCG can do for you. Although there are a variety of optimizations enabled by LTCG, I'll focus on three here: cross-module inlining, custom calling conventions, and smaller thread local storage (TLS) access. Warning: in this section, I assume you can read x86 assembly language. It's hard to show the benefits of these optimizations if you don't know the basic x86 constructs.
      To see the effects of cross-module inlining, first examine the source files for inlining.cpp and inlining2.cpp in Figure 1 and Figure 2. This is a do-nothing program that illustrates how the back end can embed the code for Goober within the main program.
      To build the program in Visual C++ 6.0, use the command line:
  CL /O2 inlining.cpp inlining2.cpp
      Figure 3 is an annotated assembly listing that shows the code generated by Visual C++ 6.0. It's not much of a surprise. As you'd expect, function main makes a cdecl style call to the Goober function. Now, let's see how LTCG can improve things. Switching to Visual C++ .NET, add the /GL switch to the previous command line like so:
 CL /GL /O2 inlining.cpp inlining2.cpp 
      The result is much better code, as seen in Figure 4. Without inlining, four instructions are needed to make the call, including two PUSH instructions to put parameters on the stack. The Goober function itself takes seven instructions. The MOV EAX,[ESP+4] and RET instruction in Goober's code are unnecessary if Goober is inlined. With inlining, Goober's code in main only uses five instructions. The net effect of inlining here is a saving of six instructions.
      The next optimization that LTCG adds is custom calling conventions. Normally, all functions are either cdecl, stdcall, or fastcall. With custom calling conventions, the back end has enough knowledge that it can pass more values in registers, and less on the stack. This usually cuts code size and improves performance.
      To see the benefits of custom calling conventions, examine CCC.CPP in Figure 5. The foo function doesn't do anything significant. It does take four parameters (two integers and two pointers). Likewise, the main function doesn't do anything worthwhile, except to call the foo function.
      Incidentally, the compiler desperately wants to inline function foo inside main. However, in doing so, it would obscure the effects of custom calling conventions. Therefore, I cheated and used the __declspec(noinline) directive on foo to prevent it from being inlined. Build CCC.CPP from the command line like so:
  CL /GL /O2 CCC.CPP

      Figure 6 shows the annotated assembly code for functions foo and main. Again, I have preceded each instruction group by a comment that identifies which source line is responsible for it. The key point is that all four parameters were passed in registers (EAX, EDX, ESI, and EDI). Even the fastcall convention passes at most two parameters in registers.
      It's also interesting to note that the back end remembers subexpression calculations. In the line
  *k = i + l;

the value of i + l is calculated and stored in ECX.
      In the next line
  return i + *j + *k + l;

i and l are added again. The compiler remembers that it has already calculated this value previously, and reuses the value previously stored in ECX.
      The benefits of custom calling conventions don't come only from functions in other source modules. To use a custom calling convention for a function, the back end must see and generate all the calls to the function at the same time. This is the only way to ensure that the function is always called correctly. If there's even a chance that a function might be called from previously generated code, the compiler backs off and decides not to use a custom calling convention at all.
      The final performance improvement from LTCG that I'll look at it is called Small TLS Encoding. When you use __declspec(thread) variables, the code generator stores the variables at a fixed offset in each per-thread data area. Without LTCG, the code generator has no idea of how many __declspec(thread) variables there will be. As such, it must generate code that assumes the worst, and uses a four-byte offset to access the variable.
      With LTCG, the code generator has the opportunity to examine all __declspec(thread) variables, and note how often they're used. The code generator can put the smaller, more frequently used variables at the beginning of the per-thread data area and use a one-byte offset to access them.
      Figure 7 shows a simple do-nothing program that declares and uses a __declspec(thread) variable named i. Figure 8 shows the results of compiling as follows:
  CL /O2 tls.cpp

I won't attempt to explain the code gyrations in the generated code. It is important to note that the "mov dword ptr [ecx+4],eax" instruction uses a four-byte encoding for the +4 value. The whole routine takes 0x1B bytes.
      Adding the /GL option to enable LTCG generates much better code, as shown in Figure 9. Not only does the code use two fewer instructions, the storage instruction "mov dword ptr [ecx+8],eax" uses only one byte to encode the value +8. Here, the whole routine takes only 0x11 bytes for a code size savings of 37 percent.
      Microsoft claims that on some programs, LTCG boosts performance by 10 to 15 percent. A boost of 3 to 5 percent is more common for real world x86 programs. In my own informal test, I found a similar improvement on code size and about a 15 percent improvement in speed.

Details Behind LTCG Decision Making

      One of my very first questions about LTCG had to do with how it decides when to inline a function. Playing around with simple scenarios didn't yield a good answer, so I did the next best thing: I asked. Here's the general algorithm that the code generator uses when deciding whether to inline a function or not.
      Any function that may be inlined has a cost associated with it. If the cost is below a certain threshold, the inlining occurs. The thresholds when inlining for size and speed are different. The size of the application also affects the threshold when optimizing for speed. Smaller applications are more aggressively inlined than large ones, which means the threshold value is higher for small programs.
      The cost of a function begins with its initial size in non-inlined form. Various factors increase or decrease the cost. If parameters with constant values are passed to a function which uses those parameters in if or switch statements, the cost is reduced. In addition, the overhead of setting up for a function call (including pushing parameters) is subtracted from a function's cost. Functions that don't branch have their cost reduced, as linear code is more easily optimized.
      If compiling for size, the number of calls to a function affects its cost. If a function is only called once, its cost is heavily reduced. If it's called only a few times, its cost is moderately reduced. A function that's called all over the place will have its cost increased, as the inlining will likely increase the code size of the final program.
      When compiling for speed, the cost of a function goes up if the function contains large loops. Large loops aren't easily optimized. If the call is heavily nested inside if statements, the cost increases, as the benefits of inlining will only be seen when all right conditions are met. If the function call is inside nested loops, its cost is decreased, making it more likely to be inlined.
      After going through all these heuristics, a final cost is calculated. If it's below the target threshold, the inlining occurs. Of course, a function will never be inlined if the code generator can't determine all the call sites of a given function in advance. This topic itself raises another interesting question: how does the code generator know which functions it can safely inline or generate custom calling conventions for?
      The brief answer is that the first condition needed for effective use of LTCG is that the function exists in IL form, rather than in its final CPU-specific form. Thus, calls to functions in static libraries can't be optimized. There's also a set of boundary conditions that cause the code generator to back off and use standard rules.
      When the linker invokes the back end, the code generator creates a control flow graph for as much of the program as possible. The control flow graph is a big picture of what calls what, and what all the dependencies between functions are. Certain items cause code to go outside of the normal flow graph, and hence lose the ability to be optimized. The big killers here are DLLs. Any calls to or from a DLL go outside the visible flow graph, and thus aren't candidates for LTCG optimization. Standard conventions must apply to any code that is declared external, is exported or imported, or has its address taken. The same goes for variables.

Limitations on LTCG Use

      While LTCG is generally a good thing, there are a few potential pitfalls that might affect you. First, precompiled headers and LTCG are incompatible. This shouldn't be an issue for most users, since you typically only turn on LTCG in release builds, where compilation time usually isn't a problem.
      Next, the OBJ files produced when using LTCG aren't standard COFF format OBJs. Again, this isn't a problem for most people, but if you examine OBJ files with tools like dumpbin, you're simply out of luck—it won't work.
      Finally, there's the subject of libraries. It's possible to create .LIB files with code in its IL form. The linker will happily work with these .LIB files. Be aware that these libraries will be tied to a specific version of the compiler and linker. If you distribute these libraries, you'll need to update them if Microsoft changes the format of IL in a future release.

Got a question? Send questions and comments to hood@microsoft.com.
Matt Pietrek is an independent writer, consultant, and trainer. He was the lead architect for Compuware/NuMega's Bounds Checker product line for eight years and has authored three books on Windows system programming. His Web site, at http://www.wheaty.net, has a FAQ page and information on previous columns and articles.

From the May 2002 issue of MSDN Magazine