May 2015

Volume 30 Number 5


Compilers - What Every Programmer Should Know About Compiler Optimizations, Part 2

By Hadi Brais

Welcome to the second part of my series on compiler optimizations. In the first article (msdn.microsoft.com/magazine/dn904673), I discussed function inlining, loop unrolling, loop-invariant code motion, automatic vectorization and COMDAT optimizations. In this second article, I’m going to look at two other optimizations—register allocation and instruction scheduling. As always, I’ll be focusing on the Visual C++ compiler, with a brief discussion of how things work in the Microsoft .NET Framework. I’ll be using Visual Studio 2013 to compile the code. Let’s get started.

Register Allocation

Register allocation is the process of allocating a set of variables to the available registers so that they don’t have to be allocated in memory. This process is usually performed at the level of a whole function. However, especially when Link-Time Code Generation (/LTCG) is enabled, the process can be performed across functions, which may result in a more efficient allocation. (In this section, all variables are automatic—those whose lifetimes are determined syntactically—unless otherwise mentioned.)

Register allocation is a particularly important optimization. To understand this, let’s see how much it takes to access the different levels of memory. Accessing a register takes less than one processor cycle. Accessing the cache is a little slower and takes from few to tens of cycles. Accessing the (remote) DRAM memory is even slower than that. Finally, accessing the hard drive is terribly slow and can take millions of cycles! Also, a memory access increases traffic to shared caches and main memory. Register allocation reduces the number of memory accesses by utilizing the available registers as much as possible.

The compiler tries to allocate a register to each variable, ideally until all instructions involving that variable are executed. If this isn’t possible, which is common for reasons I’ll discuss shortly, one or more variables have to be spilled into memory so they must be loaded and stored frequently. Register pressure refers to the number of registers that have been spilled due to the unavailability of registers. Larger register pressure means more memory accesses, and more memory accesses can slow not only the program itself, but also bring the whole system to a crawl.

Modern x86 processors offer the following registers to be allocated by compilers: eight 32-bit general-purpose registers, eight 80-bit floating-point registers and eight 128-bit vector registers. All x64 processors offer 16 64-bit general-purpose registers, eight 80-bit floating-point registers and at least 16 vector registers—each at least 128 bits wide. Modern 32-bit ARM processors offer 15 32-bit general-purpose registers and 32 64-bit floating-point registers. All 64-bit ARM processors offer 31 64-bit general-purpose registers, 32 128-bit floating-point registers and 16 128-bit vector registers (NEON). All of these are available for register allocation (and you can also add to the list the registers offered by the graphics card). When a local variable can’t be allocated to any of the available registers, it has to be allocated on the stack. This happens in almost every function for various reasons, as I’ll discuss. Consider the program in Figure 1. The program doesn’t do anything meaningful, but it serves as a good example to demonstrate register allocation.

Figure 1 Register Allocation Example Program

#include <stdio.h>
int main() {
  int n = 0, m;
  scanf_s("%d", &m);
  for (int i = 0; i < m; ++i){
    n += i;
  }
  for (int j = 0; j < m; ++j){
    n += j;
  }
  printf("%d", n);
  return 0;
}

Before it allocates the available registers to variables, the compiler first analyzes the use of all declared variables within the function (or across functions in case of /LTCG) to determine which sets of variables are live at the same time, and estimates the number of times each variable is being accessed. Two variables from different sets can be allocated the same register. If there are no suitable registers for some variables of the same set, these variables have to be spilled. The compiler tries to choose the least-accessed variables to be spilled in an attempt to minimize the total number of memory accesses. That’s the general idea. However, there are many special cases in which it’s possible to find a better allocation. Modern compilers are capable of devising a good allocation, but not the optimal one. It’s very, very hard for a mortal to do better, though.

With this in mind, I’ll compile the program in Figure 1 with optimizations enabled and see how the compiler will allocate local variables to registers. There are four variables to be allocated: n, m, i and j. I’ll assume the target in this case is the x86 platform. By examining the generated assembly code (/FA), I notice the variable n has been allocated to the register ESI, the variable m has been allocated to ECX, and both i and j have been allocated to EAX. Note how the compiler cleverly reused EAX for two variables because their lifetimes do not intersect. Also note that the compiler has reserved a space on the stack for m because its address is taken. On the x64 platform, the variable n will be allocated to the register EDI, the variable m will be allocated to EDX, i to EAX, and j to EBX. For some reason, the compiler didn’t allocate i and j to the same register this time.

Was that an optimal allocation? No. The problem is in using ESI and EDI. These registers are callee-saved registers, meaning that the called function has to make sure that the values those registers hold at exit are the same at entrance. That’s why the compiler had to emit an instruction at the entrance of the function to push ESI/EDI on the stack and another instruction at the exit to pop them from the stack. The compiler could’ve avoided this on both platforms by using a caller-saved register, such as EDX. Such deficiencies in the register allocation algorithm can sometimes be mitigated by function inlining. Many other optimizations can render the code amenable to a more efficient register allocation, such as dead code elimination, common subexpression elimination and instruction scheduling.

It’s actually common that variables have disjoint lifetimes, so allocating the same register to all of them is very economic. But what if you run out of registers to accommodate any of them? You have to spill them. However, you can do that in a clever way. You spill all of them to the same location on the stack. This optimization is called stack packing and it’s supported by Visual C++. Stack packing reduces the size of the stack frame and may improve the data cache hit ratio, resulting in better performance.

Unfortunately, things are not that simple. From a theoretical perspective, (near) optimal register allocation can be achieved. However, practically, there are many reasons why this may not be possible:

  • The available registers on the x86 and x64 platforms (mentioned earlier) and any other modern platform (such as ARM) can’t be used arbitrarily. There are complex restrictions. Each instruction imposes restrictions as to which registers can be used as operands. Therefore, if you want to use an instruction, you have to use the allowed registers to pass to it the required operands. Also, the results of some instructions are stored in predetermined registers whose values are assumed by the instructions to be volatile. There could be a different sequence of instructions that perform the same computation but lets you perform a more efficient register allocation. The problems of instruction selection, instruction scheduling and register allocation are frightfully entangled.
  • Not all variables are of primitive types. It’s not uncommon to have automatic structures and arrays. Such variables can’t be directly considered for register allocation. However, they can be discretely allocated to registers. Current compilers aren’t that good yet.
  • The calling convention of a function imposes a fixed allocation for some arguments while rendering others ineligible for allocation irrespective of the availability of registers. More on this issue a bit later. Also, the notions of caller-saved and callee-saved registers make things trickier.
  • If the address of a variable is taken, the variable better be stored in a location that has an address. A register doesn’t have an address, so it has to be stored in memory whether it’s available or not.

All of this might seem to you that current compilers are terrible at register allocation. However, they’re reasonably good at it and getting better, very slowly. Also, can you imagine yourself writing assembly code while thinking about all of this?

You can help the compiler to potentially find a better allocation by enabling /LTCG when targeting x86 architectures. If you specify the /GL compiler switch, the generated OBJ files will contain C Intermediate Language (CIL) code rather than assembly code. Function calling conventions aren’t incorporated in the CIL code. If a particular function isn’t defined to be exported from the output executable, the compiler can violate its calling convention to improve its performance. This is possible because it can identify all the call sites of the function. Visual C++ does take advantage of this by making all arguments of the function eligible for register allocation regardless of the calling convention. Even if register allocation can’t be improved, the compiler will try to reorder parameters for a more economic alignment and even remove unused parameters. Without the /GL switch, the resulting OBJ files contain binary code in which calling conventions have already been considered. If an assembly OBJ file has a call site to a function in a CIL OBJ file or if the address of the function is taken anywhere or if it’s virtual, the compiler can no longer optimize its calling convention. Without /LTCG, by default, all functions and methods have external linkage, so the compiler can’t apply this technique. However, if a function in an OBJ file has been explicitly defined with internal linkage, the compiler can apply this technique to it, but only within an OBJ file. This technique, referred to by the documentation as a custom calling convention, is important when targeting x86 architectures because the default calling convention, namely __cdecl, isn’t efficient. On the other hand, the __fastcall calling convention on the x64 architecture is very efficient because the first four arguments are passed via registers. For this reason, the custom calling convention is performed only when targeting x86.

Note that even if /LTCG is enabled, the calling convention of an exported function or method can’t be violated because it’s impossible for the compiler to locate all call sites, just as in all the cases previously mentioned.

The effectiveness of register allocation depends on the accuracy of the estimated number of accesses to the variables. Most functions contain conditional statements, jeopardizing the accuracy of these estimates. Profile-guided optimization can be used to fine-tune these estimates.

When /LTCG is enabled and the target platform is x64, the compiler performs interprocedural register allocation. This means it will consider the variables declared within a chain of functions and try to find a better allocation depending on the restrictions imposed by the code in each function. Otherwise, the compiler performs global register allocation in which each function is processed separately (“global” here means the whole function).

Both C and C++ offer the register keyword, enabling the programmer to provide a hint to the compiler regarding which variables to store in registers. In fact, the first version of C introduced this keyword and it was useful at that time (circa 1972) because no one knew how to perform register allocation effectively. (A FORTRAN IV compiler developed by IBM Corp. in the late 60s for the S/360 series could perform simple register allocation, though. Most S/360 models offered 16 32-bit general-purpose registers and four 64-bit floating-point registers!) Also, just as with many other features of C, the register keyword makes it easier to write C compilers. Nearly a decade later, C++ was created and it offered the register keyword because C was considered to be a subset of C++. (Unfortunately, there are many subtle differences.) Since the early 80s, many effective register allocation algorithms have been implemented, so the existence of the keyword has created a lot of confusion to this day. Most production languages that have been created since then don’t offer such a keyword (including C# and Visual Basic). This keyword has been deprecated since C++11, but not in the latest version of C, C11. This keyword should be used only for writing benchmarks. The Visual C++ compiler does respect this keyword, if possible. C doesn’t allow the address of a register variable to be taken. C++, however, does allow it but then the compiler has to store the variable in an addressable location instead of in a register, violating its manually specified storage class.

When targeting the CLR, the compiler has to emit Common Intermediate Language (CIL) code that models a stack machine. In this case, the compiler won’t perform register allocation (though if some of the emitted code is native, register allocation will be performed on it, of course) and will postpone it until runtime to be performed by the just-in-time (JIT) compiler (or the Visual C++ back end in the case of .NET Native compilation). RyuJIT, the JIT compiler that ships with the .NET Framework 4.5.1 and later, implements a pretty decent register allocation algorithm.

Instruction Scheduling

Register allocation and instruction scheduling are among the last optimizations performed by the compiler before it emits the binary.

All but the simplest instructions are executed in multiple stages, where each stage is handled by a specific unit of the processor. To utilize all of these units as much as possible, the processor issues multiple instructions in a pipelined fashion such that different instructions are executing at different stages at the same time. This can significantly improve performance. However, if one of these instructions isn’t ready for execution for some reason, the whole pipeline stalls. This can happen for many reasons, including waiting for another instruction to commit its result; waiting for data to come from memory or disk; or waiting for an I/O operation to complete.

Instruction scheduling is a technique that can mitigate this problem. There are two kinds of instruction scheduling:

  • Compiler-based: The compiler analyzes the instructions of a function to determine those instructions that might stall the pipeline. Then it tries to find a different order of the instructions to minimize the cost of the expected stalls while at the same time preserving the correctness of the program. This is called instruction reordering.
  • Hardware-based: The majority of modern x86, x64 and ARM processors are capable of looking ahead into the stream of instructions (micro-ops, to be accurate) and issuing those instructions whose operands and the required functional unit are available for execution. This is called out-of-order (OoOE or 3OE) or dynamic execution. The result is that the program is being executed in an order that’s different from the original.

There are other reasons that might cause the compiler to reorder certain instructions. For example, the compiler might reorder nested loops so that the code exhibits better locality of reference (this optimization is called loop interchange). Another example is to reduce the costs of register spilling by making instructions that use the same value loaded from memory consecutive so that the value is loaded once. Yet another example is to reduce data and instruction cache misses.

As a programmer, you don’t have to know how a compiler or a processor performs instruction scheduling. However, you should be aware of the ramifications of this technique and how to handle them.

While instruction scheduling preserves the correctness of most programs, it may produce some non-intuitive and surprising results. Figure 2 shows an example where instruction scheduling causes the compiler to emit incorrect code. To see this, compile the program as C code (/TC) in Release mode. You can set the target platform to either x86 or x64. Because you’re going to examine the resulting assembly code, specify /FA so the compiler emits an assembly listing.

Figure 2 Instruction Scheduling Example Program

#include <stdio.h>
#include <time.h>
__declspec(noinline) int compute(){
  /* Some code here */
  return 0;
}
int main() {
  time_t t0 = clock();
  /* Target location */
  int result = compute();
  time_t t1 = clock(); /* Function call to be moved */
  printf("Result (%d) computed in %lld ticks.", result, t1 - t0);
  return 0;
}

In this program, I want to measure the running time of the compute function. To do this, I usually wrap the call to the function by calls to a timing function such as clock. Then, by computing the difference in the values of the clock, I get an estimate of the time the function took to execute. Note that the purpose of this code is not to show you the best way to measure the performance of some piece of code, but to demonstrate the hazards of instruction scheduling.

Because this is C code and because the program is very simple, it’s easy to understand the resulting assembly code. By looking at the assembly code and focusing on the call instructions, you’ll note that the second call to the clock function precedes the call to the compute function (it has been moved to the “target location”), making the measurement completely wrong.

Note that this reordering doesn’t violate the minimum requirements imposed by the standard on conforming implementations, so it’s legal.

But why would the compiler do that? The compiler thought that the second call to clock didn’t depend on the call to compute (indeed, to the compiler, these functions don’t affect each other at all). Also, after the first call to clock, it’s likely the instruction cache contains some of the instructions of that function and the data cache contains some of the data required by these instructions. Calling compute might cause these instructions and data to be overwritten, so the compiler reordered the code accordingly.

The Visual C++ compiler doesn’t offer a switch to turn instruction scheduling off while keeping all other optimizations on. Moreover, this problem can occur due to dynamic execution if the compute function was inlined. Depending on how the execution of the compute function is going and on how far a processor can look ahead, a 3OE processor might decide to begin executing the second call to clock before the compute function completes. Just as with the compiler, the majority of processors don’t let you turn off dynamic execution. But to be fair, it’s very unlikely that this problem will occur because of dynamic execution. How could you tell if it happened, anyway?

The Visual C++ compiler is actually very careful when performing this optimization. It’s so careful that there are many things that prevent it from reordering an instruction (such as call instruction). I have noticed the following situations that caused the compiler to not move the clock function call to a particular location (the target location):

  • Calling an imported function from any of the functions being called between the location of the function call and the target location. As this code shows, calling any imported function from the compute function causes the compiler to not move the second call to clock:
__declspec(noinline) int compute(){
  int x;
  scanf_s("%d", &x); /* Calling an imported function */
  return x;
}
  • Calling an imported function between the call to compute and the second call to clock:
int main() {
  time_t t0 = clock();
  int result = compute();
  printf("%d", result); /* Calling an imported function */
  time_t t1 = clock();
  printf("Result (%d) computed in %lld.", result, t1 - t0);
  return 0;
}
  • Accessing any global or static variable from any of the functions being called between the location of the function call and the target location. This holds whether the variable is being read or written. The following shows that accessing a global variable from the compute function causes the compiler to not move the second call to clock:
int x = 0;
__declspec(noinline) int compute(){
  return x;
}
  • Marking t1 as volatile.

There are other situations that prevent the compiler from reordering instructions. It’s all about the C++ as-if rule, which says the compiler can transform a program that doesn’t contain undefined operations anyway it likes as long as the observable behavior of the code is guaranteed to remain the same. Visual C++ not only adheres to this rule, but also is much more conservative to reduce the time it takes to compile the code. An imported function might cause side effects.  Library I/O functions and accessing volatile variables cause side effects.

Volatile, Restrict and /favor

Qualifying a variable with the volatile keyword affects both register allocation and instruction reordering. First, the variable won’t be allocated to any register. (Most instructions require some of their operands to be stored in registers, which means the variable will be loaded into a register, but only to execute some of the instructions that use that variable.) That is, reading or writing to the variable will always cause a memory access. Second, writing to a volatile variable has Release semantics, meaning that all memory accesses that occur syntactically before the write to that variable will happen before it. Third, reading from a volatile variable has Acquire semantics, meaning that all memory accesses that occur syntactically after the read from that variable will happen after it. But there’s a catch: These reordering guarantees are offered only by specifying the /volatile:ms switch. In contrast, the /volatile:iso switch tells the compiler to adhere to the language standard, which doesn’t offer any such guarantees through this keyword. For ARM, /volatile:iso takes effect by default. For other architectures, the default is /volatile:ms. Before C++11, the /volatile:ms switch was useful because the standard didn’t offer anything for multithreaded programs.  However, starting with C11/C++11, the use of /volatile:ms makes your code not portable and is strongly discouraged and you should use atomics instead. It’s worth noting that if your program works correctly under /volatile:iso, it will work correctly under /volatile:ms. More important, however, if it works correctly under /volatile:ms it may not work correctly under /volatile:iso because the former provides stronger guarantees than the latter.

The /volatile:ms switch implements Acquire and Release semantics. It’s not enough to maintain these at compile time; the compiler might (depending on the target platform) emit extra instructions (such as mfence and xchg) to tell a 3OE processor to maintain these semantics while executing the code. Therefore, volatile variables degrade performance not only because the variables aren’t cached in registers, but also because of the additional instructions that are being emitted.

The semantics of the volatile keyword according to the C# language specification are similar to those offered by the Visual C++ compiler with the /volatile:ms switch specified. There’s a difference, however. The volatile keyword in C# implements Sequentially Consistent (SC) Acq/Rel semantics, while C/C++ volatile under /volatile:ms implements pure Acq/Rel semantics. Remember that C/C++ volatile under /volatile:iso has no Acq/Rel semantics. The details are beyond the scope of this article. In general, memory fences may prevent the compiler from performing many optimizations across them.

It’s very important to understand that if the compiler didn’t offer such guarantees in the first place, then any corresponding guarantees offered by the processor are automatically void.

The __restrict keyword (or restrict) also affects the effectiveness of both register allocation and instruction scheduling. However, in contrast to volatile, restrict can significantly enhance these optimizations. A pointer variable marked with this keyword in a scope indicates that there’s no other variable that points to the same object, created outside the scope and used to modify it. This keyword also might enable the compiler to perform many optimizations on pointers, confidently including automatic vectorization and loop optimizations, and it reduces the generated code size. You can think of the restrict keyword as a top-secret, high-tech, anti-anti-optimization weapon. It deserves a whole article by itself; therefore, it will not be discussed here.

If a variable is marked with both volatile and __restrict, the volatile keyword will take precedence when making decisions regarding how to optimize the code. In fact, the compiler can totally ignore restrict, but must respect volatile.

The /favor switch might enable the compiler to perform instruction scheduling that’s tuned to the specified architecture. It also can reduce the generated code size because the compiler might have the ability to not emit instructions that check whether a specific feature is supported by the processor. This in turn leads to improved instruction cache hit ratio and better performance. The default is /favor:blend, which results in code with good performance across x86 and x64 processors from Intel Corp. and AMD.

Wrapping Up

I discussed two important optimizations performed by the Visual C++ compiler: register allocation and instruction scheduling.

Register allocation is the most important optimization performed by the compiler because accessing a register is much faster than accessing even the cache. Instruction scheduling is also important. However, recent processors have outstanding dynamic execution capabilities, making instruction scheduling less significant than it was before. Still, the compiler can see all instructions of a function, no matter how big it is, while a processor can only see a limited number of instructions. Also, the out-of-order execution hardware is quite power hungry because it’s always working as long as the core is working. Furthermore, x86 and x64 processors implement a memory model that’s stronger than the C11/C++11 memory model and prevents certain instruction reordering that could improve performance. So, compiler-based instruction scheduling is still extremely important for power-limited devices.

Several keywords and compiler switches can affect performance either negatively or positively, so make sure to use them appropriately to ensure your code runs as fast as possible and produces correct results. There are still many more optimizations to talk about—stayed tuned!


Hadi Brais is a Ph.D. scholar at the Indian Institute of Technology Delhi (IITD), researching compiler optimizations for the next-generation memory technology. He spends most of his time writing code in C/C++/C# and digging deep into the CLR and CRT. He blogs at hadibrais.wordpress.com. Reach him at hadi.b@live.com.

Thanks to the following technical expert for reviewing this article: Jim Hogg (Microsoft Visual C++ Team)