Secure the Enterprise with 64-Bit Intel Architecture

Article
03/28/2011

Intel Corporation

December 2003

Applies to:
Microsoft® Windows®

Summary: Learn how the characteristics of processor architecture may contribute to or mitigate the severity of the potential consequences of software defects that affect system security.

Introduction
Basics of Software Vulnerability to Attacks by Outside Code
The Impact of Hardware Conventions on Code/Address Separation
Conclusion
Additional Resources

Introduction

Software security has become a very important subject. A great deal of attention has been paid to identifying and analyzing coding practices that are associated with security breaches. What has almost escaped consideration, however, is the contribution of the underlying hardware to exposing security breaches created by carelessly written software.

While more than 90 percent of all Microsoft Windows installations run on systems based on x86-based computers, the way software behaves on that hardware is not necessarily applicable to other architectures. Various factors, such as differences in software and hardware conventions for different CPU architectures, affect important characteristics of a system's behavior.

These differences can affect the degree of a system's vulnerability to security breaches. As a result, the outcome of an attack may differ considerably from one architecture to another. Such considerations bear more importance as the Windows operating system becomes available for architectures beyond x86, such as Itanium® 2-based hardware.

This article will show how the characteristics of processor architecture may contribute to or mitigate the severity of the potential consequences of software defects that affect system security. It will use the example of one of the most common of such defects, the buffer overrun.

Basics of Software Vulnerability to Attacks by Outside Code

It is safe to say that the vast majority of code (especially system-level code) for Windows is written in C and C++. These programming languages provide software developers with a great deal of flexibility and C/C++ compilers can generate very high-performance code. This characteristic makes C/C++ attractive for system-level programming.

As always, there is a trade-off associated with accepting that convenience, and it most notably comes in the form of a lack of built-in security against application-integrity violations. For instance, C and C++ do not provide automatic checking for array-bounds violation, nor do they enforce strict rules on type casts. This makes it relatively easy for developers to inadvertently introduce potential security breaches into code written in these languages.

One of the most common security defects in software written in C and C++ is the propensity for buffer overruns, which occur when a block of data is copied into a memory buffer that is not of sufficient length to hold the data. A simple example would be copying a string of 300 characters into a buffer that is only 256 bytes long.

The potential for this condition to exist can be widely spread in applications written in C and C++ because they do not have built-in security against buffer overruns. This potential is further exacerbated by a rich set of run-time routines supplied for these languages that implement unsafe memory operations. Code as simple as the example below can open up your system to buffer overruns:

void CopyString(char *Dest, char* Src)
{ 
    strcpy(Dest, Src);
}

If the Dest string is longer than the buffer allocated for the Src string, then the data block (if any) following theDeststring buffer will be overwritten with the spillover data.

ms995315.64bitintel01(en-us,MSDN.10).gif

Figure 1. Data overwritten with the spillover data

A buffer overrun will almost certainly lead to data loss in your application, but it will not necessarily open the application to an intrusion. For an application to be open to an intrusion through a buffer overrun, three essential conditions must be met:

The data block following a block vulnerable to an overrun must contain a code address.
The address from that location must, at some later point, be used for control transfer inside the application.
The application must be able to reach that later point of control transfer after the vulnerable block is overrun, (it must not crash due to other effects of the overrun, such as general data corruption).

The first condition is the most critical of the three. If it is met, then by performing a buffer overrun on the preceding block with well-crafted data, the intruder gets a chance to overwrite the code-address location and to set it to a value of his choice. If it is set to the address at which he has managed to seed his code inside your application, and then conditions 2 and 3 are met, then your application is guaranteed to eventually jump to and execute his code.

Note that it can generally be presumed that the second condition is always true. That is, if there is a pointer to a code address in your application, then chances are good that it will be used at some point. Modern tools do a good job of removing dead code from executable binaries. It should also be presumed that, at least in some cases, your application will be able to continue from the point of overrun and reach the point of control transfer through the overwritten address. In any case, this condition is determined by internal application logic, and it is largely out of the programmer's control.

The first condition can be illustrated by the following example. Let us say that you have the following sequence of assembly instructions that perform a call:

mov      eax,dword ptr [esp+28h]   <  set up call parameters
push     eax
call     dword ptr ds:[0040E170h]  <  make a call to a target stored 
                                   <  in memory at an offset in the data 
                                   <  segment

If the call-target location in the data section follows a data block that is the destination of a copy operation, then the call-target location could be subject to an overwrite through a buffer overrun on that block.

ms995315.64bitintel02(en-us,MSDN.10).gif

Figure 2. The call-target location could be subject to an overwrite through a buffer overrun on that block.

Such proximity of code addresses to data could generally be minimized by software tools (such as compilers and linkers) by separating data blocks that are vulnerable to an overrun from code-address locations. The underlying hardware can impose limits, however, on how much freedom the tools can have to perform such separation.

The Impact of Hardware Conventions on Code/Address Separation

One of the most notable of such restrictions is the hardware convention for the CALL instruction on IA-32-based CPUs. The implementation of this instruction in the 32-bit Intel architecture does not provide for an explicit way to specify a return address for the callee. Instead, the caller pushes the return address on the stack. A RET (return) instruction in the callee would then pop it from there to determine where it should return.

To illustrate how calling and returning from a procedure works on IA-32 systems, consider the effect of the following call/return sequence on the memory state of a process running on such a system:

010028B0   push   eax   <  push arguments on the stack
010028B1   push   dword ptr ds:[10096D8h]
010028B7   push   dword ptr ds:[1008810h]
010028BD   call   dword ptr ds:[1001274h] <  make the call (will 
010028C3   test   eax,eax                 <  push the return address 
                                          <  (0x010028C3) on the stack

Before the call instruction at 0x010028BD is executed, the application's memory stack could look like this:

0006FEC8  00000000
0006FECC  00000000
0006FED0  00000000
0006FED4  00000000
0006FED8  0006FF1C
0006FEDC  010028E4
0006FEE0  0006FEFC
0006FEE4  00040282    <  current ESP
0006FEE8  000502C7

After the call instruction has been executed and control is transferred to the callee, the return address can be found on the stack at the location the stack pointer (the ESP register) points to, and the stack would look like this:

0006FECC  00000000           < callee's future locals area
0006FED0  00000000           < callee's future locals area 
0006FED4  00000000           < callee's future locals area
0006FED8  0006FF1C           < callee's future locals area
0006FEDC  010028E4           < callee's future locals area
0006FEE0  010028C3           <  current ESP – points to procedure's
0006FEE4  00040282           <  return address
0006FEE8  000502C7

When the callee needs to return, it restores the stack to its position upon entry, cleaning up its own stack allocation (locals area), and executes a RET (return) instruction. The return instruction pops the address from the current stack location and transfers control there. There is no way on 32-bit architecture to specify a return address explicitly.

..........
77D4971F   ret   0Ch   <  will return to the double word currently in 
                       <  memory at 0x0006FEE0 (the only parameter 
                       <  specifies by how much to move the stack pointer 
                       <  to remove the call arguments from it)

Clearly, this convention for passing the return address from the caller to the callee is inherently detrimental to system security. Firstly, the return address ends up being located on the stack right after all of the callee's local data. If the callee performs any unsafe memory operation on local buffers, then the return address location automatically becomes prone to overwrite, due to local buffer overruns, and vulnerability condition 1 is satisfied.

Additionally, the callee is, in most cases, guaranteed to return using the return address, thus satisfying condition 2. Finally, the modular nature of modern software tends to help make procedures short, which helps to produce condition 3 because smaller procedures will have less code between the points of overrun and return, lowering chances for a crash before the point of return is reached.

Consider what can happen in the case of a buffer overrun in the function below. The function is supposed to read a customer's nine-digit zip code from console as a string and convert it to an integer, which it returns:

int GetZip()
{
    char sZipCode[11]; // shouldn't be more than 10 chars (5+'-'+4)
    int iZipCode;

    // get it from the console
    gets(sZipCode);

    // remove the '-'
    memcpy(sZipCode+5, sZipCode+6, 5); 

    // convert to int
    iZipCode = atoi(sZipCode); 

    return(iZipCode);
}

The VC6.0 optimizing compiler for IA-32 will translate the C code into the following sequence of assembly instructions:

GetZip:
00401010   sub         esp,0Ch     <  allocate stack space for
00401013   lea         eax,[esp]   < sZipCode
00401017   push         eax
00401018   call         _gets (004011c0)
0040101D   mov         ecx,dword ptr [esp+0Ah]
00401021   mov         dl,byte ptr [esp+0Eh]
00401025   lea         eax,[esp+4]
00401029   mov         dword ptr [esp+9],ecx
0040102D   push        eax
0040102E   mov         byte ptr [esp+11h],dl
00401032   call        _atoi (004010fb)
00401037   add         esp,14h
0040103A   ret                     <  return pulling the return address
                                   <  from the stack

The stack layout inside GetZip() would look like this:

ms995315.64bitintel03(en-us,MSDN.10).gif

Figure 3. The stack layout inside GetZip()

Let's look at the program state using the VC6.0 debugger. You will see the following picture just after entering the program (the instruction pointer eip at 0x00401013 shows that the program is stopped at the second assembly instruction, generated for this function):

ms995315.64bitintel04(en-us,MSDN.10).gif

Figure 4. The program state using the VC6.0 debugger

The procedure has indeed allocated 12 (0xC) bytes on the stack for the sZipCode string (including padding for alignment) at address 0x12FF64, and the 12 bytes are followed by a double word at 0x12FF70 that contains the return address pushed by the call instruction:

0012FF60  00408034
0012FF64  00408058   <  sZipCode
0012FF68  00000040   <  sZipCode
0012FF6C  00408060   <  sZipCode
0012FF70  00401070   <  return address

If an intruder somehow manages to insert his code inside your application, then by supplying a sufficiently long input string to the gets() call, the 12-byte buffer allocated for it on the stack at 0x12FF64 could be overrun. Because the return address is located on the stack right after the sZipCode buffer, the spillover portion of the string could overwrite the return address of GetZip()at 0x12FF70.

If the spillover data is crafted so that it puts the bytes that represent the address where the intruding code has been seeded (let us say it is 0x50607080) over the return address on the stack, then the return instruction will get that address when it pulls the return address from the stack, and it will transfer control there.

The picture below illustrates the memory state at the end of GetZip() after a successful overrun. The stack location at 0x12FF70 that contained the procedure's return address has been overwritten with the address of the intruding code (0x50607080). Executing the return instruction at 0x0040103a will transfer control to this address, and the intruding code will begin executing.

ms995315.64bitintel05(en-us,MSDN.10).gif

Figure 5. The memory state at the end of GetZip()

The potential for the described condition, which is more narrowly defined as a stack-buffer overrun, or stack smashing, cannot be fully eliminated from IA32 compatible systems. There is simply no alternative way to specify a return address for a procedure, other than putting it on the stack. Certain techniques can be employed by compilers to detect (but not prevent) an overwrite of the return-address location, but they generally result in code that is larger and slower, due to the overhead associated with stack-integrity checks.

Furthermore, the stack-buffer overrun is one of the easiest security defects to exploit. Unlike other types of buffer overruns, such as heap overruns, which may require fairly esoteric code/data patterns to be exploitable, stack-buffer overruns only require that some local buffers in a procedure are used in unsafe memory operations.

However, it is important to understand that this convention for passing the return address on the stack is unique to the IA-32 architecture and its extensions (such as AMD64*). The IPF architecture employs a different convention for its BR.CALL instruction (the functional equivalent of the IA-32 CALL instruction). Instead of pushing the return address on the stack, the 64-bit Intel architecture requires it to be explicitly specified in the call instruction itself. A typical instruction sequence implementing a call on 64-bit Intel architecture will look like this:

000006FB7BE31B90 adds                r34 = 0x18,sp     <  load arguments
000006FB7BE31B94 adds                r33 = 0x10,sp
000006FB7BE31B98 nop.b                0 ;;
000006FB7BE31BA0 ld8                 r31 = [r43],8 ;;  <  load the
000006FB7BE31BA4 ld8                 gp = [r43]        <  target address
000006FB7BE31BA8 mov                 b7 = r31
000006FB7BE31BB0 nop.m               0
000006FB7BE31BB4 nop.m               0
000006FB7BE31BB8 br.call.dptk.many   b0 = b7 ;;        <  make the call,
000006FB7BE31BC0 adds                gp = 0, r40       <  will place the
                                                       <  return address
                                                       <  (0x6FB7BE31BC0)
                                                       <  in branch
                                                       <  register b0

Consequently, a return instruction in the callee (the BR.RET instruction on Itanium 2-based systems) requires an explicit specification of a branch register that contains the return link. The return instruction will load the return address from that register, rather then pulling it from the stack. The MSVC* Itanium compiler will, for instance, produce the following assembly code for the GetZip() function:

0000000000403820 alloc               r36 = ar.pfs,6,0,1,0 
0000000000403824 mov                r35 = b0          <  if the return
0000000000403828 adds                sp = -0x10,sp     <  address needs
0000000000403830 adds                r37 = 0,gp ;;     <  saving it
0000000000403834 ld8.nta             r3 = [sp]         <  is stored in a
0000000000403838 adds                r38 = 0x10,sp     <  local register
0000000000403840 adds                r34 = 0x16,sp
0000000000403844 adds                r32 = 0x17,sp 
0000000000403848 br.call.sptk.many   b0 = .gets ;; 
0000000000403850 adds                r33 = 0x15,sp 
0000000000403854 ld1                r30 = [r32],2 
0000000000403858 adds               r29 = 0,r34 
0000000000403860 adds                gp = 0,r37 
0000000000403864 adds                r38 = 0x10,sp 
0000000000403868 nop.b               0 ;; 
0000000000403870 ld1                r31 = [r34],2 ;; 
0000000000403874 st1                [r33] = r31,2 
0000000000403878 nop.i                0 
0000000000403880 st1                [r29] = r30,2 
0000000000403884 ld1                r28 = [r34],2 
0000000000403888 nop.b                0 ;; 
0000000000403890 ld1                r27 = [r32] 
0000000000403894 st1                [r33] = r28,2 
0000000000403898 nop.b              0 ;; 
00000000004038A0 st1                 [r29] = r27 
00000000004038A4 ld1                 r26 = [r34] 
00000000004038A8 nop.b                0 ;; 
00000000004038B0 st1                [r33] = r26 
00000000004038B4 nop.b                 0 
00000000004038B8 br.call.sptk.many   b0 = .atoi ;; 
00000000004038C0 adds                gp = 0,r37 
00000000004038C4 mov                b0 = r35,GetZip+0xA0 
00000000004038C8 adds                sp = 0x10,sp ;; 
00000000004038D0 nop.m                0 
00000000004038D4 mov.i                ar.pfs = r36 
00000000004038D8 br.ret.sptk.many    b0 ;;      <  the return address is
                                                <  taken from a branch
                                                <  register b0 (not from
                                                <  the stack)

Because the return address does not have to be present on the stack (or elsewhere in memory, for that matter), it is almost impossible for it to become the target of a buffer overrun on Itanium 2-based systems. In fact, even when the return branch register (b0 here) needs to be saved/restored inside GetZip(), it is generally not spilled into memory and then filled from there. Rather, it is saved in a local stack register (r35 in this example), a feature unique to the Itanium Processor Family, where it is not an easy overrun target either.

Let us see how the GetZip() function will behave on an Itanium 2-based system when provided with the same input data that caused the overwrite of the return address on a 32-bit system.

The picture below illustrates the program state at the entry into GetZip() (ip = 0x403820). The memory buffer for sZipCode[] has again been allocated on the stack at memory location 0x6fbffb8ffc0, and b0 contains the return address of 0x403970.

ms995315.64bitintel06(en-us,MSDN.10).gif

Figure 6. The program state at the entry into GetZip() (ip = 0x403820)

At the procedure's return point, the same data that corrupted the return address in the 32-bit version have been processed, and the memory window shows that the 11-byte buffer allocated to sZipCode (at 0x6fbffb8ffc0) has been overrun. The memory stack, in fact, looks exactly the same as it did on the 32-bit system. What is different, however, is the effect on the return address. On 64-bit Intel architecture, it is still intact in b0, and control will be transferred back to the caller, rather than the invading code. The application security on Itanium 2-based systems will not be compromised.

ms995315.64bitintel07(en-us,MSDN.10).gif

Figure 7. The procedure's return point

Conclusion

Code security should always start at the application level, and nothing can improve it more than robust, well-designed code. It will certainly take a while, however, to identify and correct all security problems in the legacy code that is already in the field, and one must still consider the possibility of having to run code with security defects in it.

When faced with this fact, it is important to understand how your choice of CPU architecture can impact your application's vulnerability. Making an educated choice is especially important when considering the architecture for a server-class system. Due to the nature of applications running on servers, which typically hold and process large amounts of data and maintain multiple client connections, system security breakdowns in a server system can have far more significant consequences than an analogous breakdown on a desktop system.

While the choice of a system based on IA32 architecture often represents a solution that is both economical and adequate in terms of performance and functionality, certain legacy features of this architecture make it susceptible to some of the most common software-security defects.

When security is of paramount importance, an Itanium 2-based system offers an edge, as its architectural features make it more resistant to the exploitation of such defects. This advantage could mean the difference between your entire installation succumbing to the next worm attack, or being able to withstand it gracefully.

Additional Resources

Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:

The Early Access Program provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share.
Intel Developer Services offers free articles and training to help software developers maximize code performance and minimize time and effort.
Intel Software Development Products include Compilers, Performance Analyzers, Performance Libraries and Threading Tools.
Intel Solution Services is a worldwide consulting organization that helps solution providers, solution developers, and end customers develop cost-effective e-Business solutions to complex business problems.
IT@Intel through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies.
The Intel Software College provides a one-stop shop at Intel for training developers on leading-edge software-development technologies. Training consists of online and instructor-led courses covering all Intel architectures, platforms, tools, and technologies.