5 out of 6 rated this helpful - Rate this topic

Visual C++ Optimization Overview

Visual Studio .NET 2003
 

Kate Gregory
Gregory Consulting

April 2004

Applies to:
Microsoft® Visual C++® .NET 2003
Microsoft® Visual C++® Toolkit 2003
Microsoft® Visual Studio® .NET

Summary: Demonstrates a few of the many code-optimization features offered by the Visual C++ 2003 compiler. (8 printed pages)

This article is part of a code sample included with the Visual C++ Toolkit 2003, available for download at http://msdn.microsoft.com/visualc/vctoolkit2003.

Contents

Whole Program Optimization
Optimize Code for the Intel Pentium 4 or AMD Athlon
Streaming SIMD Extensions 2
If You Have Visual Studio
Conclusion
Related Books


The Microsoft® Visual C++® Toolkit 2003 includes an optimizing C++ compiler. Most of the switches are fairly straightforward and have been in the Visual C++ product for many versions, although two are more recent and can produce dramatic speed improvements without any need to rewrite code. These are /GL, Whole Program Optimization, and /G7, which emits code optimized for the Pentium 4 or AMD Athlon. A related option is /arch:SSE2, which emits code targeted at the SSE2 registers and instructions.

The sample code runs three tests:

  1. Call functions that are inline candidates.
  2. Perform a large number of floating point multiplications and additions.
  3. Perform a large number of integer multiplications and additions.

Whole Program Optimization

The sample code defines two very similar functions, Add() and DisplayAdd(). DisplayAdd() writes to the screen and is unlikely to be inlined as a result:

void DisplayAdd(int a, int b)
{
   cout << a << " + " << b << " = " << a + b << endl;
   cout << "Return address from " << __FUNCTION__ 
        << " " << _ReturnAddress() << endl;
}

_ReturnAddress is an intrinsic function that reports where control will return. It's useful for identifying inlined functions.

Add() is declared in gl-g7.cpp, along with a global variable it sets:

void* inlineReturnAddress; // set in Add()
int Add(int a, int b); // implementation in module.cpp

The implementation is in module.cpp:

int Add(int a, int b)
{
   inlineReturnAddress = _ReturnAddress();
   return a+b;
}

To compile this program without Whole Program Optimization, use this command line:

cl /O2 /ML /EHsc GL-G7.cpp module.cpp 

To run Test 1, use this command:

gl-g7 1

You should see output similar to this (the numerical addresses will vary):

1 + 2 = 3
Return address from DisplayAdd 00401D0A
1 + 2 = 3
Return address from Add 00401D13
Return address from Test1 00402125

The return address from Add() is not the same as from Test1(): Add() was not inlined.

Now recompile with /GL:

cl /O2 /ML /EHsc /GL GL-G7.cpp module.cpp 

Run Test 1 again and you should see output like this:

1 + 2 = 3
Return address from DisplayAdd 00401242
1 + 2 = 3
Return address from Add 0040179F
Return address from Test1 0040179F

Now the return address from Add() and Test1() are the same: Add() was inlined within Test1() even though its code came from another file.

Optimize Code for the Intel Pentium 4 or AMD Athlon

The /G7 option is new in Microsoft® Visual Studio® .NET 2003; it produces code that is optimized for the Pentium 4 or the AMD Athlon by choosing different instructions than it otherwise would. The best performance improvement is seen in routines that multiply integers, especially multiplying an integer by a constant known at compile time.

Test 2 demonstrates the speed improvements that are possible:

#define INT_ARRAY_LEN 100000
int intarray[INT_ARRAY_LEN];
int intCalculate()
{
   int total = 0;

   for (int i = 1; i < INT_ARRAY_LEN; i++)
   {
         total += intarray[i-1]*7;
   }

   return total;
}

void Test2()
{
   int var1 = 2;
   int i;

   for (i = 0; i < INT_ARRAY_LEN; i++)
   {
         intarray[i] = i*5;
         var1 += 2;
   }

   LARGE_INTEGER start, end;
   LARGE_INTEGER freq;

   SetThreadAffinityMask(GetCurrentThread(), 1);
   QueryPerformanceFrequency(&freq);
   QueryPerformanceCounter(&start);
   double total = 0;

   for (i = 0; i < 100000; i++)
   {
      total += intCalculate();
   }

   QueryPerformanceCounter(&end);

   cout << "Total = " << total << endl;

   cout << (end.QuadPart - start.QuadPart)/(double)freq.QuadPart << " seconds" << endl;
}

This code uses some timing functions that are implemented in kernel32.dll, part of Microsoft® Windows®. These functions and the data types they use are defined in windows.h. To reduce dependencies in this sample, the functions are prototyped, and the data types are defined, within gl-g7.cpp. QueryPerformanceCounter saves a starting or ending position, and QueryPerformanceFrequency gets a value to divide into the difference between starting and ending, producing an elapsed time in seconds. The call to SetThreadAffinityMask reduces artifacts on multiprocessor machines.

This routine performs a lot of integer multiplication. To compile it without processor-specific instructions, use this command line:

cl /O2 /ML /EHsc GL-G7.cpp module.cpp 

To compile it for a Pentium 4 or AMD Athlon machine, use this command line:

cl /O2 /ML /EHsc /G7 GL-G7.cpp module.cpp 

To run test 2, use this command line:

gl-g7 2

On a Pentium 4 or AMD Athlon machine, the /G7 version runs over 10% faster. This code can be run on a machine without the appropriate chip, but will be slightly slower than the version compiled without /G7.

Streaming SIMD Extensions 2

If you are sure you are building code for a computer that has SSE2 support, e.g. a Pentium 4 or AMD Athlon machine, you can use the /arch:SSE2 option. This produces code that will not run on other chips, but is much faster, especially for routines with a lot of floating point arithmetic.

Test 3 performs a floating point calculation very similar to Test 2:

#define ARRAY_LEN 10000
double array[ARRAY_LEN];

double Calculate()
{
   double total = 0;

   for (int i = 1; i < ARRAY_LEN; i++)
   {
         total += array[i-1]*array[i];
   }

   return total;
}

void Test2()
{
   double var1 = 2;
   int i;

   for (i = 0; i < ARRAY_LEN; i++)
   {
         array[i] = var1;
         var1 += .012;
   }

   LARGE_INTEGER start, end;
   LARGE_INTEGER freq;

   SetThreadAffinityMask(GetCurrentThread(), 1);
   QueryPerformanceFrequency(&freq);
   QueryPerformanceCounter(&start);
   double total = 0;

   for (i = 0; i < 100000; i++)
   {
      total += Calculate();
   }

   QueryPerformanceCounter(&end);

   cout << "Total = " << total << endl;

   cout << (end.QuadPart - start.QuadPart)/(double)freq.QuadPart << " seconds" << endl;
}

To compile it without processor-specific instructions, use this command line:

cl /O2 /ML /EHsc GL-G7.cpp module.cpp 

To compile it for a Pentium 4 or AMD Athlon machine only, use this command line:

cl /O2 /ML /EHsc /G7 /arch:SSE2 GL-G7.cpp module.cpp 

To run test 3, use this command line:

gl-g7 3

On a Pentium 4 or AMD Athlon machine, the /G7 /arch:SSE2 version runs about 10% faster. This code cannot be run on a machine without the appropriate chip.

If You Have Visual Studio

All these options are available on the Project Properties dialog:

Figure 1. General project properties

Figure 2. C/C++ optimization options

Figure 3. C/C++ code generation options

If you want to produce tailored versions for specific chips, you can create several configurations, each with a different combination of options.

Conclusion

Different programs respond to optimization in different ways. While the module-by-module optimization is good, adding whole program optimization can make a dramatic improvement. Since you don't need to change your code to use it, there's no reason not to.

If most of your user base, or all the performance-sensitive users, have Pentium 4 or AMD Athlon machines, use the /G7 option to produce faster code for these users, knowing that it will be slightly slower for the rest of your users. If you're going to create specific optimized versions for Pentium 4 or AMD Athlon machines, use the /arch:SSE2 option, as well, for maximum gain.

Related Books

Microsoft Visual C++ .Net 2003 Kick Start by Kate Gregory

 

About the author

Kate Gregory is a Microsoft Regional Director, a C++ MVP, and the author of Microsoft Visual C++ .NET 2003 Kick Start. She is a founding partner of Gregory Consulting, which provides consulting and development services throughout North America, specializing in software development, integration projects, technical writing, mentoring, and training with leading-edge technologies.

Did you find this helpful?
(1500 characters remaining)