Code Optimization with the DirectXMath Library
This topic describes optimization considerations and strategies with the DirectXMath Library.
- Use accessors sparingly
- Use correct compilation settings
- Use Est functions when appropriate
- Use Aligned Data Types and Operations
- Properly Align Allocations
- Avoid Operator Overloads When Possible
- Take Advantage of the Integer Floating Point Duality
- Prefer Template Forms
- Using DirectXMath with Direct3D
Vector-based operations use the SIMD instruction sets and these make use of special registers. Accessing individual components requires moving from the SIMD registers to the scalar ones and back again.
When possible, it is more efficient to initialize all of the components of an XMVECTOR at one time, instead of using a series of individual vector accessors.
For Windows x86 targets, enable /arch:SSE2. For all Windows targets, enable /fp:fast.
By default, compilation against the DirectXMath Library for Window x86 targets is done with _XM_SSE_INTRINSICS_ defined. This means that all DirectXMath functionality will make use of SSE2 instructions. However, the same is not true for other code.
Code outside of DirectXMath is handled using compiler defaults. Without this switch, the generated code may often use the less efficient x87 code.
We highly recommend that you always use the latest available version of the compiler.
Many functions have an equivalent estimation function ending in Est. These functions trade some accuracy for improved performance. Est functions are appropriate for non-critical calculations where accuracy can be sacrificed for speed. The exact amount of lost accuracy and speed increase are platform dependent.
The SIMD instruction sets on versions of windows supporting SSE2 typically have aligned and unaligned versions of memory operations. The use of the aligned operations is faster, and should be preferred wherever possible.
The DirectXMath Library provides access aligned and unaligned functionality through variant vector types, structure, and functions. These variants are indicated by an "A" at the end of the name.
The aligned versions of the SSE intrinsics underlying the DirectXMath Library are faster than the unaligned.
For this reason, DirectXMath operations using XMVECTOR and XMMATRIX objects assume those objects are 16-byte aligned. This is automatic for stack based allocations, if code is compiled against the DirectXMath Library using the recommended Windows (see Use Correct Compilation Settings) compiler settings. However, it is important to ensure that heap-allocation containing XMVECTOR and XMMATRIX objects, or casts to these types, meet these alignment requirements.
While 64-bit Windows memory allocations are 16-byte aligned, by default on 32 bit versions of Windows memory allocated is only 8-byte aligned. For information on controlling memory alignment, see _aligned_malloc.
When using aligned DirectXMath types with the Standard Template Library (STL), you will need to provide a custom allocator that ensures the 16-byte alignment. See the Visual C++ Team blog for an example of writing a custom allocator (instead of malloc/free you'll want to use _aligned_malloc and _aligned_free in your implementation).
Note Some STL templates modify the provided type's alignment. For example, make_shared<> adds some internal tracking information which may or may not respect the alignment of the provided user type, resulting in unaligned data members. In this case, you need to use unaligned types instead of aligned types. If you derive from existing classes, including many Windows Runtime objects, you can also modify the alignment of a class or structure.
As a convenience feature, a number of types such as XMVECTOR and XMMATRIX have operator overloads for common arithmetic operations. Such operator overloads tend to create numerous temporary objects. We recommend that you avoid these operator overloads in performance sensitive code.
To support computations close to 0, the IEEE 754 float-point standard includes support for gradual underflow. Gradual underflow is implemented through the use of denormalized values, and many hardware implementations are slow when handling denormals. An optimization to consider is to disable the handling of denormals for the vector operations used by DirectXMath.
Changing the handling of denormals is done by using the _controlfp_s routine on a pre-thread basis, and can result in performance improvements. Use this code to change the handling of denormals:
#include <float.h>; unsigned int control_word; _controlfp_s( &control_word, _DN_FLUSH, _MCW_DN );
Note On 64-bit versions of Windows, SSE instructions are used for all computations, not just the vector operations. Changing the denormal handling affects all floating-point computations in your program, not just the vector operations used by DirectXMath.
DirectXMath supports vectors of 4 single-precision floating-point or four 32-bit (signed or unsigned) values.
Because the instruction sets used to implement the DirectXMath Library have the ability to treat the same data as multiple different types-for example, treat the same vector as floating-point and integer data-certain optimizations can be achieved. You can get these optimizations by using the integer vector initialization routines and bit-wise operators to manipulate floating-point values.
The binary format of single-precision floating-point numbers used by the DirectXMath Library completely conforms to the IEEE 764 standard:
SIGN EXPONENT MANTISSA X XXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX 1 bit 8 bits 23 bits
- Positive zero is 0
- Negative zero is 0x80000000
- Q_NAN is 07FC0000
- +INF is 0x7F800000
- -INF is 0xFF800000
Template form exists for XMVectorSwizzle, XMVectorPermute, XMVectorInsert, XMVectorShiftLeft, XMVectorRotateLeft, and XMVectorRotateRight. Using these instead of the general function form allows the compiler to create much more efficent implementations. For SSE, this often collapses down to one or two _mm_shuffle_ps values. For ARM-NEON, the XMVectorSwizzle template can utilize a number of special cases rather than the more general VTBL swizzle/permute.
A common use for DirectXMath is to perform graphics computations for use with Direct3D. With Direct3D 10.x and Direct3D 11.x, you can use the DirectXMath library in these direct ways:
- Use the Colors namespace constants directly in the ColorRGBA parameter in a call to the ID3D11DeviceContext::ClearRenderTargetView or ID3D10Device::ClearRenderTargetView method. For Direct3D 9, you must convert to the XMCOLOR type to use it as the Color parameter in a call to the IDirect3DDevice9::Clear method.
- Use the XMFLOAT4/XMVECTOR and XMFLOAT4X4/XMMATRIX types to setup constant buffer structures for reference by HLSL float4 or matrix/float4x4 types.
Note XMFLOAT4X4/XMMATRIX types are in row-major format. Therefore, if you use the /Zpr compiler switch (the D3DCOMPILE_PACK_MATRIX_COLUMN_MAJOR compile flag) or omit the row_major keyword when you declare the matrix type in HLSL, you must transpose the matrix when you set it into the constant buffer.
- With Direct3D 10.x and Direct3D 11.x, you can assume that the pointer returned by the Map method (for example, ID3D11DeviceContext::Map) in the pData member (D3D10_MAPPED_TEXTURE2D.pData, D3D10_MAPPED_TEXTURE3D.pData, or D3D11_MAPPED_SUBRESOURCE.pData) is 16-byte aligned if you use feature level 10_0 or higher or whenever you use D3D11_USAGE_STAGING resources.