Application Software Considerations for NUMA-Based Systems
Updated: March 5, 2003
On This Page
The majority of Microsoft Windows compliant, high-end server platforms that will be developed over the next three years--that is, server platforms that can run a single instance of Windows on eight or more processors--will have a cache-coherent non-uniform memory architecture (NUMA).
This white paper is intended to help software vendors understand the NUMA architecture and learn how it might affect execution and performance of applications that will be supported on the Windows Server 2003 operating system. The specific goals of this paper are to:
As processor clock rates continue to increase, it becomes more difficult and therefore more expensive to provide the bandwidth and latency needed to support large numbers of processors on a single system bus. As a result, the trend has been to support an optimal number of processors per system bus. Supporting an optimal number of processors helps to ensure that the system buses do not create a performance bottleneck and that the development cost is acceptable. Most current Windowscompliant systems support four processors on each system bus.
High-end servers are designed to support more than one system bus. One design approach is to create a number of nodes where each node contains some processors, some memory, and, in some cases, an I/O subsystem. Figure 1 shows a typical node architecture.
Figure 1 - Typical Four-Processor NUMA Node Architecture
Note: Although it is possible to create a node that has more than four or fewer than four processors, this paper assumes the typical case of a four-processor node.
All resources within a node are considered to be local to that node, and access to local memory from within the node is considered to be uniform. It should be noted that I/O might reside within the same node as the processors and memory, or it might reside in dedicated I/O nodes. The architecture of this node is similar to a classic four-way symmetric multi-processor (SMP) system. The main difference between a four-processor NUMA node and an SMP system is the cache-coherent system interconnect that is accessible outside the NUMA node.
To increase system capacity, additional nodes are connected using the high-speed cache-coherent system interconnect, as shown in Figure 2.
Figure 2. Two Four-Processor NUMA Nodes Connected as an Eight-Processor NUMA System
In Figure 2, all eight processors can access memory in both nodes coherently. For example:
It takes more time to access memory in another node than it takes to access local memory. This difference in memory access times is the origin of the name for these systems: non-uniform memory architecture (NUMA).
The ratio of the time taken to access near memory to the time taken to access far memory is referred to as the NUMA ratio. The higher the NUMA ratio value -- that is, the greater the disparity between the time it takes to access far memory as compared to near memory -- the greater the effect that NUMA characteristics may have on software performance. To ensure that optimal performance can be achieved on this type of system, the Hardware Design Guide for Microsoft Windows 2000 Server Version 3.0 recommends a far-to-near ratio no greater than 3:1.
To achieve the best performance on a single operating-system image that is running across multiple NUMA nodes, accesses over the system interconnect must be kept to a minimum. This can only be achieved by adding NUMA support features to the operating system itself, and in some cases to individual applications, as described in this paper.
NUMA Support in Windows Server 2003
The NUMA features described in this section are available in the 32-bit and 64-bit versions of Windows Server 2003, Enterprise Edition and Windows Server 2003, Datacenter Edition. To enable the operating system to provide NUMA enhancements, the hardware must pass a description of the physical topology of the system to the operating system.
In Windows Server 2003, the topology description is passed using a static Advanced Configuration and Power Interface (ACPI) Specification table called the Static Resource Affinity Table (SRAT). The SRAT is constructed by system firmware and is passed to the operating system at boot time as part of the ACPI data structures. Although this table is referenced in Version 2.0 of the ACPI Specification, it can be implemented in ACPI Specification Version 1.0b data structures. For more information on the SRAT, see the "Resources" section of this paper.
Note: On a NUMA system that is divided into several hardware partitions -- that is, where an instance of the operating system is running on each partition -- the SRAT table passed by the firmware describes the hardware resources owned by the particular partition on which the operating system image is running.
The SRAT uses the concept of proximity domains, introduced in ACPI 2.0. Processor and memory resources that are physically located in the same NUMA node can be grouped into the same proximity domain by using the SRAT. This feature allows the firmware to pass the NUMA node topology to the operating system generically. Although the SRAT is a static table, it can be used to indicate where memory might be hot-added in the future to support the Windows Server 2003 hot-add memory feature.
The operating system uses the information provided by the SRAT to support the following NUMA features:
Windows Server 2003 Use of the SRAT Information
The operating system uses information from the SRAT to ensure that, wherever possible, threads of execution and the memory that the threads use most heavily are physically located in the same node. This capability ensures that most Windows applications run locally within a node, minimizing accesses across the system interconnect, which should ensure optimal performance of these applications on NUMAbased systems.
Based on information from the SRAT, the operating system supports the following capabilities:
In some scenarios, an excess number of threads or memory requests exist within a NUMA node. In these cases, the thread or requested memory may be scheduled or allocated to resources in a remote node.
Application management techniques can help prevent threads or memory from being scheduled or allocated to another node. For more information, see "Thread and Process Affinity and Scheduling" later in this paper.
NUMA Topology API
The NUMA features in the Windows Server 2003 operating system are automatically enabled when the operating system detects an SRAT describing a NUMA topology. Certain application features require knowledge of the system topology, for example, setting processor affinity for a thread. In a NUMA system, applications should call the NUMA topology API to ensure a NUMA-aware choice of processors is made. This helps to ensure optimal performance.
The NUMA topology API includes the following functions:
Process and thread affinity are discussed in more detail in "Thread and Process Affinity and Scheduling" later in this paper. For more information on the NUMA topology API and on process and thread affinity see the "Resources" section of this paper.
NUMA Effects on Windows Applications
Existing applications should run without modification on NUMA systems. Most applications perform optimally without modification on NUMA systems running Windows Server 2003 because of the automated NUMA features in the operating system. The following describes how the NUMA features in Windows Server 2003 might affect performance on different application architectures:
NUMA Effects on Large, Enterprise-Class Applications
You may need to investigate and possibly modify some application characteristics to achieve optimal performance on NUMA systems. Examples of these application characteristics are:
To improve the performance of applications with these characteristics, follow these general guidelines:
Many attributes of hardware and software on a NUMA system can affect the performance of software applications, from the hardware NUMA ratio to the thread affinity used. Covering all possible scenarios and providing accurate estimates of the effects of each is beyond the scope of this paper.
The following sections attempt to provide guidelines and suggest remedies for application features that may cause performance issues on NUMA systems. Three areas are discussed:
Thread and Process Affinity and Scheduling
Application Use of Process and Thread Processor Affinity APIs
Applications can override the default process affinity assignment by reassigning process or thread affinities:
An application that creates multiple processes and assigns the threads for each process to processors within the same node should perform optimally on a NUMA system.
Applications that spawn a large number of threads from a single process may not perform optimally when run on a NUMA system. This type of application might create a situation where threads with a shared memory space run across more than one NUMA node. Under this scenario, common data access from these threads is likely to cause increased remote memory accesses, which can degrade application performance.
A performance investigation into such applications running on a NUMA system is warranted. If performance issues are identified, you can optimize the performance of the application with one of the following NUMA optimizations:
The goal of both of these optimizations is to maximize the amount of data that resides locally to the threads that most frequently use it.
Administrator Use of Resource Management Tools
To help avoid this problem, you can use a resource management tool, such as Windows System Resource Manager (WSRM), to balance the application load across the processors on the system. You can also use a resource management tool to assign additional resources to a high-priority application.
One mechanism used by resource management tools is to set processor affinities for application processes and threads. Administrators must take the system topology into account when performing such actions on a NUMA system. The affinity settings used must ensure that data and the threads using the data are kept local to a NUMA node wherever possible. If multiple processors are being assigned to threads from the same process, those processors should be in the same NUMA node wherever possible. Other consequences that you should be aware of when using resource management tools to assign processor affinity on NUMA systems are:
Memory Partitioning and Allocation
In most software applications, the default NUMA assignment of processor affinities for all processes and threads minimizes the need to access memory across the system interconnect and optimizes application performance. In the few cases where the thread usage for a single process crosses multiple NUMA nodes and thread grouping is used as described earlier in this paper, you should investigate memory usage for potential effects on performance.
If threads are grouped based on memory usage, the operating system ensures that memory requested by the threads is allocated locally wherever possible. However, performance could degrade if all threads need access to global memory components, such as a control structure, data cache, or global lock.
You should investigate the effect of such a data access problem to understand how it affects performance before considering a change. If change is needed, you could update the application to partition any data components to provide, for example, a per-node data cache or control structure with distributed locks. Having the data cache local to the node reduces the requirement to access data over the system interconnect and improves application performance.
Where global application data locks are required, they should be distributed or queued to ensure that they dont lock out requests from remote nodes.
Another potential problem is excess memory allocation requests within a node, saturating the available per-node memory pool. In this scenario, further memory requests result in memory being allocated from other nodes, which would increase system interconnect accesses and degrade performance.
A system administrator might be able to avoid this scenario by using a resource management tool to balance the application load across the nodes in the system, as described in the "Administrator Use of Resource Management Tools" section of this paper.
I/O Configuration for Optimal Performance
The NUMA features implemented in the Windows Server 2003 operating system address processor and memory locality issues. Features to enhance I/O performance on NUMA systems will be investigated for inclusion in a future release of the Windows operating system. In a NUMA system, the I/O subsystems might be within the processor/memory nodes, or they might be in separate I/O nodes that are connected to the system interconnect. This section suggests ways to configure I/O subsystems for optimal performance on NUMA systems:
Call to Action