NUMA Node Groups

Figure 1

Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical instruction execution engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.

Windows Server 2008 R2 (and the x64 version of Windows 7) support systems that have more than 64 logical processors. This support is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity.

When the system starts, the operating system creates processor groups and assigns logical processors to the groups. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups of 64 processors each.

Figure 2

The operating system takes physical locality into account when assigning logical processors to groups, for better performance. If possible, all of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.

For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper Supporting Systems That Have More Than 64 Processors at https://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx.

Scalable application design requires NUMA awareness from several perspectives. Herb Sutter describes this process as "Maximize Locality, Minimize Contention" (https://www.ddj.com/architect/208200273). Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example. Ideally, interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device. NUMA locality may be applied to processes, threads, devices, interrupts, and memory. See the Code Gallery entry at https://code.msdn.microsoft.com/64plusLP for a detailed walk-thru of NUMA system API’s.

Figure 3