Kernel Enhancements for Windows XP
Updated: January 13, 2003
Microsoft has made many enhancements to the kernel of the Microsoft Windows XP and Windows Server 2003 operating systems. This article provides an overview of the new features and changes in the kernel for these versions of Windows, intended for system and peripheral designers, driver developers, and firmware developers who are creating products for these operating systems.
This article assumes that the reader is familiar with related concepts and issues for Windows 2000. For information, see the Windows Driver Development Kit (DDK; http://msdn.microsoft.com/en-us/windows/hardware/gg487428.aspx) and the Windows 2000 Resource Kit (available through MSDN Professional subscription or through Microsoft Press).
On This Page
The information in this article applies for the Windows XP and Windows Server 2003 operating systems. Where any reference states "Windows XP," the information also applies for Windows Server 2003, unless differences are stated explicitly.
Microsoft has made substantial enhancements to the kernel at the core of Windows XP. Kernel improvements are significant because the kernel provides low-level operating system functions, including thread scheduling, interrupt and exception dispatching, multiprocessor synchronization, and a set of routines and basic objects used by the rest of the operating system to implement higher-level constructs.
This article describes the kernel improvements for Windows XP, which include:
The Windows XP kernel improvements provide new opportunities for independent software vendors (ISVs) and independent hardware vendors (IHVs), and other value-added providers working with Windows 2000. Windows XP provides compatibility with Windows 2000 devices and drivers, while providing new APIs, enhancements, and other features that can be leveraged into future products and services.
As with Windows 2000, the registry plays a key role in the configuration and control of Windows XP. The registry, which resides on the disk as multiple files called hives, was originally designed as a repository for system configuration data. Although most people think of the registry as static data stored on the hard disk, it is also a window into various in-memory structures maintained by the Windows XP executive and kernel.
The registry code is redesigned for Windows XP, providing enhanced performance while remaining transparent to applications by using existing registry programming interfaces. Windows XP registry enhancements provide performance improvements, including the following areas:
The new registry implementation delivers two key benefits:
Windows XP supports larger registries than previous versions of the kernel, which were effectively limited to about 80 percent of the total size of paged pool. The new implementation is limited only by available system disk space.
A tendency to use the registry more like a database developed among registry consumers, which increased demands on registry size. The original design of the registry kept all of the registry files in the paged-pool, which, in the 32-bit kernel, is effectively limited at approximately 160 MB because of the layout of the kernel virtual address space. A problem arose because, as larger registry consumers such as Terminal Services and COM appeared, a considerable amount of paged-pool was used for the registry alone, potentially leaving too little memory for other kernel-mode components.
Windows XP solves this problem by moving the registry out of paged pool and using the cache manager to do an in-house management of mapped views of the registry files. The mapped views are mapped in 256K chunks into system cache space instead of paged pool.
Another issue that affected registry performance in earlier versions is the locality problem. Related cells are spread through the entire registry files. Accessing certain information, such as attributes of a key, could degenerate into page-faults, which lowers performance.
The Windows XP registry uses an improved algorithm for allocating new cells that keeps related cells in closer proximitysuch as keeping cells on the same page or nearby pages, which solves the locality problem and reduces the page faults incurred when accessing related cells. A new hive structure member tracks freed cells instead of relying on linked freed cells. When future cells are allocated, the freed cell list and a vicinity argument are used to ensure the allocation is in the same bin as the hive.
Windows XP improves the way the registry handles big data. In versions before Windows XP, if an inefficient application constantly increased a value with a small increment, it created a sparse and wasteful registry file. Windows XP solves this problem with a big cell implementation where cells larger than 16K are split into increments of 16K chunks. This reduces fragmentation when the data length of a value is increased within a certain threshold.
Windows Support Enhancements
Numerous product support enhancements have been built into Windows XP and Windows Server 2003, including enhancements to the kernel that improve the debugger shipped with Windows XP and the DDK. These enhancements include:
Kernel Changes for Improved Debugging
The debuggers for Windows XP have been redesigned and include tools such as Windbg, Kd, and Cdb. Although the new debugger also works with Windows NT 4.0 and Windows 2000, some features are only available when debugging Windows XP. There is also a 64-bit version of all of the debuggers to debug Intel Itanium-based servers running Windows XP.
Kernel enhancements available only for debugging under Windows XP:
The Windows XP debugger and debugger documentation are available at http://msdn.microsoft.com/en-us/windows/hardware/gg463009.aspx.
Cross Session Debugging
Quit and Detach
Example: Using Quit and Detach
Example: Using QD and the Dump Command
To use qd and .dump:
Debugging over an IEEE 1394 port
Dynamic Control over Debug-child Flag
Improved kd Serial Bandwidth Usage
Loading Updated Driver Files through kd
Control over Whether Terminating a Debugger Also Terminates the Process Being Debugged
Built-in User Heap Leak Detection
Windows XP provides built-in user mode heap-leak detection. Poorly written or miscoded applications can "leak" heap memory. In earlier versions before Windows XP, when this situation arose, special tools were needed on the server to help identify the cause of the memory leak. User mode component heap allocation leaks can be seen in two ways:
Leak Detection when the Process Is Exiting
To enable leak detection when the process is exiting, set the registry key as follows:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ImageName] "ShutdownFlags"="3"
Using a Debugger Extension to Investigate Leaks
!heap -l usage sample:
In this example, the bug is in hOpenFile, which didn't free the string returned by RtlDosPathNameToNtPathName_U. In all, 22 heap allocations were not freed when the application was exited. The heap address that was allocated is in the entry column, and the size is in the size column. If the GlobalFlag for Stack Backtrace was enabled, then analysis could show the actual functions that allocated the heap memory. This can be done several ways
Note: In some cases false positives are seen, such as applications that keep encoded pointers and blocks passed as contexts to local procedure call (LPC) ports. In addition, this doesn't work for an application that has its own heap, such as Microsoft Office applications, or for paged heap enabled with the /full option.
Additional Heap Counters
Another important new feature in Windows XP is heap performance monitoring. Performance Monitor (Perfmon) can display about 20 heap-related counters: amount of memory reserved and committed per heap, number of blocks allocated and freed for three class sizes, average allocated and free time, lock contention, and others. Perfmon will display these counters when the following registry key is set:
Only the process heap and the heaps with higher usage are monitored.
The I/O subsystem consists of kernel components that provide an interface to hardware devices for applications and other mandatory system components. Windows XP enhances the I/O subsystem while retaining complete compatibility with drivers written for Windows 2000. This compatibility was essential because the I/O subsystem provides the interface to all devices, and too many changes to process I/O can break existing applications and drivers.
Enhancements were made by adding new APIs, available to drivers written to take advantage of the new Windows XP functionality. For this reason, while existing Windows 2000 drivers will work with Windows XP, they must be rewritten to take advantage of the new I/O improvements, including the following:
New Cancel Queue
Rather than having drivers perform device queuing and handling the I/O request packet (IRP) cancellation race, Windows XP I/O automates this process. In Windows XP, drivers handle IRP queuing and do not have to handle IRP cancellations. Intelligence in the queuing process lets the I/O APIs handle requests rather than drivers in cases where the I/O is cancelled. A common problem with cancellation of IRPs in a driver is synchronization between the cancel lock or the InterlockedExchange in the I/O Manager with the driver's queue lock.
Windows XP abstracts the cancel logic in the APIs while allowing the driver to implement the queue and associated synchronization. The driver provides routines to insert and remove IRPs from a queue, and it provides a lock to be held while calling these routines. The driver ensures that the memory for the queue comes from the correct pool. When the driver actually wants to insert something into the queue, it does not call its insertion routine, but instead calls IoCsqInsertIrp().
To remove an IRP from the queue, the driver can either specify an IRP to be retrieved, or pass NULL, and the first IRP in the queue will be retrieved. Once the IRP has been retrieved, it cannot be canceled; it is expected that the driver will process the IRP and complete it quickly.
File System Filter Driver APIs
Several new APIs provide greater all-around reliability. Microsoft worked with third-party developers to test their filter drivers. If a driver crashed attempting to perform illegal functions, together we determined the functionality required, and provided APIs to let them accomplish what needed to be done without harming the rest of the system. These APIs are included in the Windows Installable File System Development (IFS) Kit for Windows XP.
Improved Low-Memory Performance
Windows XP is more resilient during periods of low memory because "must succeed" allocations are no longer permitted. Earlier versions of the kernel and drivers contained memory allocation requests that had to succeed even when the memory pool was low. These allocations would crash the system if no memory were available. Two important I/O allocation routines used "must succeed," with the first being for IRP allocation, and the other for Memory Descriptor List (MDL) allocations. If memory couldnt be allocated, the system would blue screen if these routines were used. For Windows XP, kernel components and drivers are no longer allowed to request "must succeed" allocations; memory allocation routines will not allocate memory if the pool is too low. These changes allow drivers and other components to take appropriate error actions, rather than an extreme approach such as bug checking a machine.
Another improvement for low-memory conditions is I/O throttling. If the system cant allocate memory, it throttles down to process one page at a time, if necessary, using freely allocated resources. This allows the system to continue at a slower pace until more resources are available.
Three new entries are added to the end of the DMA_OPERATIONS structure. These three entries will be accessible to any driver, which uses IoGetDmaAdapter(). To safely check whether the new functionality exists, the driver should set the version field of the DEVICE_DESCRIPTION structure provided to IoGetDmaAdapter() to DEVICE_DESCRIPTION_VERSION2.
The current Hardware Abstraction Layers (HALs), which don't support the new interface, will fail the operation because of the version number. HALs that support this feature will understand the new version and will succeed the request, assuming all the other parameters are in order. The driver should try to access these new function pointers only when the driver successfully gets the adapter using DEVICE_DESCRIPTION_VERSION2.
Windows XP includes a new componentthe WebDAV redirector. The WebDAV redirector allows applications on Windows XP to connect to the Internet, and to natively read and write data on the Internet. The WebDAV protocol is an extension to Hypertext Transfer Protocol (HTTP) that allows data to be written to HTTP targets such as the Microsoft MSN Web Communities. The WebDAV redirector provides file system-level access to these servers in the same that the existing redirector provides access to SMB/CIFS servers.
One way to access a WebDAV share is to use the net use command, for example:
To connect to an MSN Community, use http://www.msnusers.com/yourcommunityname/files/ as the target. The credentials you need in this case are your Passport credentials; enter these details in the Connect Using Different User Name dialog if you are using mapped network drive, or use the /u: switch with the net use command. For example:
The simplest ways to create a WebDAV share are:
System Restore is a combination of a file system filter driver and user-mode services that provide a way for user to unwind configuration operations and restore a system to an earlier configuration. System Restore includes a file system filter driver called Sr.sys, which helps to implement a copy-on-write process. System Restore is a feature only of Windows XP Home Edition and the 32-bit version of Windows XP Professional; and it is not a feature of the Windows Server 2003 versions.
Volume Snapshot Service
A volume snapshot is a point-in-time copy of that volume. The snapshot is typically used by a backup application so that it may backup files in a consistent manner, even though the files may be changing during the backup. Windows XP includes a framework for orchestrating the timing for a snapshot, as well as a storage filter driver, not a file system filter driver, that uses a copy-on-write technique in order to create a snapshot.
One important new snapshot-related I/O Control (IOCTL) that affects file systems is IOCTL_VOLSNAP_FLUSH_AND_HOLD_WRITES. This is actually intended for interpretation by file systems, even though it is an IOCTL. This is because all file systems should pass the IOCTL down to a lower-level driver that is waiting to process the IOCTL after the file system. The choice of an IOCTL instead of an FSCTL ensures that even legacy file system drivers will pass the IOCTL down.
This IOCTL is sent by the Volume Snapshot Service. When a file system such as NTFS receives the IOCTL, it should flush the volume and hold all file resources to make sure that nothing more gets dirty. When the IRP completes or is cancelled, the file system then releases the resources and returns.
Changes in Existing I/O Features
Windows XP includes several changes in existing I/O features, including:
Read-Only Kernel and HAL Pages
On many Windows XP-based systems, the kernel and HAL pages will be marked read-only. This has affected drivers that were attempting to patch system code, dispatch tables, or data structures. The change to read-only kernel and HAL does not happen on all systems:
Windows XP includes several new filter driver APIs, including:
This routine returns a Multi-Sz list of volume path names for the given volume name. The returned 'lpcchReturnLength' will include the extra tailing null characteristic of a Multi-Sz, unless ERROR_MORE_DATA is returned. In such a case, the list returned is as long as possible and may contain a part of a volume path.
Windows XP provides improved memory management. The memory manager provides the system services to allocate and free virtual memory, share memory between processes, map files into memory, flush virtual pages to disk, retrieve information about a range of virtual pages, change the protection of virtual pages, and lock the virtual pages into memory. The memory manager also provides a number of services, such as allocating and de-allocating physical memory and locking pages in physical memory for DMA transfers, to other kernel-mode components inside the executive as well as to device drivers.
Memory management enhancements include the following:
Logical Prefetcher for Faster Boot and Application Launch
When a Windows XP-based system is booted, data is saved about all logical disk read operations. On later boots, this information is used to pre-fetch these files in parallel with other boot operations. During boot and application launch, a Windows system demands and pages a sizable amount of data in small chunks (4K to 64K), seeking between files, directories, and metadata. The Logical Prefetcher, which is new for Windows XP, brings much of this data into the system cache with efficient asynchronous disk I/Os that minimize seeks. During boot, the logical prefetcher finishes most of the disk I/Os that need to be done for starting the system in parallel to device initialization delays, providing faster boot and logon performance.
Logical prefetching is accomplished by tracing frequently accessed pages in supported scenarios and efficiently bringing them into memory when the scenario is launched again. When a supported scenario is started, the transition page faults from mapped files are traced, recording which page of a file is accessed. When the scenario has completed (either the machine has booted or the application started), the trace is picked up by a user-mode maintenance service, the Task Scheduler. The information in the trace is used to update or create a prefetch-instructions file that specifies which pages from which files should be prefetched at the next launch.
The user-mode service determines which pages to prefetch by looking at how successful prefetching has been for that scenario in the past, and which pages were accessed in the last several launches of the scenario. When the scenario is run again, the kernel opens the prefetch instructions file and asynchronously queues paging I/O for all of the frequently accessed pages. The actual disk I/Os are sorted by the disk drivers to go up the disk once to load all pages that are not already in memory. This minimizes seeks, cuts down on disk time, and increases performance. The kernel also prefetches the file system metadata for the scenario, for example, MFT entries and directory files. Because prefetching is useful only when the required data is not in memory, the applications that are launched frequently are not traced and prefetched each time.
Settings for Logical Prefetch
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters RootDirPath - Key Name Reg_SZ - Data Type Prefetch - Value <default>
EnablePrefetcher (DWORD) 0x00000001= application launch prefetching 0x00000002= boot prefetching
Parameters are ANDed, so if all were enabled, the setting would be 0x00000003. The setting takes effect immediately. In Windows Server 2003 editions and later versions, only the boot prefetch is enabled by default. Application prefetch can be enabled by the registry setting cited here. The system boot prefetch file is in the %systemroot%\Prefetch directory. Although these prefetch-readable files can be opened using Notepad, they contain binary data that will not be recognized by Notepad. If you are going to view these file, make them read only or copy them to a different location before opening.
Scalability Improvements Due to Reduced Lock Contention
After significant analysis to identify the resource synchronizations incurring the highest contention, the memory management subsystem has gone through numerous changes for Windows XP to reduce Page Frame Number (PFN), address windowing, system PTE, and dispatcher lock contention. The changes removed numerous unnecessary locks and in many cases redesigned the type of locking done, improving both scalability and performance.
Improved Caching and Backup Due to Dynamic Paged Pool Usage
A major redesign of some internal Memory Manager structures causes substantially less paged pool to be consumed, allowing for greater caching capacity and faster response.
Paged pool is now only allocated while a view is active and even then, only for an amount proportional to the actual view size. When a view is unmapped, that pool is then immediately available for reclaiming if the system detects that overall pool usage is high. This can be done because the pool can be dynamically reallocated and repopulated later if the caller requests the view again. Paged pool used to be allocated for an amount proportional to the section (file) size, regardless of the actual views that were ever used.
Improved Server Scaling Due to Individual Page Charging
Improved Terminal Server and Network Server Scaling
The number of system Page Table Entries has been increased to a maximum of approximately 1.3 GB, of which just under 1 GB is virtually contiguous. This is approximately twice as many PTEs as were available in Windows 2000 and six times more than Windows NT 4.0. This is subject to system configuration, such as RAM, registry switches, and other factors. Currently, a system must:
On an x86-based systems with 256 Mb, the changes made it possible to allocate a 960-MB contiguous mapping. The new mapping is made to keep the PTEs contiguous; by not freeing this second huge chunk of PTEs into the generally available pool until the initial (smaller) PTE pool cannot satisfy a request. Thus, the system will not fragment a big PTE before a driver can access it. Drivers should only experience contention for this resource from another driver that wants a huge chunk, instead of from small random system or driver allocations that happen to occur first.
These extra PTEs also greatly enhance any configuration where PTE usage is critical, for example Terminal Server scenarios, network servers with many concurrent I/Os, and so on.
Support of Giant Drivers Due to Increased Number of System PTEs
Windows XP supports giant driver mappings. Although video drivers are the most obvious benefactors, this also enables other specialized drivers that support large amounts of dedicated RAM. Windows XP supports nearly a gigabyte of virtual continuous space for a driver. This compares to support of about 220K for Windows 2000 and about 100K for Windows NT 4.0.
Support of Direct Execution from ROM
Windows XP supports executing applications directly from ROM. This enables products like Windows NT Embedded to ship on ROMs, allowing manufacturers to ship systems without disk drives for specific markets.
Windows XP brings improvements to the Power Manager, while continuing to support legacy drivers. The Power Manager, which is responsible for managing power usage for the system, administers the system-wide power policy and tracks the path of power IRPs through the system. As with Windows 2000, the Power Manager requests power operations by sending IRP_MJ_POWER requests to drivers. A request can specify a new power state or can query whether a change in power state is feasible. When sleep, hibernation, or shut down is required, the Power Manager sends an IRP_MJ_POWER request to each leaf node in the device tree requesting the appropriate power action. The Power Manager considers the following in determining whether the system should sleep, hibernate, or shut down:
Power management works on two levels, one applying to individual devices and the other to the system as a whole. The Power Manager, part of the operating system kernel, manages the power level of the entire system. If all drivers in the system support power management, the Power Manager can manage power consumption on a system-wide basis, utilizing not only the fully-on and fully-off states, but also various intermediate system sleep states.
Legacy drivers that were written before the operating system included power management support will continue to work as they did previously. However, systems that include legacy drivers cannot enter any of the intermediate system sleep states; they can operate only in the fully on or fully off states as before. Device power management applies to individual devices. A driver that supports power management can turn its device on when it is needed, and off when it is not in use. Devices that have the hardware capability can enter intermediate device power states. The presence of legacy drivers in the system does not affect the ability of newer drivers to manage power for their devices.
Improved Boot and Logon Performance
Customer research has shown that one of the most frequently requested features that users want from their PCs is fast system startup, whether from cold boot or when resuming from standby or hibernation. When a Windows XP-based system is first booted, data is saved about all logical disk read operations. On later boots, this information is used to pre-fetch these files in parallel with other boot operations. The Windows XP operating system has improved PC startup times, which provides opportunities for system manufacturers who want to improve boot and resume times for new PCs. Windows XP has several improvements in the boot and resume processes, including:
Boot Loader Improvements
The key boost to boot loader performance is through optimizing the disk reads. The Windows XP boot loader (Ntldr) caches file and directory metadata in large chunks in a most-recently-used manner, which reduces disk seeking. Each system file is now read with a single I/O operation. The resulting improvement in Windows XP is that the boot loader is approximately four to five times faster than in Windows 2000. Additionally, redundant PS/2-compatible mouse checks were removed from Ntldr.
Boot loader enhancements also provide similar improvements in hibernation resume times, mainly by streamlining the I/O paths used by Ntldr to read the hibernation image. The hibernation file is compressed as it is written, and for efficiency, the compression algorithm overlaps with the file I/O. However, when resuming from hibernation, Ntldr is using the BIOS to perform the I/O; therefore, it is not feasible to overlap the disk I/O reads with the decompression.
Operating System Boot Improvements
Optimizing operating system load in Windows XP is achieved by overlapping device initialization with the required disk I/Os, and by removing or delaying loading all other processes and services from boot that are unnecessary at boot time. When tuning a system for fast booting, it is crucial to look at both the efficiency of device initialization and the disk I/Os. Windows XP initializes device drivers in parallel to improve boot time. Instead of waiting for each device sequentially, many can now be brought up in parallel. The slowest device has the greatest effect on boot. Overlapped device initialization can be viewed with the Bootvis.exe tool.
For more information, see Fast Boot /Fast Resume Design.
For optimizing boot time, device drivers should only do what is required to initialize the device during boot, and defer all else to post-boot. The goal is that the device is usable after boot, but does not delay boot unnecessarily. Device drivers should never hold spin locks except for very short durations, particularly during boot, as this defeats any parallelism.
Examples of gains made by overlapping device initialization with disk I/O include:
Operating System Hibernation Improvements
During hibernation, all devices are powered off, and the system's physical memory is written to disk in the system hibernation file, \Hiberfil.sys. Before Windows XP writes to the hibernation file, all memory pages on the zero, free, and standby lists are freed; these pages do not need to be written to disk. Memory pages are also compressed before being written.
To optimize the hibernation process in Windows XP, several improvements have been implemented. The compression algorithm has been optimized to compress and decompress large blocks (64K) of data. In addition, the compression is overlapped with the disk write. As the current data block is being transferred to the disk, the next block of data is being compressed. Overlapping the compression with disk writes makes the compression time virtually free. Also, the hibernation file is written using IDE DMA instead of PIO mode. Most modern IDE controllers and disks achieve their best performance only in DMA mode.
Standby Resume Improvements
During resume from standby, the operating system sends S0 IRPs to devices to indicate the change in system power state. As noted in the Power Management section of this article, device drivers typically request D0 IRPs to change their device power state. The operating system is responsible for notifying each device in the correct order. There are two key ordering rules that must be followed to prevent deadlocks:
Because many devices may take significant time to go from D3 to D0, the key to good resume performance is to overlap device initialization as much as possible. The ordering chosen by the operating system is important in maximizing parallelism.
The following changes have been implemented to optimize the resume performance in Windows XP:
Windows XP Boot and Resume Tools
Windows XP has the ability to trace boot and resume metrics, and to dump the resulting information to a binary file viewable through Bootvis.exe and other tools. Bootvis.exe displays various time-interlocked graphs showing such things as CPU Usage, Disk I/O, Driver Delays and resume activity. Bootvis.exe can show many types of useful details; the best way to start is by dragging an area on the graph and either double-clicking it or right-clicking to use the context menu to see what options are available. The operating system instrumentation starts about one second after the boot loader loads. Overall boot time shown in Bootvis.exe should add BIOS power-on self test (POST) time + one second. Taking boot traces with Driver Delays will lengthen boot by two to three seconds. The resulting binary file will be several MBs in size.
Information on using Bootvis.exe and improving boot and resume can be found at Fast Boot /Fast Resume Design.
Windows Server 2003 provides native support for "headless server" operation on server platforms--that is, support for operating without a local display or input devices. Microsoft and Intel worked with the computer industry to define firmware and hardware requirements for server operations that include requirements for headless operation under Windows Server 2003 as part of Hardware Design Guide Version 3.0 for Microsoft Windows 2000 Server, which is available for download on Hardware Design Guide for Windows Server.
Additionally, Microsoft provides detailed documentation on how to design hardware and firmware that integrates well with headless server support included in Windows XP. For additional technical information about Windows support for headless operation, see Headless Server and EMS Design.
Providing Out-of-Band Management
Windows provides many mechanisms for remotely managing a system when the operating system is loaded, fully initialized, and the system is functioning. This type of system management is called "in-band" management, and it typically occurs over the network. However, a different solution is required for managing a system when Windows network software is unavailable. For example, when the system has failed, when Windows is not loaded, when the system is in the process of loading, or when Windows is fully loaded, but the Windows network software is not available. Remote management that does not rely on Windows network software is called "out of band" management.
The goal for out-of-band management is to always return the system to a fully functioning state where in-band management is available. Making both in-band and out-of-band management available remotely are key components of headless functionality. With the exception of hardware replacement, all administrative functions that can be accomplished locally should be available remotely. For example, the following capabilities should be available for remote management:
To accommodate the many possible system operation states, careful thought must be given to hardware, firmware, and operating system functionality to ensure a comprehensive and complementary implementation of out-of-band management capabilities. Each system component must contribute to a coherent, complementary solution in order to avoid expensive, confusing administration.
Implementing Headless Support
A headless solution requires that local console I/O dependencies be removed from the operating system. Windows Server 2003 supports operating without a keyboard, mouse, or monitor attached to the system. On an ACPI-enabled system, Windows Server 2003 supports operating without a legacy 8042 keyboard controller. Windows Server 2003 also includes support for systems without VGA/display hardware.
Based on built-in Windows support, a USB-capable system running Windows Server 2003 (and Windows XP) can support hot plugging the mouse and keyboard while the operating system is running, and a system with a VGA card can support hot plugging a monitor.
In Windows Server 2003, out-of-band console I/O support is provided for all Windows kernel components--the loader, setup, recovery console, operating system kernel, and blue-screens--and for a text-mode management console available after Windows Server 2003 has initialized, which is called the Special Administration Console. This functionality is called the Emergency Management Services (EMS). This I/O conforms to the following by default:
Output conventions are defined in "Standardizing Out-of-Band Management Console Output and Terminal Emulation (VT-UTF8 and VT100+)," available on Headless Server and EMS Design.
Note: These settings were chosen because they are the current "standard" configuration in Unix administration and allow interoperability. However, additional settings are supported and may be passed via the Serial Port Console Redirection Table; documentation is available on the web at Headless Server and EMS Design.
Designing Hardware and Firmware
There are three key areas that must be addressed to provide a high-quality headless platform that is cleanly integrated with Windows Server 2003 EMS:
For more information, see Headless Server and EMS Design.
Hot-Add Memory and Memory Mirroring Support
Windows Server 2003 introduces hot-add memory support, which allows memory devices to be added to a machine and be made available to the operating system and applications as part of the normal memory pool--without turning the machine off or rebooting.
Hot-add memory support will be especially appreciated by enterprise system administrators who need to add memory resources to keep up with production demands, but are working with systems that require continuous (24x7) availability. This action would typically require shutting down the system and interrupting service. With the hot-add feature, service performance can be increased without service interruption.
Windows Server 2003 does not support removal of memory regions or a memory upgrade that requires removing and then adding memory.
Memory mirroring (different from hot-add) is also introduced in Windows Server 2003, providing increased availability for extremely high-end systems (for example, Stratus, and so on).
CAUTION: This feature will operate only on servers that have hardware support for adding memory while the server is operating. Most existing servers do not have this type of hardware support and could be damaged if memory is installed while the power is on. Consult the server operator's manual for more information. A small number of new servers will offer the ability to add memory while the server is operating. For these servers, the act of installing memory will automatically invoke the hot-add memory feature in the operating system.
Windows Server 2003 supports ccNUMA and NUMA-"lite" designs. This provides for near:far memory access time ratios of 1:3 or less.
Note: For optimal performance with Windows 2000 and later operating systems, it is recommended that system designers who are building platforms that present memories with different access times keep the ratio for access to "near" versus "far" memories relative to a given microprocessor at a 1:3 ratio or less, as seen by the operating system.
As processor speeds relative to bus speeds and the number of processors in symmetric multi-processor (SMP) systems have increased, the system bus and main memory have become a significant bottleneck. The trend in High Performance Computing (HPC) over the last decade has been to reduce dependence on a single system bus or system memory by building smaller systems (nodes), each with their own processors and memory, then providing a high-speed cache coherent interconnect network to form a larger system. Access from a processor to node local memory is fast, as is the case with small SMP systems. Access to memory in another node is much slower. Earlier systems using this approach could result in as much as a 50x decrease in performance for remote versus local access.
Today, the difference has been reduced significantly, but is still (usually) in the range of two to three times as long to access remote memory (versus local). Operating systems such as Windows 2000, which have been designed for SMP systems, can work acceptably on these newer systems; but significant performance gains are possible if the operating system is aware of underlying architecture. Windows Server 2003 contains code to take advantage of the memory available in both the current processor node as well as memory assigned to other processor nodes.
Per Node Memory Allocation
An effective method for improving performance on a ccNUMA machine is to ensure that the processors use the memory closest to them. This guideline includes continuously running threads on the same processor node. Page coloring is used in Windows 2000 to ensure that page allocation is spread as much as possible throughout the physical address space of the system. To achieve good page placement in a ccNUMA system, each node will have its own palette of colors. Page allocation is round-robin within the palette of the node containing the threads ideal processor. If no pages are available in the palette, then we will round-robin from all colors in all palettes.
Per Node Scheduling
The thread scheduler has been changed to examine and favor the following processor configurations:
If this prioritization fails, then Windows Server 2003 will only consider other processor nodes if the average processor utilization of the other nodes is lower than the average utilization of the ideal node.
OEM Support for ccNUMA
For Windows Server 2003 to utilize these additions for proper ccNUMA support, original equipment manufacturers (OEMs) must supply a HAL to identify and talk to the hardware in a way that Windows Server 2003 can understand. There is a BIOS feature that, if used on the OEM hardware, can function properly out of the box with the Itanium-based HAL that ships with Windows Server 2003. This is the Static Resource Affinity Table (SRAT).
Static Resource Affinity Table
With the current Enterprise market trend to build large-scale systems out of "nodes" of smaller scale systems, the best performance is typically achieved when processors use memory that is physically located in the same smaller system node as the processor, rather than using memory located in other nodes.
SRAT, as described in the ACPI specification, can be used to describe the physical location of processors and memory in large-scale systems (such as ccNUMA) to Windows, allowing threads and memory to be grouped in an optimal manner.
The SRAT contains topology information for all the processors and memory present in a system at system boot, including memory that can be hot removed. The topology information identifies the sets of processors and physical memory ranges in the system, enabling a tight coupling between the processors and memory ranges in a node.
The ACPI 2.0 specification introduces the concept of proximity domains within a system. Devices in a system that belong to a proximity domain are tightly coupled, or "closer," to each other than to other devices in the system. For example, in a ccNUMA machine consisting of several nodes interconnected through a switch, the latency of memory accesses from a processor to memory on the same node would typically be shorter than the latency of memory accesses on other nodes. The operating system should use this affinity information to determine the allocation of memory resources and the scheduling of software threads, therefore improving performance on large systems.
Including the methods that provide affinity information in the ACPI Namespace enables a generic mechanism that provides topology information. This information can be used to resolve scenarios in which components such as processors and memory can be hot added and hot removed from the system.
The Proximity method (_PXM) included in the ACPI 2.0 specification enables the specification of affinity between various devices in the ACPI Namespace. The _PXM method associated with a device in the ACPI Namespace returns a domain number. All devices that return the same domain number belong to the same proximity domain. In addition, specifying memory, processor devices, and the methods for identifying and removing these devices, enables the infrastructure for supporting hot-plug memory. However, defining the proximity and hot-plug information in the ACPI Namespace (instead of static ACPI tables) makes it unavailable to the operating system until the ACPI components of the operating system load.
Depending on the load order of the ACPI components with respect to the operating system construction of memory pools and processor startup, affinity and memory hot-plug information in the ACPI Namespace might not be available at the desired system startup phase. The SRAT addresses this problem and makes the affinity and hot-remove memory information available to the operating system in a static fashion before the ACPI components load. This information is needed only at boot time; only processors and memory present at boot time are represented. Unpopulated "slots" for memory and processors that can be hot added do not need to be represented.
The SRAT uses the concept of proximity domains to provide resource affinity information, similar to the domains used in the ACPI 2.0 specification. However, the domain numbers provided by the SRAT do not need to be identical to those used in the ACPI 2.0 Namespace. The SRAT is intended to enable operating system optimizations for the Windows Datacenter Server class of machines with multiple processors and ccNUMA characteristics. These machines must also support the APIC/SAPIC interrupt model.
How Windows Can Use SRAT
The SRAT is provided as an intermediate step to enable features such as ccNUMA optimizations on Windows Datacenter Server class machines until the capabilities of accessing the ACPI 2.0 Namespace are available at the desired startup phase. Until these capabilities are available, the operating system will scan and use the information provided in the SRAT at boot time. However, after these capabilities are integrated in the operating system, the operating system will no longer use the SRAT or the information provided in the ACPI namespace.
The SRAT provides data for the performance optimization of ccNUMA systems. The use of fine-grained interleaving, in which the memory is interleaved across the nodes at a small granularity, is mutually exclusive with respect to ccNUMA optimizations. The BIOS should not provide an SRAT if fine-grained interleaving is enabled. Fine-grained interleaving should be disabled if the SRAT is provided. It is expected that ccNUMA optimization will provide superior performance over fine-grained interleaving. Furthermore, the operating system will not support hot-plugging memory in the presence of interleaving.
The operating system will scan the SRAT only at boot time. The BIOS/system is expected to retain the proximity information used by the operating system at boot time across the lifetime of the operating system instance (until the next boot). This specifically implies that the BIOS/system retains the system topology across a system sleep state transition in which processor or memory context is lost (S2-S5), such that the proximity information provided by the SRAT at operating system boot time is valid across the transition. If the system topology is not retained, memory viewed as "local" to a set of processors could potentially become "non-local" memory on return from the sleep state and have an adverse impact on system performance.
The BIOS can use the Prepare to Sleep (_PTS) method to initiate the process of saving all information necessary to replicate the system topology. The BIOS can also use the sleep state argument that is provided by the operating system to the _PTS method to determine if the state save is required. If memory or processor hot-plug were not supported, providing a deterministic selection of boot processor/processor order in the Multiple APIC Description Table (MADT) and programming of the memory regions would be sufficient to ensure identical system topology on return from the sleep state.
The BIOS must reconstruct the SRAT on system reboot similar to the MADT.
Support for New Hardware
The 64-bit versions of Windows XP and Windows Server 2003 support the Extensible Firmware Interface (EFI), a new standard for the interface provided by the firmware that boots PCs. Microsoft supports EFI as the only firmware interface to boot the 64-bit version of Windows for Itanium-based systems. Because the 64-bit version of Windows cannot boot with BIOS or with the system abstraction layer (SAL) alone, EFI is a requirement for all Itanium-based systems to boot Windows.
The 64-bit versions of Windows also support the Globally Unique Identifier (GUID) Partition Table (GPT), which was introduced as part of the EFI initiative. GPT complements the older Master Boot Record (MBR) partitioning scheme that has been common to PCs. GPT allows use of very large disks. The number of partitions on a GPT disk is not constrained by temporary schemes such as container partitions, as defined by the MBR Extended Boot Record (EBR).
The GPT disk partition format is well defined and fully self-identifying. Data critical to platform operation is located in partitions and not in unpartitioned or hidden sectors. GPT disks use primary and backup partition tables for redundancy and CRC32 fields for improved partition data structure integrity. The GPT partition format uses version number and size fields for future expansion. Each GPT partition has a GUID and a partition content type, so no coordination is necessary to prevent partition identifier collision. Each GPT partition has a 36-character Unicode name, which means that any software can present a human-readable name for the partition without any additional understanding of the partition.
Intel Itanium Support
The 64-bit versions of Windows XP and Windows Server 2003 for Intel Itanium-based systems is a fully featured operating system that is compatible with most existing 32-bit applications. The 64-bit Windows operating system provides high availability, advanced scalability, and large memory support based on the Intel Itanium chip with its extensive multiprocessing features, powerful floating-point arithmetic extensions, and multimedia-specific instructions. 64-bit Windows and the Itanium microprocessor are designed to address the most demanding business needs of today's Internet-based world including e-commerce, data mining, online transaction-processing, memory-intensive high-end graphics, complex mathematics, and high-performance multimedia applications.
The Microsoft vision is to make a broad portfolio of applications available, including leading Microsoft applications on 64-bit Windows. To achieve this goal, Microsoft provides a rich set of development tools that make it easy to write new applications and port existing ones. The 64-bit Windows platform brings the following benefits to developers and end users:
Benefiting from 64-bit Architecture
A 64-bit operating system supports far more virtual memory than a 32-bit operating system. For example, 32-bit Windows Whistler supports 4 GB of virtual memory, while 64-bit Windows supports 16 TB of virtual memory. Non-paged pool increases substantially, up to 128 GB for the 64-bit platform compared to 256 MB maximum for the 32-bit platform. With these new higher limits, the scalability that the 64-bit platform offers is enormous in terms of terminal server clients, page pools, network connections, and so on.
There are also important benefits for businesses:
Designing for 64-Bit Compatible Interfaces
Porting drivers or software created for 32-bit Windows to 64-bit Windows should not create any problems for distributed applications, whether they use Remote Procedure Calls (RPC) or DCOM. The RPC programming model specifies well-defined data sizes and integer types that are the same size on each end of the connection. In addition, in the LLP64 abstract data model developed for 64-bit Windows, only the pointers expand to 64 bits--all other integer data types remain 32 bit. Because pointers are local to each side of the client/server connection and are usually transmitted as NULL or non-NULL markers, the marshaling engine can handle different pointer sizes on either end of a connection transparently.
However, backward compatibility issues arise when you add new data types or methods to an interface, change old data types, or use data types inappropriately.
Several Windows 2000 components are removed in Windows XP and Windows Server 2003:
Windows Logo Program Requirements
Requirements for the Windows Logo Program for hardware that apply specifically for systems or peripherals that will receive the "Designed for Windows XP" and "Designed for Windows Server 2003" are defined in Microsoft Windows Logo Program System and Device Requirements, Version 2.2, available on the web site at http://msdn.microsoft.com/en-us/windows/hardware/gg463010.aspx.
For the Windows XP debugger, see:
For more information on improving boot and resume, see:
For more information on headless support, see:
For more information on the PCI Hot-Plug Specification, see:
ACL - access control list