MCA Support in 64-bit Windows XP and Windows Server 2003
Updated: July 29, 2008
On This Page
The 64-bit versions of Microsoft Windows XP Professional and Windows Server 2003 Enterprise and Datacenter editions support the Machine Check Architecture (MCA) defined by Intel for the Itanium processor. This paper describes the MCA support model for these releases of the Windows operating system. The information in this paper is intended to help manufacturers plan support for MCA on systems running 64-bit Windows XP Professional and Windows Server 2003.
As computer systems grow larger and the number and density of components increase, hardware failure rates for each system are also likely to increase. At the same time, it is becoming increasingly important for large enterprise-class computer systems to have higher levels of reliability, availability, and scalability (RAS). To address these requirements, Intel has defined a Machine Check Architecture (MCA) for the Itanium processor. MCA is a hardware and software architecture for Itanium that attempts to improve the RAS features of the system.
Machine Check Architecture (MCA) on Itanium-based computer systems refers to:
This white paper describes changes to the delivery mechanisms for hardware error events on Itanium-based platforms from the delivery mechanisms used on IA-32.
This paper makes references to the following documents:
References to Windows XP and Server 2003 64-bit in this document refer to the Windows XP Professional 64-bit and Windows Server 2003 Enterprise and Datacenter 64-bit editions.
What Are MCA Events?
Almost all MCA events are hardware errors, but not every hardware error is reported as an MCA event. For example, a failure in a memory device in system memory would probably be reported as an MCA event, but a component failure on a hard disc drive would probably not be reported to the operating system as an MCA event. The difference between these two events is the hardware mechanism that is used to alert the CPU that the event occurred.
Types of MCA Events
All MCA events fall within two categories based on the hardware reporting mechanism used. These are described in the SAL 3.0 Specification as follows:
Any hardware error event that does not fall within these two categories is not delivered or processed as an MCA event. There is some flexibility in this model. System designers can connect hardware error event signals to the pins provided on the CPU to ensure that the errors are processed as MCA events. However, an MCA event must successfully identify the component that failed, or the component that is likely to fail in the future. For peripherals on an I/O bus such as Peripheral Connect Interface (PCI), there is no standardized direct link from peripheral devices to the system chipset that could be used to raise an MCA event that identifies the failing component. MCA does provide full error handling for the standard PCI fatal error types such as PERR and SERR.
MCA events can be further categorized based on the severity of the error. Each error delivered to the operating system falls within one of the following categories:
Different reporting mechanisms are used for fatal errors versus corrected errors:
Dispatch Flow for MCA Events
The most significant difference between the IA-32 MCA model and the Itanium MCA model is that with the Itanium MCA model, the operating system depends on external software to provide the hardware context for MCA events. As a result, it is critical that all Itanium-based Windows XP and Windows Server 2003 systems support a generic definition for the format of this hardware context data structure, as described in "MCA Error Record Format" later in this paper.
Dispatch mechanisms. Itanium-based systems use different mechanisms to dispatch MCA events to software than IA-32-based systems. On IA-32-based systems, the CPU detects the error and dispatches directly into the operating system. The operating system is responsible for gathering the error context and identifying the error source. On Itanium-based systems, the dispatch flow has changed and the action taken depends on the category of the error event. The dispatch flows for Itanium are described in Chapter 4 of the SAL 3.0 specification. Figure 1 later in this paper shows the Windows XP and Windows Server 2003 64-bit specific dispatch flows.
Dispatch flow on Itanium-based systems. . All fatal MCA events are handled first by firmware and then dispatched into the operating system by using the operating system MCA handler address. During the boot process, the operating system MCA code registers this fatal error (machine check abort) handler address with the SAL firmware. The firmware uses this dispatch point for all fatal MCA events.
Corrected events are initially handled by firmware, too. These events can be dispatched to the operating system by using interrupts or can be polled for by the operating system. The SAL has the option to use ACPI to provide interrupt information for dispatching platform-corrected error events. If no platform interrupt information is provided, the operating system enters polled mode for corrected events. In this mode, the operating system polls the SAL at regular intervals for any corrected MCA events that might have occurred.
When the CPU detects an error during normal operation, one of two things can happen:
If the CPU dispatches into the PAL, the dispatch flow follows these steps:
MCA Error Record Format
To ensure the operating system can parse the error and analyze the error severity and root cause, the error record must be in a generic format. The minimum error record format required by the 64-bit versions of Windows XP and Windows Server 2003 is the format defined in Appendix B of the SAL 3.0 specification. This format is a logo requirement for the Windows Server 2003 Enterprise and Datacenter 64-bit releases.
This error record structure defines seven classes of error: one CPU class and six platform classes. The six platform classes are listed below. The CPU class and the following two platform classes are generic enough so that the operating system can parse the error record and perform an analysis to identify the source of the error.
The remaining four platform classes are more platform-specific and the operating system can only identify the class of the error:
To fully analyze and identify the cause of these more platform-specific classes of error, the operating system needs assistance from the platform OEM, as described in "Windows 64-bit MCA Features" later in this paper.
Windows 64-bit MCA Features
Operating systems can provide three basic features for MCA:
The first implementation of Windows 64-bit operating system MCA code is in the Windows XP 64-bit Professional and the Windows Server 2003 Enterprise and Datacenter 64-bit releases. This implementation consists primarily of error logging and error prediction features.
Figure 1 shows the MCA flow within the operating system for Windows Sever 2003 64-bit editions.
Figure 1. Windows XP/Server 2003 64-bit MCA flow.
The kernel/HAL (hardware abstraction layer) provides a handler for each of the three MCA events that can be delivered from the hardware. In addition, it supports polled mode for corrected errors, as described in "Dispatch Flow for MCA Events" earlier in this paper.
Windows Management Interface (WMI) has built-in classes and events for 64-bit MCA in the Windows XP and Windows Server 2003 64-bit releases.
The rest of this section describes the MCA features in the Windows XP and Server 2003 64-bit releases.
Error Prediction in 64-bit Windows Server 2003
Error records for both CMC and CPE events are delivered to WMI, which executes the following steps:
These capabilities can be used to support features like remote health monitoring of the system, failure prediction, and component swap out before a fatal error occurs. The more sources of corrected errors there are in the hardware, the better the error prediction will be. These features can go a long way to improve the reliability and availability of the system.
Error Logging on 64-bit Windows XP and Server 2003
When a fatal error event is delivered to the operating system, the handling is different from that performed on corrected errors. For example:
Maintaining error records for fatal errors in NVRAM ensures that an error record that describes the cause of the error is automatically saved for every fatal event that occurs on every system running 64-bit Windows XP and Windows Server 2003. This provides the following benefits:
These features help ensure that no system crashes more than once as the result of any fatal hardware error. After the system is rebooted, notification of the fatal event can be delivered remotely to the system administrator or support staff by using management software. This is a powerful feature that enhances availability and serviceability.
For information about access to error records from the event log or through WMI, see the MCA Implementation Guide for 64-bit Windows XP and Windows Server 2003.
Error Recovery on 64-bit Windows Server 2003
There is no fatal error recovery support in Windows Server 2003. However, the operating system MCA code monitors corrected errors and checks for related events. Related events are single-bit ECC errors to the same physical page or corrected errors on the same CPU. This capability allows the operating system MCA code to keep a count of the number of related events and take action to avoid the occurrence of non-correctable events, such as multi-bit ECC errors.
The operating system MCA code automatically attempts to remove any physical page of memory that experiences more than a preset limit of corrected error events. The data from the physical page is copied to another physical page and the failing physical page is no longer used by the operating system. When a paging event occurs, an event log entry and a WMI MCA event will be created.
After a page has been mapped out, it will not be used again as long as the operating system continues to execute. However, if the system is rebooted, the operating system will not remember which physical pages were removed in a previous session and will use all physical memory that is made available by the system firmware.
Call to Action and Resources
Call to Action
Manufacturers of Itanium-based systems should plan to support MCA in 64-bit Windows XP and Windows Server 2003 by doing the following:
Itanium Processor Family System Abstraction Layer Specification, January 2001,
Intel IA64 Architecture Software Developer’s Manual, http://developer.intel.com . Volume 2, Section 11, IA64 Processor Abstraction Layer.