MCA Implementation Guide for 64-bit Windows SystemsUpdated: July 29, 2008
On This Page
IntroductionThis guide describes the operating system Machine Check Architecture (MCA) implementation model provided in Windows XP 64-bit Edition and the 64-bit versions of the Windows Server 2003 family of operating systems. Certain MCA features discussed in this guide may be supported only in selected client and server editions of the operating system. This document highlights the OEM components available in the model and provides guidelines on how each should be implemented. Details are provided for:
This guide is provided for engineers creating Intel Itanium-based systems and for developers creating SAL and system management software. It is assumed that the reader understands the operation of hardware and software components of Itanium-based MCA. The SAL implementation discussed in this document is based on Itanium Processor Family System Abstraction Layer Specification, January 2001, published by Intel Corporation and available for download from http://developer.intel.com SAL Requirements for Operating System MCAThe MCA model for Windows XP 64-bit Edition and the 64-bit versions of Windows Server 2003 supports machine check abort (also known as fatal error) processing by supplying the SAL with a callback vector into the operating system MCA code during boot. The operating system also supplies the SAL with interrupt vectors for use with RENDEZ and WAKEUP operations. These features are supported as described in the SAL 3.0 specification. The MCA model also supports both interrupt-driven and polled-mode processing of:
The corrected-error hardware interrupt configuration details are passed to the operating system in the Advanced Configuration and Power Interface (ACPI) data structures. The operating system enables interrupt mode if an interrupt source is found in the ACPI data structures. If no interrupt source is found, the operating system defaults to polled mode and polls the SAL once every 60 seconds for corrected events. These features are implemented as detailed in the SAL 3.0 specification. Three other requirements are placed on the system SAL:
Error Record Retrieval MechanismError record retrieval works differently for corrected and uncorrected MCA events. Windows XP and Windows Server 2003 do not currently support error correction, so every MCA event reported to the operating system MCA call-back vector causes a bug check of the operating system and requires a reboot. When the operating system boots after a fatal error has occurred, the operating system MCA code requests an MCA record from each CPU until all records are retrieved. When the operating system detects a CPE or CMC event, through an ISR or by polling, it repeatedly polls each CPU until all error records are retrieved. The latency between the time that a CPE or CMC event occurs and the time that the operating system retrieves the records is unspecified, regardless of whether the operating system processes these events in interrupt or polling mode. Where a global error has occurred, it is possible that each CPU in the system could build an error record. The result would be one error record per CPU, of which perhaps only one contains pertinent error information. We recommend that only error records that contain pertinent error information be returned to the operating system. If multiple CPUs experience error conditions during a global error event, an error record should be returned by each CPU that experienced an error. The operating system logs all non-recovered error records returned from the SAL to the Event Log and make them available to management software, as described in the "Access to MCA Error Records" section of this paper. Providing multiple error records that contain no useful information is likely to cause confusion and result in a bad customer experience. When an MCA event results in multiple hardware components experiencing related error conditions, we recommend that a single error record be built that describes the error event. The operating system treats each error record returned from the SAL as a separate MCA event. Producing multiple error records for a single MCA event may cause confusion and result in a bad customer experience. Where the SAL builds an error record that contains more than one section, we recommend that the error section that contains the most useful information about the error condition is included as the first section in the record. The operating system cannot generically determine which section in a multi-section error record contains the most useful information. The operating system MCA code assumes the first section in the record contains the most useful information and creates an error analysis string based on that section. Standard Error Record FormatThe Windows XP 64-bit Edition and the 64-bit versions of Windows Server 2003 MCA model place a key requirement on the format of the error record that the SAL passes to the operating system on MCA events. It is a logo requirement of the 64-bit versions of Windows Server 2003 that all error records passed to the operating system for all fatal and corrected MCA events must, at a minimum, comply with the SAL 3.0 specification, Appendix B, "Error Record Structures." The SAL specification defines seven classes of errors. Six are categorized as platform-specific error classes and the seventh is categorized as the processor error class. The ability of the operating system to analyze the information provided by the SAL varies depending on the class of the error:
The SAL error record definition also provides OEM-specific data fields for some error classes. This allows additional machine-specific data to be passed within the error record. This information can then be used by platform-specific management software to provide a more detailed analysis of errors. For more information, see "Access to MCA Error Records" later in this paper. Defining a standard error record format enables the operating system MCA code to:
If an error record that does not comply with the SAL definition is presented to the operating system MCA code, none of the features of the operating system MCA code will be usable and a generic blue screen will result for fatal errors. Additionally, no operating system processing can be performed on corrected errors that present non-compliant error records to the operating system. Such errors would be identified to the user as non-compliant. Non-volatile Storage of Error RecordsIf the operating system MCA code determines that an MCA event is non-recoverable, it makes a best effort to take a crash dump and bug check. The machine is unstable while this type of error is processed and, as a result, the error record is not logged during the event handling. On each boot, the operating system MCA code calls into the SAL to query for any unprocessed MCA error records. These records are retrieved by the operating system MCA code and logged at this point. This mechanism requires that the SAL maintain the error record in non-volatile storage until the system reboots. This feature is required by the SAL 3.0 specification. It is a logo requirement for 64-bit versions of Windows Server 2003 that the SAL maintain all non-recovered MCA error records across a system reboot and make them available to the operating system after the system has rebooted. We further suggest that OEMs include an error analysis screen within their firmware. This is not a requirement, but it will cover the situation in which a hardware error prevents the operating system from rebooting. The firmware could analyze the error record saved in NVRAM to identify the source of the error. The failing part could then be replaced to enable reboot. Corrected-Error ProcessingThe policy of the Windows Server 2003 operating system MCA code is to publish all corrected error information delivered from SAL. Details of the mechanisms used to access the information are available in the "Access to MCA Error Records" section of this paper. It is the responsibility of the SAL to decide which corrected errors require to be communicated to the operating system and made available to system users. Three corrected error topics are discussed in this section:
It should be noted that the default behavior of the operating system MCA-corrected error policy that is described in this section of the guide is subject to change by either Microsoft or Itanium-based system vendors. Thresholding and Clearing Corrected ErrorsThe SAL should ensure that the operating system MCA code is not flooded with continuous corrected-error events. If the system were polling for corrected events, such a situation would result in the error event context being overwritten and lost. If interrupts are being used, flooding with corrected events would negatively affect the CPU's performance and fill up the event log. To ensure the interrupt flooding scenario doesn't occur, the Windows Server 2003 operating system MCA code monitors the rate of delivery of corrected interrupt events. If a rate of ten interrupts within a 60 second period is detected, the operating system MCA code disables the interrupt and automatically switches to polled mode for corrected error events. This thresholding mechanism is applied separately to both CPU (CMCI) and platform (CPEI) corrected error interrupts. Having the operating system MCA code disable the interrupt is a fairly crude method to solve this problem. Firmware thresholding of corrected errors delivered to the operating system allows more flexible control over which corrected error sources are disabled. The SAL must also ensure that the source of a corrected error is cleared during processing. An example of this is scrubbing memory on a single-bit ECC error. This scrubbing ensures that a single, transient-corrected event does not result in multiple corrected events being reported to the operating system. The operating system MCA does not have the context required to disable and clean up the source of all corrected-error sources within the system. Mapping Out Failing HardwareThe operating system MCA code monitors corrected errors and checks for related events. Related events are single-bit ECC errors to the same physical page or corrected errors on the same CPU. This capability allows the operating system MCA code to keep a count of the number of related events and take action to avoid the occurrence of non-correctable events, such as multi-bit ECC errors. The operating system MCA code automatically attempts to remove any physical page of memory that experiences more than a preset limit of corrected error events. The data from the physical page is copied to another physical page and the failing physical page is no longer used by the operating system. When a paging event occurs, an event log entry and a WMI MCA event is created. See the ”Access to MCA Error Records” section of this paper for more detail. After a page has been mapped out, it is not used again as long as the operating system continues to run. However, if the system is rebooted, the operating system does not remember which physical pages were removed in a previous session and uses all physical memory that is made available by the system firmware. Thresholding of Event Log EntriesAnother feature that is enabled as the operating system MCA monitors for delivery of related corrected events is the ability to threshold the number of event log entries that are created for corrected events. A fixed amount of space is available for the event log. It is likely that error records produced for corrected MCA events could get quite large for high-end server systems and could cause the event log to fill up with corrected error events. In a situation where the same, or related, corrected events are being delivered to the operating system, it does not add value to create an event log entry for each occurrence of the event. The operating system MCA code disables event log creation for related event that occurs within a five-second period of an original corrected event. An original corrected event is one that occurs when no related event has been delivered in the preceding five-second period. To further protect against filling up the event log with corrected error events, the operating system MCA keeps a count of the total number of corrected error event log entries that have been created since the last operating system boot and allows no more than 20 entries to be created. When the 20- entry limit has been reached, no more event log entries are created for corrected events. When this limit is reached, an event log entry and a WMI MCA event are created. See the ”Access to MCA Error Records” section of this paper for more detail. The thresholding described in this section refers to event log creation only. A WMI event is generated for all corrected events that are delivered to the operating system, regardless of the event log limit, to ensure that all error information is made available to management software. Access to MCA Error RecordsAll corrected and fatal error records passed to the operating system MCA code are delivered as events to WMI, using the WMI Windows Driver Model (WDM) provider. Three base WMI classes are defined for MCA: MSMCAInfo_RawCMCEvent CPU-corrected event MSMCAInfo_RawMCAEvent Uncorrected event MSMCAInfo_RawCorrectedPlatformEvent Platform-corrected event Several other WMI MCA classes are defined that allow management software to register for subsets of these base classes. The WMI MCA classes are documented in the WMI Software Development Kit (SDK), which can be downloaded from: The MCA classes are documented under “WMI SDK > WMI References > WMI Classes > MSMCA Classes.” When an MCA-related event is delivered to WMI, two things happen:
Registration for, and delivery of, MCA events uses the standard Event Consumer WMI-specified interfaces. For documentation on this topic, see the WMI SDK under “WMI SDK > Using WMI > Monitoring Events.” Additionally, the error records can be accessed directly from the event log. A string is supplied with each MCA event log entry to provide a first-level analysis of the error. The level of analysis possible by the operating system varies across the error classes. For more information, see the “Standard Error Record Format” section of this paper. The platform-specific errors will have a very generic error message. These WMI features ensure that both event delivery and error record analysis are available to local and remote consumers. To support the fatal error reboot sequence, as described earlier in “SAL Requirements for operating system MCA,” the WMI MCA model supports querying for fatal error events and error records—that is, MSMCAInfo_RawMCAData. During boot, the operating system MCA code retrieves any unprocessed errors from the SAL and maintains them until WMI is initialized. At that point, the events are delivered to WMI. If the application misses this delivery, it can query WMI for the event at a later point or access the event log directly. There is no support for querying WMI for corrected errors that were generated before the reboot. WMI event registration can be permanent or temporary:
Both registration methods are described in the WMI SDK documentation. Error RecoveryThe MCA framework in Windows XP 64-Bit Edition and the 64-bit versions of Windows Server 2003 operating systems will enable future work on operating system MCA error recovery. Windows Server 2003 does not currently support operating system MCA error recovery. Error recovery is performed at different levels within the system. Some errors can be corrected by the hardware, some by the processor abstraction layer (PAL), and yet more can be corrected by the SAL. Some categories of error require access to operating system context for correction. The use of a standard error record to pass the error information to the operating system MCA code is a key enabler for operating system MCA error recovery. For certain types of platform-specific errors, the operating system does not have enough knowledge to attempt recovery, as discussed in the “Standard Error Record Format” section of this paper. The SAL could potentially recover from these classes of errors. Work is ongoing to investigate and prioritize the error scenarios from which the operating system could recover. This work will be ongoing for future Windows operating system releases. Error containment will be key to future operating system MCA error recovery work. The 64-bit MCA model allows the OEM to use hardware error containment techniques to restrict error propagation. The SAL can use the error record to communicate the error containment status to the operating system. The better the error containment, the better chance the operating system MCA code has of recovering from the error. The most obvious recovery scenario would be a fatal error in memory or in an unmodified cache-line data field. Levels of correction can be applied with each error type. Generally, cruder recovery techniques are easier to implement, but results are not optimal. For the fatal-memory error scenario, if the failing address is in User mode, the process owning the memory could be terminated, and the page containing the failing location could be removed from the operating system working set. This is a crude solution, but is better than a system crash. The 64-bit MCA model provides the interface to the operating system for all hardware errors within the system that are reported as machine checks. As new reliability features such as multi path I/O or dynamic system partitioning are added to future releases of the operating system, MCA can play a key role in enabling these features to recover from what would now be fatal hardware errors. Call to ActionFor OEMs who are implementing SALs:
For software vendors who implement management applications to access MCA information:
Resources
|
|

