Skip to main content

Dump Switch Support for Windows

Updated: December 4, 2001


On This Page

Introduction  Introduction
Hardware Prototype for Dump Switch  Hardware Prototype for Dump Switch
Software Solution for Dump Switch  Software Solution for Dump Switch
Enabling and Disabling Crash Dump Capabilities  Enabling and Disabling Crash Dump Capabilities
HAL Changes to Support Dump Switch  HAL Changes to Support Dump Switch
Future Directions for Dump Switch Support  Future Directions for Dump Switch Support


Introduction

This article describes a simple hardware device called a "dump switch," which can be added to a server system to force the system to crash and write a crash dump. Microsoft Windows 2000, Windows XP, and Windows NT 4.0, Service Pack 4 (SP4) include support for dump switches. This article will also describe how to enable the crash dump capability.

For developers, being able to analyze a crash dump is an essential part of eliminating reliability problems such as system stalls and crashes in operating systems, device drivers, and applications. Many crashes will cause a system to freeze in such a way that a users only recourse is to perform a hard reset, which erases any information that would support an analysis of the problem. Developers would benefit from information provided by the system if it performed a memory dump before a hard reset.

Hardware Prototype for Dump Switch

The objective of the hardware prototype is to ensure that a user is able to mechanically force the system to issue a non-maskable interrupt (NMI). The Microsoft prototype for the dump switch hardware is a momentary push button mounted on a Peripheral Component Interconnect (PCI) card. (There is an old debuggers trick of shorting out the last two pins in an ISA slot that will cause an NMI. This solution is obviously not supportable, but could be easily implemented on an ISA card. A PCI solution was chosen because robust servers should not rely on ISA dependencies, and its probable that future servers will be ISA free. The PCI solution is only slightly more difficult.)

The NMI that results from pressing the button causes the Windows NT system dump routine to copy the contents of the memory to the hard disk for later analysis. The Microsoft implementation is intended to provide proof of concept only, and there is no intent to create a PCI-based dump switch product. Instead, Microsoft is recommending that server vendors embed this function into the motherboard and expects that OEMs will begin shipping this capability based on standardized components shortly. For example, Intel will provide this capability on a forthcoming Pentium II Xeon server system board.

The circuitry on the PCI card, which is extremely simple, is connected to power, ground, system error (SERR)#, and clock line (CLK) through the PCI bus. The momentary push button is mounted at the rear bulkhead of the PCI adapter card. In the proof-of-concept design, there are no PCI configuration registers, nor is there need for the card to be recognized and configured into the system. This should not pose problems because the card uses no system resources, other than use of a slot, and it can be invisible during normal operation.

The button is debounced with a capacitor and resistor, and then input to a Schmitt trigger. The trigger feeds a latch and logic that instantiates the appropriate state machines on a programmable array of logic (PAL). The machines assert SERR for one clock pulse and then tri-state the signal in accordance with the PCI specification. The following is a schematic diagram.

Figure 1. Schematic for Dump Switch
gg463172.dmpsw1(en-us,MSDN.10).gif

Software Solution for Dump Switch

There are three basic approaches that could have been chosen to handle the NMI.

  1. Bug-check during NMI if Windows NT detected that the card was causing the NMI and that a registry key was set. This would totally preserve the default system behavior. Unfortunately, without significantly adding to the hardware and microcode complexity, Windows NT cannot detect that a NMI came from this card. Future extensions in hardware and microcode architecture should be considered to allow this, along with options for a more fault resilient architecture when an NMI is issued. Because this work is not yet complete, this approach was eliminated for the first implementation.

  2. Bug-check during NMI if a registry key was set. This would preserve default behavior, which is no bug-check, until the user enabled the registry key setting. However, Windows NT still cannot tell if an NMI came from this card or not. Because there are rare instances where KeBugCkeckEx() might not be a safe call during HalHandleNMI(), setting this registry key forces the user to accept the slight risk associated with it. This is currently the best option for handling the NMI.

  3. Bug-check during NMI. The safety issue of a user accepting the slight additional risk by setting a registry key would not have been addressed with this option.

Given these choices, option #2 was chosen as the best compromise solution.

Enabling and Disabling Crash Dump Capabilities

To enable this capability, a user needs to use RegEdit or a similar registry editor to go to HKEY_LOCAL_MACHINE\ System\ CurrentControlSet\ Control\ CrashControl and add the NMICrashDump as a DWORD, where 0 means no crash dump and 1 means crash dump. The global variable HalpNMIDumpFlag will be set to true as a result.

HAL Changes to Support Dump Switch

The HalHandleNMI() routine has been modified to call KeBugCheckEx() if the global BOOLEAN HalpNMIDumpFlag is set to true.

During system initialization, the variable is to be read from the registry into non-pageable memory. Specifically, the hardware abstraction layer (HAL) will read the registry entry at the end of the PCI bus initialization into HalpNMIDumpFlag.

The RegValue used is HKLM\System\CurrentControlSet\Control\CrashControl\NMICrashDump. It is a DWORD entry with a value of 0 (default behavior) or 1 (bug-check on NMI). If it does not exist, the system will assume that it should use the default behavior (0).

The following HALs have been updated with the HalHandleNMI() changes to enable dump switch support:

halx86
halapic
halmps
halmpsm
hal486c

halmca
haloli
halsp
halcbus
halcbusm

Alpha HALs:
haltlev56
haltlev6


Future Directions for Dump Switch Support

There are many areas where this functionality can be enhanced or incorporated into other forms of platform manageability. One extremely interesting area would be to incorporate this type of capability into out-of-band management devices. An example would be a remote-site administrator who could access a stalled server system by way of a Local Area Network (LAN) or dialup link, forcing a system dump and resetting the system to facilitate both root cause analysis and quick recovery. This would eliminate the need for a technician to be physically on the site.

Call to action for "dump switch" capabilities:

  • Test the system dump capability in your platform designs and implement this feature in new systems.

  • Let Microsoft know which of your platforms will have this capability and which HALs you have tested.

Rate: