Dump Switch Support for WindowsUpdated: December 4, 2001
On This Page
IntroductionThis article describes a simple hardware device called a "dump switch," which can be added to a server system to force the system to crash and write a crash dump. Microsoft Windows 2000, Windows XP, and Windows NT 4.0, Service Pack 4 (SP4) include support for dump switches. This article will also describe how to enable the crash dump capability. For developers, being able to analyze a crash dump is an essential part of eliminating reliability problems such as system stalls and crashes in operating systems, device drivers, and applications. Many crashes will cause a system to freeze in such a way that a users only recourse is to perform a hard reset, which erases any information that would support an analysis of the problem. Developers would benefit from information provided by the system if it performed a memory dump before a hard reset. Hardware Prototype for Dump SwitchThe objective of the hardware prototype is to ensure that a user is able to mechanically force the system to issue a non-maskable interrupt (NMI). The Microsoft prototype for the dump switch hardware is a momentary push button mounted on a Peripheral Component Interconnect (PCI) card. (There is an old debuggers trick of shorting out the last two pins in an ISA slot that will cause an NMI. This solution is obviously not supportable, but could be easily implemented on an ISA card. A PCI solution was chosen because robust servers should not rely on ISA dependencies, and its probable that future servers will be ISA free. The PCI solution is only slightly more difficult.) The NMI that results from pressing the button causes the Windows NT system dump routine to copy the contents of the memory to the hard disk for later analysis. The Microsoft implementation is intended to provide proof of concept only, and there is no intent to create a PCI-based dump switch product. Instead, Microsoft is recommending that server vendors embed this function into the motherboard and expects that OEMs will begin shipping this capability based on standardized components shortly. For example, Intel will provide this capability on a forthcoming Pentium II Xeon server system board. The circuitry on the PCI card, which is extremely simple, is connected to power, ground, system error (SERR)#, and clock line (CLK) through the PCI bus. The momentary push button is mounted at the rear bulkhead of the PCI adapter card. In the proof-of-concept design, there are no PCI configuration registers, nor is there need for the card to be recognized and configured into the system. This should not pose problems because the card uses no system resources, other than use of a slot, and it can be invisible during normal operation. The button is debounced with a capacitor and resistor, and then input to a Schmitt trigger. The trigger feeds a latch and logic that instantiates the appropriate state machines on a programmable array of logic (PAL). The machines assert SERR for one clock pulse and then tri-state the signal in accordance with the PCI specification. The following is a schematic diagram.
Figure 1. Schematic for Dump Switch
Software Solution for Dump SwitchThere are three basic approaches that could have been chosen to handle the NMI.
Given these choices, option #2 was chosen as the best compromise solution. Enabling and Disabling Crash Dump CapabilitiesTo enable this capability, a user needs to use RegEdit or a similar registry editor to go to HKEY_LOCAL_MACHINE\ System\ CurrentControlSet\ Control\ CrashControl and add the NMICrashDump as a DWORD, where 0 means no crash dump and 1 means crash dump. The global variable HalpNMIDumpFlag will be set to true as a result. HAL Changes to Support Dump SwitchThe HalHandleNMI() routine has been modified to call KeBugCheckEx() if the global BOOLEAN HalpNMIDumpFlag is set to true. During system initialization, the variable is to be read from the registry into non-pageable memory. Specifically, the hardware abstraction layer (HAL) will read the registry entry at the end of the PCI bus initialization into HalpNMIDumpFlag. The RegValue used is HKLM\System\CurrentControlSet\Control\CrashControl\NMICrashDump. It is a DWORD entry with a value of 0 (default behavior) or 1 (bug-check on NMI). If it does not exist, the system will assume that it should use the default behavior (0). The following HALs have been updated with the HalHandleNMI() changes to enable dump switch support:
Future Directions for Dump Switch SupportThere are many areas where this functionality can be enhanced or incorporated into other forms of platform manageability. One extremely interesting area would be to incorporate this type of capability into out-of-band management devices. An example would be a remote-site administrator who could access a stalled server system by way of a Local Area Network (LAN) or dialup link, forcing a system dump and resetting the system to facilitate both root cause analysis and quick recovery. This would eliminate the need for a technician to be physically on the site. Call to action for "dump switch" capabilities:
|
|
