I/O Completion/Cancellation Guidelines
This topic provides I/O completion and cancellation guidelines for Windows Vista and Windows Server 2008. It describes how drivers can cause applications to hang and how they can cause application terminations to fail. It then discusses ways that driver developers and equipment and device manufacturers can prevent these failures, as well as support the new application-initiated I/O cancellation feature planned for Windows Vista and Windows Server 2008.
- Application Hangs and Drivers
- I/O Completion/Cancellation Guidelines
- Call to Action and Resources
Applications that hang and cannot be terminated (also referred to as unresponsive, frozen, or locked-up applications) are the leading cause of unplanned reboots. Hangs are more frequent than crashes for most popular applications and contribute significantly to customer dissatisfaction. Although drivers do not contribute to a large proportion of application hangs, they are the almost always responsible for application termination failures. Driver-related hangs are by far the most painful, because they nearly always require the user to reboot the computer.
Driver writers and manufacturers can prevent driver-related hangs in Windows Vista and Windows Server 2008 by implementing the new I/O completion/cancellation guidelines presented in this paper. Another key reason to implement the guidelines is to support application-initiated I/O cancellation, which is a Windows Vista and Windows Server 2008 feature that developers of new applications will be encouraged to implement. I/O Manager enhancements that enable this feature include the new abilities to:
- Cancel create
- Cancel a specific I/O request packet (IRP)
- Cancel a synchronous IRP from another thread
This topic is organized into two sections:
- The first section discusses ways that drivers can cause applications to hang or cause application termination to fail and then discusses ways to prevent these problems.
- The second section presents the I/O completion/cancellation guidelines, which are the basis for all proposed solutions.
This section describes some of the reasons that drivers can cause application termination to fail. It shows why the current Windows architecture cannot gracefully terminate an application without cooperation from drivers that are handling I/O requests for that application.
Application termination hangs and application hangs are different. The most common scenario is an application that waits indefinitely for an I/O request to finish. Frequently, such applications were written with local file I/O in mind, and they hang when accessing remote shares, especially over slow links. In most cases, the user quickly becomes impatient and then tries to close the application. Usually the application successfully terminates, but sometimes the termination does not succeed and the user is unable to close or re-launch the application. In this case, the user has no choice but to reboot the computer.
Other examples of applications that wait indefinitely on I/O include applications that read from named pipes, the network, or the keyboard. In these cases the application depends on some external event that should not be relied upon to occur in a finite time.
If an application needs to terminate such requests, it should cancel the request. During process termination, the system walks the list of I/O requests for a process and attempts to cancel each one. This paper discusses why drivers need to implement cancellation and timely completion of I/O requests.
Common Reasons for Application Termination Hangs
Applications fail to terminate for such common reasons as the following:
- Drivers block a user thread using kernel-mode waits. The process-termination asynchronous procedure call (APC) cannot satisfy these waits, so the process termination logic cannot execute. Drivers should queue the I/O request and return STATUS_PENDING to the I/O Manager, but many drivers wait because it is simpler to do synchronous I/O, for example, to rely on allocating structures on the stack. Note that many driver developers come from other environments, such as earlier version of Windows and UNIX, which do not support asynchronous I/O.
- Drivers don't implement cancellation for a number of reasons, including the following:
- Some drivers, for example storage stack drivers, are written with the assumption that their I/O requests can be completed within a finite time. A bus reset or storage on an unreliable network results in large time-outs.
- The driver stack is layered, and there is no simple way to pass the cancellation down the stack. The IRP could stop in the middle of a layer and never get passed down, as in the case of the network stack, which is protected by time-outs that mitigate this problem.
- Cancellation logic is difficult to get right. Cancellation is a race condition inherently, and the cancellation code races with the normal processing of an I/O request. The queuing and synchronization logic required to create reliable code is complicated. The Windows Driver Foundation and the Cancel-Safe IRP Queues library simplify this logic and driver writers should use these to simplify their queues.
- Cancellation is difficult to test. Tools to support cancellation testing are planned, and adherence to the guidelines that are listed later in this paper is strongly encouraged. Driver developers as well as system and device manufacturers need to add adequate cancellation tests to their existing test suites for Windows Vista and Windows Server 2008.
Why the I/O Subsystem Depends on Drivers
The I/O system and the driver model is an extension of the kernel, so drivers run with the same privilege level as the rest of the operating system. They have full access to memory, can map memory or I/O registers into user mode processes, and can program the hardware to do direct memory access (DMA). Therefore, the process termination path has to wait for the driver to complete the I/O request. For example:
- A driver has received an I/O request from an application. It locks the application’s pages and starts DMA to those pages. If the process terminated and the application pages were reclaimed, then DMA transfers could be made to freed memory, potentially corrupting the system.
- A driver has mapped some memory or I/O registers. This is becoming popular with the arrival of sophisticated hardware like TCP Offload. In this case the process address space cannot be torn down without notifying the driver and waiting for it to un-map that memory.
- A driver has allocated an event on its thread stack and is blocking the thread. The event will be signaled by the completion of the I/O request, for example in an interrupt service routine (ISR) or deferred procedure call (DPC). If the thread were reclaimed, the DPC would signal an event that is now in freed memory.
Ways To Prevent Driver-Related Hangs
To prevent driver-related hangs, it is necessary to adhere to the I/O completion/cancellation guidelines presented in the next section. There are several ways to accomplish this:
- Use the Cancel-Safe IRP Queues library included in the Windows DDK. With this library, driver writers can implement queues without having to worry about cancellation logic. The library does the hard work and can also be used for previous releases of Windows, such as Windows 2000.
- For new drivers, the Windows Driver Foundation currently being developed will be recommended. It simplifies many difficult aspects of the Windows Driver Model, including Plug and Play (PnP)/power management (PM)/Windows Management Instrumentation (WMI) and asynchronous I/O. The Windows Driver Foundation provides a synchronous model wherein the driver can block (similar to Windows 98 or UNIX), but also ensures that I/O requests can be cancelled.
- Use the information provided in the white paper "Cancel Logic in Windows Drivers" listed in the resources section at the end of this paper. It provides detailed technical information about cancellation, including example code samples that support cancellation.
- Many books on drivers now contain detailed descriptions of how to implement cancellation. Programming the Microsoft Windows Driver Model, Second Edition, by Walter Oney, also contains a description of the cancel-safe queues and an explanation of how to use them.
The following are guidelines that all drivers that are shipped with or intended to be used with Windows Vista and Windows Server 2008 should adhere to.
Drivers written for previous versions of Windows may not automatically satisfy these guidelines due to an important change for Windows Vista and Windows Server 2008, namely support for cancellation of IRP_MJ_CREATE IRPs.
Use of the Windows Driver Foundation or Cancel-Safe IRP Queues library is strongly recommended, because they automatically implement these guidelines.
- Reasonable period
Reasonable period means here less than 10 seconds for most operations and their cancellations. This time is derived from a user's tolerance to delays when closing an application or canceling an I/O operation that the user perceives is taking too long. It should be much shorter for most operations. There may also be legitimate reasons for it to be longer for some types of devices and/or operations.
IRPs are issued by the I/O Manager on behalf of a user-mode application.
- Long-term IRPs
Long-term IRPs are IRPs that take more than a reasonable period to complete.
Pend means a driver should return STATUS_PENDING and mark the IRP pending.
- All IRPs (including Create) that can take an indefinite amount of time must be able to be cancelled. These are waits that block on user-initiated events, for example named pipes or waiting for keyboard input.
- Close-IRPs should never block for more than a reasonable period.
- All long-term IRPs should be pended. Drivers should not block a user-mode thread (for example, to acquire a mutex) inside its dispatch routine for more than a reasonable period.
- Whenever a driver pends an IRP, including Create, it must either:
- Support IRP cancellation; or
- Complete the operation within a reasonable period. This may require implementing time-outs. The only exception is for hardware that is malfunctioning.
- If a driver creates new IRPs that are passed to other drivers, then it must pass on cancellation or be able to disassociate these IRPs from the original IRP issued by the I/O Manager.
- All IRPs should be completed in a reasonable period once canceled. A driver that is about to complete an IRP for anything other than the current thread must be suspension-proof. The only exceptions are delays that are caused by hardware that is malfunctioning. If the hardware is likely to malfunction frequently, then the driver should have sufficient recovery mechanisms to complete the IRP in a reasonable period.
- A driver should never pend a canceled IRP.
- Drivers should not have any path that would miss a cancellation unless the IRP will be completed shortly anyway (code just has to run forward).
Call to Action:
- For developers and manufacturers: Understand that hangs are the leading cause of reboots and a significant contributor to customer dissatisfaction.
- For driver developers: Ensure that drivers you are responsible for or will be writing for Windows Vista and Windows Server 2008 fully adhere to the I/O completion/cancellation guidelines, where possible.
- If you are implementing a new driver, consider strongly the Windows Driver Foundation. It will simplify your work and ensures that your drivers support cancellation.
- Consider using the Cancel-Safe IRP Queues library to ensure cancelability with little effort on your part.
- Create your own completion and/or cancellation implementation using the guidelines and information in "Cancel Logic in Windows Drivers."
- Ensure that test suites include thorough testing to ensure adherence to the guidelines.
- In a few cases, adherence to the guidelines may be risky or virtually impossible for some devices. Understand the limitations and trade-offs implied by the guidelines. Encourage the hardware provider to rectify the problem as soon as possible.
- For device and system manufacturers: Ensure that drivers you ship with your product fully adhere to the I/O completion/cancellation guidelines, where possible.
- Ensure that your driver developers read and understand the guidelines and implementation options.
- Ensure that test suites for your products include thorough testing to ensure adherence to the guidelines.
- If adherence to the guidelines is risky or virtually impossible for your products, attempt to rectify the underlying problem as soon as possible.