Audio Device Performance: Driver Best PracticesUpdated: November 22, 2002
This article describes how to write a highly efficient audio driver for audio devices designed to work with Microsoft Windows XP and later versions of the Windows family of operating systems. Many of the techniques described are also relevant for Windows operating systems back to Windows 98 Second Edition. It provides guidelines for audio hardware vendors to maximize the performance of their devices. The reader should have a basic understanding of the material found in the Windows DDK, including the audio section and the PortCls subsection. On This Page
IntroductionThe PC has become an important multimedia platform for many types of media applications. As the hardware industry develops increasingly capable hardware, the software industry takes advantage of those developments to improve their products. One area that the hardware industry can provide resources for is hardware acceleration. Improving driver/device performance reduces the workload on the host system, leaving more system resources for other tasks. Currently, some WDM audio drivers needlessly consume a large amount of system resources when playing audio content or mixing multiple streams. By following a few guidelines, an audio drivers performance impact on the system can be significantly reduced:
This paper covers the first two guidelines. An important metric for determining the efficiency of any driver/device is to measure the overall system performance while the system is performing hardware-intensive tasks, such as audio mixing or 3-D rendering. Note that this may not necessarily imply moving all processing to hardware: IHVs should carefully weigh the total cost (including system slowdown from PCI contention, memory bus traffic, and so on) when making a design decision. Wave FiltersWave filters represent devices that render or capture wave-formatted digital audio data. Applications typically access the capabilities of these devices through the Microsoft DirectSound API or through the Windows Multimedia waveOutXxx and waveInXxx functions. For information about the wave formats that Windows Driver Model (WDM) audio drivers can support, see WAVEFORMATEX and WAVEFORMATEXTENSIBLE in the Windows Driver Development Kit (Windows DDK). A wave-rendering filter receives as input a digital audio stream. It outputs either an analog audio signal (to a set of speakers or external mixer) or a digital audio stream (for example, to an S/PDIF connector). A wave-capture filter receives as input either an analog audio signal (from a microphone or input jack) or a digital stream (for example, from an S/PDIF connector). It outputs a wave stream containing digital audio data. A single wave filter can simultaneously perform both rendering and capture. An example of when this type of filter would be needed is for an audio device that can play audio through a set of speakers while simultaneously recording audio through a microphone. WaveCyclic versus WavePciThis section provides an introduction to the two port filter modules that are supported by PortCls, the port class driver for streaming audio miniport drivers. The wave ports provided by PortCls can be used to create two types of wave filters:
A WaveCyclic filter can represent an audio device that resides on one of the following system buses: ISA, PCI, or PCMCIA. As the name "WavePci" implies, a WavePci filter usually represents a type of device that connects to a PCI bus, although, in principle, a WavePci device might instead connect to another type of bus. Conceptually, the name "WavePci" could have instead been coined "WaveScatterGather." Unlike the simpler devices supported by WaveCyclic, a device supported by WavePci must have scatter/gather DMA capabilities. An audio device that resides on the PCI bus but lacks scatter/gather DMA can be represented as a WaveCyclic filter but not as a WavePci filter. A WavePci device is able to perform DMA transfers to or from buffers that can be located at arbitrary memory addresses, and that begin and end with arbitrary byte alignments. In contrast, the DMA hardware for a WaveCyclic device requires only the ability to move data to or from a single buffer that the device's miniport driver allocates. A WaveCyclic miniport driver is free to allocate a cyclic buffer that meets the limited capabilities of its DMA channel. For example, the DMA channel for a typical WaveCyclic device might require a buffer that satisfies the following restrictions:
In return for this simplicity, however, a WaveCyclic device must rely on software copying of data to or from the cyclic buffer, but a WavePci device relies on the scatter/gather capabilities of its DMA hardware to avoid such copying. The IRPs that deliver wave audio data to a rendering device or retrieve data from a capture device are accompanied by data buffers; each of these buffers contains a portion of the audio stream that is being rendered or captured. A WavePci device is able to access these buffers directly through its DMA engine, whereas a WaveCyclic device requires that the data be copied to its cyclic buffer from the IRP, or vice versa. A WavePci filter has the additional advantage that it can control the amount of queuing of its rendering or capture stream. The queuing is determined by the following criteria:
In contrast, a WaveCyclic filter relies on the port driver's queuing/buffering policies, which are fixed and generally cannot be tuned. For more information, see "Factors Governing Wave-Output Latency" in the Windows DDK. WaveCyclic FilterWaveCyclic FilterA WaveCyclic filter is implemented as a port/miniport driver pair. A WaveCyclic filter factory creates a WaveCyclic filter as follows:
The code example in the "Subdevice Creation" topic of the Windows DDK illustrates this process. The port and miniport drivers communicate with each other through their IPortWaveCyclic and IMiniportWaveCyclic interfaces. The WaveCyclic filter's cyclic buffer always consists of a contiguous block of virtual memory. The port driver's implementation of the IDmaChannel::AllocateBuffer method always allocates a buffer that is contiguous in both physical and virtual memory address space. If the WaveCyclic device's DMA engine imposes additional constraints on the buffer memory, the miniport is free to implement its own buffer-allocation method to meet these constraints. A WaveCyclic miniport driver that asks for a large buffer (for example, eight physically contiguous memory pages) should accept a smaller buffer size if the operating system denies the original request. In Windows 98 and Windows Millennium Edition (Windows Me), a driver can be written with the assumption that it can always obtain its buffers at system startup time before significant memory fragmentation has occurred. This assumption is not valid for a Windows 2000 (or later) system, which might run for months between reboots. During this time, an audio device might occasionally be unloaded and reloaded to rebalance system resources. For more information, see "Stopping a Device to Rebalance Resources" in the Windows DDK. A WaveCyclic device with built-in, bus-mastering DMA hardware is called a master device. Alternatively, a WaveCyclic device can be a subordinate device with no built-in DMA-hardware capabilities. A subordinate device has to rely on the system DMA controller to perform any data transfers that it requires. A WaveCyclic miniport driver can implement its own DMA-channel object instead of using the default DMA-channel object, which is created by one of the port driver's NewXxxDmaChannel methods: IPortWaveCyclic::NewMasterDmaChannel IPortWaveCyclic::NewSlaveDmaChannel The adapter driver's custom IDmaChannel implementation can perform custom handling of data to meet special hardware constraints. For example, the Windows Multimedia functions use wave formats in which 16-bit samples are always signed values, but the audio-rendering hardware might be designed to use unsigned 16-bit values. In this case, one would create a custom IDmaChannel::CopyTo method to convert the signed source values to the unsigned destination values that the hardware requires. Although this technique can be useful for working around hardware-design flaws, it can also incur a significant cost in software overhead. For more information about master and subordinate devices, see the IDmaChannel and IDmaChannelSlave sections in the Windows DDK. For an example of a driver that implements its own DMA-channel object, see the SB16 sample audio adapter in the Windows DDK. If the constant OVERRIDE_DMA_CHANNEL is defined to be TRUE, the conditional compilation statements in the source code enable the implementation of a proprietary IDmaChannel object, which the driver uses in place of the default IDmaChannel object from the IPortWaveCyclic::NewXxxDmaChannel call. WavePci FilterWavePci FilterA WavePci filter is implemented as a port/miniport pair. A WavePci filter factory creates a WavePci filter as follows:
The code example in the "Subdevice Creation" topic of the Windows DDK illustrates this process. The port and miniport drivers communicate with each other through their IPortWavePci and IMiniportWavePci interfaces. A WavePci device's scatter/gather DMA engine must handle buffers that straddle memory page boundaries. For example, a buffer that contains 10 milliseconds worth of PCM audio data for a 48-kHz, 5.1-channel wave stream has the following size: (10 milliseconds)*(48000 frames/second)*(6 samples[channels]/frame)*(2 bytes/sample) = 5760 bytes This exceeds the memory page size (4096 bytes), which means that the buffer contains either one or two page boundaries, depending on how it is positioned in memory. The buffer contains an integral number (480) of frames of audio data, but one or two of those frames might straddle page boundaries. For this reason, the scatter/gather DMA hardware for a WavePci device should be designed to handle audio frames that are split between two physically non-contiguous pages in memory. (Such as in frame 197 in Figure 1.) Figure 1. An Audio Buffer at an Offset from the Start of a Page
At the top of Figure 1 is a 5760-byte buffer that straddles the boundary between two pages. In this example, the buffer begins at a 1728-byte offset from the start of the first page, which aligns the start of the buffer to a 64-byte boundary in memory. Assume that each audio frame occupies 12 bytes and contains six channels. The first page contains all of frames 0 through 196 but only the first four bytes of frame 197. At the bottom of Figure 1 is a detailed view of audio frame 197, which shows that only the samples for channels 0 and 1 fall within the first page. The samples for channels 2 through 5 are contained in the second page. Although the two pages appear next to each other at the top of the figure, they are, in fact, contiguous only in kernel virtual memory. Because the pages containing the buffer are non-contiguous in physical memory, a scatter/gather DMA controller, which uses physical addresses, must specify the two parts of the buffer as two separate entries in its transfer queue. The WavePci port driver automatically creates two separate mappings at the page boundary. Even if the example shown in Figure 1 is changed to align the buffer with the start of the first page, the split-frame problem does not disappear. Figure 2 shows this point. In this case, frame 341 gets split at the page boundary with the samples for channels 0 and 1 falling within the first page, and the samples for channels 2 through 5 located in the second page. Figure 2. An Audio Buffer Aligned to the Start of a Page
A WavePci device whose scatter/gather DMA controller does not properly handle split audio frames is limited in the kinds of audio data formats it can handle, although software workarounds might alleviate some hardware design flaws. For more information, see "WavePci Latency" in the Windows DDK. Although the size of a typical mapping is one memory page or less, a single mapping can exceed the page size if a portion of an audio buffer happens to occupy two or more contiguous pages in physical memory. Larger mappings can create problems for DMA hardware with design flaws that limit the block size. For example, if a DMA controller can handle a maximum block size of a single page, and GetMapping outputs a mapping that is larger than a page, the miniport driver must split the mapping into smaller blocks that the DMA hardware can handle. If the resulting number of blocks exceeds the number of available map registers in the DMA hardware, the driver cannot queue all of the blocks in a single scatter/gather DMA operation. When this occurs, the driver must keep track of the unqueued portion of the mapping and initiate DMA transfers of the remaining blocks at a later time when additional map registers become available. Maximizing a WavePci DriverThe WavePci port driver provides services for true scatter/gather bus master PCI devices. One leading cause of CPU consumption in audio drivers is data copies. By eliminating these copies, driver performance can be significantly improved. Unlike the current WaveCyclic implementation, WavePci has no inherent data copies, making it preferred for multi-stream and/or hardware-accelerated drivers. Using the principles described in this section enables one to create a device driver that consumes almost no CPU resources, making it possible to send large numbers of audio streams (64 or more) to the hardware for 3D processing and mixing. (Again, note that this does not take into account the total system impact from cache effects, bus traffic/contention, and so on.) For current operating systems, Microsoft recommends WavePci over WaveCyclic; new hardware designs should be fully compliant with WavePci. Device Hardware ConstraintsIn order to take advantage of the WavePci port driver, a device must have true scatter/gather capabilities. This includes the following capabilities:
Hardware design that takes these issues into consideration can enable all the efficiency gains detailed in the following sections. Number of MappingsIn some WavePci designs, the device interrupts when the supply of mappings is nearly exhausted. Other designs interrupt on every mapping completion, or upon some other interval. Device interrupts incur a certain amount of CPU overhead, and this overhead should be minimized. For this reason, some WavePci drivers request as many mappings as are available, whenever the chance arises. Although this will not necessarily decrease the CPU consumption, it can make a driver more resistant to starvation and audio "glitching" stemming from badly-behaved devices (such as hard disks using PIO, and so on). This is because the kernel audio mixer increases its amount of buffering in the event that its mixing thread is held off, up to 80 milliseconds. In the case of a hardware-accelerated device and a game that takes advantage of DSound hardware buffers, that game may similarly decide to increase its padding interval from the write cursor. Whether or not a driver employs this mapping maximization technique, it is especially critical that the driver and device correctly handle mapping cancellation when large numbers of mappings are queued, otherwise clients will notice an objectionable latency upon a stream STOP or PAUSE. The above software technique is beneficial only to the limit of what the hardware can map. Using the IPreFetchOffset InterfaceDirectSound users are familiar with the dual concepts of the Play Cursor and the Write Cursor. The Play Position is defined as the position in the stream of the data actually physically being emitted from the device (the closest approximation of the "sample currently at the DAC"); the write position is the position in the stream of the next safe place for the client to write additional data. For WavePci, the default assumption is that the Write Position is actually the end of the last mapping requested. Thus, if a vendor maximizes the number of outstanding mappings, the delta between the Play Position and the Write Position can be very large enough to fail certain WHQL Audio Position tests. A new interface has been added to the WavePci pin for Windows XP called IPreFetchOffset. This interface is detailed in the latest Windows DDK, and a functional prototype from the portcls.h header file is included below:
DECLARE_INTERFACE_(IPreFetchOffset,IUnknown)
{
DEFINE_ABSTRACT_UNKNOWN() // For IUnknown
STDMETHOD_(VOID,SetPreFetchOffset)
( THIS_
IN ULONG PreFetchOffset
) PURE;
};
typedef IPreFetchOffset *PPREFETCHOFFSET;
IPreFetchOffset is used to specify prefetch characteristics of the bus master hardware, essentially the hardware FIFO size. The audio subsystem uses this data to set a constant offset between the Play Cursor and the Write Cursor, taking advantage of the fact that data can be written into a mapping even after the mapping has been handed to the hardware, as long as the Play Cursor is far enough away from the location into which data is written. Using IPreFetchOffset with a small offset - say 64 samples depending on the hardware DMA engine design - can enable a WavePci driver to be fully responsive and functional while still requesting a large number of mappings. Note also that DirectSound currently pads the Write Cursor of hardware-accelerated pins by 10 milliseconds. Refer to the KSAUDIO_POSITION reference page in the Windows DDK for a more detailed explanation of how write and play cursors work at the driver level. Other OptimizationsAdditional optimizations can increase an audio drivers efficiency, whether or not it is WavePci. Minimizing the Interrupt FrequencyHardware should interrupt only when it requires it. A device that generates large numbers of interrupts will have a negative impact on the system performance. If hardware is designed to interrupt upon every mapping completion, explore using timer DPCs instead of HW interrupt. This can drastically reduce the number of times a ServiceGroup runs its RequestService() routine. Alternately, the HW or driver can be modified to interrupt less frequently. Drivers must call Notify() periodically for events to fire. Therefore, either interrupt approximately every 10 milliseconds or use a 10-millisecond timer DPC. This is one of the most important optimizations. Notify(), ServiceGroups and RequestService()Many drivers automatically call Notify in response to a hardware interrupt, regardless of whether DMA is active. However, a truly optimized ISR will call Notify() only if one or more of hardware streams are in RUN state. Because position and clock events are triggered from a Notify() call, Windows XP generates any relevant position events immediately upon a pin transition from RUN to PAUSE. If a driver runs on operating systems older than Windows XP, and if it incorporates this ISR optimization, it should call Notify() one last time, immediately after pin leaves RUN state, to take this into account (see PAUSE/ACQUIRE Optimizations, below). The CServiceGroup Service()/RequestService() routines can be very frequently called, so do not automatically add member to service groups. Be very sparing in this regard. Similarly, the WavePci pins RequestService() (called from the ServiceGroup) calls each streams Service() routine. Only do work in the Service() routine if that particular stream is running. Note that in Windows XP, RequestService() only calls the streams Service() routine if that pin is in the RUN state. If a driver runs on operating systems older than Windows XP, checking the pins state before doing any work is a worthwhile optimization. This is particularly compelling in the scenario in which numerous streams are open, but only one is running. CDmaChannel SpecificsThe CDmaChannel objects CopyTo()/CopyFrom() routines are well-optimized. Do not override them unnecessarily. If a driver requires custom CopyTo/CopyFrom routines, consider calling the built-in versions from within the custom routines. Drivers must use map registers when the system can address greater than 4 GB RAM but the hardware device cannot. Note that the system only supplies 15 map registers for each DMA channel. As a result, create a new DMA channel per stream if a driver maps large transfers or addresses large memory spaces. Utilizing Non-Blocking CodeA driver is less likely to have deadlock or performance problems now and in the future if the code is as non-blocking as possible. Specifically, the drivers thread of execution should strive to run to completion without potentially stalling while waiting for another thread or resource. For example, some very clever things can be done with the Interlocked commands. While this is a great general rule, it may not be possible to safely remove spinlocks, mutexes, and so on from the data path. In this case, use these very judiciously, with the knowledge that any indefinite wait might cause data starvation. In particular, do not create custom synchronization primitives - the built-in Microsoft primitives (mutexes, spinlocks, and so on) will be modified as needed to support new scheduler features in the future, and creating and using one's own constructs virtually guarantees that this driver will not work in the future. Furthermore, avoid having your hardware driver touch any of the data within the mappings if at all possible. Specifically, any software processing should be split out into a software filter separate from the hardware driver. Arbitrary processing code in a HW driver reduces its efficiency and adds latency problems. A hardware driver should strive to be transparent about what is supported natively in hardware - do not claim to support more in hardware than is really the case. Additional ConsiderationsIPinCount for Voice ManagementThe IPinCount interface was added to PortCls in Windows XP. It is documented fully in the Windows DDK, but is mentioned here to draw attention as a way to implement driver support for HW streams that are of variable weight. For example: if for a given HW device each 3D stream consumes the resources of two 2D streams, then for correct operation the pin counts should increase by two (not just one) when a client opens a 3D stream. This ensures that the client has an accurate view of how many additional streams can be opened. Note that this interface is not callable at elevated IRQL, so pin count manipulations must occur at passive level. This is valid because pin counts would probably only change as a result of a pin close or open, both of which are triggered by passive level calls. Requirements Change for Hardware SupportPreviously, the PC Design Guidelines stipulated that all audio output devices must support both 44.1 kHz and 48 kHz sample rates. Note that a device/driver may now choose to support only 48 kHz. The system audio mixer handles any necessary sample rate conversion. DirectX 8 AccelerationThe new DirectMusic and DirectSound interfaces in DirectX 8 have not yet been enabled for HW-acceleration. Specifically, the AudioPath functionality that adds chains of effects, sends/returns, sub-mixes, as well as the technology that links DLS synthesizers to the system audio mixer, cannot be accelerated at this time. In other words, one cannot route the output from a hardware-accelerated DirectMusic synthesizer through a hardware-accelerated DirectSound mixer. Microsoft may add this capability in future versions of DirectX and/or Windows. Special Concerns for Older Operating SystemsWhen creating a high-performance audio driver for delivery onto multiple operating systems, keep in mind a few additional nuances. Quick Fix Package for WavePci on Windows 98 Second EditionWindows 98 and Windows 98 Second Edition contain a version of WavePci, but implementation problems in these versions render them unusable. WavePci should not be used on Windows 98; a simple way to discern between Windows 98 or Windows 98 Second Edition is to query the port object for IPortEvents - this interface is only present on Windows 98 Second Edition or higher. Microsoft has addressed the Windows 98 Second Edition WavePci problems in a QFE package labeled 269601USA8 (reference KB article 242937). IHVs should contact Microsoft PSS to obtain this package and even obtain permission to ship it with their product (reference KB article 263843). Without this QFE, WavePci should not be used on Windows 98 Second Edition. Windows 98 will not be addressed. There is no built-in API to detect whether this QFE package is present. One way that has been discussed outside Microsoft is to check the presence of the following registry key: HKLM\Software\Microsoft\Windows\CurrentVersion\Setup\Updates\W98.SE\UPD\269601 However, because subsequent QFEs will also include the fixed PortCls, a user may have a future QFE installed instead of 269601. A better way to determine whether the PortCls contains the WavePci fixes (after Windows 98 Second Edition is detected) is simply to check the file version and/or date of PortCls.sys. This version (or any newer version) of PortCls has the fix: 03/21/2000 8:34:03pm 4.10.2223 169,376 PortCls.sys To reiterate, although one can programmatically check for the existence of various registry keys, the best way to determine whether the Windows 98 Second Edition QFE for WavePci is present is simply to check the file version and/or date of PortCls.sys. Using IPortClsVersionThe IPortClsVersion interface indicates which version of PortCls is loaded, for when one must know which audio services are present (non-PCM support, for example). IPortClsVersion is implemented on Windows XP and later operating systems, plus Windows 2000 SP2 and a forthcoming QFE package for Windows Me. A simple QueryInterface() for the requested functionality (if it is implemented as a PortCls interface) is more correct in a COM sense. Note that this can be called at elevated IRQL, but IoIsWdmVersionAvailable() cannot. Availability of IPreFetchOffsetThe IPreFetchOffset interface is currently available in Windows XP only. This functionality has not been redistributed back onto Windows 2000, Windows Me or Windows 98 SE. Clients should support this interface regardless of the operating system version, as support may be added in a future Service Pack or QFE. Availability of IPinCountThe IPinCount interface is available in Windows XP. It has also been redistributed back onto Windows Me as QFE # 25334 (reference Knowledge Base article 316638). Microsoft does not plan to redistribute this functionality back onto Windows 2000 or Windows 98. The English versions of QFE binaries containing this fix have the following attributes (or later): 02/02/2002 11:09am 4.90.0.3002 147,744 KMixer.sys 02/02/2002 12:45am 4.90.3001.0 147,184 PortCls.sys 08/08/2002 04:16pm 4.90.3001.0 49,296 SysAudio.sys PAUSE/ACQUIRE OptimizationsAs mentioned above, on Windows Me and earlier, the WavePci RequestService() routine calls Service() regardless of the pin state. For this reason, if your driver is running on those older operating systems, check the streams state in the Service() routine before spending time doing actual work. Similarly, if InterruptServiceRoutine() only calls Notify() if one or more streams are running, the driver needs to process the ISR immediately after the last running stream is paused. Residual position and clock events are not automatically fired in this case, so end-of-stream and clock events will be lost unless the driver calls Notify()one last time. System Audio Mixer ConsiderationsNote that in Windows XP and Windows Me, if the system mixer (KMixer) is connected to a HW render pin, that pin will be kept running if any of the pins feeding KMixer are in the RUN or PAUSE state. This is so that audio data is correctly flushed through the system. High bit-depth is now more common in audio hardware. Windows 2000 and Windows Me contain a bug in the system audio mixer (KMixer) that cause high bit-depth (>16 bit) audio data to be truncated down to 16 bits. This bug is not present in Windows 98, Windows 98 Second Edition, or Windows XP. The fix is included in Windows 2000 Service Pack 3 (reference Knowledge Base article Q308883). QFE # 25334 for Windows Me (reference Knowledge Base article Q316638) contains both the KMixer change and the IPinCount interface mentioned above. Call to ActionEvaluate WaveCyclic versus WavePci and make the appropriate choice for how to support an existing hardware design. In future hardware designs, create audio hardware that has full scatter/gather capabilities.
Feedback: Send feedback on this article to WHQLSYS@microsoft.com with Audio Device Performance in the Subject line. Resources:
|
|
