Skip to main content

High Performance Network Adapters and Drivers

Updated: January 15, 2003

Networking performance has increasingly become one of the most important factors in overall system performance. Many of the factors that affect networking performance fall under the following three categories: Network adapter hardware and driver performance, network stack performance, and the way applications interact with the network stack. This article discusses issues related to network adapter hardware and driver performance in the Microsoft Windows 2000, Windows XP, and Windows Server 2003.

On This Page

Introduction  Introduction
Network Adapter Hardware  Network Adapter Hardware
Network Adapter Driver   Network Adapter Driver
Scatter-Gather DMA  Scatter-Gather DMA
Auto-tuning   Auto-tuning
Call to Action   Call to Action


Introduction

Networking performance is measured in throughput and response time. It is also important to achieve the optimal performance operating point without over-utilizing the system resources. To achieve such optimal operating point, network adapter vendors need to examine several areas, including:

  • Which network adapter hardware capabilities are implemented

  • Network adapter driver implementation for critical path length and scalability

  • Dynamically adjustable hardware and software parameters to allow for auto-tuning

Network Adapter Hardware

There are always tradeoffs in deciding which hardware functions to implement on a network adapter. It is becoming increasingly important to consider adding task offload features that allow for interrupt moderation, dynamic tuning on the hardware, improving the use of the PCI bus, and supporting Jumbo Frames. These are particularly important for the high-end network adapter that will be used in configurations requiring top performance.

Task Offload Features

Task offload features can be implemented in the driver or in the firmware. For optimal performance, it is best to implement task offload in the hardware. Windows enables the following three task-offload features. Full documentation on task offload can be found in the Windows DDK.

  • TCP and IP Checksum Offload: For most common network traffic, offloading checksum calculation to the network adapter hardware offers a significant performance advantage by reducing the number of CPU cycles required per byte. Checksum calculation is the most expensive function in the networking stack for two reasons:

    • It contributes to long path length

    • It causes cache churning effects (typically on the sender)

    Offloading checksum calculation to the sender improves the overall system performance by reducing the load on the host CPU and increasing cache effectiveness.

    In the Windows Performance Lab, we have measured TCP throughput improvements of 19% when checksum was offloaded during network-intensive workloads. Analysis of this improvement shows that 11% of the total improvement is due to the path length reduction, and 8% is due to increasing the caches effectiveness.

    Offloading checksum on the receiver has the same advantages as offloading checksum on the sender. Increased benefit can be seen on systems that act as both client and server, such as a sockets proxy server. On systems where the CPU is not necessarily busy, such as a client system, the benefit of offloading checksum may be seen in better network response times, rather than in noticeably improved throughput.

  • Large Send Offload: Windows offers the ability for the network adapter/driver to advertise a larger Maximum Segment Size (MSS) than the MTU to TCPup to 64K. This allows TCP to allocate a buffer of up to 64K to the driver, which divides the large buffer into packets that fit within the network MTU.

    The TCP segmenting work is done by the network adapter/driver hardware instead of the host CPU. This results in a significant performance improvement if the network adapter CPU is able to handle the additional work.

    For many of the network adapters tested, there was little improvement seen for pure networking activities when the host CPU was more powerful than the network adapter hardware. However, for typical business workloads, an overall system performance improvement of up to 9% of the throughput has been measured, because the host CPU uses most of its cycles to execute transactions. In these cases, offloading TCP segmentation to the hardware frees the host CPU from the load of segmentation, allowing it extra cycles to perform more transactions.

  • IP Security (IPSec) Offload: Windows offers the ability to offload the encryption work of IPSec to the network adapter hardware. Encryption, especially 3 DES, has a very high cycles/byte ratio. Therefore, it is no surprise that offloading IPSec to the network adapter hardware measured a 30% performance boost in secure Internet and VPN tests.

Interrupt Moderation

A simple network adapter generates a hardware interrupt on the host upon the arrival of a packet or to signal completion of a packet send request. Interrupt latency and resulting cache churning effects add overhead to the overall networking performance. In many scenarios (for example, heavy system usage or heavy network traffic), it is best to reduce the cost of the hardware interrupt by processing several packets for each interrupt.

With heavy network workloads, up to 9% performance improvement in throughput has been measured over network-intensive workloads. However, tuning Interrupt Moderation parameters only for throughput improvements may result in a performance hit on the response time. To maintain optimum settings and accommodate for different workloads, it is best to allow for dynamically adjusted parameters as described in the Auto-tuning later in this article.

PCI Bus Usage

One of the most important factors in the network adapter hardware performance is its efficient usage of the PCI bus. Further, the network adapter's DMA performance affects the performance of all PCI cards that are on the same PCI bus. The following guidelines must be considered when optimizing PCI usage:

  • Streamline DMA transfers by aggregating target pages where appropriate.

  • Reduce PCI protocol overhead by performing DMA in large chunks (at a minimum of 256 bytes). If possible, time the flow of data so that entire packets are transferred in a single PCI transaction. However, consider how the transfer should take place; for example, do not wait for all of the data to arrive before initiating transfers, as this will increase latency and consume additional buffer space.

  • It is better to pad the DMA packet transfer with additional bytes, rather than requiring a short extra transfer to "clean up" by transferring the last few bytes of the packet.

  • Use the Memory Read, Memory Read Line, and Memory Read Multiple transactions as recommended by the PCI specification.

  • The network adapter bus interface hardware should detect limitations in the host memory controller and adjust behavior accordingly. In example, the network adapter bus interface hardware should detect memory-controller pre-fetch limitations on a DMA Memory Reads and wait for a short period before attempting the transaction again. The hardware should detect excessive retries on the part of the network adapter and increase the time before the first retry on future transactions when cut off by the host. There is no point in continuing to submit transactions to the memory controller when are certain that it is still busy fetching the next sequential set of data.

  • Minimize the insertion of wait states, especially during data transfers. It is better to relinquish the bus and let another PCI adapter using the bus get some work done if more than one or two wait states are going to be inserted.

  • Use Memory Mapped I/O instead of Programmed I/O. This is also true for drivers.

Jumbo Frame Support

Supporting larger Maximum Transmission Units (MTUs) and thus larger frame sizes, specifically Jumbo Frames, will reduce the network stack overhead incurred per byte. A 20% TCP throughput increase has been measured when the MTU was changed from 1514 to 9000. Also, a significant reduction of CPU utilization is obtained due to the fewer number of calls from the network stack to the network driver.

Network Adapter Driver

The performance impact of the network adapter driver is an important factor in overall network performance. Performance issues with the network adapter driver fall under three categories: Path length, scalability, and alignment.

Path Length

Although the send and receive paths differ from driver to driver, there are some general rules for performance optimizations:

  • Optimize for the common paths. The Kernprof.exe tool is provided with the MSDN and IDW builds of Windows that extracts the needed information. The developer should look at the routines that consume the most CPU cycles and attempt to reduce the frequency of these routines being called or the time spent in these routines.

  • Reduce time spent in DPC so that the network adapter driver does not use excessive system resources, which would cause overall system performance to suffer.

  • Make sure that debug code is not compiled into the final released version of the driver; this avoids executing excess code.

Scalability Issues

Scalability issues are very important to address, especially for SMP systems. Various scalability issues are discussed in this section.

Partitioning

Partitioning is needed to minimize shared data and code across processors. Partitioning helps reduce system bus utilization and improves the effectiveness of processor cache. To minimize sharing, driver writers should consider the following:

  • Implement the driver as a de-serialized miniport as described in the Windows DDK.

  • Use per-processor data structures to reduce global and shared data access. This allows you to keep statistic counters without synchronization, which reduces the code path length and increases performance. For vital statistics, have per-processor counters that are added together at query time. If you must have a global counter, use interlocked operations instead of spinlocks to manipulate the counter. See Locking Mechanisms (Avoiding Spinlocks) later in this article for information about how to avoid using spinlocks.

    To facilitate this, KeGetCurrentProcessorNumber can be used to determine the current processor. To determine the number of processors when allocating per-processor data structures, KeQueryActiveProcessors can be used. This API is scheduled for inclusion in a future release of the DDK, but is defined as follows:

    KeQueryActiveProcessors returns an affinity mask that represents the active processors in the system.

KAFFINITY







  KeQueryActiveProcessors (







    );







Include







ntddk.h

 

 Comments
The total number of bits set in the affinity mask indicates the number of active processors in the system. Drivers should not assume that all the set bits in the mask will be contiguous because the processors might not be consecutively numbered in the future releases of the operating system. The number of processors in an SMP machine is a zero-based value.
A Windows XP/Windows 2000 driver might call KeQueryActiveProcessors if it maintained per-processor data in an attempt to reduce cache-line contention.
 KeQueryActiveProcessors can be called at any IRQL.
 See also: KeGetCurrentProcessorNumber

 

False Sharing False sharing occurs when processors request shared variables that are independent from each other. However, because the variables are on the same cache line, they are shared among the processors. In such situations, the cache line will travel back and forth between processors for every access to any of the variables in it, causing an increase in cache flushes and reloads. This increases the system bus utilization and reduces overall system performance.

To avoid false sharing, align important data structures (such as spinlocks, buffer queue headers, SLists, and so on) to cache line boundaries by using NdisGetSharedDataAlignment.

Locking Mechanisms (Avoiding Spinlocks) The use of spinlocks can be expensive to performance if not used properly. Drivers should avoid using spinlocks as much as possible by using interlocked operations where applicable. However, a spinlock might be the best choice for some purposes. For example, if a driver acquires a spinlock while handling the reference count for the number of packets that have not been indicated back to the driver, it is not necessary to use an interlocked operation.

Other suggestions for locking mechanisms include the following:

  • Use SLists and the NDIS singly-linked list functions for managing resource pools. (For example, NdisInitializeSListHead, NdisInterlockedPushEntrySList, NdisInterlockedPopEntrySList, NdisQueryDepthSList)

  • If you need to use spinlocks, make sure that they protect data and not code. Dont use one lock to protect all data used in common paths. For example, separate the data used in the send and receive paths into two data structures so that when the send path needs to lock its data, the receive path is not affected.

  • If you are using spinlocks and the path is already at DPC level, use the functions for spinlock acquire and release at DPC level to avoid extra code. (For example, NdisDprAcquireSpinlock, NdisDprReleaseSpinlock)

  • In an effort to minimize the number of spinlock acquires and releases, use the NDIS ReadWriteLock functions (NdisInitializeReadWriteLock, NdisAcquireReadWriteLock, NdisReleaseReadWriteLock). The ReadWriteLock functions allow multiple concurrent readers to use a single lock and limits write access to a single writer thread. No read access is allowed during a write access.

64-Bit DMA If the network adapter supports 64-bit DMA, steps must be taken to avoid extra copies for addresses above the 4 GB range. When the driver calls NdisMInitializeScatterGatherDma, Dma64Addresses must be set to TRUE.

Mapping Sent User Data
A flag has been added to the AttributeFlags parameter passed in during initialization to
NdisMSetAttributesEx. TheNDIS_ATTRIBUTE_USES_SAFE_BUFFER_APISbit in AttributesFlags enables drivers to avoid mapping the sent user data. The driver should then only need the Physical Address passed down from NDIS when sending the data. This bit indicates to NDIS that either the driver needs only the Physical Address or that the driver will be using the NdisXxxSafe API set (listed below). When using the Safe API set, if NdisQueryBufferSafe is used to obtain the buffer length, the VirtualAddress parameter should be set to NULL in order to avoid mapping the data buffers sent down by NDIS.

Setting NDIS_ATTRIBUTE_USES_SAFE_BUFFER_APIS improves performance by avoiding mapping buffers to system virtual addresses wherever possible. The following are the NdisXxxSafe set of APIs:

NdisQueryBufferSafe
NdisGetFirstBufferFromPacketSafe
NdisCopyFromPacketToPacketSafe
NdisBufferVirtualAddressSafe

Alignment

Buffer alignment on a cache line boundary improves performance when copying data from one buffer to another. Most network adapter receive buffers are properly aligned when they are first allocated, but the user data that must eventually be copied into the application buffer is misaligned due to the header space consumed. With TCP data (the most common scenario) the shift due to the TCP, IP and Ethernet headers results in a shift of 0x36 bytes. In order to resolve this problem for the common case, we recommend that a slightly larger buffer be allocated and packet data be inserted at an offset of 0xa bytes. This will ensure that after shifting the buffers by 0x36 bytes for the header, user data will be properly aligned.

0

Scatter-Gather DMA

Scatter-Gather DMA provides the hardware with support to transfer data to and from noncontiguous ranges of physical memory. Scatter-Gather DMA uses a SCATTER_GATHER_LIST structure, which includes an array of SCATTER_GATHER_ELEMENTS and the number of elements in the array. This structure is retrieved from the packet descriptor passed to the driver's send function. Each element of the array provides the length and starting physical address of a physically contiguous Scatter-Gather region. The driver uses the length and address information for transferring the data.

Using the Scatter Gather routines for DMA operations can improve utilization of system resources by not locking these resources down statically, as would occur if map registers were used.

If the network adapter supports TCP Segmentation Offload (Large Send Offload), then the driver will need to pass in the maximum buffer size it can get from TCP/IP into the MaximumPhysicalMapping parameter within NdisMInitializeScatterGatherDma function. This will guarantee that we have enough map registers to build the Scatter Gather list and eliminate any possible buffer allocations and copying.

Auto-tuning

Auto-tuning is an important concept and needs to be implemented in the driver or network adapter hardware. It is important to maintain optimal performance while eliminating all unnecessary and confusing user controls.

In order to ensure the optimal performance operating point, the network adapter driver must support auto-tuning of its parameters. The driver must periodically adjust the hardware and software to maximize both throughput and response time without a driver reset. Not all parameters can be adjusted, especially hardware parameters. Auto-tuning is preferable to static parameters because the optimal values of parameters may change over time, and static parameters are hard for users to understand and tune correctly.

Auto-tuning is a difficult problem to solve. Much research and development time can be consumed to achieve optimal tuning algorithms. However, there are many simple algorithms that can be considered, allowing you to achieve reasonable auto-tuning and a near optimal operating point.

There are two kinds of auto-tuning: static and dynamic. In static auto-tuning, driver and network adapter hardware parameters are based on system configurations such as machine role (server versus desktop), machine hardware (CPU and memory) and on the network adapter hardware capabilities. Dynamic auto-tuning is based on system conditions such as resource utilization and network load. Dynamic auto-tuning is harder to implement, but it yields better results than static auto-tuning. However, static auto-tuning tends to overutilize resources on high-end machines because there is no mechanism to trim resource usage.

It is recommended that each driver maintain a table that is loaded from the registry upon driver initialization. The table selects the operating parameters based on CPU utilization (NdisGetCurrentProcessorCounts) and other factors, including network utilization. The driver can update these parameters by calculating the CPU utilization based on a timer (once each second or every 5 seconds), or based on certain counters, for example, the number of packets per second.

Some of the major parameters that need to be considered for auto-tuning include the following items:

  • Interrupt Moderation: Off or On, and the interrupt rate for both receives and sends

  • Packets processed per DPC, or time spent in DPC

  • Buffer allocation, especially for receive buffers

  • Other processing such as small fragment coalescing

The following table is provided as an example:

CPU Utilization %Receive Packets/
Interrupt
Send Packets/
Interrupt
Packets processed at DPC Level
<=511200
1011180
............
95197045
100208040


One reason to load the table from the registry is to avoid hard-coded values in the driver. This also allows for experimenting with the table entries to find the optimal values for several representative workloads. The actual table must appear in the registry, rather than in the user interface, to minimize user intervention. The driver must use the registry if the table is provided.

In addition, based on network traffic, it is very important for the driver to have enough buffers to handle the network adapter's maximum load. One of the most important performance issues encountered is the lack of sufficient buffers, in particular receive buffers. A lack of buffers leads to the driver dropping packets. In the case of TCP, this leads to costly retransmissions (penalties as high as 10% have been seen on network intensive workloads). From the performance point of view, it is generally better to over-allocate buffers than to under-allocate them. With auto-tuning, the driver can dynamically allocate and free its buffers, especially receive buffers, based on usage. The driver always needs to have ample buffers to handle bursts of network traffic.

If the hardware supports dynamically adjusting the hardware memory allocated to send and receive buffers, auto-tuning can adjust the on-board memory reserved for send or receive buffers based on send and receive packet rates. This allows for better utilization of the on-board memory.

Call to Action

  • Use the information provided in this article to ensure that your network adapter and driver operate at the optimal system performance point.

  • The current Windows DDK contains detailed information on using the APIs and mechanisms described in this article.

  • See Power Management for related tools and information about optimizing for fast system startup on PCs running Windows XP.

  • For questions, please send e-mail to ndisfb@microsoft.com. Please be sure to include your name, title, company name, company type (IHV, ISV, ISP, or OEM), and phone number.

Rate: