High Performance Network Adapters and Drivers
Updated: January 15, 2003
Networking performance has increasingly become one of the most important factors in overall system performance. Many of the factors that affect networking performance fall under the following three categories: Network adapter hardware and driver performance, network stack performance, and the way applications interact with the network stack. This article discusses issues related to network adapter hardware and driver performance in the Microsoft Windows 2000, Windows XP, and Windows Server 2003.
On This Page
Networking performance is measured in throughput and response time. It is also important to achieve the optimal performance operating point without over-utilizing the system resources. To achieve such optimal operating point, network adapter vendors need to examine several areas, including:
Network Adapter Hardware
There are always tradeoffs in deciding which hardware functions to implement on a network adapter. It is becoming increasingly important to consider adding task offload features that allow for interrupt moderation, dynamic tuning on the hardware, improving the use of the PCI bus, and supporting Jumbo Frames. These are particularly important for the high-end network adapter that will be used in configurations requiring top performance.
Task Offload Features
Task offload features can be implemented in the driver or in the firmware. For optimal performance, it is best to implement task offload in the hardware. Windows enables the following three task-offload features. Full documentation on task offload can be found in the Windows DDK.
A simple network adapter generates a hardware interrupt on the host upon the arrival of a packet or to signal completion of a packet send request. Interrupt latency and resulting cache churning effects add overhead to the overall networking performance. In many scenarios (for example, heavy system usage or heavy network traffic), it is best to reduce the cost of the hardware interrupt by processing several packets for each interrupt.
With heavy network workloads, up to 9% performance improvement in throughput has been measured over network-intensive workloads. However, tuning Interrupt Moderation parameters only for throughput improvements may result in a performance hit on the response time. To maintain optimum settings and accommodate for different workloads, it is best to allow for dynamically adjusted parameters as described in the Auto-tuning later in this article.
PCI Bus Usage
One of the most important factors in the network adapter hardware performance is its efficient usage of the PCI bus. Further, the network adapter's DMA performance affects the performance of all PCI cards that are on the same PCI bus. The following guidelines must be considered when optimizing PCI usage:
Jumbo Frame Support
Supporting larger Maximum Transmission Units (MTUs) and thus larger frame sizes, specifically Jumbo Frames, will reduce the network stack overhead incurred per byte. A 20% TCP throughput increase has been measured when the MTU was changed from 1514 to 9000. Also, a significant reduction of CPU utilization is obtained due to the fewer number of calls from the network stack to the network driver.
Network Adapter Driver
The performance impact of the network adapter driver is an important factor in overall network performance. Performance issues with the network adapter driver fall under three categories: Path length, scalability, and alignment.
Although the send and receive paths differ from driver to driver, there are some general rules for performance optimizations:
Scalability issues are very important to address, especially for SMP systems. Various scalability issues are discussed in this section.
Partitioning is needed to minimize shared data and code across processors. Partitioning helps reduce system bus utilization and improves the effectiveness of processor cache. To minimize sharing, driver writers should consider the following:
False Sharing False sharing occurs when processors request shared variables that are independent from each other. However, because the variables are on the same cache line, they are shared among the processors. In such situations, the cache line will travel back and forth between processors for every access to any of the variables in it, causing an increase in cache flushes and reloads. This increases the system bus utilization and reduces overall system performance.
Locking Mechanisms (Avoiding Spinlocks) The use of spinlocks can be expensive to performance if not used properly. Drivers should avoid using spinlocks as much as possible by using interlocked operations where applicable. However, a spinlock might be the best choice for some purposes. For example, if a driver acquires a spinlock while handling the reference count for the number of packets that have not been indicated back to the driver, it is not necessary to use an interlocked operation.
Other suggestions for locking mechanisms include the following:
64-Bit DMA If the network adapter supports 64-bit DMA, steps must be taken to avoid extra copies for addresses above the 4 GB range. When the driver calls NdisMInitializeScatterGatherDma, Dma64Addresses must be set to TRUE.
Mapping Sent User Data
Setting NDIS_ATTRIBUTE_USES_SAFE_BUFFER_APIS improves performance by avoiding mapping buffers to system virtual addresses wherever possible. The following are the NdisXxxSafe set of APIs:
Buffer alignment on a cache line boundary improves performance when copying data from one buffer to another. Most network adapter receive buffers are properly aligned when they are first allocated, but the user data that must eventually be copied into the application buffer is misaligned due to the header space consumed. With TCP data (the most common scenario) the shift due to the TCP, IP and Ethernet headers results in a shift of 0x36 bytes. In order to resolve this problem for the common case, we recommend that a slightly larger buffer be allocated and packet data be inserted at an offset of 0xa bytes. This will ensure that after shifting the buffers by 0x36 bytes for the header, user data will be properly aligned.
Scatter-Gather DMA provides the hardware with support to transfer data to and from noncontiguous ranges of physical memory. Scatter-Gather DMA uses a SCATTER_GATHER_LIST structure, which includes an array of SCATTER_GATHER_ELEMENTS and the number of elements in the array. This structure is retrieved from the packet descriptor passed to the driver's send function. Each element of the array provides the length and starting physical address of a physically contiguous Scatter-Gather region. The driver uses the length and address information for transferring the data.
Using the Scatter Gather routines for DMA operations can improve utilization of system resources by not locking these resources down statically, as would occur if map registers were used.
If the network adapter supports TCP Segmentation Offload (Large Send Offload), then the driver will need to pass in the maximum buffer size it can get from TCP/IP into the MaximumPhysicalMapping parameter within NdisMInitializeScatterGatherDma function. This will guarantee that we have enough map registers to build the Scatter Gather list and eliminate any possible buffer allocations and copying.
Auto-tuning is an important concept and needs to be implemented in the driver or network adapter hardware. It is important to maintain optimal performance while eliminating all unnecessary and confusing user controls.
In order to ensure the optimal performance operating point, the network adapter driver must support auto-tuning of its parameters. The driver must periodically adjust the hardware and software to maximize both throughput and response time without a driver reset. Not all parameters can be adjusted, especially hardware parameters. Auto-tuning is preferable to static parameters because the optimal values of parameters may change over time, and static parameters are hard for users to understand and tune correctly.
Auto-tuning is a difficult problem to solve. Much research and development time can be consumed to achieve optimal tuning algorithms. However, there are many simple algorithms that can be considered, allowing you to achieve reasonable auto-tuning and a near optimal operating point.
There are two kinds of auto-tuning: static and dynamic. In static auto-tuning, driver and network adapter hardware parameters are based on system configurations such as machine role (server versus desktop), machine hardware (CPU and memory) and on the network adapter hardware capabilities. Dynamic auto-tuning is based on system conditions such as resource utilization and network load. Dynamic auto-tuning is harder to implement, but it yields better results than static auto-tuning. However, static auto-tuning tends to overutilize resources on high-end machines because there is no mechanism to trim resource usage.
It is recommended that each driver maintain a table that is loaded from the registry upon driver initialization. The table selects the operating parameters based on CPU utilization (NdisGetCurrentProcessorCounts) and other factors, including network utilization. The driver can update these parameters by calculating the CPU utilization based on a timer (once each second or every 5 seconds), or based on certain counters, for example, the number of packets per second.
Some of the major parameters that need to be considered for auto-tuning include the following items:
The following table is provided as an example:
One reason to load the table from the registry is to avoid hard-coded values in the driver. This also allows for experimenting with the table entries to find the optimal values for several representative workloads. The actual table must appear in the registry, rather than in the user interface, to minimize user intervention. The driver must use the registry if the table is provided.
In addition, based on network traffic, it is very important for the driver to have enough buffers to handle the network adapter's maximum load. One of the most important performance issues encountered is the lack of sufficient buffers, in particular receive buffers. A lack of buffers leads to the driver dropping packets. In the case of TCP, this leads to costly retransmissions (penalties as high as 10% have been seen on network intensive workloads). From the performance point of view, it is generally better to over-allocate buffers than to under-allocate them. With auto-tuning, the driver can dynamically allocate and free its buffers, especially receive buffers, based on usage. The driver always needs to have ample buffers to handle bursts of network traffic.
If the hardware supports dynamically adjusting the hardware memory allocated to send and receive buffers, auto-tuning can adjust the on-board memory reserved for send or receive buffers based on send and receive packet rates. This allows for better utilization of the on-board memory.
Call to Action