Export (0) Print
Expand All

Optimizing the Performance of a Windows Network Projector (Windows Embedded CE 6.0)

1/19/2010

Qiong Xu, Brigette Huang, Luke Bayler

July 2008

The Windows Network Projector target device technology for Windows Embedded CE 6.0 provides users with a solution that allows users to create a remote connection from a Windows Network Projector display to a Windows Vista desktop computer over any IP-based network. The wireless Windows Network Projector and the Windows Vista desktop computer use Web Services on Devices (WSD) for discovery of the Windows Network Projector and the Remote Desktop Protocol (RDP) to display the Windows Vista desktop. This technical article focuses the performance characteristics of RDP during Windows Network Projector sessions.

Windows Embedded CE 6.0

The Windows Network Projector target device technology for CE 6.0 provides basic functionality for business user scenarios, and OEMs can customize this technology for their hardware platform. Other than the basic functionality and any customizations, the performance of the Windows Network Projector is an important factor that affects any customizations that OEMs may implement. This technical article provides a general discussion of important performance guidelines and considerations, and describes the flow of execution and the data structures used in the Windows Network Projector's RDP sessions. This technical article presents data that Microsoft collected from profiling tests and observations from Microsoft's performance investigations.

The performance issues discussed in this technical article focus on the core business user scenarios, such as Microsoft PowerPoint slide show animations. Performance in animations is crucial because animations are the most processor intensive component in a Microsoft PowerPoint slide show. You should not rely on the performance characteristics presented in this technical article to replace any performance benchmark tests or other tests.

This technical article does not discuss the following topics:

  • Overall RDP performance in the non-core business user Windows Network Projector scenario, or when the Windows Network Projector is in Thin Client mode; thin Client RDP mode in a Windows Network Projector is different from the collaborative mode of the Windows Network Projector. In collaborative mode, the server sends RDP packets in push mode, while in Thin Client mode the server sends RDP packets in receive-acknowledge mode.
  • The functionality and configuration of a Windows Network Projector, for more information about a Windows Network Projector, see Developing a Windows Network Projector.
  • Discovery of and connection to Web Services on Devices (WSD).

The performance of a Windows Network Projector depends on many things. As with any performance measurement, the overall performance is contingent on the performance of the most detrimental components of the system. Microsoft tested performance bottlenecks on many hardware platforms and software configurations. These tests included performance benchmark tests, Microsoft PowerPoint tests, and graphics device interface (GDI) performance tests. Microsoft conducted performance tests on a 1.8 GHz Microsoft Windows Embedded CE PC-based hardware platform (CEPC), a 233 MHz CEPC, and a 400 MHz NEC Solution Gear 2-Vr5500. After analyzing the performance data collected on these hardware platforms, only Microsoft PowerPoint slide shows with demanding animations on hardware platforms with low frequency CPUs had performance issues. Test results on other hardware platforms running a Windows Network Projector showed favorable performance data.

During the development of the Windows Network Projector technology, Microsoft created a set of Microsoft PowerPoint slide shows to test as many impacts as possible that Microsoft PowerPoint could have on the functionality and performance tests of the Windows Network Projector. Slide shows with no animation or animations in small regions ran well on all hardware platforms and the user experience was favorable. In slide shows with animations that were CPU intensive and occupied a large region of a slide, the animations ran well with low latency on fast devices such as a 1.8 GHz CEPC, but the performance degraded on slower devices such as the 233 MHz CEPC, and the 400 MHz NEC Solution Gear 2-Vr5500. In some of the slides, frame dropping and jittering were noticeable. The following sections discuss the performance elements based on data collected from a resource intensive slide show:

Slide shows with animations have CPU requirements. Each component in the execution path, including networking, decryption, decompression, and the display, all compete for CPU resources. Profiling data collected from a 1.8 GHz CEPC shows that the CPU is idle about 50 percent of the time; on a 233 MHz CEPC, the CPU is never idle. Based on this, Microsoft estimates that the minimum CPU speed for acceptable performance on a CEPC is about 800 MHz. However, for most slides, a 300 MHz CPU provides acceptable performance. Microsoft investigated the modification of other factors such as decompression, decryption, and the compression algorithm by modifying code, changing the configuration of the registry, and replacing the RDP 6.0 Compressor algorithm with the MPPC compression algorithm. Modifying these factors did not increase the idle time of the CPU. By modifying these factors, the distribution of the CPU time did change, but the usage of the CPU was still 100%, therefore, the CPU time saved in one component was lost in another component. If the CPU power is over a certain threshold, such as the 1.8 GHz CEPC, the CPU will not be the performance bottleneck, in this situation, other components such as the display driver or network throughput will become the bottleneck.

Currently, the CPU power of the hardware platforms from Windows Network Projector vendors is around 300 to 400 MHz, so the CPU is the major bottleneck for resource intensive Microsoft PowerPoint slide shows. One solution to relieve this bottleneck is to offload certain tasks to an accompanying digital signal processor (DSP).

During Microsoft PowerPoint slide shows, most of the commands from the Windows Vista desktop computer to GDI or the display are from the StretchDIBits and SetDIBitsToDevice functions. Typically, there are negligible amounts of calls from other GDI functions. For simpler analysis, Microsoft used Windows Vista desktop computers and CEPCs with resolutions of 1024x768x16 as the baseline for tests because this set up did not involve any color conversions. The execution path shows that both of the blitting functions eventually call into the SRCCOPY raster operator (ROP). For the SRCCOPY ROP, RDP handles the source bitmap that resides in the device's system memory, and the destination surface is in video memory. Therefore, the implementation of the operation of the bitmap copying from system memory to video memory is the key performance element, and you should analyze the speed of this copy when developing a display driver. Also, consider using technologies such as direct memory access (DMA) to improve the performance in this scenario.

The Windows Network Projector and the Windows Vista desktop computer communicate with each other through an IP-based network. The RDP connection can be either wired or wireless. Microsoft conducted most of their tests with a resource intensive PowerPoint slide show through a wired connection so network throughput was not a performance issue. The maximum throughput observed was 30 Mbps or 3.7 MB per second on a 1.8 GHz CEPC. On a 233 MHz CEPC, this number dropped to 14 Mbps or 2.0 MB per second because of CPU saturation and frame dropping. Because the CPU can run in idle on a 1.8 GHz CEPC, 3.7 MB per second is the theoretical maximum throughput required for Windows Network Projector scenarios. For most wired connections this is not an issue, but is could be an issue with wireless connections. The maximum bandwidth requirement is only for heavily animated slides. However, most Microsoft PowerPoint slide shows that do not require heavy animations run at an acceptable speed over a wireless connection, where the throughput is usually less than 1 MB per second. Microsoft is still acquiring more performance data for wireless connections.

No investigation has been conducted on the impact on the cache. Because the CPU usage is currently maximized on almost all hardware platforms that have a slower CPU, the impact on the cache could be the most dominant factor in terms of performance. Microsoft is currently investigating the impact on the cache by tuning each module.

The following illustration shows a high-level component view of the Windows Network Projector component stack. In CE 6.0, the RDP protocol pipeline is implemented in Mstcsax.dll. In the illustration, components that do not have a major impact on performance are not shown, neither are components that are not related to the Windows Network Projector.

Bb931330.4b79171a-1037-4294-a081-2a75b207e164(en-us,MSDN.10).gif

The Monte Carlo Profiler shows that when a session between a Windows Network Projector and a Windows Vista desktop computer is active, there are three threads actively running. The following table describes the three threads.

Thread Description

_WS2AsyncSelectThread

Windows sockets (winsock) thread

TSStaticThreadEntry

RDP receiving and data processing thread

_CeInterruptServiceThread

Network interrupt handler

The TSStaticThreadEntry thread runs most of the time during a Windows Network Projector session. This is the worker thread for RDP that does the work that includes, if necessary, copying data from the network driver buffer to the RDP buffer, decrypting and decompressing the RDP packets, decoding RDP orders, sending drawing orders to GDI, which is then followed by the display driver drawing to the screen.

This section discusses performance-related components within RDP. Profiling shows that the main RDP DLL, Mstscax.dll, occupies more than 40 percent of the CPU time on a 233 MHz CEPC, but only 12 percent on a 1.8 GHz CEPC. Therefore, it is a very CPU-intensive DLL. The following list shows the performance elements related to RDP that will be discussed:

RDP supports various mechanisms to reduce the amount of data transmitted through the network connection. The two most important mechanisms are RDP 6.0 compression and bitmap caching.

RDP 6.0 is a bulk compression algorithm for RDP. It is applied on the RDP protocol data unit (PDU) on the server side, and is decompressed on the client side. RDP 6.0 can save between 15 and 200 percent of the bandwidth, depending on the data type received.

On a 233 MHz CEPC and the 400 MHz NEC Solution Gear 2-Vr5500, the RDP 6.0 Decompression function uses about 30 percent of the CPU and is always in the top two of the functions in terms of CPU usage. However, on the 1.8 GHz CEPC, it uses only 7 percent. Therefore, RDP 6.0 Compression is a CPU-intensive algorithm and takes many CPU cycles to complete.

Instrumentation of the RDP 6.0 Compression algorithm shows that more than a hundred RDP PDUs were processed during a resource intensive slide show, although, there should be much more data because instrumented kernel profiling requires resources. For each RDP packet, RDP 6.0 Compression takes one to two milliseconds. This is measured by calling the GetTickCount function in the RDP 6.0 Compression profiling function. The raw data ranges from hundreds of bytes to around 11 KB, and the decompressed data ranges from 5 KB to 15 KB. In lightweight PowerPoint slide shows, RDP 6.0 Compression usually takes zero to one milliseconds, the data size is much smaller, and the number of RDP packets is much less. In addition, the compression rate for a resource intensive slide show is lower than a lightweight PowerPoint slide show, and the compression size is larger. This shows that the server had sent a large number of orders during an animation session. This caused the hardware platform with a slow CPU to slow down during decompression, decryption, and displaying with the resource intensive slide show.

By default, the Windows Network Projector uses RDP 6.0 Compression. However, CE 6.0, Windows CE 5.0, and Windows CE .NET 4.2 also support another compression algorithm, the MPPC compression algorithm. To use the MPPC compression algorithm, change the value of the AllowRdpStrongCompression registry key to zero (0). The following code sample shows this. Using MPPC might not improve performance.

[HKEY_LOCAL_MACHINE\Software\Microsoft\Terminal Server Client]
  "AllowRdpStrongCompression" = dword:1 ; 1 for RDP 6.0 Compression and 0 for MPPC

RDP uses different levels of encryption, including RSA security's RC4 cipher and SHATransform. The corresponding function for RSA security's RC4 cipher is _rc4 or _xrc4 if system encryption is disabled. This stream cipher is designed to efficiently encrypt small amounts of data of varying sizes. In terms of a performance hit, encryption requires a signification amount of CPU time. For example, a resource intensive slide show on a 233 MHz CEPC uses more than 15 percent of the time on decryption and on a 1.8 GHz CEPC, about 3 to 4 percent.

Encryption was modified by disabling certain types of encryption. The following sample shows the registry settings for encryption.

[HKEY_LOCAL_MACHINE\Software\Microsoft\Terminal Server Client]
  "Encryption enabled"=dword:0
  "UseSystemEncryption"=dword:0
  "DisableMACCheck"=dword:1

DisableMACCheck removes signature checking so SHATransform will not show in profiling data, and UseSystemEncryption=0 forces RDP to use _xrc4 instead of _rc4.

When removing compression, there is not a performance boost because the CPU cycles are redistributed to other RDP elements. Encryption cannot be disabled completely because the encryption is forced on the server side to level one (1) regardless of what the client is requesting.

RDP applies another layer of compression to bitmaps. The unit of this compression is the bitmap data plus part of an RDP data stream. The algorithm for this compression is a 2D lossless run-length encoding (RLE) scheme.

From the data flow diagram, this decompression occurs when the client receives a bitmap cache order or a bitmap cache miss occurs. When running a resource intensive slide show on a 233 MHz CEPC, there were no bitmap cache misses. The profiler also shows that this decompression did not use very much of the CPU. However, when running a less resource intensive slide show, the CPU spent 15 percent of the time in bitmap decompression.

In an RDP session, there are several buffer copies involved in the process pipeline. The following list shows these buffer copies:

  • Copying from the network buffer to the RDP TD buffer with winsock's recv function.
  • Copying from the TD buffer to the x224 filter's head buffer or data buffer. Sometimes the TD buffer is used directly as the client buffer instead of copies of the buffer.
  • When decompressing RDP 6.0 data, the decompressed data will be written to the decompressor's history buffer. This is the working buffer for a future pipeline. This is not a copy of the buffer.
  • If it is a bitmap cache order, the decompressed bitmap is copied into the cache line.
  • If it is a memory blit order, the bitmap will be copied from memory to the frame buffer by the display driver.

According to the Monte Carlo Profiler, the memcpy function is always the most resource intensive, and uses 19 percent of the CPU on a 233 MHz CEPC and 17 percent of the CPU on a 1.8 GHz CEPC.

You can also improve performance by increasing the size of the receive buffer and the data buffer. This increases performance for two reasons. The following list shows these two reasons:

  • RDP can receive more data from the network driver, which frees the network driver buffers to receive more data and drop less frames.
  • More packets will fit in the receiving buffer so the receiving buffer can be used directly as the input buffer for RDP 6.0 Compression, which saves one memory copy operation from the receive buffer to the data buffer.

RDP uses bitmap caching to draw orders efficiently. Bitmaps in RDP are cached in slots. There are three different cache slot types, 16, 32, and 64, depending on the size of the bitmap. The following list shows three slot lists on a sample CEPC:

  • 120 slots of size 16 by 16
  • 120 slots of size 32 by 32
  • 337 slots of size 64 by 64

There is also a persistent bitmap cache. On Windows Vista desktop computers, the persistent cache uses the hard disk driver to make the cache much larger. On CE 6.0 it is stored in memory or persistent storage if persistent storage exists.

From profiling a resource intensive slide show on a 233 MHz CEPC, there were no bitmap cache misses, so bitmap caching was not the performance bottleneck in this scenario.

To tune the performance of the Windows Network Projector, you can disable orders from RDP and send bitmaps. This yields better performance in Microsoft PowerPoint scenarios without affected the performance in other scenarios. Other factors that can affect Windows Network Projector performance on different platforms are the size of the RDP receive buffer, and the threshold size that RDP uses to initiate the Winsock API to start to retrieve data. You can adjust these factors through the registry. The following registry sample shows this.

[HKEY_LOCAL_MACHINE\Software\Microsoft\Terminal Server Client]
    "NPPerfTune"=dword:1
    "NPRecvBufferSize"=dword:80000
    "NPRecvThreshold"=dword:800

Registry key Description

NPPerfTune

By setting NPPerfTune to 1, RDP orders are all disabled and the size of the RDP receive buffer can be adjusted with other registry entries. RDP orders and its receive buffer can not be adjusted if NPPerfTune doesn’t exist or is set to 0. The default value is 0.

NPRecvBufferSize

The receive buffer size is managed by the registry NPRecvBufferSize. The default value is 64 KB. It is recommended that you try values between 64 KB and 1024 KB.

NPRecvThreshold

The receive threshold, which controls how often you should call Winsock to retrieve data, is managed by NPRecvThreshold. RDP is triggered to call Winsock to fill in the receive buffer whenever the data size is below NPRecvThreshold. The default value is 2 KB. It is recommended that you try values from 2 KB up to five percent of your receive buffer size. The larger the threshold, the less frequently Winsock is called.

After reading this technical article, you should be able to thoroughly understand how to approach a performance analysis and a performance optimization of a resource-intensive Microsoft PowerPoint slide show.

Community Additions

ADD
Show:
© 2015 Microsoft