Media Support in the Microsoft Windows Real-Time Communications Platform
Published November 2001
Updated June 2005
Summary: The Real-Time Communications (RTC) platform is a set of core components that provide rich real-time communications features. This platform is used by many in the industry, as well as by various Microsoft products. This paper outlines the media-related features and enhancements provided by these components. Application developers can use the RTC SDK to integrate real-time features that add video and voice to new or existing applications. (7 printed pages)
Video Bandwidth and Frame Rate
Acoustic Echo Cancellation (AEC)
Redundant Audio Coding
Dynamic Jitter Buffer and Adjustment
Automatic Gain Control (AGC)
Quality Control Algorithms
With the introduction of Microsoft Windows XP, rich communications features have been combined and enhanced to provide the infrastructure for a real-time communications (RTC) experience. These features are leveraged by various Microsoft applications, including Microsoft Office Communicator, MSN Messenger, and Windows Messenger, to expose user-to-user communications by using real-time voice and video, instant messaging, and other collaboration features. In addition, an application programming interfaces (API) exposes functions that make this rich communications infrastructure available to any application.
This paper details the media features that were added to the Real-Time Communications platform, which provides a rich experience for both end-users and application developers. When applications are built on the Real-Time Communications platform, the end user receives a vivid audio and video experience, and the developer gets a broad set of functionality. Applications built using this API also have access to the instant messaging and presence functionality that the Real-Time Communications platform provides. Information about the RTC API can be found in the Windows Platform SDK documentation.
This paper discusses the following features and improvements:
- Audio and Video Codec Availability
- Acoustic Echo Cancellation
- Redundant Audio Coding
- Dynamic Jitter Buffer and Adjustment
- Automatic Gain Control
- Bandwidth Estimation
- Quality Control Algorithms
The real-time communications platform supports the audio codecs listed in the table below. Also listed are the related sampling rates and bit rates. The codec selection is based on both the capability of the parties involved in the session and the bandwidth between them. For example, if one party is on a dial-up link with a speed of 56 kilobytes per second (KBps), G.711 will be disabled, because it exceeds the bandwidth available. To take another example, if one party supports SIREN but the other does not; SIREN will be disabled. If both parties support SIREN and the bandwidth is sufficient, SIREN will be chosen over other audio codecs. Except for G.729, the platform does not support the plugging in third-party audio codecs.
|Codec||Sampling Rate||Bit Rate||RTP Packet Duration|
|G.711||8 kHz||64 Kbps||20 msec|
|G.722.1||16 kHz||24 Kbps||20 msec|
|G.723||8 kHz||6.4 Kbps||30 msec, 60 msec or 90 msec|
|GSM||8 kHz||13 Kbps||20 msec|
|DVI4||8 kHz||32 Kbps||20 msec|
|SIREN||16 kHz||16 Kbps||20 msec or 40 msec|
Audio Codec Selection
The codec for an outgoing audio stream is selected and configured based on the following factors:
- Available bandwidth
- The maximum possible bandwidth
- The presence of an outgoing video stream
- Preferred codec order from the SDP negotiation
- Predefined minimum bandwidth for each audio codec
- Whether or not the RTP module has reported bandwidth estimation
- The bandwidth threshold for codec switch
The best codec and the frame size will be selected for different conditions. The changes will happen dynamically during the call. Endpoints that need to interoperate with Microsoft real-time clients should be prepared to support dynamic payload type and frame size changes if multiple payload types are published for the same media in the SDP (Session Description Protocol).
The H.263 and H.261 codecs are supported for video. H.263 is always preferred. The bit rate for this codec can vary from 6 to 125 KBps, depending on network conditions. The platform supports both Quarter Common Intermediate Format QCIF(176 x 144) and Common Intermediate Format CIF (352 x 288). The platform supports a variety of capture modes. In priority order, these are the modes that are currently enabled: MSH263, MSH261, YVU9, I420, IYUV, YUY2, UYVY, RGB16, RGB24, RGB4, and RGB8. Plugging in third-party video codecs is not supported.
The computed setting for the outgoing video stream is based on the following factors:
- Available bandwidth.
- The gross bandwidth of newly selected audio codec.
- The temporal-spatial trade off set by the application.
The computed video bit rate and frame rate are based on the above factors so that video will not interrupt the audio traffic. Again, all changes happen dynamically. The application can use MaxBitRate and TemporalSpatialTradeOff properties on the IRTCClient interface to influence the algorithm, but it cannot dictate the final settings.
AEC (acoustic echo cancellation) works by modeling the output from the speakers and removing it from the signal captured by the microphone. AEC helps to ensure that no echo is heard at the other end.
AEC can be enabled or disable through the Audio and Video Tuning wizard. In Microsoft products, this wizard is commonly found on the Tools or Options menus. As shown in Figure 1 below, selecting the I am using headphones check box in the Audio and Video Tuning wizard disables AEC. AEC is on by default if this check box is clear. Many cameras and microphones ship with hardware-specific AEC. Hardware-specific AEC often disables this check box, which the user then sees as dimmed. For information about configuring hardware-specific AEC, see the OEM's software documentation.
Figure 1. Audio and Video Tuning Wizard dialog box
AEC can be programmatically enabled or disabled by using the PreferredAEC method on the IRTCClient interface.
For more information about the RTC Client API and its interfaces, see the Windows Platform SDK documentation.
The AEC module that the real-time media platform uses is part of the Microsoft DirectSound application programming interface. This module includes, among other things, the following features and limitations:
- AEC works in small rooms only, up to 25 by 15 by 9 feet.
- AEC works only on mono streams. If the output is multichannel stereo, only one of the channels will receive echo cancellation.
- AEC does not cancel audio coming from other sources, such as a song played on the radio in the background.
Redundant audio coding is a technique that is used to compensate for packet losses. The real-time media platform implements a one-packet redundancy algorithm. When it is enabled, each packet will carry both the current audio frame and one of the earlier audio frames. If a packet is lost, the receiver has a second chance to get the audio frame in a later packet. This process is documented in IETF RFC 2198. The maximum number of consecutive packets that can be recovered is three. This algorithm adapts to information provided by the Real-Time Control Protocol (RTCP).
The algorithm starts with zero redundancy and introduces redundancy when packet loss is detected. The distance between the original packet and the packet that carries a copy of the original data determines how many lost packets can be recovered. This distance can vary from one to three packets. For example, if the distance is two and a program loses packet n, it will get the same information in packet n+2. If it loses both packet n and packet n+1, it can still recover all the information from packets n+2 and n+3. If it loses the n, n+1, and n+2 packets, then the information in packet n cannot be recovered (it was in packet n+2). The table below shows the distance for the different low and high packet loss rates.
|Distance||Loss Rate (Low)||Loss Rate (High)|
The Real-Time Communications platform performs Redundant Audio Coding automatically.
Jitter buffers smooth delay variations in received audio by buffering the packets and adjusting their rendering. The result is a smoother delivery of audio to the user. The client has a jitter buffer that can grow to 500 msec. In other words, the buffer can absorb up to 500 msec of delay variations in the received packets without causing choppy sound.
The total render buffer is a two-second circular buffer. If a packet storm that gives the program more than two seconds' worth of data in a very short time, then new packets will be discarded.
The jitter buffer is readjusted at the beginning of a new audio spurt. By default, the real-time media platform uses silence suppression.
AGC (automatic gain control) is a mechanism by which gain is adjusted automatically as the input signal level changes. The real-time media platform implements AGC by adjusting the microphone gain depending on the level of the captured audio.
When the capture or render device's audio output no longer varies according to the input gain, so that the output is essentially a flat line at maximum level, the audio breaks up. This condition is called clipping. When the real-time media platform detects that the running average peak Pulse Code Modulation (PCM) value (the audio gain) of each packet exceeds a ceiling threshold, it automatically reduces the gain so that clipping does not occur.
On the other hand, if the captured audio is too low (for example, if the running average peak PCM value of each packet is below a floor threshold), the real-time media platform boosts the gain. However, the gain is adjusted so that the level does not exceed the levels set by the user in the tuning wizard.
The actual available bandwidth may be less than the local connection speed reported by using Windows Sockets. Several factors can cause this discrepancy, including a low-speed connection in the path or bandwidth consumed by other applications.
In order to estimate the actual bandwidth available, the real-time media platform sends back-to-back RTCP packets (commonly referred to as packet pair bandwidth estimation). The other endpoint calculates the delay between the packets to estimate the actual bandwidth. The estimation is initially done for every RTCP report, which is approximately once every 5 seconds, and the frequency is then gradually reduced one for every three RTCP reports.
The aim of quality control (QC) in the real-time media platform is to provide a good audio and video experience to users of the real-time media under different network conditions. QC constantly monitors networking condition, computes the available bandwidth for outgoing streams, and dynamically alters the settings for audio and video outgoing streams to provide stream smoothness and to minimize jitter and delay. Between audio and video outgoing streams, QC puts a higher priority on audio.
QC applies its adjustments to outgoing streams when it receives commands or events from the application, the remote party, or the Real-Time Transport Protocol (RTP) module. The application triggers an adjustment by adding or removing streams or by changing the maximum bit rate setting. The adjustment is also triggered when the remote party starts a new SDP (Session Description Protocol) packet, which in turn changes streams and bit rate settings. The RTP module periodically sends real-time communications media events to inform the peer of the estimated bandwidth and packet loss rate. Upon receiving these events, QC adjusts the outgoing audio and video streams.
The QC algorithm consists of three main parts:
- Computing available bandwidth for outgoing streams
- Dynamically selecting and setting up audio codec
- Computing bandwidth and frame rate for the video outgoing stream
QC computes the available bandwidth for outgoing streams according to these factors:
- Local Bandwidth
Local bandwidth is the link speed detected minus the reserved bandwidth. Reserved bandwidth is the lesser of 20 Kbps and two-fifths of the detected link speed. Bandwidth is reserved for usage other than audio/video streaming, such as SIP (Session Initiation Protocol) signaling. At the beginning of the call, before any estimated bandwidth is reported by the RTP module, local bandwidth is restricted so that it is not greater than 120 Kbps if the detected value is greater than 200 Kbps.
- Remote Bandwidth
Remote bandwidth is received from the SDP packet.
- Application Bandwidth
Application bandwidth is set by the application. It has an upper limit of 1 Mbps. The application can configure this by setting the MaxBitrate property on the IRTCClient interface.
- Estimated Bandwidth
Estimated bandwidth is the bandwidth reported by the RTP module minus the reserved bandwidth. Reserved bandwidth is the lesser of 10 Kbps and three-tenths of the reported value.
- Previously Allocated Bandwidth
Previously allocated bandwidth is the available bandwidth computed during the last media session between the two devices.
- Current Bandwidth
Current bandwidth is the actual total bandwidth in use by outgoing streams.
- Current Loss Rate
Current loss rate is the loss percentage of packets sent from the local end.
- Number of Continuous Zero Loss reported
When a zero loss report is received, this number is incremented by one. When a non-zero loss report is received, this number is set at zero.
The media features included in the Real-Time Communications platform enable developers to create rich VoIP and Video over IP client experiences in their applications. The Real-Time Communications platform is used by internal and external developers, which enables applications to make millions of calls each month.
See the following resources for further information: