Windows Media Audio Professional Codec Features


by Nick Vicars-Harris

Microsoft Corporation

September 2004


Applies to:

   Microsoft® Windows Media® Format SDK

   Microsoft Windows Media Encoder 9 Series


Summary: This article explains how the Windows Media Audio Professional codec transforms multichannel audio content into stereo audio content when that is necessary. It also shows how you can create an encoding application that enables a content author to determine how that transformation happens.



Fold-down to Stereo

   Matrix Fold-down

   Author-controlled Fold-down

   Fold-down Algorithm

Dynamic Range Control

Code Examples

   Writing Multichannel Output

   Reading Multichannel Audio

For More Information


The Microsoft® Windows Media® Audio Professional codec supports high-resolution multichannel content, author-controlled fold-down from multiple channels to stereo, and dynamic range control. This article provides information that enables programmers using the Windows Media Format Software Development Kit (SDK) to create encoding or rendering applications that take advantage of these capabilities.

Multichannel playback requires the Windows® XP operating system. An important feature of the codec is the ability to fold down multiple channels into stereo playback on earlier versions of the Windows operating system or on system configurations that do not support multichannel playback (systems having only two speakers, for example).

This article describes the multichannel fold-down algorithm in detail, and describes how dynamic range control works.

This article includes the following topics:

  • Fold-down to Stereo. Describes how an encoding application can provide values that affect the fold-down process, and provides details about the fold-down algorithm.
  • Dynamic Range Control. Describes how an encoding application can provide peak and average volume levels that will be used for dynamic range control.
  • Code Examples. Shows how to insert and retrieve the relevant metadata attributes in an encoded file.

Fold-down to Stereo

The Windows Media Audio Professional codec supports multichannel encoding and decoding. If the playback system supports multichannel playback, then no further action is required of an application developer. Multichannel playback requires Windows Media Player 9 Series or later on the Windows XP operating system. If the playback system does not support multichannel playback, then the decoding process must transform the multichannel content into stereo output.

Fold-down is the process of taking a certain number of channels of audio content and restructuring them in fewer channels. The fold-down discussed in this article is from six channels to two channels. The six-channel audio configuration discussed includes two speakers right of center, two left of center, one center channel speaker, and a subwoofer. This configuration is referred to as 5.1 audio because there are five standard speakers and one subwoofer.

The following circumstances require this fold-down process:

  • Playing multichannel content using a version of Windows Media Player earlier than Windows Media Player 9 Series or on an operating system earlier than Windows XP.
  • Playing multichannel content using a supported Player and Windows version when the Windows speaker configuration is set to any two-speaker option. To set the speaker configuration, open Control Panel and click Sounds and Audio Devices.
  • Exporting multichannel content to portable devices using Windows Media Player 9 Series or later.
  • Using Windows Media Encoder 9 Series to transcode multichannel content that was encoded with the Windows Media Audio Professional codec into two-channel content using a version 8 or later codec.

The fold-down process starts with three input values, called the fold-down coefficients. These values are used in formulas that yield the signal levels from each of the six input channels that will be combined to produce the two output channels.

An encoding application is responsible for presenting a user interface that enables a content author to specify the fold-down coefficients. The application then uses the coefficients to calculate the various signal levels. Finally, the application inserts a metadata attribute in the encoded file. The value of the metadata attribute is a comma-delimited string containing the signal levels.

During decoding, the Windows Media Audio Professional codec determines whether fold-down is required. If so, it retrieves the signal level values from the metadata header in the encoded file and performs the fold-down.

Matrix Fold-down

The Windows Media Audio Professional codec uses a special fold-down mode if the playback system uses the following configuration:

  • Windows Media 9 Series or later technology.
  • Windows XP operating system.
  • Windows speaker configuration is configured to support surround sound speakers.

In this configuration, the two-channel folded-down signal is mixed down to two physical channels in a way that a Matrix Decoder can interpret. If the audio card is sending its output to a consumer audio/video receiver with a Matrix Decoder, the receiver will be able to decode the input signal back to four channels (left, right, center, and surround) for a surround sound experience.

This mode does not use any fold-down metadata that may have been provided by the content author.

This mode is not used to fold down 7.1 channel content. Content that includes 7.1 channel audio is always folded down to stereo.

Author-controlled Fold-down

To influence fold-down behavior, content authors can provide up to three fold-down coefficients. An encoding application should present a user interface that allows an author to enter these values, such as in Windows Media Encoder. The application then writes fold-down information to the file by using interfaces provided by the Windows Media Format SDK.

The following table describes the three fold-down coefficients.

Fold-Down Coefficient Description
SurroundMix Mix level, in decibels (dB), applied to the Left Surround (LS) and Right Surround (RS) channels.
CenterMix Mix level (in dB) applied to the Center (C) channel.
LFEMix Mix level (in dB) applied to the Low Frequency Effect (LFE) channel.

The application must enforce the limits described in the following table.

Parameter Value
SurroundMix default value -3 dB
CenterMix default value -3 dB
LFEMix default value -12 dB
Maximum value of each coefficient 0 dB
Minimum value of each coefficient -144 dB

Fold-down Algorithm

To implement author-controlled fold-down, an encoding application must insert the g_wszFold6To2Channels3 metadata attribute into the encoded file. The value of the attribute is a comma-delimited string that contains the numbers the Windows Media Audio Professional codec uses to derive stereo output from multichannel input. This section shows how to derive those numbers from the coefficients specified by the content author.

These equations show the general form of the 5.1-to-2.0 fold-down:

LResult = B * (L + (LinearSurroundMix * LS) + 
                   (LinearCenterMix * C) + 
                   (LinearLFEMix * LFE))
RResult = B * (R + (LinearSurroundMix * RS) + 
                   (LinearCenterMix * C) + 
                   (LinearLFEMix * LFE))

LResult and RResult are the two output channels. L, R, LS, RS, C, and LFE are the six input channels.

Three of the terms in each formula are weighted by linear coefficients that are calculated from the corresponding fold-down coefficient provided by the content author. B is a scaling factor calculated from these linear coefficients. The scaling factor is required so that the overall mix never results in a gain greater than 1.

These equations are used to calculate the linear coefficients:

LinearSurroundMix = 10 ^ (SurroundMix / 20)
LinearCenterMix = 10 ^ (CenterMix / 20)
LinearLFEMix = 10 ^ (LFEMix / 20)

This formula is used to calculate B:

B = 1 / (1 + LinearSurroundMix + LinearCenterMix + LinearLFEMix)

The mixer service provided by the operating system, KMixer, requires log values scaled by 65,536. To provide compatibility with that functionality, we calculate the following values for the codec to use during fold-down:

W = 20 * 65536 * log10(B)
X = 20 * 65536 * log10(LinearSurroundMix * B)
Y = 20 * 65536 * log10(LinearCenterMix * B)
Z = 20 * 65536 * log10(LinearLFEMix *B)

The following table shows how these four values relate to the six input channels and the two output channels:

Output L R C LFE LS RS
LResult W minusInfinity Y Z X minusInfinity
RResult minusInfinity W Y Z minusInfinity X

The string that is the value of the fold-down attribute follows this pattern:


The encoding application can use this value for minusInfinity: 

–2147483648 (0x80000000)

The Windows Media Format SDK constant that specifies the fold-down attribute is g_wszFold6To2Channels3. The final section of this article shows sample code that calculates the required values, constructs the comma-delimited string, and inserts the attribute in the encoded file.

Dynamic Range Control

Some audio streams contain a wide range of audio volume, and users may want to limit this loudness. This can be true of a variety of audio sources, from classical music to movie soundtracks.

The difference between the peak and average volume levels is called the dynamic range. The Windows Media Audio Professional codec supports authoring peak and average volume levels that will control the dynamic range available during playback, if the user enables dynamic range control.

Note   The Windows Media Audio Lossless codec also supports author-controlled dynamic range control.

After encoding audio content, the Windows Media Audio Professional codec calculates the peak and average volume levels of the encoded content. It inserts those values twice in the metadata of the file: as read-only reference values and as read-write target values. The reference values preserve the original values of the content. All four values are used during decoding if the user enables dynamic range control. Because the target values can be changed, an encoding application can enable the content author to specify new values that will affect playback when the user enables dynamic range control.

An encoding application should display the reference values to enable the content author to select appropriate target values. It should enforce a maximum value of 0 dB and a minimum value of -90 dB for both the peak and average target values.

The Windows Media Format SDK provides the following constants for the four attributes:

  • g_wszWMWMADRCPeakReference
  • g_wszWMWMADRCAverageReference
  • g_wszWMWMADRCPeakTarget
  • g_wszWMWMADRCAverageTarget

An encoding application should make clear to the content author that setting the average target value is not recommended. Adjusting the average value does not affect the difference between loud and soft sounds. Instead, it cuts or boosts the overall average volume, which may cause undesirable distortion during playback.

A playback application enables dynamic range control by using the output setting g_wszDynamicRangeControl. The application uses the IWMReaderAdvanced2::SetOutputSetting method of the Reader object to configure the setting. A value of zero (the default) specifies no alteration of the dynamic range. A value of 1 specifies a moderate level of dynamic range compression. A value of 2 specifies a high level of dynamic range compression.

Note   In Windows Media Player 9 Series and later, the user interface choices for Quiet Mode determine dynamic range control. The settings Off, Medium difference, and Little difference correspond to the g_wszDynamicRangeControl values 0, 1, and 2.

The following table explains the effect of the g_wszDynamicRangeControl settings during playback.

Setting Target Values Range of Delivered Audio
0 Any target values. Same range as the original content.
1 Target values equal to reference values. Average level is maintained and peaks are confined to the average +12 dB.
2 Target values equal to reference values. Average level is maintained and peaks are confined to the average +6 dB.
1 Target values specified. Average level set to the target average values and peaks confined to the target peak value.
2 Target values specified. Average level set to the target average value and peaks confined to the mean of the target average and target peak values.

If you are using the Windows Media Format SDK to build an encoding application, you should provide a user interface that allows content authors to specify the target values. The application should then modify the g_wszWMWMADRCPeakTarget and g_wszWMWMADRCAverageTarget metadata attributes as necessary. You should also provide a user interface that enables users to specify the degree of dynamic range control to be applied.

Code Examples

The following code examples demonstrate how to write and read multichannel audio by using the Windows Media Format 9 Series SDK or later.

Writing Multichannel Output

The following example code shows how to write fold-down data into a multichannel ASF file:

//Some useful defines, although not all are used in this code example
#define  DEFAULT_CENTER_MIX     -3
#define  DEFAULT_LFE_MIX        -12


#define     MINUS_INFINITY      0x80000000 // –2147483648
#define     PLUS_INFNITY        0x7FFFFFFF // 2147483647
#define     KMIXER_LOG_CONSTANT1    20.0
#define     KMIXER_LOG_CONSTANT2    65536.0
#define     OUT_FORMAT L"%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\0"

// Name: SixToTwoFoldDown
// Desc: This is the essential fold-down algorithm
HRESULT SixToTwoFoldDown(  long lSurroundMix,
                           long lCenterMix,
                           long lLFEMix,
                           LPWSTR pwszTable,
                           DWORD dwLen )
    if( NULL == pwszTable )
        return( E_POINTER );

    double  dLinearSurroundMix, dLinearCenterMix, dLinearLFEMix;
    long    W, X, Y, Z;
    double  dbB;

    // Calculate coefficients 
    dLinearSurroundMix = pow( 10.0, ( (double) lSurroundMix ) / 
                              KMIXER_LOG_CONSTANT1 );
    dLinearCenterMix = pow( 10.0, ( (double) lCenterMix ) / 
                            KMIXER_LOG_CONSTANT1 );
    dLinearLFEMix = pow( 10.0, ( (double) lLFEMix ) / 
                         KMIXER_LOG_CONSTANT1 );

    dbB = 1.0 / ( 1.0 + dLinearSurroundMix + dLinearCenterMix + 
                  dLinearLFEMix );

    // KMixer requires log values scaled by 65,536
                 log10( dbB ) );
                 log10( dLinearSurroundMix * dbB ) );
                 log10( dLinearCenterMix * dbB ) );
                 log10( dLinearLFEMix * dbB ) );

    // Print out into a string.
    _snwprintf( pwszTable, dwLen-1, 
                OUT_FORMAT , //L"%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\0"
                W, (long) MINUS_INFINITY,    // L values
                (long) MINUS_INFINITY, W,    // R values
                Y, Y,                        // C values
                Z, Z,                        // LFE values
                X, (long) MINUS_INFINITY,    // LS values
               (long) MINUS_INFINITY, X );   // RS values

    *(pwszTable + dwLen - 1 ) = L'\0';

    return S_OK;

// Name: SetFoldDownProperty 
// Desc: Sets the fold-down property string on the specified ASF stream
HRESULT SetFoldDownProperty ( long lSurroundMix, long lCenterMix, long lLFEMix, WORD wStreamNum, IWMHeaderInfo* pHeader)

    HRESULT hr;
    // Wide character buffer to hold the property strings
    WCHAR   wszTable[MAX_PATH];
    memset( wszTable, 0, sizeof(WCHAR) * MAX_PATH );
    // Variables used in SetAttribute
    WORD cbLen = 0;
    BYTE* pData = NULL;

    // Check for valid input
    if ( ( lSurroundMix <= 0 && lSurroundMix >= -144) && 
        ( lCenterMix <= 0 && lCenterMix >= -144) &&
        ( lLFEMix <= 0 && lLFEMix >= -144))
            // Calculate the fold-down values and write them as a string
            // into the buffer provided. 
            // In this example, we only do 5.1 to 2 fold-down.
            hr = SixToTwoFoldDown( lSurroundMix, lCenterMix, lLFEMix, 
                                    wszTable, MAX_PATH );
                cbLen = sizeof(WCHAR) * ( wcslen( pwszAttribValue ) + 1);
                pData = (BYTE*) wszTable;

                // Now set the attribute for the stream in the ASF file
                hr = pHeader->SetAttribute( wStreamNum, //zero-based
                                            cbLen );

            hr = E_INVALIDARG;

    return hr;

Reading Multichannel Audio

To read or play back multichannel audio by using the Reader object, perform these three steps, in this order:

  1. Call IWMReaderAdvanced2::SetOutputSetting twice, first to set the g_wszEnableDiscreteOutput setting to TRUE, and then to set the g_wszSpeakerConfig setting to DSSPEAKER_5POINT1 (defined in dsound.h in the DirectX SDK).
  2. Obtain the supported multichannel media type from the Reader object.
  3. Set the multichannel media type on the output.

These steps are demonstrated in the code examples that follow. The audioplayer sample application in the Windows Media Format SDK uses the waveOut APIs to render audio that has been parsed and decoded by a Reader object. To add multichannel audio support to audioplayer, perform the following steps:

  1. Add the following lines to the top of audioplay.cpp.
    #include "atlbase.h" //for CComPtr and CComQIPtr
    #include <dsound.h> //For the DSSPEAKER_5POINT1 value
    #include <mmreg.h> //For WAVEFORMATEXTENSIBLE (if you use it)
  2. In audioplay.cpp, insert the following lines to the Open function after the call to RetrieveAndDisplayAttributes.
    //Set up multichannel playback
    BOOL fEnableDiscreteOutput = TRUE;
    DWORD dwSpeakerConfig = DSSPEAKER_5POINT1;
    CComQIPtr<IWMReaderAdvanced2, &IID_IWMReaderAdvanced2>
    if(! pReaderAdvanced2)
    //Make the required settings on the Reader Object
    hr = pReaderAdvanced2->SetOutputSetting(0,
        (BYTE *)&fEnableDiscreteOutput,
        sizeof( BOOL ) );
    if(FAILED(hr)) break;
    hr = pReaderAdvanced2->SetOutputSetting(0,
        (BYTE *)&dwSpeakerConfig,
        sizeof( DWORD ) );
    if(FAILED(hr)) break;
    // Dynamic range control can also be set using SetOutputSetting, 
    // although we don't show that here. 
    // Get the various formats supported by the audio output. 
    // In this example, to keep things simple, we only handle 
    // audio-only files with a single stream. In other words, we
    // assume that there is one audio output and that its number is zero.
    DWORD dwAudioOutput = 0;
    DWORD formats = 0;
    hr = m_pReader->GetOutputFormatCount(dwAudioOutput, &formats);
    if(FAILED(hr)) break;
    // Multichannel formats, if available, are returned first
    for(int j = 0; j < formats;j++)
        CComPtr<IWMOutputMediaProps> pProps;
        hr = m_pReader->GetOutputFormat(dwAudioOutput, j, &pProps);
        if(FAILED(hr)) break;
        WM_MEDIA_TYPE* pNativeType = NULL;
        DWORD cbFormat = 0;
        hr = pProps->GetMediaType( NULL, &cbFormat );
        if(FAILED(hr)) break;     
        pNativeType = (WM_MEDIA_TYPE *)new BYTE[ cbFormat ];
        if( NULL == pNativeType )
            printf( "Not enough core\n" );
            hr = E_OUTOFMEMORY;
        hr = pProps->GetMediaType( pNativeType, &cbFormat );
        if( hr != S_OK )
            printf( "Failed getting the media type\n");
            return hr;
        //  This works for WAVEFORMATEXTENSIBLE formats as long
        //  as we are only looking at the WAVEFORMATEX members.
        WAVEFORMATEX* pWFX = (WAVEFORMATEX*) pNativeType->pbFormat;
        if(pWFX->nChannels == 6)
            // We have found a six-channel output supported for
            // this file. If we were going to examine all the contents
            // of the structure, we would need to cast 
            // pNativeType->pbFormat to WAVEFORMATEXTENSIBLE. In this
            // example, we just set the first multichannel type
            // we find. The format block has been correctly 
            // allocated, so we just pass it to SetMediaType.
            hr = pProps->SetMediaType(pNativeType);
            if( hr != S_OK )
                printf( "Failed to set output props\n");
                return hr;
            hr = m_pReader->SetOutputProps(0, pProps);
            if( hr != S_OK )
                printf( "Failed to set output props\n");
                return hr;
            delete pNativeType;
        } // end if
        delete pNativeType;
    } // end for
  3. Move the GetAudioOutput method call down to the bottom of the Open method, just before the "return hr" line. This method sets the WAVEFORMATEX structure that will be used in the waveOutOpen call later in the application.

Note   When the format type set by the reader object is WAVEFORMATEXTENSIBLE, then wFormatTag is set to WAVE_FORMAT_EXTENSIBLE and cbSize is set to 22. Although the pointer used by the waveOutOpen function is WAVEFORMATEX, the function detects those two values and correctly handles the extra data that is present in the WAVEFORMATEXTENSIBLE structure.

For More Information

To learn more about how Windows Media Player handles dynamic range information, see the article "Dynamic Range Control in Windows Media 9 Series" on the Microsoft Web site.

For more information about multichannel audio on Windows XP, see the article "Multiple Channel Audio Data and WAVE Files" on the Microsoft Web site.