Article
09/04/2015

January 2013

Volume 28 Number 01

DirectX Factor - Windows 8 Sound Generation with XAudio2

By Charles Petzold | January 2013

Charles Petzold A Windows Store app for Windows 8 can play MP3 or WMA sound files easily using MediaElement—you simply give it a URI or a stream to the sound file. Windows Store apps can also access the Play To API for streaming video or audio to external devices.

But what if you need more sophisticated audio processing? Perhaps you’d like to modify the contents of an audio file on its way to the hardware, or generate sounds dynamically.

A Windows Store app can also perform these jobs through DirectX. Windows 8 supports two DirectX components for sound generation and processing—Core Audio and XAudio2. In the grand scheme of things, both of these are rather low-level interfaces, but Core Audio is lower than XAudio2 and is geared more toward applications that require a closer connection with audio hardware.

XAudio2 is the successor to DirectSound and the Xbox XAudio library, and it’s probably your best bet for Windows Store apps that need to do interesting things with sound. While XAudio2 is intended primarily for games, that shouldn’t stop us from using it for more serious purposes—such as making music or entertaining the business user with funny sounds.

XAudio2 version 2.8 is a built-in component of Windows 8. Like the rest of DirectX, the XAudio2 programming interface is based on COM. While it’s theoretically possible to access XAudio2 from any programming language supported by Windows 8, the most natural and easiest language for XAudio2 is C++. Working with sound often requires high-performance code, so C++ is a good choice in that respect as well.

A First XAudio2 Program

Let’s begin writing a program that uses XAudio2 to play a simple 5 second sound in response to the push of a button. Because you might be new to Windows 8 and DirectX programming, I’ll take it a little slow.

I’ll assume you have a version of Visual Studio installed that’s suitable for creating Windows Store apps. In the New Project dialog box, select Visual C++ and Windows Store at the left, and Blank App (XAML) in the list of available templates. I gave my project the name SimpleAudio, and you can find that project among the downloadable code for this article.

In building an executable that uses XAudio2, you’ll need to link the program with the xaudio2.lib import library. Bring up the project Properties dialog box by selecting the last item on the Project menu, or by right-clicking the project name in the Solution Explorer and selecting Properties. In the left column, select Configuration Properties and then Linker and Input. Click Additional Dependencies (the top item) and the little arrow. Select Edit and type xaudio2.lib into the box.

You’ll also want a reference to the xaudio2.h header file, so add the following statement to the precompiled headers list in pch.h:

#include <xaudio2.h>

In the MainPage.xaml file, I added a TextBlock for displaying any errors that the program might encounter working with the XAudio2 API, and a Button for playing a sound. These are shown in Figure 1.

Figure 1 The MainPage.xaml File for SimpleAudio

<Page x:Class="SimpleAudio.MainPage" ... >
  <Grid Background="{StaticResource ApplicationPageBackgroundThemeBrush}">
    <TextBlock Name="errorText"
               FontSize="24"
               TextWrapping="Wrap"
               HorizontalAlignment="Center"
               VerticalAlignment="Center" />
    <Button Name="submitButton"
            Content="Submit Audio Button"
            Visibility="Collapsed"
            HorizontalAlignment="Center"
            VerticalAlignment="Center"
            Click="OnSubmitButtonClick" />
  </Grid>
</Page>

The bulk of the MainPage.xaml.h header file is shown in Figure 2. I’ve removed the declaration of the OnNavigatedTo method because I won’t be using it. The Click handler for the Button is declared, as are four fields connected with the program’s use of XAudio2.

Figure 2 The MainPage.xaml.h Header File for SimpleAudio

namespace SimpleAudio
{
  public ref class MainPage sealed
  {
  private:
    Microsoft::WRL::ComPtr<IXAudio2> pXAudio2;
    IXAudio2MasteringVoice * pMasteringVoice;
    IXAudio2SourceVoice * pSourceVoice;
    byte soundData[2 * 5 * 44100];
  public:
    MainPage();
  private:
    void OnSubmitButtonClick(Platform::Object^ sender,
      Windows::UI::Xaml::RoutedEventArgs^ args);
  };
}

Any program that wishes to use XAudio2 must create an object that implements the IXAudio2 interface. (You’ll see how this is done shortly.) IXAudio2 derives from the famous IUnknown class in COM, and it’s inherently reference-counted, which means that it deletes its own resources when it’s no longer referenced by a program. The ComPtr (COM Pointer) class in the Microsoft::WRL namespace turns a pointer to a COM object into a “smart pointer” that keeps track of its own references. This is the recommended approach for working with COM objects in a Windows Store app.

Any non-trivial XAudio2 program also needs pointers to objects that implement the IXAudio2MasteringVoice and IXaudio2SourceVoice interfaces. In XAudio2 parlance, a “voice” is an object that generates or modifies audio data. The mastering voice is conceptually a sound mixer that assembles all the individual voices and prepares them for the sound-generation hardware. You’ll only have one of these, but you might have a number of source voices that generate separate sounds. (There are also ways to apply filters or effects to source voices.)

The IXAudio2MasterVoice and IXAudio2SourceVoice pointers are not reference-counted; their lifetimes are governed by the IXAudio2 object.

I’ve also included a large array for 5 seconds worth of sound data:

byte soundData[5 * 2 * 44100];

In a real program, you’ll want to allocate an array of this size at run time—and get rid of it when you don’t need it—but you’ll see shortly why I did it this way.

How did I calculate that array size? Although XAudio2 supports compressed audio, most programs that generate sound will stick with the format known as pulse-code modulation, or PCM. Sound waveforms in PCM are represented by values of a fixed size at a fixed sampling rate. For music on compact disks, the sampling rate is 44,100 times per second, with 2 bytes per sample in stereo, for a total of 176,400 bytes of data for 1 second of audio. (When embedding sounds in an application, compression is recommended. XAudio2 supports ADPCM; WMA and MP3 are also supported in the Media Foundation engine.)

For this program, I’ve also chosen to use a sampling rate of 44,100 with 2 bytes per sample. In C++, each sample is therefore a short. I’ll stick with monaural sound for now, so 88,200 bytes are required per second of audio. In the array allocation, that’s multiplied by 5 for 5 seconds.

Creating the Objects

Much of the MainPage.xaml.cpp file is shown in Figure 3. All of the XAudio2 initialization is performed in the MainPage constructor. It begins with a call to XAudio2Create to obtain a pointer to an object that implements the IXAudio2 interface. This is the first step in using XAudio2. Unlike some COM interfaces, no call to CoCreateInstance is required.

Figure 3 MainPage.xaml.cpp

MainPage::MainPage()
{
  InitializeComponent();
  // Create an IXAudio2 object
  HRESULT hr = XAudio2Create(&pXAudio2);
  if (FAILED(hr))
  {
    errorText->Text = "XAudio2Create failure: " + hr.ToString();
    return;
  }
  // Create a mastering voice
  hr = pXAudio2->CreateMasteringVoice(&pMasteringVoice);
  if (FAILED(hr))
  {
    errorText->Text = "CreateMasteringVoice failure: " + hr.ToString();
    return;
  }
  // Create a source voice
  WAVEFORMATEX waveformat;
  waveformat.wFormatTag = WAVE_FORMAT_PCM;
  waveformat.nChannels = 1;
  waveformat.nSamplesPerSec = 44100;
  waveformat.nAvgBytesPerSec = 44100 * 2;
  waveformat.nBlockAlign = 2;
  waveformat.wBitsPerSample = 16;
  waveformat.cbSize = 0;
  hr = pXAudio2->CreateSourceVoice(&pSourceVoice, &waveformat);
  if (FAILED(hr))
  {
    errorText->Text = "CreateSourceVoice failure: " + hr.ToString();
    return;
  }
  // Start the source voice
  hr = pSourceVoice->Start();
  if (FAILED(hr))
  {
    errorText->Text = "Start failure: " + hr.ToString();
    return;
  }
  // Fill the array with sound data
  for (int index = 0, second = 0; second < 5; second++)
  {
    for (int cycle = 0; cycle < 441; cycle++)
    {
      for (int sample = 0; sample < 100; sample++)
      {
        short value = sample < 50 ? 32767 : -32768;
        soundData[index++] = value & 0xFF;
        soundData[index++] = (value >> 8) & 0xFF;
      }
    }
  }
  // Make the button visible
  submitButton->Visibility = Windows::UI::Xaml::Visibility::Visible;
}
void MainPage::OnSubmitButtonClick(Object^ sender, RoutedEventArgs^ args)
{
  // Create a button to reference the byte array
  XAUDIO2_BUFFER buffer = { 0 };
  buffer.AudioBytes = 2 * 5 * 44100;
  buffer.pAudioData = soundData;
  buffer.Flags = XAUDIO2_END_OF_STREAM;
  buffer.PlayBegin = 0;
  buffer.PlayLength = 5 * 44100;
  // Submit the buffer
  HRESULT hr = pSourceVoice->SubmitSourceBuffer(&buffer);
  if (FAILED(hr))
  {
    errorText->Text = "SubmitSourceBuffer failure: " + hr.ToString();
    submitButton->Visibility = Windows::UI::Xaml::Visibility::Collapsed;
    return;
  }
}

Once the IXAudio2 object is created, the CreateMasteringVoice and CreateSourceVoice methods obtain pointers to the other two interfaces defined as fields in the header file.

The CreateSourceVoice call requires a WAVEFORMATEX structure, which will be familiar to anyone who has worked with audio in the Win32 API. This structure defines the nature of the audio data you’ll be using for this particular voice. (Different source voices can use different formats.) For PCM, only three numbers are really relevant: the sampling rate (44,100 in this example), the size of each sample (2 bytes or 16 bits) and the number of channels (1 here). The other fields are based on these: The nBlockAlign field is nChannels times wBitsPerSample divided by 8, and nAvgBytesPerSec field is the product of nSamplesPerSec and nBlockAlign. But in this example I’ve shown all the fields with explicit values.

Once the IXAudio2SourceVoice object is obtained, the Start method can be called on it. At this point, XAudio2 is conceptually playing, but we haven’t actually given it any audio data to play. A Stop method is also available, and a real program would use these two methods to control when sounds should and shouldn’t be playing.

All four of these calls are not as simple as they appear in this code! They all have additional arguments, but convenient defaults are defined and I’ve simply chosen to accept those defaults for now.

Virtually all function and method calls in DirectX return HRESULT values to indicate success or failure. There are different strategies for dealing with these errors. I’ve chosen simply to display the error code using the TextBlock defined in the XAML file, and stop further processing.

PCM Audio Data

The constructor concludes by filling up the soundData array with audio data, but the array isn’t actually forked over to the IXAudio2SourceVoice until the button is pressed.

Sound is vibration, and humans are sensitive to vibrations in the air roughly in the range of 20 Hz (or cycles per second) to 20,000 Hz. Middle C on the piano is approximately 261.6 Hz.

Suppose you’re working with a sampling rate of 44,100 Hz and 16-bit samples and you wish to generate audio data for a waveform at a frequency of 4,410 Hz, which is just beyond the highest key on a piano. Each cycle of such a waveform requires 10 samples of signed 16-bit values. These 10 values would be repeated 4,410 times for each second of sound.

A waveform at a frequency of 441 Hz—very close to 440 Hz, corresponding to the A above middle C used as a tuning standard—is rendered with 100 samples. This cycle would be repeated 441 times for each second of sound.

Because PCM involves a constant sampling rate, low-frequency sounds seem to be sampled and rendered at a much higher resolution than high-frequency sounds. Isn’t this a problem? Doesn’t a 4,410 Hz waveform rendered with just 10 samples have a considerable amount of distortion compared with the 441 Hz waveform?

It turns out that any quantization distortion in PCM occurs at frequencies greater than half the sampling rate. (This is known as the Nyquist frequency after Bell Labs engineer Harry Nyquist.) One reason a sampling frequency of 44,100 Hz was chosen for CD audio is that the Nyquist frequency is 22,050 Hz, and human hearing maxes out at about 20,000 Hz. In other words, at a sampling rate of 44,100 Hz, the quantization distortion is inaudible to humans.

The SimpleAudio program generates an algorithmically simple waveform—a square wave at a frequency of 441 Hz. There are 100 samples per cycle. In each cycle the first 50 are maximum positive values (32,767 when dealing with short integers) and the next 50 are maximum negative values (-32,768). Notice that these short values must be stored in the byte array with the low byte first:

soundData[index + 0] = value & 0xFF;
soundData[index + 1] = (value >> 8) & 0xFF;

So far, nothing has actually played. This happens in the Click handler for the Button. The XAUDIO2_BUFFER structure is used to reference the byte array with a count of the bytes and a duration specified as the number of samples. This buffer is passed to the SubmitSourceBuffer method of the IXAudio2SourceVoice object. If the Start method has already been called (as it has in this example) then the sound begins playing immediately.

I suspect I don’t have to mention that the sound plays asynchronously. The SubmitSourceBuffer call returns immediately while a separate thread is devoted to the actual process of shoveling data to the sound hardware. The XAUDIO2_BUFFER passed to SubmitSourceBuffer can be discarded after the call—as it is in this program when the Click handler is exited and the local variable goes out of scope—but the actual array of bytes must remain in accessible memory. Indeed, your program can manipulate these bytes as the sound is playing. However, there are much better techniques (involving callback methods) that let your program dynamically generate sound data.

Without using a callback to determine when the sound has completed, this program needs to retain the soundData array for the program’s duration.

You can press the button multiple times, and each call effectively queues up another buffer to be played when the previous buffer finishes. If the program moves to the background, the sound is muted but it continues to play in silence. In other words, if you click the button and move the program to the background for at least 5 seconds, nothing will be playing when the program returns to the foreground.

The Characteristics of Sound

Much of the sound we hear in daily life comes simultaneously from a variety of different sources and, hence, is quite complex. However, in some cases—and particularly when dealing with musical sounds—individual tones can be defined with just a few characteristics:

Amplitude, which is interpreted by our senses as volume.
Frequency, which is interpreted as pitch.
Space, which can be mimicked in audio playback with multiple speakers.
Timbre, which is related to the mix of overtones in a sound and represents the perceived difference between a trumpet and a piano, for example.

The SoundCharacteristics project demonstrates these four characteristics in isolation. It keeps the 44,100 sample rate and 16-bit samples of the SimpleAudio project but generates sound in stereo. For two channels of sound, the data must be interleaved: a 16-bit sample for the left channel, followed by a 16-bit sample for the right channel.

The MainPage.xaml.h header file for SoundCharacteristics defines some constants:

static const int sampleRate = 44100;
static const int seconds = 5;
static const int channels = 2;
static const int samples = seconds * sampleRate;

It also defines four arrays for sound data, but these are of type short rather than byte:

short volumeSoundData[samples * channels];
short pitchSoundData[samples * channels];
short spaceSoundData[samples * channels];
short timbreSoundData[samples * channels];

Using short arrays makes the initialization easier because the 16-bit waveform values don’t need to be broken in half. A simple cast allows the array to be referenced by the XAUDIO2_BUFFER when submitting the sound data. These arrays have double the number of bytes as the array in SimpleAudio because I’m using stereo in this program.

All four of these arrays are initialized in the MainPage constructor. For the volume demonstration, a 441 Hz square wave is still involved, but it starts at zero volume, gets progressively louder over the first 2 seconds, and then declines in volume over the last 2 seconds. Figure 4 shows the code to initialize volumeSoundData.

Figure 4 Sound Data That Changes in Volume for SoundCharacteristics

for (int index = 0, sample = 0; sample < samples; sample++)
{
  double t = 1;
  if (sample < 2 * samples / 5)
    t = sample / (2.0 * samples / 5);
  else if (sample > 3 * samples / 5)
    t = (samples - sample) / (2.0 * samples / 5);
  double amplitude = pow(2, 15 * t) - 1;
  short waveform = sample % 100 < 50 ? 1 : -1;
  short value = short(amplitude * waveform);
  volumeSoundData[index++] = value;
  volumeSoundData[index++] = value;
}

Human perception of volume is logarithmic: Each doubling of a waveform’s amplitude is equivalent to a 6 dB increase in volume. (The 16-bit amplitude used for CD audio has a dynamic range of 96 decibels.) The code shown in Figure 4 to alter the volume first calculates a value of t that increases linearly from 0 to 1, and then decreases back to 0. The amplitude variable is calculated using the pow function and ranges from 0 to 32,767. This is multiplied by a square wave that has values of 1 and -1. The result is added to the array twice: first for the left channel, then for the right channel.

Human perception of frequency is also logarithmic. Much of the world’s music organizes pitch around the interval of an octave, which is a doubling of frequency. The first two notes of the chorus of “Somewhere over the Rainbow” are an octave leap whether it’s sung by a bass or a soprano. Figure 5 shows code that varies pitch over the range of two octaves: from 220 to 880 based on a value of t that (like the volume example) goes from 0 to 1 and then back down to 0.

Figure 5 Sound Data That Changes in Frequency for SoundCharacteristics

double angle = 0;
for (int index = 0, sample = 0; sample < samples; sample++)
{
  double t = 1;
  if (sample < 2 * samples / 5)
    t = sample / (2.0 * samples / 5);
  else if (sample > 3 * samples / 5)
    t = (samples - sample) / (2.0 * samples / 5);
  double frequency = 220 * pow(2, 2 * t);
  double angleIncrement = 360 * frequency / waveformat.nSamplesPerSec;
  angle += angleIncrement;
  while (angle > 360)
    angle -= 360;
  short value = angle < 180 ? 32767 : -32767;
  pitchSoundData[index++] = value;
  pitchSoundData[index++] = value;
}

In earlier examples I chose a frequency of 441 Hz because it divides cleanly into the sampling rate of 44,100. In the general case, the sampling rate isn’t an integral multiple of the frequency, and hence there can’t be an integral number of samples per cycle. Instead, this program maintains a floating point angleIncrement variable that is proportional to the frequency and used to increase an angle value that ranges from 0 to 360. This angle value is then used to construct the waveform.

The demonstration for space moves the sound from the center to the left channel, then to the right, then back to the center. For the timbre demo, the waveform starts with a sine curve at 441 Hz. A sine curve is the mathematical representation of the most fundamental type of vibration—the solution of the differential equation where force is inversely proportional to displacement. All other periodic waveforms contain harmonics, which are also sine waves but with frequencies that are integral multiples of the base frequency. The timbre demo changes the waveform smoothly from a sine wave to a triangle wave, to a square wave, to a sawtooth wave, while increasing the harmonic content of the sound.

The Bigger Picture

Although I’ve just demonstrated how you can control volume, pitch, space, and timbre by generating sound data for a single IXAudio2SourceVoice object, the object itself includes methods to change the volume and space, and even the frequency. (A “3D” space facility is also supported.) Although it’s possible to generate composite sound data that combines a bunch of individual tones with a single-source voice, you can create multiple IXAudio2SourceVoice objects and play them all together through the same mastering voice.

In addition, XAudio2 defines an IXAudioSubmixVoice that allows you to define filters and other effects, such as reverberation or echo. Filters have the ability to change the timbre of existing tones dynamically, which can contribute greatly to creating interesting and realistic musical sounds.

Perhaps the most essential enhancement beyond what I’ve shown in these two programs requires working with XAudio2 callback functions. Instead of allocating and initializing big chunks of sound data as these two programs do, it makes much more sense for a program to generate sound data dynamically as it’s being played.

Charles Petzold is a longtime contributor to MSDN Magazine, and the author of “Programming Windows, 6th edition” (O’Reilly Media, 2012), a book about writing applications for Windows 8. His Web site is charlespetzold.com.

Thanks to the following technical expert for reviewing this article: Scott Selfon