Article
09/04/2015

February 2013

Volume 28 Number 02

DirectX Factor - Constructing Audio Oscillators for Windows 8

By Charles Petzold | February 2013

Charles Petzold I’ve been making electronic music instruments as a hobby for about 35 years now. I started in the late 1970s wiring up CMOS and TTL chips, and much later went the software route—first with the Multimedia Extensions to Windows in 1991 and more recently with the NAudio library for Windows Presentation Foundation (WPF), and the MediaStreamSource class in Silverlight and Windows Phone 7. Just last year, I devoted a couple installments of my Touch & Go column to applications for Windows Phone that play sound and music.

I should probably be jaded by this time, and perhaps reluctant to explore yet another sound-generation API. But I’m not, because I think Windows 8 is probably the best Windows platform yet for making musical instruments. Windows 8 combines a high-performance audio API—the XAudio2 component of DirectX—with touchscreens on handheld tablets. This combination offers much potential, and I’m particularly interested in exploring how touch can be exploited as a subtle and intimate interface to a musical instrument implemented entirely in software.

Oscillators, Samples and Frequency

At the heart of the sound-generation facility of any music synthesizer are multiple oscillators, so called because they generate a more or less periodic oscillating waveform at a particular frequency and volume. In generating sounds for music, oscillators that create unvarying periodic waveforms usually sound rather boring. More interesting oscillators incorporate vibrato, tremolo or changing timbres, and they’re only roughly periodic.

A program that wishes to create oscillators using XAudio2 begins by calling the XAudio2Create function. This provides an object that implements the IXAudio2 interface. From that object you can call CreateMasteringVoice just once to obtain an instance of IXAudio2MasteringVoice, which functions as the main audio mixer. Only one IXAudio2MasteringVoice exists at any time. In contrast, you’ll generally call CreateSourceVoice multiple times to create multiple instances of the IXAudio2SourceVoice interface. Each of these IXAudio2SourceVoice instances can function as an independent oscillator. Combine multiple oscillators for a multiphonic instrument, an ensemble or a full orchestra.

An IXAudio2SourceVoice object generates sound by creating and submitting buffers containing a sequence of numbers that describe a waveform. These numbers are often called samples. They’re often 16 bits wide (the standard for CD audio), and they come at a constant rate—usually 44,100 Hz (also the standard for CD audio) or thereabouts. This technique has the fancy name Pulse Code Modulation, or PCM.

Although this sequence of samples can describe a very complex waveform, often a synthesizer generates a fairly simple stream of samples—most commonly a square wave, a triangle wave or a sawtooth—with a periodicity corresponding to the waveform’s frequency (perceived as pitch) and an average amplitude that is perceived as volume.

For example, if the sample rate is 44,100 Hz, and every cycle of 100 samples has values that get progressively larger, then smaller, then negative, and back to zero, the frequency of the resultant sound is 44,100 divided by 100, or 441 Hz—a frequency close to the perceptual center of the audible range for humans. (A frequency of 440 Hz is the A above middle C and is used as a tuning standard.)

The IXAudio2SourceVoice interface inherits a method named SetVolume from IXAudio2Voice and defines a method of its own named SetFrequencyRatio. I was particularly intrigued by this latter method, because it seemed to provide a way to create an oscillator that generates a particular periodic waveform at a variable frequency with a minimum of fuss.

Figure 1 shows the bulk of a class named SawtoothOscillator1 that implements this technique. Although I use familiar 16-bit integer samples for defining the waveform, internally XAudio2 uses 32-bit floating point samples. For performance-critical applications, you’ll probably want to explore the performance differences between integer and floating-point.

Figure 1 Much of the SawtoothOscillator1 Class

SawtoothOscillator1::SawtoothOscillator1(IXAudio2* pXAudio2)
{
  // Create a source voice
  WAVEFORMATEX waveFormat;
  waveFormat.wFormatTag = WAVE_FORMAT_PCM;
  waveFormat.nChannels = 1;
  waveFormat.nSamplesPerSec = 44100;
  waveFormat.nAvgBytesPerSec = 44100 * 2;
  waveFormat.nBlockAlign = 2;
  waveFormat.wBitsPerSample = 16;
  waveFormat.cbSize = 0;
  HRESULT hr = pXAudio2->CreateSourceVoice(&pSourceVoice, &waveFormat,
                                           0, XAUDIO2_MAX_FREQ_RATIO);
  if (FAILED(hr))
    throw ref new COMException(hr, "CreateSourceVoice failure");
  // Initialize the waveform buffer
  for (int sample = 0; sample < BUFFER_LENGTH; sample++)
    waveformBuffer[sample] =
      (short)(65535 * sample / BUFFER_LENGTH - 32768);
  // Submit the waveform buffer
  XAUDIO2_BUFFER buffer = {0};
  buffer.AudioBytes = 2 * BUFFER_LENGTH;
  buffer.pAudioData = (byte *)waveformBuffer;
  buffer.Flags = XAUDIO2_END_OF_STREAM;
  buffer.PlayBegin = 0;
  buffer.PlayLength = BUFFER_LENGTH;
  buffer.LoopBegin = 0;
  buffer.LoopLength = BUFFER_LENGTH;
  buffer.LoopCount = XAUDIO2_LOOP_INFINITE;
  hr = pSourceVoice->SubmitSourceBuffer(&buffer);
  if (FAILED(hr))
    throw ref new COMException(hr, "SubmitSourceBuffer failure");
  // Start the voice playing
  pSourceVoice->Start();
}
void SawtoothOscillator1::SetFrequency(float freq)
{
  pSourceVoice->SetFrequencyRatio(freq / BASE_FREQ);
}
void SawtoothOscillator1::SetAmplitude(float amp)
{
  pSourceVoice->SetVolume(amp);
}

In the header file, a base frequency is set that divides cleanly into the 44,100 sampling rate. From that, a buffer size can be calculated that is the length of a single cycle of a waveform of that frequency:

static const int BASE_FREQ = 441;
static const int BUFFER_LENGTH = (44100 / BASE_FREQ);

Also in the header file is the definition of that buffer as a field:

short waveformBuffer[BUFFER_LENGTH];

After creating the IXAudio2SourceVoice object, the SawtoothOscillator1 constructor fills up a buffer with one cycle of a sawtooth waveform—a simple waveform that goes from an amplitude of -32,768 to an amplitude of 32,767. This buffer is submitted to the IXAudio2SourceVoice with instructions that it should be repeated forever.

Without any further code, this is an oscillator that plays a 441 Hz sawtooth wave forever. That’s great, but it’s not very versatile. To give SawtoothOscillator1 a bit more versatility, I’ve also included a SetFrequency method. The argument to this is a frequency that the class uses to call SetFrequencyRatio. The value passed to SetFrequencyRatio can range from float values of XAUDIO2_MIN_FREQ_RATIO (or 1/1,024.0) up to a maximum value earlier specified as an argument to CreateSourceVoice. I used XAUDIO2_MAX_FREQ_RATIO (or 1,024.0) for that argument. The range of human hearing—about 20 Hz to 20,000 Hz—is well within the bounds defined by those two constants applied to the base frequency of 441.

Buffers and Callbacks

I must confess that I was initially somewhat skeptical of the SetFrequencyRatio method. Digitally increasing and decreasing the frequency of a waveform is not a trivial task. I felt obliged to compare the results with a waveform generated algorithmically. This is the impetus behind the OscillatorCompare project, which is among the downloadable code for this column.

The OscillatorCompare project includes the SawtoothOscillator1 class I’ve already described as well as a SawtoothOscillator2 class. This second class has a SetFrequency method that controls how the class dynamically generates the samples that define the waveform. This waveform is continuously constructed in a buffer and submitted in real time to the IXAudio2SourceVoice object in response to callbacks.

A class can receive callbacks from IXAudio2SourceVoice by implementing the IXAudio2VoiceCallback interface. An instance of the class that implements this interface is then passed as an argument to the CreateSourceVoice method. The SawtoothOscillator2 class implements this interface itself and it passes its own instance to CreateSourceVoice, also indicating that it won’t be making use of SetFrequencyRatio:

pXAudio2->CreateSourceVoice(&pSourceVoice, &waveFormat,
        XAUDIO2_VOICE_NOPITCH, 1.0f,
        this);

A class that implements IXAudio2VoiceCallback can use the OnBufferStart method to be notified when it’s time to submit a new buffer of waveform data. Generally when using OnBufferStart to keep waveform data up-to-date, you’ll want to maintain a pair of buffers and alternate them. This is probably the best solution if you’re obtaining audio data from another source, such as an audio file. The goal is to not let the audio processor become “starved.” Keeping a buffer ahead of the processing helps prevent starvation, but does not guarantee it.

But I gravitated toward another method defined by IXAudio2VoiceCallback—OnVoiceProcessingPassStart. Unless you’re working with very small buffers, generally OnVoiceProcessingPassStart is called more frequently than OnBufferStart and indicates when a chunk of audio data is about to be processed and how many bytes are needed. In the XAudio2 documentation, this callback method is promoted as the one with the lowest latency, which is often highly desirable for interactive electronic music instruments. You don’t want a delay between pressing a key and hearing the note!

The SawtoothOscillator2 header file defines two constants:

static const int BUFFER_LENGTH = 1024;
static const int WAVEFORM_LENGTH = 8192;

The first constant is the length of the buffer used to submit waveform data. Here it functions as a circular buffer. Calls to the OnVoiceProcessingPassStart method request a particular number of bytes. The method responds by putting those bytes in the buffer (starting from where it left off the last time) and calling SubmitSourceBuffer just for that updated segment of the buffer. You want this buffer to be sufficiently large so your program code isn’t overwriting the part of the buffer still being played in the background.

It turns out that for a voice with a sample rate of 44,100 Hz, calls to OnVoiceProcessingPassStart always request 882 bytes, or 441 16-bit samples. In other words, OnVoiceProcessingPassStart is called at the constant rate of 100 times per second, or every 10 ms. Although not documented, this 10 ms duration can be treated as an XAudio2 audio processing “quantum,” and it’s a good figure to keep in mind. Consequently, the code you write for this method can’t dawdle. Avoid API calls and runtime library calls.

The second constant is the length of a single cycle of the desired waveform. It could be the size of an array containing the samples of that waveform, but in SawtoothOscillator2 it’s used only for calculations.

The SetFrequency method in SawtoothOscillator2 uses that constant to calculate an angle increment that’s proportional to the desired frequency of the waveform:

angleIncrement = (int)(65536.0
                * WAVEFORM_LENGTH
                * freq / 44100.0);

Although angleIncrement is an integer, it’s treated as though it comprises integral and fractional words. This is the value used to determine each successive sample of the waveform.

For example, suppose the argument to SetFrequency is 440 Hz. The angleIncrement is calculated as 5,356,535. In hexadecimal, this is 0x51BBF7, which is treated as an integer of 0x51 (or 81 decimal), with a fractional part of 0xBBF7, equivalent to 0.734. If the complete cycle of a waveform is 8,192 bytes and you use only the integer part and skip 81 bytes for each sample, the resultant frequency is about 436.05 Hz. (That’s 44,100 times 81 divided by 8,192.) If you skip 82 bytes, the resultant frequency is 441.43 Hz. You want something between these two frequencies.

This is why a fractional part also needs to enter the calculation. The whole thing would probably be easier in floating point, and floating point might even be faster on some modern processors, but Figure 2 shows a more “traditional” integer-only approach. Notice that only the updated section of the circular buffer is specified with each call to SubmitSourceBuffer.

Figure 2 OnVoiceProcessingPassStart in SawtoothOscillator2

void _stdcall SawtoothOscillator2::OnVoiceProcessingPassStart(UINT32 bytesRequired)
{
  if (bytesRequired == 0)
      return;
  int startIndex = index;
  int endIndex = startIndex + bytesRequired / 2;
  if (endIndex <= BUFFER_LENGTH)
  {
    FillAndSubmit(startIndex, endIndex - startIndex);
  }
  else
  {
    FillAndSubmit(startIndex, BUFFER_LENGTH - startIndex);
    FillAndSubmit(0, endIndex % BUFFER_LENGTH);
  }
  index = (index + bytesRequired / 2) % BUFFER_LENGTH;
}
void SawtoothOscillator2::FillAndSubmit(int startIndex, int count)
{
  for (int i = startIndex; i < startIndex + count; i++)
  {
    pWaveformBuffer[i] = (short)(angle / WAVEFORM_LENGTH - 32768);
    angle = (angle + angleIncrement) % (WAVEFORM_LENGTH * 65536);
  }
  XAUDIO2_BUFFER buffer = {0};
  buffer.AudioBytes = 2 * BUFFER_LENGTH;
  buffer.pAudioData = (byte *)pWaveformBuffer;
  buffer.Flags = 0;
  buffer.PlayBegin = startIndex;
  buffer.PlayLength = count;
  HRESULT hr = pSourceVoice->SubmitSourceBuffer(&buffer);
  if (FAILED(hr))
    throw ref new COMException(hr, "SubmitSourceBuffer");
}

SawtoothOscillator1 and SawtoothOscillator2 can be compared side-by-side in the OscillatorCompare program. MainPage has two pairs of Slider controls to change the frequency and volume of each oscillator. The Slider control for the frequency generates only integer values ranging from 24 to 132. I borrowed these values from the codes used in the Musical Instrument Digital Interface (MIDI) standard to represent pitches. The value of 24 corresponds to the C three octaves below middle-C, which is called C 1 (C in octave 1) in scientific pitch notation and has a frequency of about 32.7 Hz. The value of 132 corresponds to C 10, six octaves above middle-C, and a frequency of about 16,744 Hz. A tooltip converter on these sliders displays the current value in both scientific pitch notation and the frequency equivalent.

As I experimented with these two oscillators, I couldn’t hear a difference. I also installed a software oscilloscope on another computer to visually examine the resultant waveforms, and I couldn’t see any difference either. This indicates to me that the SetFrequencyRatio method is implemented intelligently, which of course we should expect in a system as sophisticated as DirectX. I suspect that interpolations are being performed on resampled waveform data to shift the frequency. If you’re nervous, you can set the BASE_FREQ very low—for example, to 20 Hz—and the class will generate a detailed waveform consisting of 2,205 samples. You can also experiment with a high value: For example, 8,820 Hz will cause a waveform of just five samples to be generated! To be sure, this has a somewhat different sound because the interpolated waveform lies somewhere between a sawtooth and a triangle wave, but the resultant waveform is still smooth without “jaggies.”

This is not to imply that everything works hunky dory. With either sawtooth oscillator, the top couple octaves get rather chaotic. The sampling of the waveform tends to emit high and low frequency overtones of a sort I’ve heard before, and which I plan to investigate more fully in the future.

Keep the Volume Down!

The SetVolume method defined by IXAudio2Voice and inherited by IXAudio2SourceVoice is documented as a floating-point multiplier that can be set to values ranging from -2²⁴ to 2²⁴, which equals 16,777,216.

In real life, however, you’ll probably want to keep the volume on an IXAudio2SourceVoice object to a value between 0 and 1. The 0 value corresponds to silence and 1 corresponds to no gain or attenuation. Keep in mind that whatever the source of the waveform associated with an IXAudio2SourceVoice—whether it’s being generated algorithmically or originates in an audio file—it probably has 16-bit samples that quite possibly come close to the minimum and maximum values of -32,768 and 32,767. If you try to amplify those waveforms with a volume level greater than 1, the samples will exceed the width of a 16-bit integer and will be clipped at the minimum and maximum values. Distortion and noise will result.

This becomes critical when you start combining multiple IXAudio2SourceVoice instances. The waveforms of these multiple instances are mixed by being added together. If you allow each of these instances to have a volume of 1, the sum of the voices could very well result in samples that exceed the size of the 16-bit integers. This might happen sporadically—resulting only in intermittent distortion—or chronically, resulting in a real mess.

When using multiple IXAudio2SourceVoice instances that generate full 16-bit-wide waveforms, one safety measure is setting the volume of each oscillator to 1 divided by the number of voices. That guarantees that the sum never exceeds a 16-bit value. An overall volume adjustment can also be made via the mastering voice. You might also want to look into the XAudio2CreateVolumeMeter function, which lets you create an audio processing object that can help monitor volume for debugging purposes.

Our First Musical Instrument

It’s common for musical instruments on tablets to have a piano-style keyboard, but I’ve been intrigued recently by a type of button keyboard found on accordions such as the Russian bayan (which I’m familiar with from the work of Russian composer Sofia Gubaidulina). Because each key is a button rather than a long lever, many more keys can be packed within the limited space of the tablet screen, as shown in Figure 3.

Figure 3 The ChromaticButtonKeyboard Program

The bottom two rows duplicate the keys on the top two rows and are provided to ease the fingering of common chords and melodic sequences. Otherwise, each group of 12 keys in the top three rows provide all the notes of the octave, generally ascending from left to right. The total range here is four octaves, which is about twice what you’d get with a piano keyboard of the same size.

A real bayan has an additional octave, but I couldn’t fit it in without making the buttons too small. The source code allows you to set constants to try out that extra octave, or to eliminate another octave and make the buttons even larger.

Because I can’t claim that this program sounds like any instrument that exists in the real world, I simply called it ChromaticButtonKeyboard. The keys are instances of a custom control named Key that derives from ContentControl but performs some touch processing to maintain an IsPressed property and generate an IsPressedChanged event. The difference between the touch handling in this control and the touch handling in an ordinary button (which also has an IsPressed property) is noticeable when you sweep your finger across the keyboard: A standard button will set the IsPressed property to true only if the finger press occurs on the surface of the button, while this custom Key control considers the key to be pressed if a finger sweeps in from the side.

The program creates six instances of a SawtoothOscillator class that’s virtually identical to the SawtoothOscillator1 class from the earlier project. If your touchscreen supports it, you can play six simultaneous notes. There are no callbacks and the oscillator frequency is controlled by calls to the SetFrequencyRatio method.

To keep track of which oscillators are available and which oscillators are playing, the MainPage.xaml.h file defines two standard collection objects as fields:

std::vector<SawtoothOscillator *> availableOscillators;
std::map<int, SawtoothOscillator *> playingOscillators;

Originally, each Key object had its Tag property set to the MIDI note code I discussed earlier. That’s how the IsPressedChanged handler determines what key is being pressed, and what frequency to calculate. That MIDI code was also used as the map key for the playingOscillators collection. It worked fine until I played a note from the bottom two rows that duplicated a note already playing, which resulted in a duplicate key and an exception. I easily solved that problem by incorporating a value into the Tag property indicating the row in which the key is located: The Tag now equals the MIDI note code plus 1,000 times the row number.

Figure 4 shows the IsPressedChanged handler for the Key instances. When a key is pressed, an oscillator is removed from the availableOscillators collection, given a frequency and non-zero volume, and put into the playingOscillators collection. When a key is released, the oscillator is given a zero volume and moved back to availableOscillators.

Figure 4 The IsPressedChanged Handler for the Key Instances

void MainPage::OnKeyIsPressedChanged(Object^ sender, bool isPressed)
{
  Key^ key = dynamic_cast<Key^>(sender);
  int keyNum = (int)key->Tag;
  if (isPressed)
  {
    if (availableOscillators.size() > 0)
    {
      SawtoothOscillator* pOscillator = availableOscillators.back();
      availableOscillators.pop_back();
      double freq = 440 * pow(2, (keyNum % 1000 - 69) / 12.0);
      pOscillator->SetFrequency((float)freq);
      pOscillator->SetAmplitude(1.0f / NUM_OSCILLATORS);
      playingOscillators[keyNum] = pOscillator;
    }
  }
  else
  {
    SawtoothOscillator * pOscillator = playingOscillators[keyNum];
    if (pOscillator != nullptr)
    {
      pOscillator->SetAmplitude(0);
      availableOscillators.push_back(pOscillator);
      playingOscillators.erase(keyNum);
    }
  }
}

That’s about as simple as a multi-voice instrument can be, and of course it’s flawed: Sounds should not be turned off and on like a switch. The volume should glide up rapidly but smoothly when a note starts, and fall back when it stops. Many real instruments also have a change in volume and timbre as the note progresses. There’s still plenty of room for enhancements.

But considering the simplicity of the code, it works surprisingly well and is very responsive. If you compile the program for the ARM processor, you can deploy it on the ARM-based Microsoft Surface and walk around cradling the untethered tablet in one arm while playing on it with the other hand, which I must say is a bit of a thrill.

Charles Petzold is a longtime contributor to MSDN Magazine and the author of “Programming Windows, 6th edition” (O’Reilly Media, 2012), a book about writing applications for Windows 8. His Web site is charlespetzold.com.

Thanks to the following technical experts for reviewing this article: Tom Mathews and Thomas Petchel

DirectX Factor - Constructing Audio Oscillators for Windows 8

Oscillators, Samples and Frequency

Buffers and Callbacks

Keep the Volume Down!

Our First Musical Instrument

Additional resources