Audio Object (SAPI 5.3)

Speech API 5.3
Microsoft Speech API 5.3

Audio Object


This document is intended to help developers write custom audio objects. Application developers can use this tool to direct speech data from memory into SAPI for speech recognition (SR) and for text-to-speech (TTS). The object does not generate or consume any audio data. Instead, it works as an audio buffer manager. For SR, audio data is passed to this object using a custom method ISpAudioPlug::SetData. SAPI retrieves the audio data from this object using IStream::Read. For TTS, audio data is passed from SAPI to this object using Istream::Write and the audio data can be retrieved calling a custom method ISpAudioPlug::GetData.

In order to use audio object for TTS output, the application uses asynchronous speak because the audio object does not consume the audio data. The application must consume the audio data by calling ISpAudioPlug.GetData. If the application uses synchronous speak, SAPI blocks the client thread. SAPI's write call on the audio object will block if the internal queue does not allocate more space. If the application retrieves the audio data on a different thread, the problem is averted.

Interface description


The SAPI audio object needs to implement the ISpAudio interface.

Custom interface ISpAudioPlug

ISpAudioPlug inherits from ISpeechAudio. It provides methods to send and retrieve audio data. The interface is automation compliant and can be used easily in languages that support automation.

ISpAudioPlug::Init(VARIANT_BOOL fWrite, SpeechAudioFormatType FormatType)

This method is used to initialize the audio object's basic mode, including the read/write mode, as well as initialize the audio data format. If fWrite is TRUE, then the object is in write mode; if fWrite is FALSE the object is in read mode. FormatType specifies the audio format. By default, the object is in write mode and the format is set to SPSF_22kHz16BitMono. If the method is called while the object is processing audio data, SPERR_DEVICE_BUSY is returned.

ISpAudioPlug::SetData(VARIANT vData, long * pWritten)

SR uses this method when the object is set to read mode. The caller uses this method to send audio data so that SAPI can retrieve the audio data by Istream::Read.

ISpAudioPlug::GetData(VARIANT* vData)

TTS uses this method when the object is set to be write mode. The caller uses this method to retrieve audio data.

SAPI automation ISpeechAudio

The sample audio object provides an empty implementation of ISpeechAudio. In order to make the audio object usable in languages that support automation, SAPI requires that the object implement ISpeechAudio, which inherits from IDispatch. Internally, SAPI would not use ISpeechAudio directly. It uses QueryInterface on ISpAudio and calls the methods on ISpAudio. This way, the audio object only needs to provide an empty implementation of ISpeechAudio.

Buffer management

Internally, the audio object uses a queue object CBasicQueueByArray to manage the incoming and outgoing audio data. The queue internally uses an array to store data. When the data reaches the end of the array, it would move the head of the array to fill the data. The methods on the queue objects are thread safe.

State management

When SAPI starts an audio stream, the audio state changes to SPAS_RUN. When SAPI closes an audio stream, the audio state changes to SPAS_CLOSE. The audio object must perform the appropriate action according to the audio state. For example, when the audio state changes to SPAS_CLOSE, the audio object needs to free the audio buffer and signal other threads waiting for audio data.


Because the thread calling IStream::Read on the audio object is the same one that the SR engine uses to call SAPI, the client thread calling SetData/GetData is different from SAPI's IStream::Read thread. The audio object needs to be thread safe.

Event rerouting

The audio object implements ISpEventSink and ISpEventSource. SAPI forwards the SR/TTS events to the audio object. The audio object forwards SAPI the events with the audio position later than the current device position.