Speech Telephony Application Guide SAPI 5.4

Speech API 5.4
Microsoft Speech API 5.4

Speech Telephony Application Guide


The purpose of this paper is to document the ability to add speech capabilities to telephony applications.


Application developers can use SAPI 5.x to speech-enable Microsoft TAPI applications. This includes processing speech with either telecommunication devices (such as a modem), or across a network. This paper provides two examples: one uses devices such as a standard voice modem, and the other uses a Internet connection. These samples assume that you are familiar with TAPI programming and possibly already have a telephony application that you wish to speech enable.

This paper primarily covers the following topics:

The section, Adding SAPI Automation to a Telephony Application, discusses how to set up SAPI audio input and output to telecommunication devices. The next section demonstrates a method to process the recognition results. This section is intended to help developers of TAPI add SAPI automation to telephony applications using a standard voice modem or other telephony hardware communication devices. However, the samples in the Recognition Results Storage and Retrieval section can be used with the Internet as well.

The next section, Custom Real Time Audio Stream, describes how to build a custom real time audio stream and connect SAPI audio input and output to the streams. The stream object, enables an application to have voice communication using SAPI on either a telephony device or the network.

Adding SAPI automation to a telephony application

The following sections demonstrate adding SAPI to a telephony application. The procedures and code samples are suggested methods only, but other methods may be used to fit your specific needs.

To incorporate the SAPI run time, the SAPI text-to-speech (TTS) and Speech Recognition (SR), the following methods or procedures may be used. SimpleTelephony, a simple speech-enabled telephony application using device connections in C++ is available with the SAPI SDK. See the SAPI SDK for additional details. Examples in this section will demonstrate this with Visual Basic 6.0 or later. In order to use the sample code in Visual Basic, SAPI 5.1 or later must also be installed on your system.

Set up SAPI audio input/output

To transmit voice data over telecommunications devices using SAPI, it is very important to set up the audio input and output to the specific audio device correctly. The following are examples of how to set the audio output and input in C++ and Visual Basic, respectively.

To set the audio output object for TTS in C/C++, use the following steps:

  1. Create an ISpeechMMSysAudio audio object.
  2. Retrieve the wav/out device identifier and set it to the audio object calling ISpMMSysAudio::SetDeviceId().
  3. Find the wav format that your audio device supports and assign it to the audio object using ISpMMSysAudio::SetFormat().
  4. Call ISpVoice::SetOutput () to inform the TTS engine of the audio object.

Note that the second parameter of ISpVoice::SetOutput() is set to False. This means that the ISpVoice object will use the SAPI format converter to translate between the data being rendered by TTS engines and the format of the output audio data. This prevents the audio format from being changed on the output device.

To set the audio input object for SR in C/C++, use the following steps:

  1. Create an ISpeechMMSysAudio audio object.
  2. Get the wav/in device identifier and set it to the audio object by calling ISpMMSysAudio:: SetDeviceId().
  3. Find the wav format that your audio device supports and assign it to the audio object using ISpMMSysAudio::SetFormat().
  4. Call ISpRecognizer::SetInput() to inform the SR engine of the audio object. Set the second parameter of SetInput() to False to prevent the audio format on the input device from changing.

For additional information regarding setting up audio input and output in C++, please consult Simple Telephony Application in SAPI SDK.

Similarly, you can follow the above procedures to set the audio input and output in Visual Basic. You may discover that it is difficult to obtain the device identifier using TAPI ITLegacyCallMediaControl.GetID in Visual Basic. Hence, you may write supporting code in C++ to help the Visual Basic application to obtain the device identifier. In the following example, a TAPI helper interface, ITAPIHelper, is created in C/C++ to retrieve the device identifier and the wav format supported by the audio device for Visual Basic applications.

Snippet 1: ITAPIHelper Interface

Assume that the following APIs are exposed by the ITAPIHelper interface:

interface ITAPIHelper : IDispatch
	[id(1), helpstring("method GetDeviceIDWaveOut")]
HRESULT GetDeviceIDWaveOut ([in]IUnknown * pBasicCallCtl,
[out, retval]long *pDeviceId);
	[id(2), helpstring("method GetDeviceIDWaveIn")]
HRESULT GetDeviceIDWaveIn ([in]IUnknown * pBasicCallCtl,
           [out, retval]long *pDeviceId);
	[id(3), helpstring("method FindSupportedWaveOutFormat")]
HRESULT FindSupportedWaveOutFormat ([in]long DeviceId,
[out, retval]SpeechAudioFormatType *pFormat);
	[id(4), helpstring("method FindSupportedWaveInFormat")]
HRESULT FindSupportedWaveInFormat ([in]long DeviceId,
 [out, retval]SpeechAudioFormatType *pFormat);

GetDeviceIDWaveOut() and GetDeviceIDWaveIn() takes pBasicCallControl, which points to an ITBasicCallControl object created by the Visual Basic application. From the ITBasicCallControl object, you can query the ITLegacyCallMediaControl interface and then call ITLegacyCallMediaControl::GetID() to obtain the device identifier. The following sample code demonstrates the implementation of GetDeviceIDWaveOut() method.

STDMETHODIMP CTAPIHelper:: GetDeviceIDWaveOut (IUnknown *pBasicCallCtl,
   long *pDeviceId)
	CComPtr<ITBasicCallControl> cpBasicCallCtl;

	hr = pBasicCallCtl->QueryInterface( IID_ITBasicCallControl,
                                                (void**)&cpBasicCallCtl; );
    	// Get the LegacyCallMediaControl interface so that we can
    	// get a device ID to reroute the audio
    	ITLegacyCallMediaControl *pLegacyCallMediaControl;
	if ( SUCCEEDED(hr) )
hr = cpBasicCallCtl->QueryInterface( IID_ITLegacyCallMediaControl,
                                        			(void**)&pLegacyCallMediaControl; );

// Get the device ID  through ITLegacyCallMediaControl interface
UINT *puDeviceID;
BSTR bstrWavOut = ::SysAllocString( L"wave/out" );
if ( !bstrWavOut )
        return E_OUTOFMEMORY;
   	DWORD dwSize = sizeof( puDeviceID );
	if ( SUCCEEDED(hr) )
hr = pLegacyCallMediaControl->GetID( bstrWavOut, &dwSize;,
(BYTE**) &puDeviceID; );

*pDeviceId = *puDeviceID;

//clean up
::SysFreeString( bstrWavOut );
::CoTaskMemFree( puDeviceID );
cpBasicCallCtl.Release ();

return hr;

To implement of GetDeviceIDWaveIn (), you only need to change wav/out to wav/in in SysAllocString().

The following sample demonstrates implementation of the FindSupportedWaveOutFormat() method. The sample loops through all of the SAPI audio formats and queries the wav/out device as to whether it supports the given format. The method returns as soon as it finds one. Similarly, you can change waveOutOpen to waveInOpen for the implementation of the FindSupportedWaveInFormat() method.

STDMETHODIMP CSThelper::FindSupportedWaveOutFormat(long DeviceId,     SpeechAudioFormatType *pFormat)

GUID guidWave = SPDFID_WaveFormatEx;
SPSTREAMFORMAT enumFmtId= SPSF_NoAssignedFormat;

// Find out what formats are supported
if ( SUCCEEDED(hr) )
// Loop through all of the SAPI audio formats and query the wav/out device
// about whether it supports each one.  We will take the first one that we find

 for ( DWORD dw = 0;
(dw < SPSF_NUM_FORMATS); dw++ )
  if ( pWaveFormatEx && ( MMSYSERR_NOERROR != mmr ) )
           // The audio device does not support this format
           // Free up the WAVEFORMATEX pointer
            ::CoTaskMemFree( pWaveFormatEx );
            pWaveFormatEx = NULL;

       	 // Get the next format from SAPI and convert it into a WAVEFORMATEX
        	enumFmtId = (SPSTREAMFORMAT) (SPSF_8kHz8BitMono + dw);
        	HRESULT hrConvert = SpConvertStreamFormatEnum(
           		 enumFmtId, &guidWave;, &pWaveFormatEx; );

       	 if ( SUCCEEDED( hrConvert ) )
// This call to waveOutOpen() does not actually open the device;
// it just queries the device whether it supports the given format
mmr = ::waveOutOpen( NULL, DeviceId, pWaveFormatEx, 0, 0,                     WAVE_FORMAT_QUERY );

// If we made it all the way through the loop without breaking, that
// means we found no supported formats
	if ( enumFmtId == SPSF_NUM_FORMATS )


if ( SUCCEEDED( hr ))
*pFormat = (SpeechAudioFormatType)enumFmtId;

if ( pWaveFormatEx )
::CoTaskMemFree( pWaveFormatEx );

return hr;

Snippet 2: Use of ITAPIHelper Object in Visual Basic

The following code snippet illustrates setting up audio input and output to the devices in Visual Basic using the ITAPIHelper object.

Set up the audio output for TTS:

Dim MMSysAudioOut As ISpeechMMSysAudio
Dim TapiHelper As TAPIHelper

       'Create SpMMAudioOut object
Set MMSysAudioOut = New SpMMAudioOut

'Create helper object
Set TapiHelper = New TAPIHelper

'Get the device identifier and set it to audio out
MMSysAudioOut.DeviceId = TapiHelper.GetDeviceIDWaveOut(??)

'Find the supported wav format and set it to audio out object
MMSysAudioOut.Format.Type = TapiHelper.FindSupportedWaveOutFormat(??)

'Prevent format changes
VoiceObj.AllowAudioOutputFormatChangesOnNextSet = False

'Set the object as the audio output
Set VoiceObj.AudioOutputStream = MMSysAudioOut

Set up the audio input for SR:

Dim MMSysAudioIn As ISpeechMMSysAudio

'Create an SpMMAudioIn object
Set MMSysAudioIn = New SpMMAudioIn

'Get the device identifier and assign it to audio in object
MMSysAudioIn.DeviceId = TapiHelper.GetDeviceIDWaveIn(??)

'Find the supported wave in format and set it to audio in object
MMSysAudioIn.Format.Type = TapiHelper.FindSupportedWaveInFormat(??)

'Prevents format changes
RecognizerObj.AllowAudioInputFormatChangesOnNextSet = False

'Set the object as the audio input
Set RecognizerObj.AudioInputStream = MMSysAudioIn

'Release the helper object
Set TapiHelper = Nothing

Recognition result storage and retrieval

Once the audio input and output are set up, you can use SAPI TTS and SR functions to play or transcribe audio through the audio devices. This section details the processing of recognition results after a call has been connected. The sample code in this section has general purpose so it may be used for other connections such as the Internet.

To demonstrate using SAPI playback and transcribing for a telephony application, the following code examples use the case of a simple speech-enabled voice mail system. Using the sample application, the caller chooses from a menu of two options: leave a message and check messages. In order to use these functions, the application needs to store the recognition results for the left messages, and retrieve the results for checked messages. The following code snippet illustrates the initialization of SAPI, the use of ISpeechMemoryStream and the recognition results storage and retrieval in Visual Basic.

Declaration of variables

The following are declared as global variables and used in the speech-related APIs in the example.

Dim WithEvents VoiceObj As SpVoice				'TTS Voice
Dim RecognizerObj As SpInprocRecognizer			'SR recognizer
Dim WithEvents RecoContextObj As SpInProcRecoContext	'Recognition context
Dim DictationGrammarObj As ISpeechRecoGrammar		'Dictation grammar
Dim gMemStream As SpMemoryStream			'Memory stream

Dim StreamLength(100)		'Array of the lengths of each recognition result
Dim NumOfResults As Long		'Number of stored recognition results
    	GID_DICTATION = 1   		'ID for the dictation grammar
GID_CC = 2          		'ID for the C and C grammar
End Enum

SAPI initialization

The TTS voice object, in the following sample, is obtained from the RecoContextObj instead of from a separate voice object. This allows the application to play back the retained audio later using ISpeechRecoResult.SpeakAudio. Code snippets 3 through 6 demonstrate initializing SAPI objects. For simplicity, dictation is used, although realistically you may use command and control (C and C) grammar for better recognition of the menu of choices and dictation grammar for transcribing messages.

Snippet 3: Initialization

'Create a recognizer object
Set RecognizerObj = New SpInprocRecognizer

'Create a RecoContext object
Set RecoContextObj = RecognizerObj.CreateRecoContext

'Get the voice object from the RecoContext object
Set VoiceObj = RecoContextObj.Voice

'Although by default, all of SR events, except the audio level event, get set as events of interest. The sample application assumes that only the following five events are of interest and other events are ignored: Recognition, SoundStart, SoundEnd, StreamStart, StreamEnd.

RecoContextObj.EventInterests = SRERecognition + SRESoundEnd + SREStreamEnd + _
                                SREStreamStart + SRESoundEnd

'Retain the audio data in recognition result
RecoContextObj.RetainedAudio = SRAORetainAudio

'Create the dictation grammar
Set DictationGrammarObj = RecoContextObj.CreateGrammar(GID_DICTATION)

'Load dictation grammar
DictationGrammarObj.DictationLoad vbNullString, SLOStatic

Answer the call

After the application receives a call notification, it will perform the following to handle the call:

  1. Set the audio input and output to the right audio device (refer to the first section, Set up SAPI Audio Input/Output section) or streams.
  2. Answer the call using ITBasicCallControl.Answer.
  3. Prompt using ISpeechVoice::Speak. For example, "Welcome! Please select from the following two options: Leave a message or Check your messages."
  4. Activate the recognition.
  5. Process recognition results.
  6. Disconnect the call.

Implementation for handling the answer call is straight forward now except for processing the recognition results. Snippet 3 will discuss in detail about how to use ISpeechMemoryStream to store the recognition results and extract them later on.

Snippet 4: Leave a message

After the caller selects the Leave a message option, the application creates an ISpeechMemoryStream object. The stream will be used to save the recognition results in recognition event handler.

'Reset the number of recognition results
NumOfResults = 0

 'Cleanup the stream and create a new one
 Set gMemStream = New SpMemoryStream

'Activate the recognition
RecoDictationGrammar.DictationSetState SGDSActive

 'Wait for maximum 30 seconds to allow the caller to leave a message
Dim Start
Start = Timer   ' Get start time.
 Do While (Timer < Start + 30)
      DoEvents     ' Yield to other processes.

'Deactivate the recognition
RecoDictationGrammar.DictationSetState SGDSInactive

Snippet 5: Handle the recognition event

ISpeechRecoResult.SaveToMemory and ISpeechMemoryStream.Write are used in the following example to save the entire current recognition result to the memory stream. In the meantime, the application, increases the number of recognition results which have been saved in the memory stream and records the length of each recognition result in bytes. These two variables will be used while retrieving the recognition information from the memory stream.

Private Sub RecoContextObj _Recognition (??,ByVal Result As  SpeechLib.ISpeechRecoResult)

    Select Case Result.PhraseInfo.GrammarId

    	       Case GID_DICTATION
         		Dim SerializeResult as Variant
		'Save the entire recognition result to memory
         		SerializeResult = Result.SaveToMemory
'Write the result to the memory and store the length of the current recognition result to the array
        		StreamLength (NumOfResults) = gMemStream.Write(SerializeResult)
'Record the number of recognition results having been saved in the file stream
         		NumOfResults = NumOfResults + 1

		'Additional speech processing code here
End Select

End Sub

Snippet 6: Check the message

The following code snippet may be used when the caller chooses the Check the message option. NumOfResults stores the number of recognition results in the file stream. If this variable is zero, then there is no message. Otherwise, the sample uses ISpeechMemoryStream.Read and ISpeechRecoResult.CreateResultFromMemory to restore each recognition result from the file stream and call ISpeechRecoResult.SpeakAudio to play the message in the original voice.

If (NumOfResults <> 0) Then
Dim resultGet As Variant, length As Long
Dim RecoResultGet As ISpeechRecoResult

'Set the pointer to the beginning of the stream
gMemStream.Seek 0, SSSPTRelativeToStart

'Speak using TTS voice
VoiceObj.Speak "Your message is paused."

Dim i as Integer
For i = 0 To NumOfResults - 1

'Extract data in bytes from the stream
length = gMemStream.Read(resultGet, StreamLength (i))

'Restore the recognition results
    Set RecoResultGet = RecoContextObj.CreateResultFromMemory(resultGet)

'Speak the audio

'Release the result object
Set RecoResultGet = Nothing

Next i

'Speak using TTS voice
VoiceObj.Speak "End of your messages"

'Speak using TTS voice
                VoiceObj.Speak "You have no messages, good bye!"

End If

Custom real time audio stream

Under some circumstances, you might want a custom real-time audio stream to read the audio data from one entity and write it to another. You can use this stream object in two ways: call using the phone system or call over the Internet (VoIP) from or to another computer. In addition, you can play the SAPI TTS voice and transcribe the message using SAPI SR functionality over the network. Before using the code snippets in this example, you need Windows 2000 Operating System or later installed on your system. The example needs TAPI 3.x which is installed with Windows 2000 OS.

The following is an example that builds a custom real time audio stream to send and receive audio data between SAPI and TAPI objects using media streaming terminals (MST) and other media controls provided by TAPI Media Service Providers (MSPs). Assume that STCustomStream is the name of the component of the custom audio stream containing two interfaces: ITTSStream and IASRStream. ITTSStream handles data exchanges between the TTS object and the media stream while IASRStream deals with data transition between the SR object and the media stream. The media stream interfaces used in ITTSStream and IASRStream are queried from the TAPI media streaming terminals. The terminals are created by the TAPI application. Using these two media streaming terminals with the aid of other media streaming interfaces, a TAPI applications should be able capture the audio data from the SAPI TTS engine and send it out to the remote caller side or render the audio data arriving from the remote end to the SAPI SR engine over the network.

Code samples are provided for the following topics: " TTS Custom stream " SR Custom stream

TTS custom stream

The TTS custom stream, (the aforementioned ITTSStream), is used to capture the audio data from a TTS engine and inject the live audio data into a TAPI media stream. In this example, ITTSStream inherits from ISpStreamFormat. This allows SAPI to eventually call ITTSStream::Write() to feed the live audio data to the ITTSStream object. The object then simply uses the media stream terminal to send out the audio data to the remote ends. The following are the sample code snippets illustrate ITTSStream and its uses.

Snippet 7: ITTSStream idl

interface ITTSStream : IDispatch
[id(1), helpstring("method InitTTSStream")]
HRESULT InitTTSStream(IUnknown *pCaptureTerminal);

ITTSStream::InitTTSStream () initializes the IMediaStream object by querying from a capture terminal, pCaptureTerminal, pointing to an ITTerminal object. The method can also obtains the audio wav format using ITAMMediaFormat::get_MediaFormat() and store the format for later use.

Snippet 8: Use of the ITTSStream in Visual Basic

When the call is connected, the TAPI application creates an MST for capture. The word "capture" is used in the DirectShow sense, and indicates the fact that MST captures an application's data to be introduced into the TAPI data stream.

Dim objTTSTerminal As ITTerminal
Dim MediaStreamTerminalClsid As String

MediaStreamTerminalClsid = "{E2F7AEF7-4971-11D1-A671-006097C9A2E8}"

'Create a capture terminal
Set objTTSTerminal = objTerminalSupport.CreateTerminal( _

'Process here for selecting terminals, answering calls, etc.

'Set the output for SAPI TTS
Dim CustomStream As New  SpCustomStream
Dim MyTTSStream As New TTSStream

'Initialize the TTSStream object
MyTTSStream.InitTTSStream objTTSTerminal

'Set MyTTSStream as a BaseStream for the SAPI ISpeechCustomStream
Set CustomStream.BaseStream = MyTTSStream

'Prevent the format change
gObjVoice.AllowAudioOutputFormatChangesOnNextSet = False

'Set the CustomStream as an audio output
Set gObjVoice.AudioOutputStream = CustomStream

Set MyTTSStream = Nothing
Set CustomStream = Nothing

After your application receives the media event, CME_STREAM_ACTIVE, you can call Speak. For instance:

gObjVoice.Speak "Welcome!"

Snippet 9: Implementation for ITTSStream methods

Listed below are the methods that must be implemented in the TTS custom stream. Currently in SAPI 5.1, other methods, such as CopyTo(), Commit(), etc., may return as E_NOTIMPL if those methods are not defined by the application.

//  IStream interface

	STDMETHODIMP Write(const void * pv, ULONG cb, ULONG * pcbWritten);

//  ISpStreamFormat interface

	STDMETHODIMP GetFormat(GUID * pFormatId, WAVEFORMATEX ** ppCoMemWaveFormatEx);

SAPI calls the ITTSStream::Write() using the TTS engine whenever the audio data is ready. This method copies the data from the input buffer, void *pv to a IStreamSample buffer and then submits them to MST. SAPI calls Seek() to move the Seek pointer to a new location in the stream. SAPI calls GetFormat to locate the current stream format.

STDMETHODIMP CTTSStream::Write(const void * pv, ULONG cb, ULONG * pcbWritten)


ULONG lWritten = 0;
ULONG ulPos =0;
BYTE *pbData = (BYTE *)pv;

// Keep reading samples from void *pv and sending them on.
   while ( SUCCEEDED ( hr ) )
      //Allocate a sample on the terminal's media stream. m_cpTTSMediaStream is data //member variable, defined as CComPtr<IMediaStream> .

      IStreamSample *pStreamSample = NULL;

      hr = m_cpTTSMediaStream->AllocateSample(0, &pStreamSample;);

      // Check hr

      // Get IMemoryData interface from the sample
      IMemoryData *pSampleMemoryData = NULL;

      hr = pStreamSample->QueryInterface(IID_IMemoryData, (void**)&pSampleMemoryData;);

      // Check hr

      //Get the sample buffer information

      ULONG nBufferSize = 0;
      BYTE *pBuffer = NULL;

      hr = pSampleMemoryData->GetInfo(&nBufferSize;, &pBuffer;, NULL);

      // Check hr

      // Copy the audio data to the buffer provided by the sample

      nBufferSize = min ( nBufferSize, cb - ulPos);
      memcpy ( pBuffer, (BYTE *)(pbData+ ulPos), nBufferSize);

      ulPos += nBufferSize;

      // Tell the sample how many useful bytes are available in the sample buffer
      hr = pSampleMemoryData->SetActual(nBufferSize);

      // Check hr

      pSampleMemoryData = NULL;

      //Tell the MST that the sample is ready for processing
      hr = pStreamSample->Update(NULL, NULL, NULL, 0);

      //Break the while loop when the current data process completes or fails
      if ( FAILED(hr) || ulPos == cb )
         pStreamSample = NULL;

   return hr;

DWORD dwOrigin,
ULARGE_INTEGER *plibNewPosition)
   // We only accept queries for the current stream position
	if (STREAM_SEEK_CUR != dwOrigin || dlibMove.QuadPart)
		return E_INVALIDARG;

	// Validate the OUT parameter
	if (SPIsBadWritePtr(plibNewPosition, sizeof(ULARGE_INTEGER)))
		return E_POINTER;


   	plibNewPosition->QuadPart = (LONG)dlibMove.LowPart;


	return S_OK;

WAVEFORMATEX ** ppCoMemWaveFormatEx)

    	HRESULT hr = S_OK;

	hr = m_StreamFormat.ParamValidateCopyTo( pFmtId, ppCoMemWaveFormatEx );


    	return hr;


  • SPIsBadWritePtr() and SPIsBadReadPtr() used in the above example are parameter checking help functions. They are defined in Spddkhlp.h in the SAPI SDK.
  • Variable m_StreamFormat is declared as CSpStreamFormat. It is defined in Sphelper.h.
  • The m_hCritSec variable is defined as CComAutoCriticalSection.
  • IStreamSample::Update() in the above Write() method performs a synchronous update of a sample. If you would like to update the samples asynchronously, you need to define a mechanism to keep track of all of the samples that having been submitted in order to ensure that these submitted samples are completely processed by the MST.

SR custom stream

The SR custom stream, (the above-mentioned IASRStream), is used for rendering the audio data from a TAPI media stream to the SAPI SR object using MST. In this example, IASRStream inherits from ISpStreamFormat. This allows SAPI to eventually call IASRStream::Read() to retrieve the live audio data from the media stream. The following are sample code snippets about IASRStream and its uses.

Snippet 10: IASRStream idl

interface IASRStream : IDispatch
	[id(1), helpstring("method InitSRStream ")]
HRESULT InitSRStream(IUnknown *pRenderTerminal);
	[id(2), helpstring("method StopRenderStream ")] HRESULT StopRenderStream();

IASRStream::InitSRStream () initializes the IMediaStream object by querying from pRenderTerminal, which points to an ITTerminal object. The method also obtains the audio wav format using ITAMMediaFormat::get_MediaFormat() and stores the format for later use. IASRStream::StopRenderStream () is used by TAPI applications to notify the IASRStream object to stop providing the audio data to SAPI during the read operation.

Snippet 11: Use of the IASRStream in Visual Basic

Dim objSRTerminal As ITTerminal
Dim MediaStreamTerminalClsid As String
MediaStreamTerminalClsid = "{E2F7AEF7-4971-11D1-A671-006097C9A2E8}"

'Create a render terminal
Set objSRTerminal = objTerminalSupport.CreateTerminal( _
            MediaStreamTerminalClsid, TAPIMEDIATYPE_AUDIO, TD_RENDER)

'Process here for selecting terminals, answering calls, etc.

'Set input for SAPI SR
Dim CustomStream As New SpCustomStream
Dim MySRStream As New ASRStream

'Initialize the ASRStream object
MySRStream.InitSRStream objSRTerminal

'Set MySRStream as the BaseStream for the SAPI ISpeechCustomStream
 Set CustomStream.BaseStream = MySRStream

'Prevent the format change
gObjRecognizer.AllowAudioInputFormatChangesOnNextSet = False

'Set the CustomStream as an audio input
Set gObjRecognizer.AudioInputStream = CustomStream

Set CustomStream = Nothing

'Assume the RecoDictationGrammar, ISpeechRecoGrammar, is valid
RecoDictationGrammar.DictationSetState SGDSActive

'Wait here for a few seconds for recognition events

'Deactivate the dictation
RecoDictationGrammar.DictationSetState SGDSInactive

'Tell the ASRStream object to stop providing any audio data to SAPI in IASRStream ::Read()

'Release ASRStream object
Set MySRStream = Nothing

Snippet 12: Implementation for IASRStream methods

The following snippet lists the methods that must be implemented in the SR custom stream. Currently in SAPI 5., other methods like Commit(), CopyTo(), etc., may simply return E_NOTIMPL.

//  IStream interface

"	STDMETHODIMP Read(void * pv, ULONG cb, ULONG * pcbRead);

//  ISpStreamFormat interface

"	STDMETHODIMP GetFormat(GUID * pFormatId, WAVEFORMATEX ** ppCoMemWaveFormatEx);

SR engines call IASRStream::Read() using SAPI. This method retrieves the audio data from a TAPI media stream and copies it to the buffer, pointed to by void *pv.

The following are sample code snippets for the Read() method. For the implementation of Seek() and GetFormat(), please refer to the TTS Custom Stream section.

STDMETHODIMP CASRStream::Read(void * pv, ULONG cb, ULONG *pcbRead)

    	BYTE *pbData = (BYTE *)pv;
	if (m_bPurgeFlag)
// Add code here for Cleanup such as, release events, samples, etc.
return hr;

	//allocate the buffer
    	if ( m_pnDataBuffer == NULL )
m_ulBufferSize = cb;
m_pnDataBuffer = new BYTE [m_ulBufferSize ];
else if ( m_ulBufferSize != cb)
//cb might be different from that in the previous Read()
delete []m_pnDataBuffer;
m_ulBufferSize = cb;
m_pnDataBuffer = new BYTE [ m_ulBufferSize ];
	//Retrieve cb bytes audio data from the TAPI media stream
   	If ( SUCCEEDED ( hr ) )
hr = RenderAudioStream();

    	if ( SUCCEEDED ( hr ) )
        *pcbRead = m_ulActualRead;
          memcpy ( & pbData, (BYTE *)m_pnDataBuffer,  m_ulActualRead)


    	return hr;

The RenderAudioStream() function extracts the audio data from the media streaming terminal to the buffer m_pnDataBuffer. The function first reads the terminal's allocator properties to perform the following:

  • Obtains the number of samples.
  • Calls IMediaStream::AllocateSample() on the terminal's IMediaStream interface to allocate an array for each stream sample.
  • Creates an array of events and associates each sample with an event. An event is signaled when the corresponding sample is filled with data by the Media Streaming Terminal and is ready for use.
  • Calls IStreamSample::CompletionStatus to ensure that the sample contains valid data and then copies data to buffer m_pnDataBuffer.
  • Calls IStreamSample::Update() to return the sample to the terminal in order to be notified again when the sample refills with a new port of data.

For detailed information, please refer to the MSDN Media Streaming Terminal sample application.


  • m_pnDataBuffer is a data member, pointing to BYTE. It stores the audio data received from the TAPI media stream.
  • m_ulActualRead is a data member, containing the number of audio data in bytes stored in m_pnDataBuffer.
  • m_bPurgeFlag is a data member. It gets set when the application calls StopRenderStream() of the IASRStream object.

Pitfalls: Common problems Encountered

The following are possible issues that developers might encounter during development:

Audio input/output devices

If your audio input or output source is not a standard windows Multimedia device, you need to create an audio object first and then call SAPI SetInput and SetOutput to the device (see the Set Audio Input and Output to an Audio section of this paper). Your application will not work if you simply select your wave In/Out device as the default audio input or output device using Speech properties in Control Panel.

Custom stream object

In your SR custom stream object, when the Read() method returns an error, SAPI will deactivate the recognizer state. In the case of telephony, you must explicitly set the recognizer state to active in every connection even through the recognizer state, by default, is set to active. If connections are not set to active, the Read() method might return E_ABORT or other error message after the caller disconnects the phone and the recognizer will be tuned off. This might cause troubles during the next calls.

After your application sets either the dictation or command and control grammar state to inactive, you may purge the stream by simply returning zero bytes in Read() to inform the SAPI that SR engine the end of stream has been reached. Otherwise, some SR engines might keep calling the Read() method, so this might cause your application to hang.