Persisting Recognized WAV Audio from the SR Engine (SAPI 5.3)

Speech API 5.3
Microsoft Speech API 5.3

Persisting Recognized WAV Audio from the SR Engine


This document is intended to help developers of speech recognition (SR) applications use the Microsoft speech recognition and audio APIs to persist or store the wav audio recognized by a SR engine. The topics covered include:

  • Typical file input scenario
  • Typical audio storage scenario
  • Relevant APIs for C++ and Visual Basic/Scripting developers
  • Sample recognized audio storage source code for developers (written in both C++ and Visual Basic 6.0)

Typical file input scenario

The following are typical scenarios that would need to store the wav audio recognized by the SR engine:

  • Transcription applications (e.g., convert voice mail to email)
  • Audio correction user interface (e.g., replay and/or re-recognize audio snippets)
  • SR engine testing (e.g., measure and improve engine accuracy with reproducible audio input data)

Typical audio storage scenario

Follow these basic steps to retrieve and store recognized wav audio:

  1. Create an SR engine (InProc or shared).
  2. Enable retained audio on the relevant recognition context.
  3. Set the retained audio format (specify lower quality for smaller storage size, higher quality for clearer audio). Default is the SR engine's audio format.
  4. Set up and receive recognition events for relevant recognition context.
  5. Retrieve audio stream from recognition result.
  6. Copy result's audio stream to file-bound stream.

Relevant wav audio file input APIs for COM/C/C++ Developers:

  • SpStream object, ISpStream interface: Basic SAPI audio stream
  • ISpStream::BindToFile: Setup audio stream for wav file input
  • SpBindToFile: Helper function to setup stream with a wav file
  • ISpRecoContext::SetAudioOptions: To enable/disable retained audio
  • ISpRecoResult::GetAudio: To retrieve recognized audio
  • ISpStreamFormat::GetFormat: To retrieve audio format
  • CSpStreamFormat helper object: Helper for handling audio formats
  • ISpStream::Read/Write: Methods for reading and writing stream data
  • SPEI_RECOGNITION/SPEI_FALSE_RECOGNITION: Events sent by SAPI when a recognition or false recognition has occurred

Relevant wav audio file input APIs for Automation/Visual Basic/Scripting Developers:

  • SpFileStream object: Basic file-based SAPI audio stream
  • SpMemoryStream object: Basic memory-based SAPI audio stream
  • ISpeechRecoContext::RetainedAudio property: To enable/disable retained audio
  • ISpeechBaseStream::Read/Write: Methods for reading and writing stream data
  • ISpeechBaseStream::Format property: To retrieve audio format
  • SpFileStream::Open/Close: Methods for opening and closing a file-based stream
  • ISpeechRecoContext::Recognition/FalseRecognition: Events sent by SAPI when a recognition or false recognition has occurred

Sample recognized audio storage source code

Note: Error handling is omitted for brevity

COM/C++ Developers (C-style is very similar)

   HRESULT hr = S_OK;
   CComPtr<ISpRecoContext> cpRecoContext;
   CComPtr<ISpRecoGrammar> cpRecoGrammar;
   CComPtr<ISpRecoResult> cpRecoResult;
   CComPtr<ISpStreamFormat> cpStreamFormat;
   CSpEvent spEvent;
   WAVEFORMATEX* pexFormat = NULL;

   ' format for storing the audio
   const SPSTREAMFORMAT spFormat = SPSF_22kHz8BitMono;
   CSpStreamFormat Fmt( spFormat, &hr;);
   // Check hr

   // Create shared recognition context for receiving events
   hr = cpRecoContext.CoCreateInstance(CLSID_SpSharedRecoContext);
   // Check hr

   // Create a grammar
   hr = cpRecoContext->CreateGrammar(NULL, &cpRecoGrammar;);
   // Check hr

   // Load dictation
   hr = cpRecoGrammar->LoadDictation(NULL, SPLO_STATIC);
   // Check hr

   // Enabled audio retention in the SAPI runtime, and set the retained audio format
   hr = cpRecoContext->SetAudioOptions( SPAO_RETAIN_AUDIO, &Fmt;.FormatId(), Fmt.WaveFormatExPtr());
   // Check hr

   // activate dictation
   hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
   // Check hr

   // wait 15 seconds for an event to occur (specifically, the default event, recognition)
   hr = cpRecoContext->WaitForNotifyEvent(15000);
   if (S_OK == hr)
      // retrieve the event from the recognition context
      hr = spEvent.GetFrom(cpRecoContext);
      if (S_OK == hr)
         // verify that the event is a recognition event
         if (SPEI_RECOGNITION == spEvent.eEventId)
            // store the recognition result pointer
            cpRecoResult = spEvent.RecoResult();
            // release recognition result pointer in event object

   // deactivate dictation (only processing one recognition in sample code)
   hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
   // Check hr
   // unload dictation
   hr = cpRecoGrammar->UnloadDictation();
   // Check hr

   // if recognition received, and result stored then store the audio
   if (cpRecoResult)
      // get stream pointer to recognized audio
      // Note: specifying NULL for the start element and element length defaults to the entire recognized audio stream. Correction UI may only need a subset of the audio for playback
      hr = cpRecoResult->GetAudio( 0, 0, &cpStreamFormat; );
      // Check hr

      // basic SAPI-stream for file-based storage
      CComPtr<ISpStream> cpStream;
      ULONG cbWritten = 0;

      // create file on hard-disk for storing recognized audio, and specify audio format as the retained audio format
      hr = SPBindToFile(L"c:\\recognized_audio.wav", SPFM_CREATE_ALWAYS, &cpStream;, &Fmt;.FormatId(), Fmt.WaveFormatExPtr(), SPFEI_ALL_EVENTS);
      // Check hr

      ' Continuously transfer data between the two streams until no more data is found (i.e. end of stream)
      ' Note only transfer 1000 bytes at a time to creating large chunks of data at one time
      while (TRUE)
         // for logging purposes, the app can retrieve the recognized audio stream length in bytes
         STATSTG stats;
         hr = cpStreamFormat->Stat(&stats;, NULL);
         // Check hr

         // create a 1000-byte buffer for transferring
         BYTE bBuffer[1000];
         ULONG cbRead;

         // request 1000 bytes of data from the input stream
         hr = cpStreamFormat->Read(bBuffer, 1000, &cbRead;);
         // if data was returned??
         if (SUCCEEDED(hr) && cbRead > 0)
            ' then transfer/write the audio to the file-based stream
            hr = cpStream->Write(bBuffer, cbRead, &cbWritten;);
            // Check hr

         // since there is no more data being added to the input stream, if the read request returned less than expected, the end of stream was reached, so break data transfer loop
         if (cbRead < 1000)

   ' explicitly close the file-based stream to flush file data and allow app to immediately use the file
   hr = cpStream->Close();

Automation/Visual Basic 6.0 Developers

Scripting code is similar to Visual Basic.

Option Explicit

Dim WithEvents RecoContext As SpSharedRecoContext ' context for receiving SR events
Dim Grammar As ISpeechRecoGrammar ' grammar

' Setup/Initialization code for application startup
Private Sub Form_Load()
    ' Create new shared recognition context (inproc works similarly)
    Set RecoContext = New SpSharedRecoContext
    ' Create grammar
    Set Grammar = RecoContext.CreateGrammar
    ' Activate retained audio
    RecoContext.RetainedAudio = SRAORetainAudio
    ' Optionally, set the retained audio format to lower quality for smaller size
    ' RecoContext.RetainedAudioFormat = ???

    ' load and activate dictation
    Grammar.DictationSetState SGDSActive
End Sub

' Recognition event was received
Private Sub RecoContext_Recognition(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal RecognitionType As SpeechLib.SpeechRecognitionType, ByVal Result As SpeechLib.ISpeechRecoResult)
    ' Create new file-based stream for audio storage
    Dim FileStream As New SpFileStream
    ' Variable for accessing the recognized audio stream
    Dim AudioStream As SpMemoryStream

    ' Retrieve recognized audio from result object
    ' Note: application can also retrieve smaller portions of the audio stream by specifying a starting phrase element and phrase element length
    Set AudioStream = Result.Audio

    ' Setup the file-based stream format with the same format as the audio stream format
    Set FileStream.Format = AudioStream.Format
    ' Create a file on the hard-disk for storing the recognized audio
    FileStream.Open "c:\recognized_audio.wav", SSFMCreateForWrite

    Dim Buffer As Variant ' Buffer for storing stream data
    Dim lRead As Long ' Amount of data read from the stream
    Dim lWritten As Long ' Amount of data written to the stream

    ' Continuously transfer data between the two streams until no more data is found (i.e. end of stream)
    ' Note only transfer 1000 bytes at a time to creating large chunks of data at one time
    Do While True
   ' read 1000 bytes of stream data
        lRead = AudioStream.Read(Buffer, 1000)
        ' if data was retrieved, then transfer/write it to the file-based stream
        If (lRead > 0) Then
            lWritten = FileStream.Write(Buffer)
        End If

        ' Since the input stream will not increase in size, the number of bytes read will only be less than requested if there is no more data to be transferred
        If lRead < 1000
            Exit Do ' exit if no more data
        End If
    ' close the file-based stream
    ' Note: The stream will be closed automatically when the object is released, but explicit closing enables app to immediately use the file stream's data

    ' Sample code will deactivate and unload dictation, then shutdown after one recognition
    Grammar.DictationSetState SGDSInactive
    Unload Me ' shutdown app
End Sub