Persisting Recognized WAV Audio from the SR Engine SAPI 5.4

Speech API 5.4
Microsoft Speech API 5.4

Persisting Recognized WAV Audio from the SR Engine


This document is intended to help developers of speech recognition (SR) applications use the Microsoft speech recognition and audio APIs to persist or store the wav audio recognized by a SR engine. The topics covered include:

  • Typical file input scenario
  • Typical audio storage scenario
  • Relevant APIs for C++ and Visual Basic/Scripting developers
  • Sample recognized audio storage source code for developers (written in both C++ and Visual Basic 6.0)
  • Typical file input scenario

    The following are typical scenarios that would need to store the wav audio recognized by the SR engine:

  • Transcription applications (e.g., convert voice mail to email)
  • Audio correction user interface (e.g., replay and/or re-recognize audio snippets)
  • SR engine testing (e.g., measure and improve engine accuracy with reproducible audio input data)
  • Typical audio storage scenario

    Follow these basic steps to retrieve and store recognized wav audio:

    1. Create an SR engine (InProc or shared).
    2. Enable retained audio on the relevant recognition context.
    3. Set the retained audio format (specify lower quality for smaller storage size, higher quality for clearer audio). Default is the SR engine's audio format.
    4. Set up and receive recognition events for relevant recognition context.
    5. Retrieve audio stream from recognition result.
    6. Copy result's audio stream to file-bound stream.

    Relevant wav audio file input APIs for COM/C/C++ Developers:

  • SpStream object, ISpStream interface: Basic SAPI audio stream
  • ISpStream::BindToFile: Setup audio stream for wav file input
  • SpBindToFile: Helper function to setup stream with a wav file
  • ISpRecoContext::SetAudioOptions: To enable/disable retained audio
  • ISpRecoResult::GetAudio: To retrieve recognized audio
  • ISpStreamFormat::GetFormat: To retrieve audio format
  • CSpStreamFormat helper object: Helper for handling audio formats
  • ISpStream::Read/Write: Methods for reading and writing stream data
  • SPEI_RECOGNITION/SPEI_FALSE_RECOGNITION: Events sent by SAPI when a recognition or false recognition has occurred
  • Relevant wav audio file input APIs for Automation/Visual Basic/Scripting Developers:

  • SpFileStream object: Basic file-based SAPI audio stream
  • SpMemoryStream object: Basic memory-based SAPI audio stream
  • ISpeechRecoContext::RetainedAudio property: To enable/disable retained audio
  • ISpeechBaseStream::Read/Write: Methods for reading and writing stream data
  • ISpeechBaseStream::Format property: To retrieve audio format
  • SpFileStream::Open/Close: Methods for opening and closing a file-based stream
  • ISpeechRecoContext::Recognition/FalseRecognition: Events sent by SAPI when a recognition or false recognition has occurred
  • Sample recognized audio storage source code

    Note: Error handling is omitted for brevity

    COM/C++ Developers (C-style is very similar)

       HRESULT hr = S_OK;
       CComPtr<ISpRecoContext> cpRecoContext;
       CComPtr<ISpRecoGrammar> cpRecoGrammar;
       CComPtr<ISpRecoResult> cpRecoResult;
       CComPtr<ISpStreamFormat> cpStreamFormat;
       CSpEvent spEvent;
       WAVEFORMATEX* pexFormat = NULL;
       SPAUDIOOPTIONS eAudioOptions = SPAO_NONE;
       ' format for storing the audio
       const SPSTREAMFORMAT spFormat = SPSF_22kHz8BitMono;
       CSpStreamFormat Fmt( spFormat, &hr;);
       // Check hr
       // Create shared recognition context for receiving events
       hr = cpRecoContext.CoCreateInstance(CLSID_SpSharedRecoContext);
       // Check hr
       // Create a grammar
       hr = cpRecoContext->CreateGrammar(NULL, &cpRecoGrammar;);
       // Check hr
       // Load dictation
       hr = cpRecoGrammar->LoadDictation(NULL, SPLO_STATIC);
       // Check hr
       // Enabled audio retention in the SAPI runtime, and set the retained audio format
       hr = cpRecoContext->SetAudioOptions( SPAO_RETAIN_AUDIO, &Fmt;.FormatId(), Fmt.WaveFormatExPtr());
       // Check hr
       // activate dictation
       hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
       // Check hr
       // wait 15 seconds for an event to occur (specifically, the default event, recognition)
       hr = cpRecoContext->WaitForNotifyEvent(15000);
       if (S_OK == hr)
          // retrieve the event from the recognition context
          hr = spEvent.GetFrom(cpRecoContext);
          if (S_OK == hr)
             // verify that the event is a recognition event
             if (SPEI_RECOGNITION == spEvent.eEventId)
                // store the recognition result pointer
                cpRecoResult = spEvent.RecoResult();
                // release recognition result pointer in event object
       // deactivate dictation (only processing one recognition in sample code)
       hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
       // Check hr
       // unload dictation
       hr = cpRecoGrammar->UnloadDictation();
       // Check hr
       // if recognition received, and result stored then store the audio
       if (cpRecoResult)
          // get stream pointer to recognized audio
          // Note: specifying NULL for the start element and element length defaults to the entire recognized audio stream. Correction UI may only need a subset of the audio for playback
          hr = cpRecoResult->GetAudio( 0, 0, &cpStreamFormat; );
          // Check hr
          // basic SAPI-stream for file-based storage
          CComPtr<ISpStream> cpStream;
          ULONG cbWritten = 0;
          // create file on hard-disk for storing recognized audio, and specify audio format as the retained audio format
          hr = SPBindToFile(L"c:\\recognized_audio.wav", SPFM_CREATE_ALWAYS, &cpStream;, &Fmt;.FormatId(), Fmt.WaveFormatExPtr(), SPFEI_ALL_EVENTS);
          // Check hr
          ' Continuously transfer data between the two streams until no more data is found (i.e. end of stream)
          ' Note only transfer 1000 bytes at a time to creating large chunks of data at one time
          while (TRUE)
             // for logging purposes, the app can retrieve the recognized audio stream length in bytes
             STATSTG stats;
             hr = cpStreamFormat->Stat(&stats;, NULL);
             // Check hr
             // create a 1000-byte buffer for transferring
             BYTE bBuffer[1000];
             ULONG cbRead;
             // request 1000 bytes of data from the input stream
             hr = cpStreamFormat->Read(bBuffer, 1000, &cbRead;);
             // if data was returned??
             if (SUCCEEDED(hr) && cbRead > 0)
                ' then transfer/write the audio to the file-based stream
                hr = cpStream->Write(bBuffer, cbRead, &cbWritten;);
                // Check hr
             // since there is no more data being added to the input stream, if the read request returned less than expected, the end of stream was reached, so break data transfer loop
             if (cbRead < 1000)
       ' explicitly close the file-based stream to flush file data and allow app to immediately use the file
       hr = cpStream->Close();

    Automation/Visual Basic 6.0 Developers

    Scripting code is similar to Visual Basic.

    Option Explicit
    Dim WithEvents RecoContext As SpSharedRecoContext ' context for receiving SR events
    Dim Grammar As ISpeechRecoGrammar ' grammar
    ' Setup/Initialization code for application startup
    Private Sub Form_Load()
        ' Create new shared recognition context (inproc works similarly)
        Set RecoContext = New SpSharedRecoContext
        ' Create grammar
        Set Grammar = RecoContext.CreateGrammar
        ' Activate retained audio
        RecoContext.RetainedAudio = SRAORetainAudio
        ' Optionally, set the retained audio format to lower quality for smaller size
        ' RecoContext.RetainedAudioFormat = ???
        ' load and activate dictation
        Grammar.DictationSetState SGDSActive
    End Sub
    ' Recognition event was received
    Private Sub RecoContext_Recognition(ByVal StreamNumber As Long, ByVal StreamPosition As Variant, ByVal RecognitionType As SpeechLib.SpeechRecognitionType, ByVal Result As SpeechLib.ISpeechRecoResult)
        ' Create new file-based stream for audio storage
        Dim FileStream As New SpFileStream
        ' Variable for accessing the recognized audio stream
        Dim AudioStream As SpMemoryStream
        ' Retrieve recognized audio from result object
        ' Note: application can also retrieve smaller portions of the audio stream by specifying a starting phrase element and phrase element length
        Set AudioStream = Result.Audio
        ' Setup the file-based stream format with the same format as the audio stream format
        Set FileStream.Format = AudioStream.Format
        ' Create a file on the hard-disk for storing the recognized audio
        FileStream.Open "c:\recognized_audio.wav", SSFMCreateForWrite
        Dim Buffer As Variant ' Buffer for storing stream data
        Dim lRead As Long ' Amount of data read from the stream
        Dim lWritten As Long ' Amount of data written to the stream
        ' Continuously transfer data between the two streams until no more data is found (i.e. end of stream)
        ' Note only transfer 1000 bytes at a time to creating large chunks of data at one time
        Do While True
       ' read 1000 bytes of stream data
            lRead = AudioStream.Read(Buffer, 1000)
            ' if data was retrieved, then transfer/write it to the file-based stream
            If (lRead > 0) Then
                lWritten = FileStream.Write(Buffer)
            End If
            ' Since the input stream will not increase in size, the number of bytes read will only be less than requested if there is no more data to be transferred
            If lRead < 1000
                Exit Do ' exit if no more data
            End If
        ' close the file-based stream
        ' Note: The stream will be closed automatically when the object is released, but explicit closing enables app to immediately use the file stream's data
        ' Sample code will deactivate and unload dictation, then shutdown after one recognition
        Grammar.DictationSetState SGDSInactive
        Unload Me ' shutdown app
    End Sub