Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Microsoft Speech Platform

Use Speech Recognition Events

The Microsoft Speech Platform includes events that provide notifications and return information about the status of speech recognition. Most speech recognition events are raised by the speech recognition engine while it is performing recognition. An application can register to receive notification of the events that it wants to process. In addition to subscribing to events, applications should use either a window message, callback function, or Win32 event to signal when speech events are available.

Note that both variations of callback functions as well as the window message notification require a window message pump to run on the thread that initialized the notification source. A callback function will only be called as the result of window message processing, and will always be called on the same thread that initialized the notify source. However, using Win32 events for SAPI event notification does not require a window message pump.

Example

The following code excerpt subscribes to events that are raised when speech recognition begins and when the input stream ends, and establishes a Win32 event. `

// Subscribe to the speech recognition event and end stream event.
if (SUCCEEDED(hr))
{
ULONGLONG ullEventInterest = SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_END_SR_STREAM);
hr = cpContext->SetInterest(ullEventInterest, ullEventInterest);
}

// Establish a Win32 event to signal when speech events are available. HANDLE hSpeechNotifyEvent = INVALID_HANDLE_VALUE;

if (SUCCEEDED(hr)) { hr = cpContext->SetNotifyWin32Event(); }

if (SUCCEEDED(hr)) { hSpeechNotifyEvent = cpContext->GetNotifyEventHandle();

if (INVALID_HANDLE_VALUE == hSpeechNotifyEvent) { // Notification handle unsupported. hr = E_NOINTERFACE; } }

`

See the end of this topic for an example of a complete console application that subscribes to events, recognizes speech input, and processes events raised during speech recognition.

Speech recognition events in the Speech Platform

The following table lists and describes the events raised in the Speech Platform to support speech recognition.

Event Description
SPEI_END_SR_STREAM The SR engine has finished receiving an audio input stream. lParam points to the SR engine's final HRESULT code (see CSpEvent::EndStreamResult). wParam points to a Boolean value signifying whether the audio input stream object was released (see CSpEvent::InputStreamReleased).
SPEI_SOUND_START The SR engine determined that audible sound is available through the input stream.
SPEI_SOUND_END The SR engine has determined that audible sound is no longer available through the input stream, or that the sound stream has been inactive for a period.
SPEI_PHRASE_START The SR engine is starting to recognize a phrase. Note that this will precede any of an SPEI_FALSE_RECOGNITION, SPEI_HYPOTHESIS, or SPEI_RECOGNITION event.
SPEI_RECOGNITION The SR engine is returning a full recognition - its best guess at a text representation of the audio data. lParam is a pointer to an ISpRecoResult object (see CSpEvent::RecoResult).
SPEI_HYPOTHESIS The SR engine is returning a partial phrase recognition - effectively its best guess up to that point in the stream. lParam is a pointer to an ISpRecoResult object (see CSpEvent::RecoResult).
SPEI_SR_BOOKMARK A Bookmark event is returned when the SR engine has processed to the stream position of a bookmark. lParam is an application specified value set using ISpRecoContext::Bookmark. wParam is SPREF_AutoPause if ISpRecoContext::Bookmark was called with SPBO_PAUSE. Otherwise, wParam is NULL.
SPEI_PROPERTY_NUM_CHANGE An SR engine supported property was changed. LPARAM is a string pointer to the property name that changed (see CSpEvent::PropertyName CSpEvent::PropertyName]. wParam contains the new value (see CSpEvent::PropertyNumValue).
SPEI_PROPERTY_STRING_CHANGE LPARAM is a string pointer to the property name that changed (see CSpEvent::PropertyName). Immediately following the NULL-termination of the property name is the new property value (see CSpEvent::PropertyStringValue).
SPEI_FALSE_RECOGNITION Apparent speech without valid recognition. An SR engine can optionally return a result object, which will be referenced by the LPARAM member (see CSpEvent::RecoResult).
SPEI_INTERFERENCE The SR engine determined that there is a problem in the sound stream that is preventing a successful recognition. lParam is any combination of SPINTERFERENCE flags (See CSpEvent::Interference).
SPEI_REQUEST_UI The SR engine's request to display a specific user interface. LPARAM is a null-terminated string (see CSpEvent::RequestTypeOfUI). The Speech Platform does not support display of graphical user interfaces (GUIs). Microsoft engines do not support display of graphical user interfaces (GUIs) in the Speech Platform. Calls to any ::DisplayUI method will fail.
SPEI_RECO_STATE_CHANGE The recognizer state has changed. WPARAM is the new recognizer state (see SPRECOSTATE and CSpEvent::RecoState).
SPEI_START_SR_STREAM The SR engine has reached the start of a new audio stream.
SPEI_RECO_OTHER_CONTEXT A recognition was sent to another context.
SPEI_SR_AUDIO_LEVEL The audio input stream object fires this event. wParam is the current audio level from zero to 100.
SPEI_SR_RETAINEDAUDIO Returns the audio that was sent to the recognizer.

Complete speech recognition example

The following is an example of a complete console application that subscribes to events, recognizes speech, and processes events raised during speech recognition. The example loads a speech recognition grammar that recognizes phrases such as "Find restaurants near Madrid". See the end of the code example for the contents of the grammar.

`

int _tmain(int argc, _TCHAR* argv[])
{
CoInitialize(NULL);
{
HRESULT hr = S_OK;

    // Find the best matching installed en-us recognizer.
    CComPtr<ISpObjectToken> cpRecognizerToken;

    if (SUCCEEDED(hr))
    {
        hr = SpFindBestToken(SPCAT_RECOGNIZERS, L"language=409", NULL, &cpRecognizerToken;);
    }

    // Create a recognizer and immediately set its state to inactive.
    CComPtr<ISpRecognizer> cpRecognizer;

    if (SUCCEEDED(hr))
    {
        hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);
    }

    if (SUCCEEDED(hr))
    {
        hr = cpRecognizer->SetRecognizer(cpRecognizerToken);
    }

    if (SUCCEEDED(hr))
    {
        hr = cpRecognizer->SetRecoState(SPRST_INACTIVE);
    }

    // Create a new recognition context from the recognizer.
    CComPtr<ISpRecoContext> cpContext;

    if (SUCCEEDED(hr))
    {
        hr = cpRecognizer->CreateRecoContext(&cpContext;);
    }

    // Subscribe to the speech recognition event and end stream event.
    if (SUCCEEDED(hr))
    {
        ULONGLONG ullEventInterest = SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_END_SR_STREAM);
        hr = cpContext->SetInterest(ullEventInterest, ullEventInterest);
    }

    // Establish a Win32 event to signal when speech events are available.
    HANDLE hSpeechNotifyEvent = INVALID_HANDLE_VALUE;

    if (SUCCEEDED(hr))
    {
        hr = cpContext->SetNotifyWin32Event();
    }

    if (SUCCEEDED(hr))
    {
        hSpeechNotifyEvent = cpContext->GetNotifyEventHandle();

        if (INVALID_HANDLE_VALUE == hSpeechNotifyEvent)
        {
            // Notification handle unsupported.
            hr = E_NOINTERFACE;
        }
    }

    // Initialize an audio object to use the default audio input of the system and set the recognizer to use it.
    CComPtr<ISpAudio> cpAudioIn;

    if (SUCCEEDED(hr))
    {
        hr = cpAudioIn.CoCreateInstance(CLSID_SpMMAudioIn);
    }

    // This will typically use the microphone input. 
    // Speak a phrase such as "Find restaurants near Madrid".  
    if (SUCCEEDED(hr))
    {
        hr = cpRecognizer->SetInput(cpAudioIn, TRUE);
    }

    // Populate a WAVEFORMATEX struct with our desired output audio format information.
    WAVEFORMATEX* pWfexCoMemRetainedAudioFormat = NULL;
    GUID guidRetainedAudioFormat = GUID_NULL;

    if (SUCCEEDED(hr))
    {
        hr = SpConvertStreamFormatEnum(SPSF_16kHz16BitMono, &guidRetainedAudioFormat;, &pWfexCoMemRetainedAudioFormat;);
    }

    // Create a new grammar and load an SRGS grammar from file.
    CComPtr<ISpRecoGrammar> cpGrammar;

    if (SUCCEEDED(hr))
    {
        hr = cpContext->CreateGrammar(0, &cpGrammar;);
    }

    if (SUCCEEDED(hr))
    {
        hr = cpGrammar->LoadCmdFromFile(L"C:\\Test\\FindServices.grxml", SPLO_STATIC);
    }

    // Set all top-level rules in the new grammar to the active state.
    if (SUCCEEDED(hr))
    {
        hr = cpGrammar->SetRuleState(NULL, NULL, SPRS_ACTIVE);
    }

    // Set the recognizer state to active to begin recognition.
    if (SUCCEEDED(hr))
    {
        hr = cpRecognizer->SetRecoState(SPRST_ACTIVE_ALWAYS);
    }    

    // Establish a separate win32 event to signal event loop exit.
    HANDLE hExitEvent = CreateEvent(NULL, FALSE, FALSE, NULL);

    // Collect the events listened for to pump the speech event loop.
    HANDLE rghEvents[] = { hSpeechNotifyEvent, hExitEvent };

    // Speech recognition event loop.
    BOOL fContinue = TRUE;

    while (fContinue && SUCCEEDED(hr))
    {
        // Wait for either a speech event or an exit event.
        DWORD dwMessage = WaitForMultipleObjects(sp_countof(rghEvents), rghEvents, FALSE, INFINITE);

        switch (dwMessage)
        {
            // With the WaitForMultipleObjects call above, WAIT_OBJECT_0 is a speech event from hSpeechNotifyEvent.
            case WAIT_OBJECT_0: 
            {
                // Sequentially grab the available speech events from the speech event queue.
                CSpEvent spevent;

                while (S_OK == spevent.GetFrom(cpContext))
                {
                    switch (spevent.eEventId)
                    {
                        case SPEI_RECOGNITION:
                        {
                            // Retrieve the recognition result and output the text of that result.
                            ISpRecoResult* pResult = spevent.RecoResult();

                            LPWSTR pszCoMemResultText = NULL;
                            hr = pResult->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, &pszCoMemResultText;, NULL);

                            if (SUCCEEDED(hr))
                            {
                                wprintf(L"Recognition event received, text=\"%s\"\r\n", pszCoMemResultText);
                            }

                            if (NULL != pszCoMemResultText)
                            {
                                CoTaskMemFree(pszCoMemResultText);
                            }

                            break;
                        }
                        case SPEI_END_SR_STREAM:
                        {
                            // The stream has ended; signal the exit event if it hasn't been signaled already.
                            wprintf(L"End stream event received\r\n");
                            SetEvent(hExitEvent);
                            break;
                        }
                    }
                }

                break;
            }
            case WAIT_OBJECT_0 + 1:
            {
                // Exit event; discontinue the speech loop.
                fContinue = FALSE;
                break;
            }
        }
    }

    // Pause to prevent application exit.
    wprintf(L"Press any key to exit!\r\n");
    getchar();
}
CoUninitialize();

return 0;

}

`

The following are the contents of the grammar file (FindServices.grxml) used in the preceding example. See Grammar Authoring Overview for more information about authoring grammars.

`

<?xml version="1.0" encoding="utf-8"?>
<grammar version="1.0" xml:lang="en-US" mode="voice" root="findServices"
xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">

<rule id="findServices"> <item> Find </item> <ruleref uri="#services"/> <item> near </item> <ruleref uri="#city"/> </rule>

<rule id="services"> <one-of> <item> restaurants </item> <item> gas stations </item> <item> coffee </item> </one-of> </rule>

<rule id="city"> <one-of> <item> Seattle </item> <item> Madrid </item> <item> London </item> </one-of> </rule>

</grammar>

`