Share via


Creating Speech Recognition Calculators in UCMA 3.0: UCMA Infrastructure (Part 3 of 4)

Summary:   Add speech recognition and speech synthesis to your Microsoft Unified Communications Managed API (UCMA) 3.0 application by incorporating the recognition and synthesis APIs of Microsoft Speech Platform SDK. Part 3 describes the application, and how it interacts with the speech recognition engine and speech synthesizer.

Applies to:   Microsoft Unified Communications Managed API (UCMA) 3.0 Core SDK | Microsoft Speech Platform SDK

Published:   November 2011 | Provided by:   Mark Parker, Microsoft | About the Author

Contents

  • Application Outline

  • Creating the Platform and Endpoint

  • Receiving an Incoming Call

  • Setting Up the Speech Recognition Engine

  • Creating and Loading the Grammar

  • Setting Up the Speech Synthesizer

  • Shutting Down the Application

  • Part 4

  • Additional Resources

This article is the third in a four-part series of articles about how to create a calculator that uses speech recognition and speech synthesis.

Application Outline

  1. Create and establish a UserEndpoint instance.

  2. Register for an incoming call.

  3. Handle the incoming call.

  4. Set up the speech recognition engine.

  5. Create the grammar that will be used.

  6. Load the grammar.

  7. Set the input stream for the speech recognition engine.

  8. Start speech recognition.

  9. After all recognitions have been made, shut down the application.

The following sections provide a detailed description of these steps.

Application Global Variables

The following example lists the global variables that are used in the application that is discussed in this series of articles. All of the code for the application described in the articles appears in Creating Speech Recognition Calculators in UCMA 3.0: Code Listing and Conclusion (Part 4 of 4).

private UCMASampleHelper _helper;
private UserEndpoint _userEndpoint;
private AudioVideoCall _audioVideoCall;
private AudioVideoFlow _audioVideoFlow;
private SpeechSynthesizer _speechSynthesizer;

// Wait handles are used to keep the main and worker threads synchronized.
private AutoResetEvent _waitForCallToBeAccepted = new AutoResetEvent(false);
private AutoResetEvent _waitForGrammarToLoad = new AutoResetEvent(false);
private AutoResetEvent _waitForConversationToBeTerminated = new AutoResetEvent(false);
private AutoResetEvent _waitForShutdownEventCompleted = new AutoResetEvent(false);
private AutoResetEvent _waitForRecoCompleted = new AutoResetEvent(false);

Creating the Platform and Endpoint

The sample uses the CreateAndEstablishUserEndpoint method, which is defined in UCMASampleHelper.cs, to create a CollaborationPlatform instance, and then start it. This method then creates and establishes a UserEndpoint instance.

_helper = new UCMASampleHelper();
// Create a user endpoint using the network credential object. 
_userEndpoint = _helper.CreateEstablishedUserEndpoint("Speech Calculator Sample User");

Receiving an Incoming Call

After the UserEndpoint is created and established, the application calls the RegisterForIncomingCall<TCall> method on the UserEndpoint instance. This action registers a delegate that is called when an audio/video call arrives.

_userEndpoint.RegisterForIncomingCall<AudioVideoCall>(AudioVideoCall_Received);

AudioVideoCall_Received Delegate

The delegate that handles an incoming call must perform three important tasks.

  • Set a global variable that is a reference to the incoming audio/video call.

  • Register for notification of the AudioVideoFlowConfigurationRequested event.

    The handler for the AudioVideoFlowConfigurationRequested event sets a global variable that is a reference to an AudioVideoFlow instance. The flow variable is required for initializing the speech recognition connector and the speech synthesis connector that are used in later steps.

  • Accept the call by using either the BeginAccept method or the BeginAccept method on the call.

The following example is the definition for the AudioVideoCall_Received method.

void AudioVideoCall_Received(object sender, CallReceivedEventArgs<AudioVideoCall> e)
{
  _audioVideoCall = e.Call;
  _audioVideoCall.AudioVideoFlowConfigurationRequested += this.AudioVideoCall_FlowConfigurationRequested;

  // For logging purposes, register for notification of the StateChanged event on the call.
  _audioVideoCall.StateChanged +=
            new EventHandler<CallStateChangedEventArgs>(AudioVideoCall_StateChanged);

  // Remote Participant URI represents the far end (caller) in this conversation. 
  Console.WriteLine("Call received from: " + e.RemoteParticipant.Uri);

  // Now, accept the call. CallAcceptCB will run on the same thread.
  _audioVideoCall.BeginAccept(CallAcceptCB, _audioVideoCall);
}

AudioVideoFlowConfigurationRequested Event Handler

When an AudioVideoFlow object is created, the AudioVideoFlowConfigurationRequested event is raised. The event handler in this article retrieves a reference to the flow from the Flow property on the AudioVideoFlowConfigurationRequestedEventArgs parameter.

The following example is the definition for the AudioVideoCall_FlowConfigurationRequested method.

public void AudioVideoCall_FlowConfigurationRequested(object sender, AudioVideoFlowConfigurationRequestedEventArgs e)
{
  Console.WriteLine("Flow Created.");
  _audioVideoFlow = e.Flow;

  // Now that the flow is non-null, bind a handler for the StateChanged event.
  // When the flow goes active, (as indicated by the StateChanged event) the application can take media-related actions on the flow.
  _audioVideoFlow.StateChanged += new EventHandler<MediaFlowStateChangedEventArgs>(AudioVideoFlow_StateChanged);
}

CallAcceptCB Callback Method

The CallAcceptCB callback method completes the call acceptance process by invoking the EndAccept method on the incoming audio/video call

The following example is the definition for the CallAcceptCB method.

private void CallAcceptCB(IAsyncResult ar)
{
  AudioVideoCall audioVideoCall = ar.AsyncState as AudioVideoCall;
  try
  {
    // Determine whether the call was accepted successfully.
    audioVideoCall.EndAccept(ar);
  }
  catch (RealTimeException exception)
  {
    // RealTimeException may be thrown on media or link-layer failures. 
    // A production application should catch additional exceptions, such as OperationTimeoutException,
    // OperationTimeoutException, and CallOperationTimeoutException.

    Console.WriteLine(exception.ToString());
  }
  finally
  {
    // Synchronize with main thread.
    _waitForCallToBeAccepted.Set();
  }
}

Setting Up the Speech Recognition Engine

To set up Speech Recognition

  1. Create a SpeechRecognitionConnector instance.

  2. Attach the flow obtained in the handler for the AudioVideoFlowConfigurationRequested, by using the AttachFlow method on the speech recognition connector object.

  3. Start the speech recognition connector object, by using a call to the Start method on the speech recognition connector. The Start method returns a SpeechRecognitionStream object that is used in a later step as a parameter to the SpeechRecognitionEngine constructor.

  4. Create a SpeechRecognitionEngine instance.

  5. Register for notification of the SpeechRecognized event on the speech recognition engine.

  6. Register for notification of the LoadGrammarCompleted event on the speech recognition engine.

  7. Create and load the grammar that is used for speech recognition.

    The example code that follows uses a helper method, CreateGrammar, to create a GrammarBuilder grammar, which is used to initialize a Grammar instance. For more information, see Creating Speech Recognition Calculators in UCMA 3.0: Grammar Creation (Part 2 of 4).

  8. Create a SpeechAudioFormatInfo instance. This object specifies the number of samples per second, the number of bits per sample, and the number of channels to use.

  9. Set the input on the speech recognition engine, by using the SetInputToAudioStream method on the speech recognition engine. The first parameter is the stream obtained in step 3, and the second parameter is the SpeechAudioFormatInfo instance that was created in the previous step. In the example that follows, the SpeechAudioFormatInfo instance is initialized to 8,000 samples per second, 16 bits per sample, and using a monaural audio channel.

  10. Begin recognition by using the RecognizeAsync method on the speech recognition engine. The Multiple value on the RecognizeMode enumeration permits the application to accept multiple utterances from the user.

The following example shows the code for the steps in the previous list.

// Create a speech recognition connector and attach an AudioVideoFlow to it.
SpeechRecognitionConnector speechRecognitionConnector = new SpeechRecognitionConnector();
speechRecognitionConnector.AttachFlow(_audioVideoFlow);

// Start the speech recognition connector.
SpeechRecognitionStream stream = speechRecognitionConnector.Start();

// Create a speech recognition engine.
SpeechRecognitionEngine speechRecognitionEngine = new SpeechRecognitionEngine();
speechRecognitionEngine.SpeechRecognized += 
        new EventHandler<SpeechRecognizedEventArgs>(SpeechRecognitionEngine_SpeechRecognized);
speechRecognitionEngine.LoadGrammarCompleted += new EventHandler<LoadGrammarCompletedEventArgs>(SpeechRecognitionEngine_LoadGrammarCompleted);

Grammar gr = new Grammar(CreateGrammar());     

speechRecognitionEngine.LoadGrammarAsync(gr);
SpeechAudioFormatInfo speechAudioFormatInfo = 
        new SpeechAudioFormatInfo(8000, AudioBitsPerSample.Sixteen, Microsoft.Speech.AudioFormat.AudioChannel.Mono);
speechRecognitionEngine.SetInputToAudioStream(stream, speechAudioFormatInfo);
speechRecognitionEngine.RecognizeAsync(RecognizeMode.Multiple);

SpeechRecognized Event Handler

The SpeechRecognitionEngine_SpeechRecognized method is invoked when the SpeechRecognized event on the speech recognition engine is raised. This event handler retrieves the semantic items that are available by way of the SpeechRecognizedEventArgs argument.

By necessity, the SpeechRecognitionEngine_SpeechRecognized method is closely tied to the grammar that is used for recognition, because this method uses the semantic keys that are defined in the grammar to gain access to the list of key-value pairs that is returned by the grammar. This list can be accessed from the expression e.Result.Semantics. For example, the expression e.Result.Semantics["operator"] provides access to the arithmetic operator spoken by the user.

Before exiting, this method calls the Calculate helper method, which calculates the result of the arithmetic expression spoken by the user.

The following example is the definition for the SpeechRecognitionEngine_SpeechRecognized method.

void SpeechRecognitionEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
  RecognitionResult recoResult = e.Result;
  Int32 op1, op2;
  char operation;
  
  if (recoResult != null)
  {
    Console.WriteLine("Speech recognized: " + recoResult.Text);

    if (recoResult.Text.Contains("exit"))
    {
      _waitForRecoCompleted.Set();
    }
    else
    {
      op1 = Convert.ToInt32(recoResult.Semantics["number1"].Value.ToString());
      operation = Convert.ToChar(recoResult.Semantics["operator"].Value.ToString());
      op2 = Convert.ToInt32(recoResult.Semantics["number2"].Value.ToString());

      Calculate(op1, operation, op2);
    }
  }
}

Calculate Method

The Calculate method calculates the value of the expression that is formed from its three arguments. The only arithmetic operation that is not allowed is division by zero.

After calculating the value of the input expression, this method displays a string in the console and calls the Speak method on the speech synthesizer to speak the answer.

The following example is the definition of the Calculate method.

private void Calculate(int op1, char operation, int op2)
{
  int result = 0;
  String prompt;
  String operationStr = "";
  if (operation == '/' && op2 == 0)
  {
    prompt = op1.ToString() + " divided by zero is undefined. You cannot divide by zero.";
    Console.WriteLine("{0} {1} {2} is undefined. You cannot divide by zero.",
          op1, operation, op2);
    _speechSynthesizer.Speak(prompt);
  }
  else
  {
    switch (operation)
    {
      case '+':
        result = op1 + op2;
        operationStr = " plus ";
        break;
      case '-':
        result = op1 - op2;
        operationStr = " minus ";
        break;
      case '*':
        result = op1 * op2;
        operationStr = " times ";
        break;
      case '/':
        result = op1 / op2;
        operationStr = " divided by ";
        break;
    }
    Console.WriteLine("{0} {1} {2} = {3}", op1, operation, op2, result);
    prompt = op1.ToString() + operationStr + op2.ToString() + " is " + result.ToString();
    _speechSynthesizer.Speak(prompt);
  }
}

Note

In the interest of simplicity, integer division is performed in this method. If the user says “six divided by four,” the application replies by speaking, “six divided by four is one.”

Creating and Loading the Grammar

Two grammars are discussed in Creating Speech Recognition Calculators in UCMA 3.0: Grammar Creation (Part 2 of 4), a GrammarBuilder instance that is created at run time, and a Speech Recognition Grammar Specification (SRGS) grammar, an XML text file that is loaded into memory at run time.

The GrammarBuilder grammar is created by a helper method named CreateGrammar, that returns a Grammar object. This method is discussed in Creating Speech Recognition Calculators in UCMA 3.0: Grammar Creation (Part 2 of 4). After the Grammar object is initialized, it is loaded into the speech recognition engine by a call to the LoadGrammarAsync method.

Grammar gr = new Grammar(CreateGrammar());
speechRecognitionEngine.LoadGrammarAsync(gr);

The SRGS grammar can be used by providing the path and file name of this grammar file to the Grammar constructor that takes a string parameter. After the Grammar object is created, it is passed as a parameter to the LoadGrammarAsync method on the speech recognition engine.

String currDirPath = Environment.CurrentDirectory;
Grammar gr = new Grammar(currDirPath + @"\NumberOpNumber.grxml", "Expression");
speechRecognitionEngine.LoadGrammarAsync(gr);

LoadGrammarCompleted Event Handler

If a grammar is loaded into the speech recognition by a call to the asynchronous LoadGrammarAsync method, the LoadGrammarCompleted event on the speech recognition engine is raised. An application must register a handler to be notified of this event.

The following example is the definition for the LoadGrammarCompleted method.

void SpeechRecognitionEngine_LoadGrammarCompleted(object sender, LoadGrammarCompletedEventArgs e)
{
  Console.WriteLine("Grammar is now loaded.");
  _waitForGrammarToLoad.Set();
}

Setting Up the Speech Synthesizer

Speech synthesizer setup resembles speech recognition engine setup.

To set up the Speech Synthesizer

  1. Create a SpeechSynthesisConnector instance.

  2. Attach the flow obtained in the handler for the AudioVideoFlowConfigurationRequested, by using the AttachFlow method on the speech recognition connector object.

    Note

    This is the same flow that was attached to the speech recognition connector.

  3. Create a SpeechSynthesizer instance.

  4. Create a SpeechAudioFormatInfo instance. This object is used to initialize parameters for the audio stream that the speech synthesizer uses for output. In the following example, these parameters are 16,000 samples per second and 16 bits per sample, and using a monaural audio channel.

  5. Call the SetOutputToAudioStream method on the speech synthesizer. This method sets the output of the synthesizer to the stream that is associated with the SpeechSynthesisConnector instance.

  6. Register for notification of the SpeakStarted and SpeakCompleted events.

    In the application presented here, the handlers for these events are used only for debugging, and are not shown here. The definitions of these handlers appear in Creating Speech Recognition Calculators in UCMA 3.0: Code Listing and Conclusion (Part 4 of 4).

  7. Call the Start method on the SpeechSynthesisConnector. This method causes the speech recognition connector to accept input from the SpeechSynthesizer and transmit it to the AudioVideoFlow object that is mentioned in step 2.

  8. Call the Speak method on the speech synthesizer.

The following example shows the code for the steps of the previous list.

/ Create a speech synthesis connector and attach the AudioVideoFlow instance to it.
SpeechSynthesisConnector speechSynthesisConnector = new SpeechSynthesisConnector();

speechSynthesisConnector.AttachFlow(_audioVideoFlow);

// Create a speech synthesizer and set its output to the speech synthesis connector.
_speechSynthesizer = new SpeechSynthesizer();
SpeechAudioFormatInfo audioformat = new SpeechAudioFormatInfo(16000, AudioBitsPerSample.Sixteen, Microsoft.Speech.AudioFormat.AudioChannel.Mono);
_speechSynthesizer.SetOutputToAudioStream(speechSynthesisConnector, audioformat);

// Register for notification of the SpeakCompleted and SpeakStarted events on the speech synthesizer.
_speechSynthesizer.SpeakStarted += new EventHandler<SpeakStartedEventArgs>(SpeechSynthesizer_SpeakStarted);
_speechSynthesizer.SpeakCompleted += new EventHandler<SpeakCompletedEventArgs>(SpeechSynthesizer_SpeakCompleted);

// Start the speech synthesis connector.
speechSynthesisConnector.Start();
_speechSynthesizer.Speak("Simple speech calculator");

Shutting Down the Application

When the user says “exit,” the application begins an orderly shutdown process.

  1. Stop the speech recognition connector, by using the connector’s Stop method.

  2. Detach the AudioVideoFlow object from the speech recognition connector, by using the connector’s DetachFlow method.

  3. Stop the speech synthesis connector, by using the connector’s Stop method.

  4. Detach the AudioVideoFlow object from the speech synthesis connector, by using the connector’s DetachFlow method.

  5. Terminate the call, by using the BeginTerminate method on the audio/video call.

  6. Terminate the conversation, by using the BeginTerminate method on the conversation.

  7. Unregister the endpoint from receiving any more incoming calls, by using the UnregisterForIncomingCall<TCall> method on the endpoint.

  8. Shut down the collaboration platform, by using the ShutdownPlatform helper method.

Note

Steps 7 and 8 in the previous list do not appear in the following code sample. Termination of the Conversation object and unregistering for additional calls occur in the CallTerminateCB callback method. This programming style is known as callback chaining.

The following example shows the code that represents the steps in the previous list.

// Stop the speech recognition connector.
speechRecognitionConnector.Stop();
Console.WriteLine("Stopping the speech recognition connector");

// Detach the flow from the speech recognition connector, to prevent the flow from being kept in memory.
speechRecognitionConnector.DetachFlow();

// Stop the speech synthesis connector.
speechSynthesisConnector.Stop();
Console.WriteLine("Stopping the speech synthesis connector.");

// Detach the flow from the speech synthesis connector, to prevent the flow from being kept in memory.
speechSynthesisConnector.DetachFlow();

// Terminate the call, the conversation, and then unregister the 
// endpoint from receiving an incoming call. 
_audioVideoCall.BeginTerminate(CallTerminateCB, _audioVideoCall);
_waitForConversationToBeTerminated.WaitOne();
 
// Shut down the platform.
_helper.ShutdownPlatform();

CallTerminateCB Callback Method

The CallTerminateCB callback method completes the call termination process, by using the EndTerminate method on the AudioVideoCall instance. This method then unregisters the handler for the StateChanged event on the call. Before returning, this method calls the BeginTerminate method on the Conversation object

The following example shows the definition of the CallTerminateCB callback method.

private void CallTerminateCB(IAsyncResult ar)
{
  AudioVideoCall audioVideoCall = ar.AsyncState as AudioVideoCall;

  // Finish terminating the incoming call.
  audioVideoCall.EndTerminate(ar);

  // Unregister this event handler now that the call has been terminated.
  _audioVideoCall.StateChanged -= AudioVideoCall_StateChanged;

  // Terminate the conversation.
  _audioVideoCall.Conversation.BeginTerminate(ConversationTerminateCB, _audioVideoCall.Conversation);
}

ConversationTerminateCB Callback Method

The ConversationTerminateCB callback method completes the conversation termination process, by using the EndTerminate method on the Conversation object. This method then unregisters the endpoint from receiving new calls.

The following example shows the definition of the ConversationTerminateCB callback method.

private void ConversationTerminateCB(IAsyncResult ar)
{
  Conversation conversation = ar.AsyncState as Conversation;

  // Finish terminating the conversation.
  conversation.EndTerminate(ar);

  // Unregister for incoming calls.
  _userEndpoint.UnregisterForIncomingCall<AudioVideoCall>(AudioVideoCall_Received);
  // Synchronize with main thread.
  _waitForConversationToBeTerminated.Set();
}

Part 4

Creating Speech Recognition Calculators in UCMA 3.0: Code Listing and Conclusion (Part 4 of 4)

Additional Resources

For more information, see the following resources:

About the Author

Mark Parker is a programming writer at Microsoft whose current responsibility is the UCMA SDK documentation. Mark previously worked on the Microsoft Speech Server 2007 documentation.