Creating Speech Recognition Calculators in UCMA 3.0: Code Listing and Conclusion (Part 4 of 4)

Creating Speech Recognition Calculators in UCMA 3.0: Code Listing and Conclusion (Part 4 of 4)

Summary:   Add speech recognition and speech synthesis to your Microsoft Unified Communications Managed API (UCMA) 3.0 application by incorporating the recognition and synthesis APIs of Microsoft Speech Platform SDK. Part 4 includes the code that comprises the application, and a conclusion.

Applies to:   Microsoft Unified Communications Managed API (UCMA) 3.0 Core SDK | Microsoft Speech Platform SDK

Published:   November 2011 | Provided by:   Mark Parker, Microsoft | About the Author


This article is the last in a four-part series of articles about how to create a calculator that uses speech recognition and speech synthesis.

App.Config, the application configuration file, is used to configure settings for the computer that is hosting the application. When the appropriate parameters are entered in the add elements (and the XML comment delimiters are removed), they do not have to be entered from the keyboard when the application is running.

The following example shows the App.Config file.

<?xml version="1.0" encoding="utf-8" ?>
    <!-- Provide parameters necessary for the sample to run without 
    prompting for input. -->

    <!-- Provide the FQDN of the Microsoft Lync Server. -->
    <!-- <add key="ServerFQDN1" value=""/> -->

    <!-- The user ID of the user on whose behalf the application runs. -->
    <!-- Leave this value as blank to use credentials of the currently logged on user. -->
    <!-- <add key="UserName1" value=""/> -->

    <!-- The domain of the user on whose behalf the application runs. -->
    <!-- Leave this value as blank to use the credentials of the currently logged on user. -->
    <!-- <add key="UserDomain1" value=""/> -->

    <!-- The URI of the user on whose behalf the application runs, in the form user@host. -->
    <!-- <add key="UserURI1" value=""/> -->
    <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
        <assemblyIdentity name="mscorlib" publicKeyToken="b77a5c561934e089" culture="neutral" />
        <bindingRedirect oldVersion="" newVersion=""/>

The following example is the code for the application that is described in this article.

using System;
using System.Threading;
using Microsoft.Rtc.Collaboration;
using Microsoft.Rtc.Collaboration.AudioVideo;
using Microsoft.Rtc.Signaling;
using Microsoft.Speech.Recognition;
using Microsoft.Speech.Synthesis;
using Microsoft.Speech.AudioFormat;
using Microsoft.Rtc.Collaboration.Sample.Common;

namespace Microsoft.Rtc.Collaboration.Sample.GrammarBuilderSpeechReco
  // After starting, the application waits for an incoming audio/video call. When an audio/video call
  // arrives, the application uses speech synthesis to provide a brief introduction, and asks the user 
  // to speak a simple arithmetic problem, such as "how much is eight times thirteen". 
  // The application uses speech recognition to parse the user's speech, and then speaks the 
  // answer to the user, using speech synthesis. 
  // The application continues to respond to questions until the user says "exit," after which the 
  // application shuts down gracefully.

  public class UCMASpeechCalculator
    private UCMASampleHelper _helper;
    private UserEndpoint _userEndpoint;
    private AudioVideoCall _audioVideoCall;
    private AudioVideoFlow _audioVideoFlow;
    private SpeechSynthesizer _speechSynthesizer;

    // Wait handles are used to keep the main and worker threads synchronized.
    private AutoResetEvent _waitForCallToBeAccepted = new AutoResetEvent(false);
    private AutoResetEvent _waitForGrammarToLoad = new AutoResetEvent(false);
    private AutoResetEvent _waitForConversationToBeTerminated = new AutoResetEvent(false);
    private AutoResetEvent _waitForShutdownEventCompleted = new AutoResetEvent(false);
    private AutoResetEvent _waitForRecoCompleted = new AutoResetEvent(false);
    static void Main(string[] args)
      UCMASpeechCalculator calc = new UCMASpeechCalculator();

    public void Run()
      // A helper class to take care of platform and endpoint setup and cleanup. 
      _helper = new UCMASampleHelper();

      // Create a user endpoint using the network credential object. 
      _userEndpoint = _helper.CreateEstablishedUserEndpoint("Speech Calculator Sample User");

      // Register a delegate to be called when an incoming audio/video call arrives.

      // Wait for the incoming call to be accepted, then terminate the conversation.
      Console.WriteLine("Waiting for incoming call...");

      // Create a speech recognition connector and attach an AudioVideoFlow to it.
      SpeechRecognitionConnector speechRecognitionConnector = new SpeechRecognitionConnector();

      // Start the speech recognition connector.
      SpeechRecognitionStream stream = speechRecognitionConnector.Start();

      // Create a speech recognition engine.
      SpeechRecognitionEngine speechRecognitionEngine = new SpeechRecognitionEngine();
      speechRecognitionEngine.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(SpeechRecognitionEngine_SpeechRecognized);
      // Register for notification of the LoadGrammarCompleted event.
      // This event is needed when a grammar is loaded asynchronously.
      speechRecognitionEngine.LoadGrammarCompleted += new EventHandler<LoadGrammarCompletedEventArgs>(SpeechRecognitionEngine_LoadGrammarCompleted);

      String currDirPath = Environment.CurrentDirectory;
      Grammar gr = new Grammar(currDirPath + @"\NumberOpNumber.grxml", "Expression");
      // Grammar gr = new Grammar(CreateGrammar());     

      SpeechAudioFormatInfo speechAudioFormatInfo = new SpeechAudioFormatInfo(8000, AudioBitsPerSample.Sixteen, Microsoft.Speech.AudioFormat.AudioChannel.Mono);
      speechRecognitionEngine.SetInputToAudioStream(stream, speechAudioFormatInfo);
      Console.WriteLine("\r\nGrammar loaded, say exit to shut down.");


      // Create a speech synthesis connector and attach the AudioVideoFlow instance to it.
      SpeechSynthesisConnector speechSynthesisConnector = new SpeechSynthesisConnector();


      // Create a speech synthesizer and set its output to the speech synthesis connector.
      _speechSynthesizer = new SpeechSynthesizer();
      SpeechAudioFormatInfo audioformat = new SpeechAudioFormatInfo(16000, AudioBitsPerSample.Sixteen, Microsoft.Speech.AudioFormat.AudioChannel.Mono);
      _speechSynthesizer.SetOutputToAudioStream(speechSynthesisConnector, audioformat);

      // Register for notification of the SpeakCompleted and SpeakStarted events on the speech synthesizer.
      _speechSynthesizer.SpeakStarted += new EventHandler<SpeakStartedEventArgs>(SpeechSynthesizer_SpeakStarted);
      _speechSynthesizer.SpeakCompleted += new EventHandler<SpeakCompletedEventArgs>(SpeechSynthesizer_SpeakCompleted);

      // Start the speech synthesis connector.
      _speechSynthesizer.Speak("Simple speech calculator");
      _speechSynthesizer.Speak("For example, you can say how much is six times nine?");
      _speechSynthesizer.Speak("When you are finished, say exit");

      // Pause the main thread until recognition is finished, which is indicated by the user saying "exit".
      // Stop the speech recognition connector.
      Console.WriteLine("Stopping the speech recognition connector");

      // Detach the flow from the speech recognition connector, to prevent the flow from being kept in memory.

      // Stop the speech synthesis connector.
      Console.WriteLine("Stopping the speech synthesis connector.");

      // Detach the flow from the speech synthesis connector, to prevent the flow from being kept in memory.

      // Terminate the call, the conversation, and then unregister the 
      // endpoint from receiving an incoming call. 
      _audioVideoCall.BeginTerminate(CallTerminateCB, _audioVideoCall);
      // Shut down the platform.

    // Helper method that creates and returns a GrammarBuilder grammar.
    private GrammarBuilder CreateGrammar()
      GrammarBuilder [] gb = new GrammarBuilder[]{null, null};
      gb[0] = new GrammarBuilder(new Choices("exit"));
      gb[1] = new GrammarBuilder();
      gb[1].Append("how much is", 0, 1);
      string[] numberString = { "zero", "one", "two", "three", "four",
                               "five", "six", "seven", "eight", "nine", "ten",
                               "eleven", "twelve", "thirteen", "fourteen", "fifteen",
                               "sixteen", "seventeen", "eighteen", "nineteen", "twenty"};
      Choices numberChoices = new Choices();
      for (int i = 0; i < numberString.Length; i++) 
        numberChoices.Add(new SemanticResultValue(numberString[i], i)); 
      gb[1].Append(new SemanticResultKey("number1", (GrammarBuilder)numberChoices));

      string[] operatorString = { "plus", "and", "minus", "times", "divided by" };
      Choices operatorChoices = new Choices();
      operatorChoices.Add(new SemanticResultValue("plus", "+"));
      operatorChoices.Add(new SemanticResultValue("and", "+"));
      operatorChoices.Add(new SemanticResultValue("minus", "-"));
      operatorChoices.Add(new SemanticResultValue("times", "*"));
      operatorChoices.Add(new SemanticResultValue("multiplied by", "*"));
      operatorChoices.Add(new SemanticResultValue("divided by", "/"));
      gb[1].Append(new SemanticResultKey("operator", (GrammarBuilder)operatorChoices));
      gb[1].Append(new SemanticResultKey("number2", (GrammarBuilder)numberChoices));

      Choices choices = new Choices(gb);
      return new GrammarBuilder(choices);

    // Validates the number, operator, and number semantic values extracted from the user's speech.
    // The results are spoken by the speech synthesizer.
    private void Calculate(int op1, char operation, int op2)
      int result = 0;
      String prompt;
      String operationStr = "";
      if (operation == '/' && op2 == 0)
        prompt = op1.ToString() + " divided by zero is undefined. You cannot divide by zero.";
        Console.WriteLine("{0} {1} {2} is undefined. You cannot divide by zero.",
              op1, operation, op2);
        switch (operation)
          case '+':
            result = op1 + op2;
            operationStr = " plus ";
          case '-':
            result = op1 - op2;
            operationStr = " minus ";
          case '*':
            result = op1 * op2;
            operationStr = " times ";
          case '/':
            result = op1 / op2;
            operationStr = " divided by ";
        Console.WriteLine("{0} {1} {2} = {3}", op1, operation, op2, result);
        prompt = op1.ToString() + operationStr + op2.ToString() + " is " + result.ToString();
    #region EVENT HANDLERS
    // Delegate that is called when an incoming AudioVideoCall arrives.
    void AudioVideoCall_Received(object sender, CallReceivedEventArgs<AudioVideoCall> e)
      _audioVideoCall = e.Call;
      _audioVideoCall.AudioVideoFlowConfigurationRequested += this.AudioVideoCall_FlowConfigurationRequested;

      // For logging purposes, register for notification of the StateChanged event on the call.
      _audioVideoCall.StateChanged +=
                new EventHandler<CallStateChangedEventArgs>(AudioVideoCall_StateChanged);

      // Remote Participant URI represents the far end (caller) in this conversation. 
      Console.WriteLine("Call received from: " + e.RemoteParticipant.Uri);

      // Now, accept the call. CallAcceptCB will run on the same thread.
      _audioVideoCall.BeginAccept(CallAcceptCB, _audioVideoCall);

    // Handles the StateChanged event on the incoming audio/video call.
    void AudioVideoCall_StateChanged(object sender, CallStateChangedEventArgs e)
      Console.WriteLine("Previous call state: " + e.PreviousState + "\nCurrent state: " + e.State);

    // Handles the StateChanged event on the audio/video flow.
    private void AudioVideoFlow_StateChanged(object sender, MediaFlowStateChangedEventArgs e)
      // When the flow is active, media operations can begin.
      if (e.State == MediaFlowState.Terminated)
        // Detach the speech recognition connector, because the state of the flow is now Terminated.
        AudioVideoFlow avFlow = (AudioVideoFlow)sender;
        if (avFlow.SpeechRecognitionConnector != null)

    // Handles the AudioVideoFlowConfigurationRequested event on the call.
    // This event is raised when there is a flow present to begin media operations with, and that it is no longer null.
    public void AudioVideoCall_FlowConfigurationRequested(object sender, AudioVideoFlowConfigurationRequestedEventArgs e)
      Console.WriteLine("Flow Created.");
      _audioVideoFlow = e.Flow;

      // Now that the flow is non-null, bind a handler for the StateChanged event.
      // When the flow goes active, (as indicated by the StateChanged event) the application can take media-related actions on the flow.
      _audioVideoFlow.StateChanged += new EventHandler<MediaFlowStateChangedEventArgs>(AudioVideoFlow_StateChanged);

    // Handles the LoadGrammarCompleted event on the speech recognition engine.
    void SpeechRecognitionEngine_LoadGrammarCompleted(object sender, LoadGrammarCompletedEventArgs e)
      Console.WriteLine("Grammar is now loaded.");

    // Handles the SpeechRecognized event on the speech recognition engine.
    // This method extracts the semantic values from the recognized speech.
    // If the user says "exit" the application begins to shut down.
    // Otherwise, the first number, operation, and second number are retrieved 
    // by the use of the "number1", "operator", and "number2" semantic keys.
    void SpeechRecognitionEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
      RecognitionResult recoResult = e.Result;
      Int32 op1, op2;
      char operation;
      if (recoResult != null)
        Console.WriteLine("Speech recognized: " + recoResult.Text);

        if (recoResult.Text.Contains("exit"))
          op1 = Convert.ToInt32(recoResult.Semantics["number1"].Value.ToString());
          operation = Convert.ToChar(recoResult.Semantics["operator"].Value.ToString());
          op2 = Convert.ToInt32(recoResult.Semantics["number2"].Value.ToString());

          Calculate(op1, operation, op2);

    // Handles the SpeakStarted event on the speech synthesizer.
    void SpeechSynthesizer_SpeakStarted(object sender, SpeakStartedEventArgs e)
      Console.WriteLine("SpeakStarted event raised.");

    // Handles for the SpeakCompleted event on the speech synthesizer.
    void SpeechSynthesizer_SpeakCompleted(object sender, SpeakCompletedEventArgs e)
      Console.WriteLine("SpeakCompleted event raised.");


    private void CallAcceptCB(IAsyncResult ar)
      AudioVideoCall audioVideoCall = ar.AsyncState as AudioVideoCall;
        // Determine whether the call was accepted successfully.
      catch (RealTimeException exception)
        // RealTimeException may be thrown on media or link-layer failures. 
        // A production application should catch additional exceptions, such as OperationTimeoutException,
        // OperationTimeoutException, and CallOperationTimeoutException.

        // Synchronize with main thread.

    private void CallTerminateCB(IAsyncResult ar)
      AudioVideoCall audioVideoCall = ar.AsyncState as AudioVideoCall;

      // Finish terminating the incoming call.

      // Unregister this event handler now that the call has been terminated.
      _audioVideoCall.StateChanged -= AudioVideoCall_StateChanged;

      // Terminate the conversation.
      _audioVideoCall.Conversation.BeginTerminate(ConversationTerminateCB, _audioVideoCall.Conversation);

    private void ConversationTerminateCB(IAsyncResult ar)
      Conversation conversation = ar.AsyncState as Conversation;

      // Finish terminating the conversation.

      // Unregister for incoming calls.
      // Synchronize with main thread.


The helper class contains member methods that create and start a CollaborationPlatform instance, and then create and establish the UserEndpoint instance that is used in the sample. For more information about the helper class, see Using UCMA 3.0 BackToBackCall: Code Listing (Part 3 of 4).

The following sample is the Speech Recognition Grammar Specification (SRGS) grammar that is discussed in this article. This grammar, which is included only for reference purposes, is not needed by the application, because a GrammarBuilder grammar is created at run time.

<?xml version="1.0" encoding="UTF-8" ?>
<grammar version="1.0" xml:lang="en-US" mode="voice" root= "Expression"
xmlns="" tag-format="semantics/1.0">
  <rule id="Expression" scope="public"> 
    <example>four plus seven</example>
    <example>four and seven</example>
    <example>how much is four multiplied by seven</example>

    <tag>out.number1=0; out.operator = "";out.number2=0;</tag>
    <item repeat="0-1">how much is</item> 
        <ruleref uri ="#Digit" type="application/srgs+xml"/>
        <ruleref uri ="#Operator" type="application/srgs+xml"/> 
        <ruleref uri ="#Digit" type="application/srgs+xml"/>

  <rule id="Digit">
      <item> zero <tag>out = 0; </tag> </item>
      <item> one <tag>out = 1; </tag> </item>
      <item> two <tag>out = 2; </tag> </item>
      <item> three <tag>out = 3; </tag> </item>
      <item> four <tag>out = 4; </tag> </item>
      <item> five <tag>out = 5; </tag> </item> 
      <item> six <tag>out = 6; </tag> </item>
      <item> seven <tag>out = 7; </tag> </item>
      <item> eight <tag>out = 8; </tag> </item>
      <item> nine <tag>out = 9; </tag> </item>
      <item> ten <tag>out = 10; </tag> </item>
      <item> eleven <tag>out = 11; </tag> </item>
      <item> twelve <tag>out = 12; </tag> </item>
      <item> thirteen <tag>out = 13; </tag> </item>
      <item> fourteen <tag>out = 14; </tag> </item>
      <item> fifteen <tag>out = 15; </tag> </item>
      <item> sixteen <tag>out = 16; </tag> </item>
      <item> seventeen <tag>out = 17; </tag> </item>
      <item> eighteen <tag>out = 18; </tag> </item>
      <item> nineteen <tag>out = 19; </tag> </item>
      <item> twenty <tag>out = 20; </tag> </item>
  <rule id="Operator">
      <item> plus <tag>out = "+"; </tag> </item>
      <item> and <tag>out = "+"; </tag> </item>
      <item> minus <tag>out = "-"; </tag> </item>
      <item> times <tag>out = "*"; </tag> </item>
      <item> multiplied by <tag>out = "*"; </tag> </item>
      <item> divided by <tag>out = "/"; </tag> </item>

The intent of this article is to show how to you can incorporate speech recognition and speech synthesis in your UCMA 3.0 application.

For speech recognition, the following grammars are presented.

  • A GrammarBuilder grammar that is created at application run time by using the SpeechRecognitionConnector class that supplies the connection between the UCMA 3.0 API and the SpeechRecognitionEngine and other classes in the Speech Platform SDK.

  • An SRGS grammar in the form of a text file, that is present before the application runs. Although the two grammars seem to be different at first glance, their actions are equivalent.

Both grammars are more sophisticated than their short lengths would imply. Each grammar returns an object in the form of a list that contains key-value pairs. The application uses semantic keys to access the semantic values that are contained in the list.

For speech synthesis, the SpeechSynthesisConnector class forms the connection between the UCMA 3.0 API and the SpeechSynthesizer class in the Speech Platform SDK.

Mark Parker is a programming writer at Microsoft whose current responsibility is the UCMA SDK documentation. Mark previously worked on the Microsoft Speech Server 2007 documentation.

© 2016 Microsoft