Using Speech Recognition in UCMA 3.0 and Lync 2010: UCMA Application (Part 3 of 5)

Summary:   This is the third in a series of five articles that describe how a Microsoft Unified Communications Managed API (UCMA) 3.0 application and a Microsoft Lync 2010 application can be combined to perform speech recognition. Part 3 describes the parts of the UCMA 3.0 application that are involved with speech recognition.

Applies to:   Microsoft Lync Server 2010 | Microsoft Unified Communications Managed API 3.0 Core SDK | Microsoft Lync 2010

Published:   April 2011 | Provided by:   Mark Parker, Microsoft | About the Author

Contents

  • Setting Up Speech Recognition in the UCMA Application

  • The Recognition Grammar

  • Processing the Recognition Result

  • Part 4

  • Additional Resources

Code Gallery  Download code

This is the third in a five-part series of articles that describe how to incorporate speech recognition in Lync 2010 applications that interoperate with UCMA 3.0.

Setting Up Speech Recognition in the UCMA Application

Before you set up speech recognition, ensure that an AudioVideoFlow instance exists. For information about how to create an AudioVideoFlow instance, see the CreateAudioVideoFlow method in Using Speech Recognition in UCMA 3.0 and Lync 2010: Code Walkthrough (Part 5 of 5).

Note

To see all of the UCMA and Lync code that is used in the scenario, see Using Speech Recognition in UCMA 3.0 and Lync 2010: Code Walkthrough (Part 5 of 5).

To set up speech recognition in the UCMA application

  1. Create a SpeechRecognitionConnector instance, and then use the AttachFlow method to attach the SpeechRecognitionConnector instance to the AudioVideoFlow instance.

    SpeechRecognitionConnector speechRecognitionConnector = new SpeechRecognitionConnector();
    speechRecognitionConnector.AttachFlow(_audioVideoFlow);
    
  2. Start the speech recognition connector by using the Start method. The Start method returns a SpeechRecognitionStream object, which is used in step 5 to connect the speech recognition connector to the speech recognition engine that is created in step 3.

    SpeechRecognitionStream stream = speechRecognitionConnector.Start();
    
  3. Create a SpeechRecognitionEngine instance and then register to receive notifications for the SpeechRecognized and LoadGrammarCompleted events. The SpeechRecognitionEngine class belongs to the Microsoft.Speech namespace.

    _speechRecognitionEngine = new SpeechRecognitionEngine();
    _speechRecognitionEngine.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(speechRecognitionEngine_SpeechRecognized);
    _speechRecognitionEngine.LoadGrammarCompleted += new EventHandler<LoadGrammarCompletedEventArgs>(speechRecognitionEngine_LoadGrammarCompleted);
    
  4. Create a Grammar object and then load it into the SpeechRecognitionEngine instance. The grammar that is loaded in this step is discussed in the next section.

    Grammar gr = new Grammar(@"C:\Users\mp\Documents\Visual Studio 2008\Projects\ChooseFlight\Airports.grxml", "Main");
    _speechRecognitionEngine.LoadGrammarAsync(gr);
    
  5. Connect the audio stream that is created in step 2 to the SpeechRecognitionEngine instance.

    SpeechAudioFormatInfo speechAudioFormatInfo = new SpeechAudioFormatInfo(8000, AudioBitsPerSample.Sixteen, Microsoft.Speech.AudioFormat.AudioChannel.Mono);
    _speechRecognitionEngine.SetInputToAudioStream(stream, speechAudioFormatInfo);
    
  6. Begin recognition by calling the RecognizeAsync method. The argument that is passed in the call to the RecognizeAsync method specifies that recognition should continue until the recognition engine is stopped or is canceled.

    _speechRecognitionEngine.RecognizeAsync(RecognizeMode.Multiple);
    

The Recognition Grammar

The UCMA 3.0 application discussed in this series of articles uses a Speech Recognition Grammar Specification (SRGS) XML grammar. This SRGS XML grammar is used in the following example. For more information about this kind of grammar, see Speech Recognition Grammar Specification Version 1.0, Semantic Interpretation for Speech Recognition (SISR) Version 1.0, and Speech Recognition.

The following code, Airports.grxml, shows the SRGS XML grammar that is loaded into the speech recognition engine in the UCMA 3.0 application.

<?xml version="1.0" encoding="UTF-8" ?>
<grammar version="1.0" xml:lang="en-US" mode="voice" root= "Main"
xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">

<rule id="Main" scope="public">
  <example>I would like to fly from Seattle to Denver</example>
  <example>I want to fly from San Francisco to Miami</example>
  <example>I want a ticket from Vancouver to Denver</example>
  <tag>out.Origination=""; out.Destination="";</tag>
  <one-of>
     <item>I would like to fly </item>
     <item>I want to fly </item>  
     <item>I want a ticket </item>
  </one-of>
  <item>from</item>
  <ruleref uri="#Cities" type="application/srgs+xml"/>  
  <tag>out.Origination=rules.latest();</tag>
  <item>to</item>
  <ruleref uri="#Cities" type="application/srgs+xml"/> 
  <tag>out.Destination=rules.latest();</tag>
</rule>

<rule id="Cities">
<one-of>
  <item> Atlanta <tag>out="Atlanta, GA";</tag> </item>
  <item> Baltimore <tag>out="Baltimore, MD";</tag> </item>
  <item> Boston <tag>out="Boston, MA";</tag> </item>
  <item> Dallas <tag>out="Dallas, TX";</tag> </item>
  <item> Denver <tag>out="Denver, CO";</tag> </item>
  <item> Detroit <tag>out="Detroit, MI";</tag> </item>
  <item> Jackson <tag>out="Jackson, MS";</tag> </item>
  <item> Miami <tag>out="Miami, FL";</tag> </item>
  <item> New York <tag>out="New York, NY";</tag> </item>
  <item> Philadelphia <tag>out="Philadelphia, PA";</tag> </item>
  <item> Phoenix <tag>out="Phoenix, AZ";</tag> </item>
  <item> San Francisco <tag>out="San Francisco, CA";</tag> </item>
  <item> Seattle <tag>out="Seattle, WA";</tag> </item>
  <item> Vancouver <tag>out="Vancouver, BC";</tag> </item>
</one-of>
</rule>
</grammar>

The Airports.grxml grammar can be used to recognize utterances in the following form.

{I would like to fly | I want to fly | I want a ticket} {from} {City1} {to} {City2}

The grammar consists of two rules: Main and Cities. The Main rule returns an object that has two fields: Origination and Destination. The UCMA 3.0 application is set up to use the object returned by this grammar.

The Main rule consists of the following elements (in sequential order).

  • A one-of element that contains three item elements (I would like to fly, I want to fly, I want a ticket).

    A recognition match occurs if the utterance matches one of the items in the one-of element.

  • An item element (from).

  • A rule-ref element (a reference to the Cities rule).

    A recognition match occurs if the utterance contains the name of one of the cities that are listed in the Cities rule. If a match occurs, the Cities rule returns a string that consists of the city name and state abbreviation.

  • A tag element that stores the most recently returned value in the Origination field.

  • An item element (to).

  • A rule-ref element (a second reference to the Cities rule).

  • A tag element that stores the most recently returned value in the Destination field.

Processing the Recognition Result

When the Lync 2010 user makes an utterance, the audio stream is sent to the UCMA 3.0 application, where it is processed by the SpeechRecognitionEngine instance. If the utterance matches the recognition engine’s grammar, the SpeechRecognized event is raised. The following example is the handler for the SpeechRecognized event. This handler was registered in step 3 in the previous procedure.

The Airports.grxml grammar returns an object that has an Origination field and a Destination field. This object can be accessed through the e parameter (of type SpeechRecognizedEventArgs) of the event handler, from the expression e.Result.Semantics.

// Event handler for the SpeechRecognized event on the SpeechRecognitionEngine.
void speechRecognitionEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
  RecognitionResult recoResult = e.Result;
  String str; 
  String origCity, destCity;
  String[] prices = new String[4] { "$368.00", "$429.00", "$525.00", "$631.00" };
  int idx;
  Random rand = new Random();
    
  if (recoResult != null)
  {
    Console.WriteLine("Speech recognized: " + recoResult.Text);
    origCity = recoResult.Semantics["Origination"].Value.ToString();
    destCity = recoResult.Semantics["Destination"].Value.ToString();
    str = origCity;
    str = String.Concat(str, ";");
    str = String.Concat(str, destCity);
    str = String.Concat(str, ";");
    
    // The (bogus) cost of the flight.
    idx = rand.Next(0, 3);
    str = String.Concat(str, prices[idx]);
    SendDataToRemote(str);
  }
}

After the handler retrieves the values of the Origination and Destination fields of the object that is returned by the grammar, it constructs a string that contains the two city names and a random value from an array of four ticket prices. The city names and the ticket price are separated by semicolons. This string is passed to the SendDataToRemote helper method, which converts the string to an array of type Byte, and then sends it through the context channel to the Lync 2010 application.

The SendDataToRemote method is defined in Using Speech Recognition in UCMA 3.0 and Lync 2010: Code Walkthrough (Part 5 of 5). For more information about context channel usage, see Using UCMA 3.0 and Lync 2010 for Contextual Communication: Creating the Lync Application (Part 4 of 6).

Part 4

Using Speech Recognition in UCMA 3.0 and Lync 2010: Lync Application (Part 4 of 5)

Additional Resources

For more information, see the following resources:

About the Author

Mark Parker is a programming writer at Microsoft whose current responsibility is the UCMA SDK documentation. Mark previously worked on Microsoft Speech Server 2007 documentation.