Working with speech recognition results for Windows Phone 8

[ This article is for Windows Phone 8 developers. If you’re developing for Windows 10, see the latest documentation. ]

At the end of a speech recognition operation, the speech recognizer returns a result that contains info about the outcome of recognition. Apps typically use the recognition result to determine the user's intent and what to do next, such as present a list of options, perform a search, confirm a user's input, or prompt the user for additional input.

If recognition was successful (a user's speech was matched to an active grammar), the recognition result includes the following info that apps typically find most useful:

  • The text of the recognized phrase.

  • A confidence rating assigned by the speech recognizer that indicates the certainty that the speech input matches the grammar.

  • The semantics for the recognized phrase (only for Speech Recognition Grammar Specification (SRGS) grammars).

This topic contains the following sections.

Recognized text

The text of the recognized phrase is the phrase in the grammar that is the speech recognizer's best match for speech input by a user. This may not be exactly what the user said, but it’s the closest match to a phrase in an enabled grammar for what the user said. In grammars that don’t contain semantics, an app typically uses the recognized text to interpret the user's intent and to initiate an action in response.

The following example shows how to access the text of the recognized phrase in the result object.

// Declare the SpeechRecognizerUI object at the class level.
SpeechRecognizerUI recoWithUI;

// Initialize the SpeechRecognizerUI object.
recoWithUI = new SpeechRecognizerUI();
        
// Handle the button click event.
private async void Reco1_Click(object sender, RoutedEventArgs e)
{
  // Display text to prompt the user's input.
  recoWithUI.Settings.ListenText = "Reminder...";

  // Give an example of ideal speech input.
  recoWithUI.Settings.ExampleText = "'Buy milk' or 'Call baby sitter'";
            
  // Load the pre-defined dictation grammar and start recognition.
  SpeechRecognitionUIResult recoResult = await recoWithUI.RecognizeWithUIAsync();

  // Access the text of the recognition result.
  MessageBox.Show("You said: " + recoResult.RecognitionResult.Text + "\nSave this reminder?");
}

Handling profanity

If the speech recognizer uses a predefined grammar type and a user speaks profanity, the profanity words within the recognized phrases are encapsulated in <profanity> tags in the speech recognition result's Text property.

If the SpeechRecognizerUI class is used, any profanity words are individually censored on the displayed Heard you say screen, but the actual recognition result still will be returned with encapsulated <profanity> tags, as described earlier.

It’s up to the developer to handle the profanity returned in the recognition result in the way that they think is appropriate.

Confidence rating

The confidence rating is the speech recognizer's assessment of how accurately it matched a user's speech to a phrase in an active grammar. A speech recognizer may assign a low confidence score to spoken input for various reasons, including background interference, inarticulate speech, or unanticipated words or word sequences.

The speech recognizer may return one or more possible recognized phrases (called alternates) in the result for a recognition operation. The alternates are phrases from the grammar. Recognition alternates may be used in any of the following ways:

  • If there’s only one alternate and its confidence score meets or exceeds the preset threshold, then the recognizer matches the alternate to the speech input.

  • If there’s only one alternate and its confidence score does not meet or exceed the preset threshold, then the alternate is rejected and the user's speech is not recognized.

  • If there are multiple recognition alternates for speech input that meet or exceed the preset threshold, and one alternate has a substantially higher confidence score than the other alternates, then the recognizer matches the higher scoring alternate to the speech input.

  • If there are multiple recognition alternates for speech input that meet or exceed the preset threshold, and their confidence scores are similar, the standard GUI for speech on Windows Phone 8 displays a Did you say screen. The Did you say screen displays and optionally speaks up to four phrases from the grammar that most likely match what the user spoke.

If the recognition operation is initiated with a call to SpeechRecognizerUIRecognizeWithUIAsync()()() and only local grammars are used for matching, the Did you say screen automatically processes the recognition alternates and confidences scores. If the utterance was matched with a Low or Medium confidence score, and speech recognizer returned alternates, the Did you say screen will display up to four of the highest-scoring alternates.

Your app can use the confidence level returned with the recognition result to decide how to act on the recognition result. For example, your app may want to only accept recognition results with a confidence rating of "high", or with a confidence rating of "medium" or better. The following example checks the confidence level of recognition results and creates different responses to speech input depending on the confidence level returned.

// Declare the SpeechRecognizerUI object at the class level.
SpeechRecognizerUI recoWithUI;

// Initialize the SpeechRecognizerUI object.
recoWithUI = new SpeechRecognizerUI();
        
// Handle the button click event.
private async void Reco1_Click(object sender, RoutedEventArgs e)
{
  // Display text to prompt the user's input.
  recoWithUI.Settings.ListenText = "Find what?";

  // Give an example of ideal speech input.
  recoWithUI.Settings.ExampleText = " 'Coffee', 'ATMs', 'restaurants' ";

  // Prevents the Heard you say screen from displaying.
  recoWithUI.Settings.ShowConfirmation = false;
            
  // Add the web search grammar to the grammar set.
  recoWithUI.Recognizer.Grammars.AddGrammarFromPredefinedType(
          "webSearch", SpeechPredefinedGrammar.WebSearch);

  // Load the grammar set and start recognition.
  SpeechRecognitionUIResult recoResult = await recoWithUI.RecognizeWithUIAsync();

  // Check the confidence level of the recognition result.
  if ((int)recoResult.RecognitionResult.TextConfidence < (int)SpeechRecognitionConfidence.Medium)
  {
    // If the confidence level of the result is too low, prompt the user to try again and restart recognition.
    MessageBox.Show("Not sure what you said! Try again.");
    await recoWithUI.RecognizeWithUIAsync();
  }
                
  // If the confidence level of the result is good, display a confirmation.
  else
  {
    MessageBox.Show("Searching for " + recoResult.RecognitionResult.Text);
  }
}

Semantic result

Semantics are info that a grammar author enters when creating an XML format grammar that conforms to the Speech Recognition Grammar Specification (SRGS) Version 1.0. Semantics assign meaning to the contents of an item element or to a ruleref element and the phrases they define. For example, a semantic assignment might assign an airport code to the name of a city in a list of city names that can be recognized. The speech recognition engine will return the airport code, as well as the name of the city, when a user speaks a city name that matches the grammar. The following excerpt from an SRGS grammar illustrates how to assign a semantic value in the form of a string literal to the contents of an item element. The tag element within the item element defines the semantic value.

<rule id = "flightCities">
  <one-of>
    <item>
      New York <tag> out = "JFK" </tag>
    </item>
    <item>
      London <tag> out = "LHR" </tag>
    </item>
    <item>
      Beijing <tag> out = "PEK" </tag>
    </item>
    </one-of>
</rule>

A semantic assignment to a ruleref element could indicate that the airport name returned by one rule reference to a list of airports is the point of origin for a flight, and the airport name returned by another rule reference to the same list of airports is the flight's destination. In the following example, the first rule reference to the rule named flightCities is directly followed by a tag element. The tag element creates a property called LeavingFrom for the Rule Variable of the rule named flightBooker, and assigns the recognition result from the flightCities rule to the LeavingFrom property. The second reference to the flightCities rule assigns recognition result from the flightCities rule to the GoingTo property.

<grammar mode="voice"
         root="main"
         tag-format="semantics/1.0"
         version="1.0"
         xml:lang="en-US"
         xmlns="http://www.w3.org/2001/06/grammar">

  <rule id="main" scope="public">
    
    <item repeat="0-1"> I want to </item>
    
    <item> fly from </item>

    <item>
      <ruleref uri="#flightCities"/>
      <tag>out.LeavingFrom = rules.latest();</tag>
    </item>

    <item> to </item>

    <item>
      <ruleref uri="#flightCities"/>
      <tag>out.GoingTo = rules.latest();</tag>
    </item>

  </rule>

  <rule id = "flightCities">
    <one-of>
      <item>
        New York <tag> out = "JFK" </tag>
      </item>
      <item>
        London <tag> out = "LHR" </tag>
      </item>
      <item>
        Beijing <tag> out = "PEK" </tag>
      </item>
      </one-of>
  </rule>

</grammar>

To make sure that an SRGS grammar is correctly deployed, add it to your solution, set the Build Action property for the SRGS grammar to Content, and set the Copy To Output Directory property to Copy if newer.

The following example demonstrates how to retrieve semantic values defined by the flightBooker and flightCities rules (the airport codes and the flight origin and destination) from the recognition result.

// Declare the SpeechRecognizerUI object at the class level.
SpeechRecognizerUI recoWithUI;

// Initialize the SpeechRecognizerUI object.
recoWithUI = new SpeechRecognizerUI();

// Handle the button click event.
private async void Reco1_Click(object sender, RoutedEventArgs e)
{
  // Display text to prompt the user's input.
  recoWithUI.Settings.ListenText = "Fly from where to where?";

  // Give an example of ideal speech input.
  recoWithUI.Settings.ExampleText = "'New York to London', 'London to Beijing', 'Beijing to New York'";

  // Initialize a URI with a path to the SRGS-compliant XML file.
  Uri bookFlights = new Uri("ms-appx:///FlightBooker.grxml", UriKind.Absolute);

  // Add the grammar to the grammar set.
      recoWithUI.Recognizer.Grammars.AddGrammarFromUri("bookFlights", citiesGrammar);

  // Load the grammar set and start recognition.
  SpeechRecognitionUIResult recoResult = await recoWithUI.RecognizeWithUIAsync();

  // Show the semantic information retrieved from the recognition result.
  MessageBox.Show(string.Format("The origin airport is " + recoResult.RecognitionResult.Semantics["LeavingFrom"].Value
        + " \nThe destination airport is " + recoResult.RecognitionResult.Semantics["GoingTo"].Value));
}

For more info about using semantics in SRGS grammars, see tag Element, Referencing Grammar Rule Variables, Using the tag element, and Semantic Results Content.

See Also

Other Resources

Speech recognition for Windows Phone 8

Speech for Windows Phone 8