November 2012

Volume 27 Number 11

Windows Phone - Speech-Enabling a Windows Phone 8 App with Voice Commands

By F Avery | November 2012

One recent evening I was running late for an after-work meeting with an old friend. I knew he was already driving to the rendezvous, so calling him would be out of the question. Nevertheless, as I dashed out of my office and ran toward my car, I grabbed my Windows Phone and held down the Start button. When I heard the “earcon” listening prompt, I said, “Text Robert Brown,” and when the text app started up, I said, “Running late, leaving office now,” followed by “send” to send the text message.

Without the speech features in the built-in texting app, I would’ve had to stop running and fumble around in frustration to send a text because I find the keypad hard to use with my fat fingers and the screen difficult to read while on the run. Using speech to text saved me time, frustration and no small amount of anxiety.

Windows Phone 8 offers these same speech features for developers to interact with their users through speech recognition and text-to-speech. These features support the two scenarios illustrated in my example: From anywhere on the phone, the user can say a command to launch an app and carry out an action with just one utterance; and once in the app, the phone carries on a dialog with the user by capturing commands or text from the speaker’s spoken utterances and by audibly rendering text to the user for notification and feedback.

The first scenario is supported by a feature called voice commands. To enable this feature, the app provides a Voice Command Definition (VCD) file to specify a set of commands that the app is equipped to handle. When the app is launched by voice commands, it receives parameters in a query string such as the command name, parameter names and the recognized text that it can use to execute the command specified by the user. This first installment of a two-part article explains how to enable voice commands in your app on Windows Phone 8.

In the second installment I’ll discuss in-app speech dialog. To support this, Windows Phone 8 provides an API for speech recognition and synthesis. This API includes a default UI for confirmation and disambiguation as well as default values for speech grammars, timeouts and other properties, making it possible to add speech recognition to an app with just a few lines of code. Similarly, the speech synthesis API (also known as text-to-speech, or TTS) is easy to code for simple scenarios; it also provides advanced features such as fine-tuned manipulation via the World Wide Web Consortium Speech Synthesis Markup Language (SSML) and switching between end-user voices already on the phone or downloaded from the marketplace. Stay tuned for a detailed exploration of this feature in the follow-up article.

To demonstrate these features, I’ve developed a simple app called Magic Memo. You can launch Magic Memo and execute a command by holding the Start button and then speaking a command when prompted. Inside the app, you can enter your memo using simple dictation or navigate within the app and execute commands using speech. Throughout this article, I’ll explain the source code that implements these features.

Requirements for Using Speech Features in Apps

The Magic Memo app should work out of the box, assuming your development environment meets the hardware and software requirements for developing Windows Phone 8 apps and testing on the phone emulator. When this article went to press the requirements were as follows:

  • 64-bit version of Windows 8 Pro or higher
  • 4GB or more of RAM
  • Second Level Address Translation supported by the BIOS
  • Hyper-V installed and running
  • Visual Studio 2012 Express for Windows Phone or higher

As always, it’s best to check MSDN documentation for the latest requirements before attempting to develop and run your app.

Three other things to keep in mind when you develop your own app from scratch:

  1. Ensure that the device microphone and speaker are working properly.

  2. Add capabilities for speech recognition and microphone to the WpAppManifest.xml file either by checking the appropriate boxes in the properties editor or manually by including the following in the XML file:

           <Capability Name="ID_CAP_SPEECH_RECOGNITION"/>

            <Capability Name="ID_CAP_MICROPHONE"/>
    
  3. When attempting speech recognition, you should catch the exception thrown when the user hasn’t accepted the speech privacy policy. The GetNewMemoByVoice helper function in MainPage.xaml.cs in the accompanying sample code download gives an example of how to do this.

The Scenario

On any smartphone, a common scenario is to launch an app and execute a single command, optionally followed by more commands. Doing this manually requires several steps: finding the app, navigating to the right place, finding the button or menu item, tapping on that, and so on. Many users find this frustrating even after they’ve become accustomed to the steps.

For example, to display a saved memo—for instance, memo No. 12—in the Magic Memo sample app, the user must find and launch the app, tap on “View saved memos” and scroll down until the desired memo is displayed. Contrast this with the experience of using the Windows Phone 8 voice commands feature: The user holds the Start button and says “Magic Memo, show memo 12,” after which the Magic Memo app is launched and the desired memo is displayed in a message box. Even for this simple command, there’s a clear savings in user interaction.

There are three steps to implementing voice commands in an app and an optional fourth step for handling dynamic content. The following sections outline those steps.

Specifying the User Commands to Recognize

The first step to implementing voice commands is to specify the commands to listen for in a VCD file. A VCD file is authored in a simple XML format consisting of a collection of CommandSet elements, each with Command child elements that contain the phrases to listen for. An example from the Magic Memo app is shown in Figure 1.

Figure 1 Voice Command Definition File for the Magic Memo App

<?xml version="1.0" encoding="utf-8"?>
<VoiceCommands xmlns="https://schemas.microsoft.com/voicecommands/1.0">
  <CommandSet xml:lang="en-us" Name="MagicMemoEnu">
    <!-- Command set for all US English commands-->
    <CommandPrefix>Magic Memo</CommandPrefix>
    <Example>enter a new memo</Example>

    <Command Name="newMemo">
      <Example>enter a new memo</Example>
      <ListenFor>Enter [a] [new] memo</ListenFor>
      <ListenFor>Make [a] [new] memo</ListenFor>
      <ListenFor>Start [a] [new] memo</ListenFor>
      <Feedback>Entering a new memo</Feedback>
      <Navigate />    <!-- Navigation defaults to Main page -->
    </Command>

    <Command Name="showOne">
      <Example>show memo number two</Example>
      <ListenFor>show [me] memo [number] {num} </ListenFor>
      <ListenFor>display memo [number] {num}</ListenFor>
      <Feedback>Showing memo number {num}</Feedback>
      <Navigate Target="/ViewMemos.xaml"/>
    </Command>

    <PhraseList Label="num">
      <Item> 1 </Item>
      <Item> 2 </Item>
      <Item> 3 </Item>
    </PhraseList>
  </CommandSet>

  <CommandSet xml:lang="ja-JP" Name="MagicMemoJa">
    <!-- Command set for all Japanese commands -->
    <CommandPrefix>マジック・メモ</CommandPrefix>
    <Example>新規メモ</Example>

    <Command Name="newMemo">
      <Example>新規メモ</Example>
      <ListenFor>新規メモ[を]</ListenFor>
      <ListenFor>新しいメモ</ListenFor>
      <Feedback>メモを言ってください</Feedback>
      <Navigate/>
    </Command>

    <Command Name="showOne">
      <Example>メモ1を表示</Example>
      <ListenFor>メモ{num}を表示[してください] </ListenFor>
      <Feedback>メモ{num}を表示します。 </Feedback>
      <Navigate Target="/ViewMemos.xaml"/>
    </Command>

    <PhraseList Label="num">
      <Item> 1 </Item>
      <Item> 2 </Item>
      <Item> 3 </Item>
    </PhraseList>
</CommandSet>
</VoiceCommands>

Following are guidelines to design a VCD file:

  1. Keep the command prefix phonetically distinct from Windows Phone keywords. This will help to avoid confusing your app with a built-in phone feature. For U.S. English, the keywords are call, dial, start, open, find, search, text, note and help.
  2. Make your command prefix a subset or a natural pronunciation of your app name rather than something completely different. This will avoid user confusion and reduce the chance of misrecognizing your app for some other app or feature.
  3. Keep in mind that recognition requires an exact match on the command suffix. Thus, it’s a good idea to keep the command prefix simple and easy to remember.
  4. Give each command set a Name attribute so that you can access it in your code.
  5. Keep ListenFor elements in different Command elements phonetically distinct from each other to reduce the chances of misrecognition.
  6. Ensure that ListenFor elements in the same command are different ways to specify the same command. If ListenFor elements in a command correspond to more than one action, split them into separate commands. This will make it easier to handle the commands in your app.
  7. Keep in mind the limits: 100 Command elements in a command set; 10 ListenFor entries in a command; 50 total PhraseList elements; and 2,000 total PhraseList items across all PhraseLists.
  8. Keep in mind that recognition on PhraseList elements requires an exact match, not a subset. Thus, to recognize both “Star Wars” and “Star Wars Episode One,” you should include both as PhraseList elements.

In my example there are two CommandSet elements, each with different xml:lang and Name attributes. There can be only one CommandSet per xml:lang value. The Name attributes must also be unique but are restricted only by the Name attribute’s value specification. Though optional, it’s highly recommended that you include a Name attribute because you’ll need it to access the CommandSet from your app code to implement step 4. Also note that only one CommandSet is active for your app at one time, namely the one whose xml:lang attribute exactly matches that of the current global speech recognizer, as set by the user in SETTINGS/speech. You should include CommandSets for any languages you expect your users to require in their market.

The next thing to notice is the CommandPrefix element. Think of this as an alias the user can say to invoke your app. This is useful if your app name has nonstandard spelling or nonpronounceable characters, such as Mag1c or gr00ve. Remember that this word or phrase has to be something that the speech recognition engine can recognize and also that it must be phonetically distinct from Windows Phone built-in keywords.

You’ll note there are Example elements as children of both the CommandSet element and of the Command element. The Example under CommandSet is a general example for your app that will show up on the system help What can I say? screen as shown in Figure 2. In contrast, the Example element under a Command is specific to that Command. This Example shows up on a system help page (see Figure 3) that’s displayed when the user taps on the app name in the help page shown in Figure2.

Help Page Showing Voice Command Examples for Installed Apps
Figure 2 Help Page Showing Voice Command Examples for Installed Apps

Example Page for Magic Memo Voice Commands
Figure 3 Example Page for Magic Memo Voice Commands

Speaking of which, each Command child element within a CommandSet corresponds to an action to take in the app once launched. There may be multiple ListenFor elements in a Command, but they should all be different ways of telling the app to carry out the action (command) of which they are a child.

Note also that the text in a ListenFor element has two special constructs. Square braces around text means the text is optional—that is, the user’s utterance can be recognized with or without the enclosed text. Curly braces contain a label that references a PhraseList element. In the U.S. English example in Figure 1, the first ListenFor under the “showOne” command has a label {num} referencing the phrase list below it. You can think of this as a slot that can be filled with any of the phrases in the referenced list, in this case numbers.

What happens when a command is recognized in the user’s utterance? The phone’s global speech recognizer will launch the app at the page specified in the Target attribute of the Navigate element under the corresponding Command, as explained later in step 3. But first, I’ll discuss step 2.

Enabling Voice Commands

Having included the VCD file in your installation package, step 2 is to register the file so that Windows Phone 8 can include the app’s commands in the system grammar. You do this by calling a static method InstallCommandSetsFromFileAsync on the VoiceCommandService class, as shown in Figure 4. Most apps will make this call on first run, but of course it can be done at any time. The implementation of VoiceCommandService is smart enough to do nothing on subsequent calls if there has been no change in the file, so don’t worry about the fact that it’s called on each launch of the app.

Figure 4 Initializing the VCD file from Within the App

using Windows.Phone.Speech.VoiceCommands;
// ...
// Standard boilerplate method in the App class in App.xaml.cs
private async void Application_Launching(object sender, 
  LaunchingEventArgs e)
{
  try // try block recommended to detect compilation errors in VCD file
  {
    await VoiceCommandService.InstallCommandSetsFromFileAsync(
      new Uri("ms-appx:///MagicMemoVCD.xml"));
  }
  catch (Exception ex)
  {
    // Handle exception
  }
}

As the method name InstallCommandSetsFromFileAsync implies, the operational unit in the VCD file is a CommandSet element rather than the file itself. The call to this method inspects and validates all of the command sets contained in the file, but it installs only the one whose xml:lang attribute matches exactly that of the global speech engine. If the user switches the global recognition language to one that matches the xml:lang of a different CommandSet in your VCD, that CommandSet will be loaded and activated.

Handling a Voice Command

Now I’ll discuss step 3. When the global speech recognizer recognizes the command prefix and a command from your app, it launches the app at the page specified in the Target attribute of the Navigate element, using your default task target (usually MainPage.xaml for Silverlight apps) if no Target is specified. It also appends to the query string key/value pairs for the Command name and PhraseList values. For example, if the recognized phrase is, “Magic Memo show memo number three,” the query string might look something like the following (the actual string may vary by implementation or version):

"/ViewMemos.xaml?voiceCommandName=show&num=3&
reco=show%20memo%20number%20three"

Fortunately, you don’t have to parse the query string and dig out the parameters yourself because they’re available on the NavigationContext object’s QueryString collection. The app can use this data to determine whether it was launched by voice command—and, if so, handle the command appropriately (for example, in the page’s Loaded handler). Figure 5 shows an example from the Magic Memo app for the ViewMemos.xaml page.

Figure 5 Handling Voice Commands in an App

// Takes appropriate action if the application was launched by voice command.
private void ViewMemosPage_Loaded(object sender, RoutedEventArgs e)
{
  // Other code omitted
  // Handle the case where the page was launched by Voice Command
  if (this.NavigationContext.QueryString != null
    && this.NavigationContext.QueryString.ContainsKey("voiceCommandName"))
  {
    // Page was launched by Voice Command
    string commandName =
      NavigationContext.QueryString["voiceCommandName"];
    string spokenNumber = "";
    if (commandName == "showOne" &&
      this.NavigationContext.QueryString.TryGetValue("num", 
        out spokenNumber))
    {
      // Command was "Show memo number 'num'"
      int index = -1;
      if (int.TryParse(spokenNumber, out index) &&
        index <= memoList.Count && index > 0)
      { // Display the specified memo
        this.Dispatcher.BeginInvoke(delegate
          { MessageBox.Show(String.Format(
          "Memo {0}: \"{1}\"", index, memoList[index - 1])); });
      }
    }
    // Note: no need for an "else" block because if launched by another VoiceCommand
    // then commandName="showAll" and page is shown
  }
}

Because there’s more than one way to navigate to any page, the code in Figure 5 first checks for the presence of the voiceCommandName key in the query string to determine if the user launched the app by a voice command. If so, it verifies the command name and gets the value of the PhraseList parameter num, which is the number of the memo the user wishes to see. This page has only two voice commands and the processing is simple, but a page that can be launched by many voice commands would use something like a switch block on the commandName to decide what action to take.

The PhraseList in this example is also simple; it’s just a series of numbers, one for each stored memo. You can envision more sophisticated scenarios, however, requiring phrase lists that are populated dynamically—for example, from data on a Web site. The optional step 4 mentioned earlier addresses how to implement PhraseLists for these scenarios. I’ll discuss it next.

Updating Phrase Lists from Your App

You may have noticed a problem with the VCD file in Figure 1: The “num” PhraseList defined statically in the VCD supports recognition up to three items, but eventually there are likely to be many more than three memos stored in the app’s isolated storage. For use cases where the phrase list changes over time, there’s a way to update the phrase list dynamically from within the app, as shown in Figure 6. This is especially useful for apps that need to recognize against dynamic lists such as downloaded movies, favorite restaurants or points of interest near the phone’s current location.

Figure 6 Updating the Installed Phrase Lists Dynamically

// Updates the "num" PhraseList to have the same number of
// entries as the number of saved memos; this supports
// "Magic Memo show memo 5" if there are five or more memos saved
private async void UpdateNumberPhraseList(string phraseList,
  int newLimit, string commandSetName)
{
  // Helper function that sets string array to {"1", "2", etc.}
  List<string> positiveIntegers =
    Utilities.GetStringListOfPositiveIntegers(Math.Max(1, newLimit));
  try
  {
    VoiceCommandSet vcs = null;
    if (VoiceCommandService.InstalledCommandSets.TryGetValue(
      commandSetName, out vcs))
    {
      // Update "num" phrase list to the new numbers
      await vcs.UpdatePhraseListAsync(phraseList, positiveIntegers);
    }
  }
  catch (Exception ex)
  {
    this.Dispatcher.BeginInvoke(delegate
      { MessageBox.Show("Exception in UpdateNumberPhraseList " 
        + ex.Message); }
    );
  }
}

Although the Magic Memo app doesn’t demonstrate it, dynamically updated phrase lists are a perfect candidate for updating in a user agent because the updating can happen behind the scenes, even when the app isn’t running.

And there you have it: four steps to enabling voice commands in your app. Try it out with the Magic Memo sample app. Remember that you need to run it once normally to load the VCD file, but after that you can say things such as the following to launch the app and take you right to a page and carry out the command:

  • Magic Memo, enter a new memo
  • Magic Memo, show all memos
  • Magic Memo, show memo number four

Next Up: In-App Dialog

Implementing voice commands as I’ve discussed in this article is the first step to letting your users interact with your app on Windows Phone 8 just like they can with built-in apps such as Text, Find and Call.

The second step is to provide in-app dialog, in which the user speaks to your app after it’s launched to record text or execute commands, and receives audio feedback as spoken text. I’ll delve into that topic in Part 2, so stay tuned.


F Avery Bishop has been working in software development for more than 20 years, 12 years of that at Microsoft, where he’s a program manager for the speech platform. He has published numerous articles on natural language support in applications including topics such as complex script support, multilingual applications and speech recognition.

Thanks to the following technical experts for reviewing this article: Robert Brown, Victor Chang, Jay Waltmunson and Travis Wilson