December 2014

Volume 29 Number 12

Voice Recognition : Speech Recognition with .NET Desktop Applications

James McCaffrey

With the introduction of Windows Phone Cortana, the speech-activated personal assistant (as well as the similar she-who-must-not-be-named from the Fruit company), speech-enabled applications have taken an increasingly important place in software development. In this article, I’ll show you how to get started with speech recognition and speech synthesis in Windows console applications, Windows Forms applications, and Windows Presentation Foundation (WPF) applications.

Note that you can also add speech capabilities to Windows Phone apps, ASP.NET Web apps, Windows Store apps, Windows RT apps and Xbox Kinect, but the techniques to do so are different from those presented in this article.

A good way to see what this article will explain is to take a look at the screenshots of two different demo programs in Figure 1 and Figure 2. After the console application in Figure 1 was launched, the app immediately spoke the phrase “I am awake.” Of course, you can’t hear the demo while reading this article, so the demo program displays the text of what the computer is saying. Next, the user spoke the command “Speech on.” The demo echoed the text that was recognized, and then, behind the scenes, enabled the application to listen for and respond to requests to add two numbers.

Speech Recognition and Synthesis in a Console Application
Figure 1 Speech Recognition and Synthesis in a Console Application

Speech Recognition in a Windows Forms Application
Figure 2 Speech Recognition in a Windows Forms Application

The user asked the application to add one plus two, then two plus three. The application recognized these spoken commands and gave the answers out loud. I’ll describe more useful ways to use speech recognition later.

The user then issued the command “Speech off,” which deactivated listening for commands to add numbers, but didn’t completely deactivate speech recognition. With speech off, the next spoken command to add one plus two was ignored. Finally, the user turned speech back on, and spoke the nonsense command, “Klatu barada nikto,” which the application recognized as the command to completely deactivate speech recognition and exit the application.

Figure 2 shows a dummy speech-enabled Windows Forms application. The application recognizes spoken commands, but doesn’t respond with speech output. When the application was first launched, the Speech On checkbox control wasn’t checked, indicating speech recognition wasn’t active. The user checked the Speech On control and then spoke, “Hello.” The application echoed the recognized spoken text in the ListBox control at the bottom of the application.

The user then said, “Set text box 1 to red.” The application recognized, “Set text box 1 red,” which is almost—but not quite—exactly what the user spoke. Although not visible in Figure 2, the text in the TextBox control at the top of the application was in fact set to “red.”

Next, the user spoke, “Please set text box 1 to white.” The application recognized “set text box 1 white” and did just that. The user concluded by speaking, “Good-bye,” and the application echoed the command, but didn’t manipulate the Windows Forms, although it could have, for example, by unchecking the Speech On checkbox control.

In the sections that follow, I’ll walk you through the process of creating both demo programs, including the installation of the required .NET speech libraries. This article assumes you have at least intermediate programming skills, but doesn’t assume you know anything about speech recognition or speech synthesis.

Adding Speech to a Console Application

To create the demo shown in Figure 1, I launched Visual Studio and created a new C# console application named ConsoleSpeech. I have successfully used speech with Visual Studio 2010 and 2012, but any recent version should work. After the template code loaded into the editor, in the Solution Explorer window I renamed file Program.cs to the more descriptive ConsoleSpeechProgram.cs and then Visual Studio renamed class Program for me.

Next, I added a Reference to file Microsoft.Speech.dll, which was located at C:\ProgramFiles (x86)\Microsoft SDKs\Speech\v11.0\Assembly. This DLL was not on my host machine and had to be downloaded. Installing the files necessary to add speech recognition and synthesis to an application is not entirely trivial. I’ll explain the installation process in detail in the next section of this article, but for now, assume that Microsoft.Speech.dll exists on your machine.

After adding the reference to the speech DLL, at the top of the source code I deleted all using statements except for the one that points to the top-level System namespace. Then, I added using statements to namespaces Microsoft.Speech.Recognition, Microsoft.Speech.Synthesis and System.Globalization. The first two namespaces are associated with the speech DLL. Note: Somewhat confusingly, there are also System.Speech.Recognition and System.Speech.Synthesis namespaces. I’ll explain the difference shortly. The Globalization namespace was available by default and didn’t require adding a new reference to the project.

The entire source code for the console application demo is shown in Figure 3, and is also available in the code download that accompanies this article. I removed all normal error checking to keep the main ideas as clear as possible.

Figure 3 Demo Console Application Source Code

using System;
using Microsoft.Speech.Recognition;
using Microsoft.Speech.Synthesis;
using System.Globalization;
namespace ConsoleSpeech
{
  class ConsoleSpeechProgram
  {
    static SpeechSynthesizer ss = new SpeechSynthesizer();
    static SpeechRecognitionEngine sre;
    static bool done = false;
    static bool speechOn = true;
    static void Main(string[] args)
    {
      try
      {
        ss.SetOutputToDefaultAudioDevice();
        Console.WriteLine("\n(Speaking: I am awake)");
        ss.Speak("I am awake");
        CultureInfo ci = new CultureInfo("en-us");
        sre = new SpeechRecognitionEngine(ci);
        sre.SetInputToDefaultAudioDevice();
        sre.SpeechRecognized += sre_SpeechRecognized;
        Choices ch_StartStopCommands = new Choices();
        ch_StartStopCommands.Add("speech on");
        ch_StartStopCommands.Add("speech off");
        ch_StartStopCommands.Add("klatu barada nikto");
        GrammarBuilder gb_StartStop = new GrammarBuilder();
        gb_StartStop.Append(ch_StartStopCommands);
        Grammar g_StartStop = new Grammar(gb_StartStop);
        Choices ch_Numbers = new Choices();
        ch_Numbers.Add("1");
        ch_Numbers.Add("2");
        ch_Numbers.Add("3");
        ch_Numbers.Add("4");
        GrammarBuilder gb_WhatIsXplusY = new GrammarBuilder();
        gb_WhatIsXplusY.Append("What is");
        gb_WhatIsXplusY.Append(ch_Numbers);
        gb_WhatIsXplusY.Append("plus");
        gb_WhatIsXplusY.Append(ch_Numbers);
        Grammar g_WhatIsXplusY = new Grammar(gb_WhatIsXplusY);
        sre.LoadGrammarAsync(g_StartStop);
        sre.LoadGrammarAsync(g_WhatIsXplusY);
        sre.RecognizeAsync(RecognizeMode.Multiple);
        while (done == false) { ; }
        Console.WriteLine("\nHit <enter> to close shell\n");
        Console.ReadLine();
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
        Console.ReadLine();
      }
    } // Main
    static void sre_SpeechRecognized(object sender,
      SpeechRecognizedEventArgs e)
    {
      string txt = e.Result.Text;
      float confidence = e.Result.Confidence;
      Console.WriteLine("\nRecognized: " + txt);
      if (confidence < 0.60) return;
      if (txt.IndexOf("speech on") >= 0)
      {
        Console.WriteLine("Speech is now ON");
        speechOn = true;
      }
      if (txt.IndexOf("speech off") >= 0)
      {
        Console.WriteLine("Speech is now OFF");
        speechOn = false;
      }
      if (speechOn == false) return;
      if (txt.IndexOf("klatu") >= 0 && txt.IndexOf("barada") >= 0)
      {
        ((SpeechRecognitionEngine)sender).RecognizeAsyncCancel();
        done = true;
        Console.WriteLine("(Speaking: Farewell)");
        ss.Speak("Farewell");
      }
      if (txt.IndexOf("What") >= 0 && txt.IndexOf("plus") >= 0)
      {
        string[] words = txt.Split(' ');
        int num1 = int.Parse(words[2]);
        int num2 = int.Parse(words[4]);
        int sum = num1 + num2;
        Console.WriteLine("(Speaking: " + words[2] + " plus " +
          words[4] + " equals " + sum + ")");
        ss.SpeakAsync(words[2] + " plus " + words[4] +
          " equals " + sum);
      }
    } // sre_SpeechRecognized
  } // Program
} // ns

After the using statements, the demo code begins like so:

namespace ConsoleSpeech
{
  class ConsoleSpeechProgram
  {
    static SpeechSynthesizer ss = new SpeechSynthesizer();
    static SpeechRecognitionEngine sre;
    static bool done = false;
    static bool speechOn = true;
    static void Main(string[] args)
    {
...

The class-scope SpeechSynthesizer object gives the application the ability to speak. The SpeechRecognitionEngine object allows the application to listen for and recognize spoken words or phrases. The Boolean variable “done” determines when the entire application is finished. Boolean variable speechOn controls whether the application is listening for any commands other than a command to exit the program.

The idea here is that the console application doesn’t accept typed input from the keyboard, so the application is always listening for commands. However, if speechOn is false, only the command to exit the program will be recognized and acted on; other commands will be recognized but ignored.

The Main method begins:

try
{
  ss.SetOutputToDefaultAudioDevice();
  Console.WriteLine("\n(Speaking: I am awake)");
  ss.Speak("I am awake");

The SpeechSynthesizer object was instantiated when it was declared. Using a synthesizer object is quite simple. The SetOutput­ToDefaultAudioDevice method sends output to your machine’s speakers (output can also be sent to a file). The Speak method accepts a string and then, well, speaks. It’s that easy.

Speech recognition is much more difficult than speech synthesis. The Main method continues by creating the recognizer object:

CultureInfo ci = new CultureInfo("en-us");
sre = new SpeechRecognitionEngine(ci);
sre.SetInputToDefaultAudioDevice();
sre.SpeechRecognized += sre_SpeechRecognized;

First, the language to recognize is specified, United States English in this case, in a CultureInfo object. The CultureInfo object is located in the Globalization namespace that was referenced with a using statement. Next, after calling the SpeechRecognitionEngine constructor, voice input is set to the default audio device, a microphone in most situations. Note that most laptops have a built-in microphone, but most desktop machines will need an external microphone (often combined with a headset these days).

The key method for the recognizer object is the SpeechRecognized event handler. When using Visual Studio, if you type “sre.Speech­Recognized +=” and wait just a fraction of a second, the IntelliSense feature will auto-complete your statement with “sre_SpeechRecognized” for the name of the event handler. I recommend hitting the tab key to accept and use that default name.

Next, the demo sets up the ability to recognize commands to add two numbers:

Choices ch_Numbers = new Choices();
ch_Numbers.Add("1");
ch_Numbers.Add("2");
ch_Numbers.Add("3");
ch_Numbers.Add("4"); // Technically Add(new string[] { "4" });
GrammarBuilder gb_WhatIsXplusY = new GrammarBuilder();
gb_WhatIsXplusY.Append("What is");
gb_WhatIsXplusY.Append(ch_Numbers);
gb_WhatIsXplusY.Append("plus");
gb_WhatIsXplusY.Append(ch_Numbers);
Grammar g_WhatIsXplusY = new Grammar(gb_WhatIsXplusY);

The three key objects here are a Choices collection, a GrammarBuilder template and the controlling Grammar. When I’m designing recognition Grammar, I start by listing some specific examples of what I want to recognize. For example, “What is one plus two?” and, “What is three plus four?”

Then, I determine the corresponding general template, for example, “What is <x> plus <y>?” The template is a GrammarBuilder and the specific values that go into the template are the Choices. The Grammar object encapsulates the template and choices.

In the demo, I restrict the numbers to add to 1 through 4, and add them as strings to the Choices collection. A better approach is:

string[] numbers = new string[] { "1", "2", "3", "4" };
Choices ch_Numbers = new Choices(numbers);

I present the weaker approach to create a Choices collection for two reasons. First, adding one string at a time was the only approach I saw in other speech examples. Second, you might think that adding one string at a time shouldn’t even work; the real-time Visual Studio IntelliSense shows that one of the Add overloads accepts a parameter of type “params string[] phrases.” If you didn’t notice the params keyword you might think the Add method accepts only an array of strings, rather than either an array of type string or a single string. I recommend passing an array.

Creating a Choices collection of consecutive numbers is somewhat a special case, and allows a programmatic approach like this:

string[] numbers = new string[100];
for (int i = 0; i < 100; ++i)
  numbers[i] = i.ToString();
Choices ch_Numbers = new Choices(numbers);

After creating the Choices to fill in the slots of the GrammarBuilder, the demo creates the GrammarBuilder and then the controlling Grammar, like so:

GrammarBuilder gb_WhatIsXplusY = new GrammarBuilder();
gb_WhatIsXplusY.Append("What is");
gb_WhatIsXplusY.Append(ch_Numbers);
gb_WhatIsXplusY.Append("plus");
gb_WhatIsXplusY.Append(ch_Numbers);
Grammar g_WhatIsXplusY = new Grammar(gb_WhatIsXplusY);

The demo uses a similar pattern to create a Grammar for start- and stop-related commands:

Choices ch_StartStopCommands = new Choices();
ch_StartStopCommands.Add("speech on");
ch_StartStopCommands.Add("speech off");
ch_StartStopCommands.Add("klatu barada nikto");
GrammarBuilder gb_StartStop = new GrammarBuilder();
gb_StartStop.Append(ch_StartStopCommands);
Grammar g_StartStop = new Grammar(gb_StartStop);

You have a lot of flexibility when defining grammars. Here, the commands “speech on,” “speech off,” and “klatu barada nikto” are all placed in the same grammar, because they’re logically related. The three commands could’ve been defined in three separate grammars, or you can put the “speech on” and “speech off” commands in one grammar and the “klatu barada nikto” command in a second grammar.

After all the Grammar objects have been created, they’re passed to the speech recognizer, and speech recognition is activated:

sre.LoadGrammarAsync(g_StartStop);
sre.LoadGrammarAsync(g_WhatIsXplusY);
sre.RecognizeAsync(RecognizeMode.Multiple);

The RecognizeMode.Multiple argument is required when you have more than one grammar, which will be the case in all but the simplest programs. The Main method finishes like so:

...
    while (done == false) { ; }
    Console.WriteLine("\nHit <enter> to close shell\n");
    Console.ReadLine();
  }
  catch (Exception ex)
  {
    Console.WriteLine(ex.Message);
    Console.ReadLine();
  }
} // Main

The curious-looking empty while loop allows the console application shell to stay alive. The loop will terminate when Boolean class-scope variable “done” is set to true by the speech recognizer event handler.

Handling Recognized Speech

The code for the speech-recognized event handler begins like this:

static void sre_SpeechRecognized(object sender,
  SpeechRecognizedEventArgs e)
{
  string txt = e.Result.Text;
  float confidence = e.Result.Confidence;
  Console.WriteLine("\nRecognized: " + txt);
  if (confidence < 0.60) return;
...

The actual text that’s recognized is stored in the SpeechRecognized­EventArgs Result.Text property. You can also use the Result.Words collection. The Result.Confidence property holds a value between 0.0 and 1.0 that’s a rough measure of how likely the spoken text matches any of the grammars associated with the recognizer. The demo instructs the event handler to ignore any low-confidence-recognized text.

Confidence values can vary wildly depending on the complexity of your grammars, the quality of your microphone and so on. For example, if the demo program must recognize only 1 through 4, the confidence values on my machine are typically about 0.75. However, if the grammar must recognize 1 through 100, the confidence values drop to about 0.25. In short, you must typically experiment with confidence values to get good speech-recognition results.

Next, the speech-recognizer event handler toggles recognition on and off:

if (txt.IndexOf("speech on") >= 0)
{
  Console.WriteLine("Speech is now ON");
  speechOn = true;
}
if (txt.IndexOf("speech off") >= 0)
{
  Console.WriteLine("Speech is now OFF");
  speechOn = false;
}
if (speechOn == false) return;

Although perhaps not entirely obvious at first, the logic should make sense if you examine it for a moment. Next, the secret exit command is processed:

if (txt.IndexOf("klatu") >= 0 && txt.IndexOf("barada") >= 0)
{
  ((SpeechRecognitionEngine)sender).RecognizeAsyncCancel();
  done = true;
  Console.WriteLine("(Speaking: Farewell)");
  ss.Speak("Farewell");
}

Notice that the speech recognition engine can in fact recognize nonsense words. If a Grammar object contains words that aren’t in the object’s built-in dictionary, the Grammar attempts to identify such words as best it can using semantic heuristics, and is usually quite successful. This is why I used “klatu” rather than the correct “klaatu” (from an old science fiction movie).

Also notice that you don’t have to process the entire recognized Grammar text (“klatu barada nikto”), you only need to have enough information to uniquely identify a grammar phrase (“klatu” and “barada”).

Next, commands to add two numbers are processed, and the event handler, Program class and namespace finish up:

...
      if (txt.IndexOf("What") >= 0 && txt.IndexOf("plus") >= 0)
      {
        string[] words = txt.Split(' ');
        int num1 = int.Parse(words[2]);
        int num2 = int.Parse(words[4]);
        int sum = num1 + num2;
        Console.WriteLine("(Speaking: " + words[2] +
          " plus " + words[4] + " equals " + sum + ")");
        ss.SpeakAsync(words[2] + " plus " + words[4] +
          " equals " + sum);
      }
    } // sre_SpeechRecognized
  } // Program
} // ns

Notice that the text in Results.Text is case-sensitive (“What” vs. “what”). Once you’ve recognized a phrase, you can parse out specific words. In this case, the recognized text has form, “What is x plus y,” so the “What” is in words[0], and the two numbers to add (as strings) are in words[2] and words[4].

Installing the Libraries

The explanation of the demo program assumes you have all the necessary speech libraries installed on your machine. To create and run the demo programs, you need to install four packages: an SDK to be able to create the demos in Visual Studio, a runtime to be able to execute the demos after they’ve been created, a recognition language, and a synthesis (speaking) language.

To install the SDK, do an Internet search for “Speech Platform 11 SDK.” This will bring you to the correct page in the Microsoft Download Center, as shown in Figure 4. After clicking the Download button, you’ll see the options shown in Figure 5. The SDK comes in 32-bit and 64-bit versions. I strongly recommend using the 32-bit version regardless of what your host machine is. The 64-bit version doesn’t interoperate with some applications.

The SDK Installation Main Page at the Microsoft Download Center
Figure 4 The SDK Installation Main Page at the Microsoft Download Center

Installing the Speech SDK
Figure 5 Installing the Speech SDK

You don’t need anything except the single x86 (32-bit) .msi file. After selecting that file and clicking the Next button, you can run the installation program directly. The speech libraries don’t give you much feedback about when the installation has completed, so don’t look for some sort of success message.

Next, you want to install the speech runtime. After finding the main page and clicking the Next button, you’ll see the options shown in Figure 6.

Installing the Speech Runtime
Figure 6 Installing the Speech Runtime

It’s critical you choose the same platform version (11 in the demo) and bit version (32 [x86] or 64 [x64]) as the SDK. Again, I strongly recommend the 32-bit version even if you’re working on a 64-bit machine.

Next, you can install the recognition language. The download page is shown in Figure 7. The demo used file MSSpeech_SR_en-us_TELE.msi (English-U.S.). The SR stands for speech recognition and the TELE stands for telephony, which means that the recognition language is designed to work with low-quality audio input, such as that from a telephone or desktop microphone.

Installing the Recognition Language
Figure 7 Installing the Recognition Language

Finally, you can install the speech synthesis language and voice. The download page is shown in Figure 8. The demo uses file MSSpeech_TTS_en-us_Helen.msi. The TTS stands for text-to-speech, which is essentially a synonym phrase for speech synthesis. Notice there are two English, U.S. voices available. There are other English, non-U.S. voices, too. Creating synthesis files is quite difficult. It’s possible to buy and then install other voices from a handful of companies.

Installing the Synthesis Language and Voice
Figure 8 Installing the Synthesis Language and Voice

Interestingly, even though a speech recognition language and a speech synthesis voice/language are really two entirely different things, both downloads are options from a single download page. The Download Center UI allows you to check both a recognition language and a synthesis language, but trying to install them at the same time was disastrous for me, so I recommend installing them one at a time.

Microsoft.Speech vs. System.Speech

If you’re new to speech recognition and synthesis for Windows applications, you can easily get confused by the documentation because there are multiple speech platforms. In particular, in addition to the Microsoft.Speech.dll library used by the demos in this article, there’s a System.Speech.dll library that’s part of the Windows OS. The two libraries are similar in the sense that the APIs are almost, but not quite, the same. So, if you’re searching online for speech examples and you see a code snippet rather than a complete program, it’s not always obvious if the example is referring to System.Speech or Microsoft.Speech.

The bottom line is, if you’re a beginner with speech, for adding speech to a .NET application, use the Microsoft.Speech library, not the System.Speech library.

Although the two libraries share some of the same core base code and have similar APIs, they’re definitely different. Some of the key differences are summarized in the table in Figure 9.

Figure 9 Microsoft.Speech vs System.Speech

Microsoft.Speech.dll System.Speech.dll
Must install separately Part of the OS (Windows Vista+)
Can package with apps Cannot redistribute
Must construct Grammars Uses Grammars or free dictation
No user training Training for specific user
Managed code API (C#) Native code API (C++)

The System.Speech DLL is part of the OS, so it’s installed on every Windows machine. The Microsoft.Speech DLL (and an associated runtime and languages) must be downloaded and installed onto a machine. System.Speech recognition usually requires user training, where the user reads some text and the system learns to understand that particular user’s pronunciation. Microsoft.Speech recognition works immediately for any user. System.Speech can recognize virtually any words (called free dictation). Microsoft.Speech will recognize only words and phrases that are in a program-defined Grammar.

Adding Speech Recognition to a Windows Forms Application

The process of adding speech recognition and synthesis to a Windows Forms or WPF application is similar to that of adding speech to a console application. To create the dummy demo program shown in Figure 2, I launched Visual Studio and created a new C# Windows Forms application and named it WinFormSpeech.

After the template code loaded into the Visual Studio editor, in the Solution Explorer window, I added a Reference to file Microsoft.Speech.dll, just as I did with the console application demo. At the top of the source code, I deleted unnecessary using statements, leaving just references to the System, Data, Drawing and Forms namespaces. I added two using statements to bring the Microsoft.Speech.Recognition and System.Globalization namespaces into scope.

The Windows Forms demo doesn’t use speech synthesis, so I don’t use a reference to the Microsoft.Speech.Synthesis library. Adding speech synthesis to a Windows Forms app is exactly like adding synthesis to a console app.

In the Visual Studio design view, I dragged a TextBox control, a CheckBox control and a ListBox control onto the Form. I double-clicked on the CheckBox control and Visual Studio automatically created a skeleton of the CheckChanged event handler method.

Recall that the console app demo started listening for spoken commands immediately, and continuously listened until the app exited. That approach can be used for a Windows Forms app, but instead I decided to allow the user to toggle speech recognition on and off by using the CheckBox control.

The source code for the demo program’s Form1.cs file, which defines a partial class, is presented in Figure 10. A speech recognition engine object is declared and instantiated as a Form member. Inside the Form constructor, I hook up the SpeechRecognized event handler and create and load two Grammars:

public Form1()
{
  InitializeComponent();
  sre.SetInputToDefaultAudioDevice();
  sre.SpeechRecognized += sre_SpeechRecognized;
  Grammar g_HelloGoodbye = GetHelloGoodbyeGrammar();
  Grammar g_SetTextBox = GetTextBox1TextGrammar();
  sre.LoadGrammarAsync(g_HelloGoodbye);
  sre.LoadGrammarAsync(g_SetTextBox);
  // sre.RecognizeAsync() is in CheckBox event
}

Figure 10 Adding Speech Recognition to a Windows Forms

using System;
using System.Data;
using System.Drawing;
using System.Windows.Forms;
using Microsoft.Speech.Recognition;
using System.Globalization;
namespace WinFormSpeech
{
  public partial class Form1 : Form
  {
    static CultureInfo ci = new CultureInfo("en-us");
    static SpeechRecognitionEngine sre = 
      new SpeechRecognitionEngine(ci);
    public Form1()
    {
      InitializeComponent();
      sre.SetInputToDefaultAudioDevice();
      sre.SpeechRecognized += sre_SpeechRecognized;
      Grammar g_HelloGoodbye = GetHelloGoodbyeGrammar();
      Grammar g_SetTextBox = GetTextBox1TextGrammar();
      sre.LoadGrammarAsync(g_HelloGoodbye);
      sre.LoadGrammarAsync(g_SetTextBox);
      // sre.RecognizeAsync() is in CheckBox event
    }
    static Grammar GetHelloGoodbyeGrammar()
    {
      Choices ch_HelloGoodbye = new Choices();
      ch_HelloGoodbye.Add("hello");
      ch_HelloGoodbye.Add("goodbye");
      GrammarBuilder gb_result = 
        new GrammarBuilder(ch_HelloGoodbye);
      Grammar g_result = new Grammar(gb_result);
      return g_result;
    }
    static Grammar GetTextBox1TextGrammar()
    {
      Choices ch_Colors = new Choices();
      ch_Colors.Add(new string[] { "red", "white", "blue" });
      GrammarBuilder gb_result = new GrammarBuilder();
      gb_result.Append("set text box 1");
      gb_result.Append(ch_Colors);
      Grammar g_result = new Grammar(gb_result);
      return g_result;
    }
    private void checkBox1_CheckedChanged(object sender, 
      EventArgs e)
    {
      if (checkBox1.Checked == true)
        sre.RecognizeAsync(RecognizeMode.Multiple);
      else if (checkBox1.Checked == false) // Turn off
        sre.RecognizeAsyncCancel();
    }
    void sre_SpeechRecognized(object sender, 
      SpeechRecognizedEventArgs e)
    {
      string txt = e.Result.Text;
      float conf = e.Result.Confidence;
      if (conf < 0.65) return;
      this.Invoke(new MethodInvoker(() =>
      { listBox1.Items.Add("I heard you say: " 
      + txt); })); // WinForm specific
      if (txt.IndexOf("text") >= 0 && txt.IndexOf("box") >=
        0 && txt.IndexOf("1")>= 0)
      {
        string[] words = txt.Split(' ');
        this.Invoke(new MethodInvoker(() =>
        { textBox1.Text = words[4]; })); // WinForm specific
      }
    }
  } // Form
} // ns

I could’ve created the two Grammar objects directly as I did in the console application demo, but instead, to keep things a bit cleaner, I defined two helper methods, GetHelloGoodbyeGrammar and GetTextBox1TextGrammar, to do that work.

Notice that the Form constructor doesn’t call the RecognizeAsync method, which means that speech recognition won’t immediately be active when the application is launched.

Helper method GetHelloGoodbyeGrammar follows the same pattern as described earlier in this article:

static Grammar GetHelloGoodbyeGrammar()
{
  Choices ch_HelloGoodbye = new Choices();
  ch_HelloGoodbye.Add("hello"); // Should be an array!
  ch_HelloGoodbye.Add("goodbye");
  GrammarBuilder gb_result =
    new GrammarBuilder(ch_HelloGoodbye);
  Grammar g_result = new Grammar(gb_result);
  return g_result;
}

Similarly, the helper method that creates a Grammar object to set the text in the Windows Forms TextBox control doesn’t present any surprises:

static Grammar GetTextBox1TextGrammar()
{
  Choices ch_Colors = new Choices();
  ch_Colors.Add(new string[] { "red", "white", "blue" });
  GrammarBuilder gb_result = new GrammarBuilder();
  gb_result.Append("set text box 1");
  gb_result.Append(ch_Colors);
  Grammar g_result = new Grammar(gb_result);
  return g_result;
}

The helper will recognize the phrase, “set text box 1 red.” However, the user doesn’t have to speak this phrase exactly. For example, a user could say, “Please set the text in text box 1 to red,” and the speech recognition engine would still recognize the phrase as “set text box 1 red,” although with a lower confidence value than if the user had matched the Grammar pattern exactly. Put another way, when you’re creating Grammars, you don’t have to take into account every variation of a phrase. This dramatically simplifies using speech recognition.

The CheckBox event handler is defined like so:

private void checkBox1_CheckedChanged(object sender, EventArgs e)
{
  if (checkBox1.Checked == true)
    sre.RecognizeAsync(RecognizeMode.Multiple);
  else if (checkBox1.Checked == false) // Turn off
    sre.RecognizeAsyncCancel();
}

The speech recognition engine object, sre, always remains in existence during the Windows Forms app’s lifetime. The object is activated and deactivated using methods RecognizeAsync and RecognizeAsync­Cancel when the user toggles the CheckBox control.

The definition of the speech-recognized event handler begins with:

void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
  string txt = e.Result.Text;
  float conf = e.Result.Confidence;
  if (conf < 0.65) return;
...

In addition to the more or less always-used Result.Text and Result.Confidence properties, the Result object has several other useful, but more advanced, properties you might want to investigate, such as Homophones and ReplacementWordUnits. Additionally, the speech recognition engine has several useful events such SpeechHypothesized.

The event handler code concludes with:

...
  this.Invoke(new MethodInvoker(() =>
    { listBox1.Items.Add("I heard you say: " + txt); }));
  if (txt.IndexOf("text") >= 0 &&
    txt.IndexOf("box") >= 0 && txt.IndexOf("1")>= 0)
  {
    string[] words = txt.Split(' ');
    this.Invoke(new MethodInvoker(() =>
    { textBox1.Text = words[4]; }));
  }
}

The recognized text is echoed in the ListBox control using the MethodInvoker delegate. Because the speech recognizer is running in a different thread from the Windows Forms UI thread, a direct attempt to access the ListBox control, such as:

listBox1.Items.Add("I heard you say: " + txt);

will fail and throw an exception. An alternative to Method­Invoker is to use the Action delegate like this:

this.Invoke( (Action)( () =>
  listBox1.Items.Add("I heard you say: " + txt)));

In theory, in this situation, using the MethodInvoker delegate is slightly more efficient than using the Action delegate because MethodInvoker is part of the Windows.Forms namespace and, therefore, specific to Windows Forms applications. The Action delegate is more general. This example shows you can completely manipulate a Windows Forms application using speech recognition—incredibly powerful and useful.

Wrapping Up

The information presented in this article should get you up and running if you want to explore speech recognition and speech synthesis with .NET applications. Mastering the technology itself isn’t too difficult once you get over the initial installation and learning hurdles. The real issue with speech recognition and synthesis is determining when they’re useful.

With console applications, you can create interesting back-and-forth dialogs where the user asks a question and the application answers, resulting in a Cortana-like environment. You have to be a bit careful because when your computer speaks, that speech will be picked up by the microphone, and may be recognized. I’ve found myself in some amusing situations where I ask a question, the application recognizes and answers, but the spoken answer triggers another recognition event, and I end up in an entertaining infinite speech loop.

Another possible use of speech with a console application is to recognize commands such as, “Launch Notepad” and “Launch Word.” In other words, a console application can be used to perform actions on your host machine that would normally be performed using multiple mouse and keyboard interactions.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following Microsoft Research technical experts for reviewing this article: Rob Gruen, Mark Marron and Curtis von Veh