Enabling Speech Recognition in Microsoft Word 2007 with Visual Studio 2008
by Alessandro Del Sole – Microsoft MVP
In my previous article we saw how to integrate text-to-speech features into Microsoft Word 2007 easily with Visual Studio 2008 by using the System.Speech.Synthesis namespace, which is exposed by .NET Framework 3.5. Another feature provided by .NET Framework (since version 3.0) is speech recognition, which is included by default in Microsoft Windows Vista. The Vista operating system includes a speech recognition engine based on Windows Desktop Speech technology which enables users to dictate vocal commands that can be received by applications as well as the operating system itself. To utilize Windows Vista's built-in speech recognition engine, both as a normal user and as a developer, you will need to configure the system by following the steps shown here, where you will also find also a list of supported languages.
An important consideration that we have to take into account is that most of the latest versions of Microsoft applications, particularly the Microsoft Office System 2007, support speech recognition by default. For example, you could simply launch Vista's speech engine and then open Microsoft Word 2007 to automatically enable dictation into your documents or to control UI elements in Word with vocal commands. But this is the simplest scenario; we can also programmatically control the speech engine if needed. In this article we're going to focus on how this can be done by creating a custom task pane for Microsoft Word 2007 with Visual Studio 2008. This article will help you understand the basics of the System.Speech.Recognition namespace so that you will be able to use it in your .NET applications.
This article will also point out some new features introduced by Visual Studio 2008 Service Pack 1 and illustrate how speech recognition and text-to-Speech can be easily integrated in order to provide helpful capabilities to your applications, especially to users with disabilities. You'll find many concepts that have already been explained in my previous article about creating a text-to-speech add-in for Microsoft Word 2007 with Visual Studio 2008, so I suggest you to read that article first.
Creating the Project
Visual Studio solutions for Microsoft Word and Microsoft Excel are of two kinds: application-level solutions and document-level solutions. Application-level solutions are add-ins that affect the host application each time it is loaded, while document-level solutions affect only specific documents or spreadsheets that you program. We're going to develop an application-level solution, because an important feature like speech recognition should be available to all documents hosted in Microsoft Word 2007.
First, let's create a new Visual Studio Word 2007 Add-In project. With Visual Studio 2008 opened, select the New > Project command from the File menu. When the New Project window appears, browse the Office templates folder and select the Word 2007 add-in template. Our new project should be called SpeechRecoWordAddin, as shown in Figure 1:
Figure 1 – Selecting the Word add-in project template
After the project is created, add a reference to the System.Speech.dll assembly, which provides namespaces and classes for handling the speech recognition engine via the .NET Framework. If you read the previous article, you'll remember how Visual Studio 2008 adds a code file to the project called ThisAddin.vb, which defines a class called ThisAddin. This represents the instance of the custom component and exposes two events representing the two main events of an add-in's lifecycle: Startup and Shutdown. For this simple example we're going to handle just the Startup event: this is the moment when our custom task pane is instantiated and added to Microsoft Word's task panes collection (CustomTaskPaneCollection).
A custom task pane consists of a user control, so we need to
create one. But first we have to declare a CustomTaskPane
object, which requires an
Microsoft.Office.Tools directive. The code in Listing 1 declares
an object that will represent our custom control.
Custom task panes are managed by Visual Studio via a particular object called CustomTaskPanes, which is a collection of type CustomTaskPaneCollection. The next step is to instantiate our custom component and add it to the collection. This can be accomplished by handling the Startup event, as shown in Listing 2.
Private Sub ThisAddIn_Startup() Handles Me.Startup Me.SpeechRecoTaskPane = Me.CustomTaskPanes.Add(New SpeechRecoUserControl, "Speech recognition") Me.SpeechRecoTaskPane.Visible = True End Sub
Customizing the Task Pane
Now it's time to add a new user control to our project. To accomplish this, select Project > Add user control. When the Add new item window appears, type SpeechRecoUserControl.vb inside the appropriate text box.
Our user control will be really simple. It will expose two buttons for starting and stopping the speech recognition engine and a Label for showing status messages. Speech recognition in .NET development is a big topic, so in this article we'll just pay attention to the most common features. Figure 2 shows what our control will look like.
Figure 2 – Layout of the new custom control
Visual Studio enables you to add Windows Forms controls easily both to applications' documents and to custom controls. So let's add three Windows Forms controls onto our user control's design surface:
- a control for starting the Speech recognition engine. Set the property for this button to .
- a control for stopping the Speech recognition engine. Set the property for this button to .
- a control for displaying messages. Set the property for this control to and change the font style as you like.
Now we're ready to explore the speech recognition features in.NET.
Working with Speech Recognition in .NET
A little bit of explanation is required before jumping into code. The .NET Framework provides speech recognition through an object called SpeechRecognizer. This is exposed by the System.Speech.Recognition namespace. When instantiated, the SpeechRecognizer object automatically launches Windows Vista's speech recognition engine, and applications like Microsoft Word 2007 can automatically take advantage of the engine to provide dictation capabilities to the user. So, how can we interact with the engine and manage the dictation?
The SpeechRecognizer raises some events. The most important of these are the following:
Handling these events is important, because each of them enables us to take some actions depending on the result of the recognition. For example, if no word is recognized, we could send a message to the user. We'll do this by integrating text-to-speech techniques.
The above-mentioned events refer to grammar rules to determine what was spoken by the user. The speech engine itself cannot know every language in the world, neither can it know what vocal commands you should send to an application's user interface; so it's necessary to define a so-called grammar, a collection of words and phrases that the speech engine can recognize. In the speech recognition API, grammars are compliant to W3C and CFG specifications as illustrated in the MSDN Library. We can create two kinds of grammars. The first one is composed of specific words and/or phrases; this is the case in applications where the user interface receives vocal commands, e.g., the user speaks a word to click a button. We can construct this type of grammar by using a GrammarBuilder object. The second kind of grammar consists of default grammar rules defined by the operating system. This is the case in applications that support free-text dictation like Microsoft Word.
Regarding this second case, Windows Vista has a built-in set of grammar rules corresponding to the current system's culture and language. In this example, we're going to use the default grammar rules for en-US culture and the English language. We do this by instantiating a DictationGrammar object, which represents the default grammar rules for the system localization allowing free-text dictation. When speaking unrecognized words or non-English phrases, a SpeechRecognitionRejected event will be raised. When this happens we want to communicate to the user any unrecognized words that were spoken as well as the engine's states via audio messages. A good way to accomplish this is by utilizing the System.Speech.Synthesis namespace, which provides text-to-speech capabilities just as I showed you in my previous article.
Now it's time to write code. First we need some
Imports statements (see Listing 3).
Imports System.Speech.Recognition Imports System.Speech.Recognition.SrgsGrammar Imports System.Speech.Synthesis Imports Microsoft.Office.Tools.Word
Inside the SpeechRecoUserControl class we can declare some objects that will handle references to the speech engine, the grammar, and the active Word document. Code in Listing 4 has comments which explain this.
'This will handle the speech recognition engine Private WithEvents Recognizer As SpeechRecognizer 'This will obtain a reference to the active Word document Private MyDocument As Document 'We'll use this object to "say" messages to the user Private Synthesizer As SpeechSynthesizer 'A reference to Windows Vista's built-in grammar Private CustomGrammar As DictationGrammar
We have now to initialize the speech engine and explicitly specify event handlers. We can modify the constructor as shown in Listing 5.
Public Sub New() ' This call is required by the Windows Form Designer. InitializeComponent() ' Add any initialization after the InitializeComponent() call. 'The instance of the SpeechRecognizer will automatically 'launch Windows Vista's speech recognition engine Me.Recognizer = New SpeechRecognizer() 'We need a SpeechSynthesizer to send audio messages to the user. Me.Synthesizer = New SpeechSynthesizer() 'We know the name of Vista's default voice, so we select that Me.Synthesizer.SelectVoice("Microsoft Anna") Me.statusLabel.Text = "Ready" 'An asyncronous vocal message tells the user that the engine is ready Me.Synthesizer.SpeakAsync("Speech engine is ready") End Sub
It's worth mentioning that the SpeechRecognizer object requires Full Trust permissions. Notice how Text-to-Speech features are used to communicate to the user that the speech recognition engine has started via a vocal message.
At this point we can implement the event handlers. Listing 6 shows how to accomplish this.
Private Sub recognizer_SpeechDetected(ByVal sender As Object, _ ByVal e As SpeechDetectedEventArgs) Handles Recognizer.SpeechDetected 'Raised when the engine detects that the user is speaking some words. 'In our scenario we don't need to handle this event. End Sub Private Sub recognizer_SpeechHypothesized(ByVal sender As Object,_ ByVal e As SpeechHypothesizedEventArgs) Handles Recognizer.SpeechHypothesized 'Raised when the engine doesn't recognize spoken words but tries to parse them anyway. 'In our scenario we don't need to handle this event. End Sub Private Sub recognizer_SpeechRecognitionRejected(ByVal sender As Object, _ ByVal e As SpeechRecognitionRejectedEventArgs) Handles Recognizer.SpeechRecognitionRejected Me.Synthesizer.SpeakAsync("Words spoken were not recognized") End Sub Private Sub recognizer_SpeechRecognized(ByVal sender As Object, _ ByVal e As SpeechRecognizedEventArgs) Handles Recognizer.SpeechRecognized 'This example adds the spoken words to the first paragraph in the document Dim currentParagraph As Word.Range = myDocument.Paragraphs(1).Range currentParagraph.InsertAfter(String.Concat(e.Result.Text, " ")) End Sub
The first two events (SpeechDetected and SpeechHypothesized) are empty in the example. In a simple scenario like our user control, we don't need to handle them. This is because we don't need to take actions when the user begins to speak (SpeechDetected) or when the engine tries to parse unrecognized words (SpeechHypothesized). We just want Microsoft Word to add grammar-compliant words to the active document. If the engine doesn't recognize phrases at all, the SpeechRecognitionRejected event is raised. At this point we are sending another vocal message to the user saying that the engine could not recognize what was said.
If the engine is able to recognize and parse a phrase, the
code will add the phrase to the active document. The result of the speech
recognition is stored inside the Result.Text
property of the event argument
(of type SpeechRecognizedEventArgs).
This Text property is a string so we
can add it to the first paragraph in the document (
myDocument.Paragraphs(1).Range) via the InsertAfter method. This is just one of
the possible ways to send the result of the speech recognition to the active
Word document. The important thing you have to remember is that the event
argument of type SpeechRecognizedEventArgs
will allow you to control the speech engine.
We need to implement two remaining event handlers to handle the Click events for the StartButton and StopButton buttons. When clicking the StartButton, we have to load the default grammar rules into the speech recognition engine and get a reference to the active document, as shown in Listing 7.
Private Sub StartButton_Click(ByVal sender As System.Object, _ ByVal e As System.EventArgs) Handles StartButton.Click 'If is there at least one grammar, the engine is already running If recognizer.Grammars.Count > 0 Then Exit Sub 'Otherwise instantiates the default Windows Vista's built-in grammar customGrammar = New DictationGrammar recognizer.LoadGrammar(customGrammar) 'Retrieves the active document with a new extension method called GetVstoObject myDocument = Globals.ThisAddIn.Application.ActiveDocument.GetVstoObject() statusLabel.Text = "Engine started" synthesizer.SpeakAsync("Speech engine has started") End Sub
Here we have to make sure that there are no grammar rules already loaded into the speech engine (e.g., if the user clicks the Start button twice without stopping the engine). We check this using the recognizer.Grammar.Count property. If no grammar has been loaded, we can instantiate a new default system grammar and pass it to the engine using the LoadGrammar method.
To get a reference to the active document opened inside Microsoft Word, we are using a new method called GetVstoObject, which is used to obtain a VSTO host item object from the PIA object for a Document, Worksheet, or Workbook. This an extension method, introduced by Visual Studio 2008 Service Pack 1, that enables you to get a managed reference to the active Microsoft Word document and Microsoft Excel workbook or worksheet.
Next, a vocal message is sent to the user to communicate that the engine was started correctly. When we don't need to send our dictation to Word anymore, we can stop the engine. At this point we must also unload unnecessary resources like the grammar we previously loaded. In this example, we can do this by handling the Click event in the StopButton, as shown in Listing 8.
Private Sub StopButton_Click(ByVal sender As System.Object, _ ByVal e As System.EventArgs) Handles StopButton.Click 'We have to release unnecessary resources If customGrammar IsNot Nothing Then recognizer.UnloadGrammar(customGrammar) statusLabel.Text = "Engine stopped" synthesizer.SpeakAsync("Speech engine has been stopped") End If End Sub
In the UnloadGrammar method, we can release our custom grammar so that it can be loaded just when needed by pressing the Start button again.
Dictating our Documents
We are now ready to see the dictation in action. Press F5 to compile the project, which starts an instance of Microsoft Word 2007. While Word is running, you'll first notice how the speech engine control appears at the top of the screen. The first time you use speech dictation features, you'll be prompted to configure the engine. You can do this by going through the Wizard to set it up. Now the engine is sleeping, so the first thing to do is to activate it. At this point we have to click the Start engine button in our custom task pane. This will load the default grammar into the speech engine. The "Engine started" message appears, and a spoken message will be sent to the user advising that the engine is working (via text-to-speech), so we can turn on our microphone and say the words "Start listening." This a default phrase that you will always need to say whenever you want to activate Vista's speech recognition engine. Now we can begin dictating our phrases and Microsoft Word will show them inside the new document whenever the engine correctly recognizes them. Figure 3 shows an example of dictation. It's worth mentioning that Microsoft Word 2007 is able to also receive UI commands and punctuation statements; for example, speaking the words "question mark" will add a ? symbol to the document.
Figure 3. Our custom task pane in action inside Microsoft Word 2007 showing the result of our dictation.
Terminating dictation is a very simple task. Just say the words "Stop listening" and this will deactivate the engine. If you want to start dictating again, say "Start listening" again. If you're not going to dictate to the engine again for a while, then it's better to press the Stop Engine button so that resources will be freed.
This article has shown you how Microsoft speech technologies can enhance the user experience of applications. Receiving dictation or vocal commands and speaking vocal messages can be very useful in your applications, especially if they are used by people with disabilities. Microsoft Office applications can take advantage of Windows Vista's speech recognition engine automatically; however, as I have shown, you can easily customize Office applications to interact with the speech recognition engine using Visual Studio 2008.
- Office Development with Visual Studio
- System.Speech.Recognition namespace
- System.Speech.Synthesis namespace
About the author
Alessandro Del Sole is a Microsoft Visual Basic MVP and team member in the Italian Visual Basic Tips & Tricks community. He writes lots of Italian and English language community articles and books about .NET development. He also enjoys writing freeware and open-source developer tools. You can visit Alessandro's blog.