by Alessandro
Del Sole – Microsoft MVP
Download
the code
Introduction
In my previous article
we saw how to integrate text-to-speech features into Microsoft Word 2007 easily
with Visual Studio 2008 by using the System.Speech.Synthesis
namespace, which is exposed by .NET Framework 3.5. Another feature provided by
.NET Framework (since version 3.0) is speech
recognition, which is included by default in Microsoft Windows Vista. The
Vista operating system includes a speech recognition engine based on Windows
Desktop Speech technology which enables users to dictate vocal commands that
can be received by applications as well as the operating system itself. To
utilize Windows Vista's built-in speech recognition engine, both as a normal
user and as a developer, you will need to configure the system by following the
steps shown here,
where you will also find also a list of supported languages.
An important consideration that we have to take into account
is that most of the latest versions of Microsoft applications, particularly the
Microsoft Office System 2007, support speech recognition by default. For
example, you could simply launch Vista's speech engine and then open Microsoft
Word 2007 to automatically enable dictation into your documents or to control
UI elements in Word with vocal commands. But this is the simplest scenario; we
can also programmatically control the speech engine if needed. In this article
we're going to focus on how this can be done by creating a custom task pane for
Microsoft Word 2007 with Visual Studio 2008. This article will help you
understand the basics of the System.Speech.Recognition
namespace so that you will be able to use it in your .NET applications.
This article will also point out some new features
introduced by Visual
Studio 2008 Service Pack 1 and illustrate how speech recognition and text-to-Speech
can be easily integrated in order to provide helpful capabilities to your
applications, especially to users with disabilities. You'll find many concepts
that have already been explained in my previous article
about creating a text-to-speech add-in for Microsoft Word 2007 with Visual
Studio 2008, so I suggest you to read that article first.
Creating the Project
Visual Studio solutions for Microsoft Word and Microsoft
Excel are of two kinds: application-level
solutions and document-level
solutions. Application-level solutions are add-ins that affect the host
application each time it is loaded, while document-level solutions affect only
specific documents or spreadsheets that you program. We're going to develop an
application-level solution, because an important feature like speech
recognition should be available to all documents hosted in Microsoft Word 2007.
First, let's create a new Visual Studio Word 2007 Add-In
project. With Visual Studio 2008 opened, select the New > Project command
from the File menu. When the New Project window appears, browse the Office
templates folder and select the Word 2007 add-in template. Our new project
should be called SpeechRecoWordAddin, as shown in Figure 1:
.jpg)
Figure 1 – Selecting the Word
add-in project template
After the project is created, add a reference to the
System.Speech.dll assembly, which provides namespaces and classes for handling
the speech recognition engine via the .NET Framework. If you read the previous
article, you'll remember how Visual Studio 2008 adds a code file to the project
called ThisAddin.vb, which defines a class called ThisAddin. This represents the instance of the custom component and
exposes two events representing the two main events of an add-in's lifecycle: Startup and Shutdown. For this simple example we're going to handle just the
Startup event: this is the moment when our custom task pane is instantiated and
added to Microsoft Word's task panes collection (CustomTaskPaneCollection).
A custom task pane consists of a user control, so we need to
create one. But first we have to declare a CustomTaskPane
object, which requires an Imports
Microsoft.Office.Tools directive. The code in Listing 1 declares
an object that will represent our custom control.
Private SpeechRecoTaskPane As CustomTaskPane
Listing 1
Custom task panes are managed by Visual Studio via a
particular object called CustomTaskPanes,
which is a collection of type CustomTaskPaneCollection.
The next step is to instantiate our custom component and add it to the collection.
This can be accomplished by handling the Startup
event, as shown in Listing 2.
Private Sub ThisAddIn_Startup() Handles Me.Startup
Me.SpeechRecoTaskPane = Me.CustomTaskPanes.Add(New SpeechRecoUserControl, "Speech recognition")
Me.SpeechRecoTaskPane.Visible = True
End Sub
Listing 2
Customizing the Task Pane
Now it's time to add a new user control to our project. To
accomplish this, select Project > Add
user control. When the Add new item window appears, type SpeechRecoUserControl.vb inside the
appropriate text box.
Our user control will be really simple. It will expose two
buttons for starting and stopping the speech recognition engine and a Label for showing status messages.
Speech recognition in .NET development is a big topic, so in this article we'll
just pay attention to the most common features. Figure 2 shows what our control
will look like.
.jpg)
Figure 2 – Layout of
the new custom control
Visual Studio enables you to add Windows Forms controls easily
both to applications' documents and to custom controls. So let's add three
Windows Forms controls onto our user control's design surface:
- a
control for starting the Speech recognition engine. Set the property for this button to .
- a
control for stopping the Speech recognition engine. Set the property for this button to .
- a
control for displaying messages. Set the
property for this control to
and change the font style as you like.
Now we're ready to explore the speech recognition features
in.NET.
Working with Speech Recognition in .NET
A little bit of explanation is required before jumping into
code. The .NET Framework provides speech recognition through an object called SpeechRecognizer. This is exposed by
the System.Speech.Recognition
namespace. When instantiated, the SpeechRecognizer
object automatically launches Windows Vista's speech recognition engine, and
applications like Microsoft Word 2007 can automatically take advantage of the
engine to provide dictation capabilities to the user. So, how can we interact
with the engine and manage the dictation?
The SpeechRecognizer
raises some events. The most important of these are the following:
- SpeechDetected
- SpeechRecognized
- SpeechHypothesized
- SpeechRecognitionRejected
Handling these events is important, because each of them enables
us to take some actions depending on the result of the recognition. For
example, if no word is recognized, we could send a message to the user. We'll
do this by integrating text-to-speech techniques.
The above-mentioned events refer to grammar rules to
determine what was spoken by the user. The speech engine itself cannot know
every language in the world, neither can it know what vocal commands you should
send to an application's user interface; so it's necessary to define a so-called
grammar, a collection of words and
phrases that the speech engine can recognize. In the speech recognition API,
grammars are compliant to W3C and CFG specifications as illustrated in the MSDN
Library. We can create two kinds of grammars. The first one is composed of
specific words and/or phrases; this is the case in applications where the user
interface receives vocal commands, e.g., the user speaks a word to click a
button. We can construct this type of grammar by using a GrammarBuilder object. The second kind of grammar consists of
default grammar rules defined by the operating system. This is the case in
applications that support free-text dictation like Microsoft Word.
Regarding this second case, Windows Vista has a built-in set
of grammar rules corresponding to the current system's culture and language. In
this example, we're going to use the default grammar rules for en-US culture and the English language.
We do this by instantiating a DictationGrammar
object, which represents the default grammar rules for the system localization
allowing free-text dictation. When speaking unrecognized words or non-English
phrases, a SpeechRecognitionRejected
event will be raised. When this happens we want to communicate to the user any
unrecognized words that were spoken as well as the engine's states via audio
messages. A good way to accomplish this is by utilizing the System.Speech.Synthesis namespace,
which provides text-to-speech capabilities just as I showed you in my previous article.
Now it's time to write code. First we need some Imports statements (see Listing 3).
Imports System.Speech.Recognition
Imports System.Speech.Recognition.SrgsGrammar
Imports System.Speech.Synthesis
Imports Microsoft.Office.Tools.Word
Listing 3
Inside the SpeechRecoUserControl
class we can declare some objects that will handle references to the speech
engine, the grammar, and the active Word document. Code in Listing 4 has
comments which explain this.
'This will handle the speech recognition engine
Private WithEvents Recognizer As SpeechRecognizer
'This will obtain a reference to the active Word document
Private MyDocument As Document
'We'll use this object to "say" messages to the user
Private Synthesizer As SpeechSynthesizer
'A reference to Windows Vista's built-in grammar
Private CustomGrammar As DictationGrammar
Listing 4
We have now to initialize the speech engine and explicitly
specify event handlers. We can modify the constructor as shown in Listing 5.
Public Sub New()
' This call is required by the Windows Form Designer.
InitializeComponent()
' Add any initialization after the InitializeComponent() call.
'The instance of the SpeechRecognizer will automatically
'launch Windows Vista's speech recognition engine
Me.Recognizer = New SpeechRecognizer()
'We need a SpeechSynthesizer to send audio messages to the user.
Me.Synthesizer = New SpeechSynthesizer()
'We know the name of Vista's default voice, so we select that
Me.Synthesizer.SelectVoice("Microsoft Anna")
Me.statusLabel.Text = "Ready"
'An asyncronous vocal message tells the user that the engine is ready
Me.Synthesizer.SpeakAsync("Speech engine is ready")
End Sub
Listing 5
It's worth mentioning that the SpeechRecognizer object requires Full Trust permissions. Notice how
Text-to-Speech features are used to communicate to the user that the speech
recognition engine has started via a vocal message.
At this point we can implement the event handlers. Listing 6
shows how to accomplish this.
Private Sub recognizer_SpeechDetected(ByVal sender As Object, _
ByVal e As SpeechDetectedEventArgs) Handles Recognizer.SpeechDetected
'Raised when the engine detects that the user is speaking some words.
'In our scenario we don't need to handle this event.
End Sub
Private Sub recognizer_SpeechHypothesized(ByVal sender As Object,_
ByVal e As SpeechHypothesizedEventArgs) Handles Recognizer.SpeechHypothesized
'Raised when the engine doesn't recognize spoken words but tries to parse them anyway.
'In our scenario we don't need to handle this event.
End Sub
Private Sub recognizer_SpeechRecognitionRejected(ByVal sender As Object, _
ByVal e As SpeechRecognitionRejectedEventArgs) Handles Recognizer.SpeechRecognitionRejected
Me.Synthesizer.SpeakAsync("Words spoken were not recognized")
End Sub
Private Sub recognizer_SpeechRecognized(ByVal sender As Object, _
ByVal e As SpeechRecognizedEventArgs) Handles Recognizer.SpeechRecognized
'This example adds the spoken words to the first paragraph in the document
Dim currentParagraph As Word.Range = myDocument.Paragraphs(1).Range
currentParagraph.InsertAfter(String.Concat(e.Result.Text, " "))
End Sub
Listing 6
The first two events (SpeechDetected
and SpeechHypothesized) are empty in
the example. In a simple scenario like our user control, we don't need to
handle them. This is because we don't need to take actions when the user begins
to speak (SpeechDetected) or when
the engine tries to parse unrecognized words (SpeechHypothesized). We just want Microsoft Word to add
grammar-compliant words to the active document. If the engine doesn't recognize
phrases at all, the SpeechRecognitionRejected
event is raised. At this point we are sending another vocal message to the user
saying that the engine could not recognize what was said.
If the engine is able to recognize and parse a phrase, the
code will add the phrase to the active document. The result of the speech
recognition is stored inside the Result.Text
property of the event argument e
(of type SpeechRecognizedEventArgs).
This Text property is a string so we
can add it to the first paragraph in the document (myDocument.Paragraphs(1).Range) via the InsertAfter method. This is just one of
the possible ways to send the result of the speech recognition to the active
Word document. The important thing you have to remember is that the event
argument of type SpeechRecognizedEventArgs
will allow you to control the speech engine.
We need to implement two remaining event handlers to handle
the Click events for the StartButton and StopButton buttons. When clicking the StartButton, we have to load the default grammar rules into the
speech recognition engine and get a reference to the active document, as shown
in Listing 7.
Private Sub StartButton_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles StartButton.Click
'If is there at least one grammar, the engine is already running
If recognizer.Grammars.Count > 0 Then Exit Sub
'Otherwise instantiates the default Windows Vista's built-in grammar
customGrammar = New DictationGrammar
recognizer.LoadGrammar(customGrammar)
'Retrieves the active document with a new extension method called GetVstoObject
myDocument = Globals.ThisAddIn.Application.ActiveDocument.GetVstoObject()
statusLabel.Text = "Engine started"
synthesizer.SpeakAsync("Speech engine has started")
End Sub
Listing 7
Here we have to make sure that there are no grammar rules
already loaded into the speech engine (e.g., if the user clicks the Start
button twice without stopping the engine). We check this using the recognizer.Grammar.Count property. If
no grammar has been loaded, we can instantiate a new default system grammar and
pass it to the engine using the LoadGrammar
method.
To get a reference to the active document opened inside
Microsoft Word, we are using a new method called GetVstoObject, which is used to obtain a VSTO host item object from the PIA object for a Document, Worksheet,
or Workbook. This an extension method, introduced by Visual Studio 2008 Service
Pack 1, that enables you to get a managed reference to the active Microsoft
Word document and Microsoft Excel workbook or worksheet.
Next, a vocal message is sent to the user to communicate
that the engine was started correctly. When we don't need to send our dictation
to Word anymore, we can stop the engine. At this point we must also unload
unnecessary resources like the grammar we previously loaded. In this example,
we can do this by handling the Click event in the StopButton, as shown in
Listing 8.
Private Sub StopButton_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles StopButton.Click
'We have to release unnecessary resources
If customGrammar IsNot Nothing Then
recognizer.UnloadGrammar(customGrammar)
statusLabel.Text = "Engine stopped"
synthesizer.SpeakAsync("Speech engine has been stopped")
End If
End Sub
Listing 8
In the UnloadGrammar
method, we can release our custom grammar so that it can be loaded just when
needed by pressing the Start button again.
Dictating our Documents
We are now ready to see the dictation in action. Press F5 to
compile the project, which starts an instance of Microsoft Word 2007. While
Word is running, you'll first notice how the speech engine control appears at the
top of the screen. The first time you use speech dictation features, you'll be
prompted to configure the engine. You can do this by going through the Wizard
to set it up. Now the engine is sleeping, so the first thing to do is to
activate it. At this point we have to click the Start engine button in our
custom task pane. This will load the default grammar into the speech engine.
The "Engine started" message appears, and a spoken message will be
sent to the user advising that the engine is working (via text-to-speech), so
we can turn on our microphone and say the words "Start listening."
This a default phrase that you will always need to say whenever you want to
activate Vista's speech recognition engine. Now we can begin dictating our
phrases and Microsoft Word will show them inside the new document whenever the
engine correctly recognizes them. Figure 3 shows an example of dictation. It's
worth mentioning that Microsoft Word 2007 is able to also receive UI commands
and punctuation statements; for example, speaking the words "question mark"
will add a ? symbol to the document.
.jpg)
Figure 3. Our custom
task pane in action inside Microsoft Word 2007 showing the result of our
dictation.
Terminating dictation is a very simple task. Just say the
words "Stop listening" and this will deactivate the engine. If you want
to start dictating again, say "Start listening" again. If you're not
going to dictate to the engine again for a while, then it's better to press the
Stop Engine button so that resources will be freed.
This article has shown you how Microsoft speech technologies
can enhance the user experience of applications. Receiving dictation or vocal
commands and speaking vocal messages can be very useful in your applications,
especially if they are used by people with disabilities. Microsoft Office
applications can take advantage of Windows Vista's speech recognition engine
automatically; however, as I have shown, you can easily customize Office
applications to interact with the speech recognition engine using Visual Studio
2008.
Useful resources
About the author
Alessandro Del Sole is a Microsoft Visual Basic MVP and team
member in the Italian Visual Basic Tips
& Tricks community. He writes lots of Italian
and English
language community articles and books about .NET development. He also enjoys
writing freeware and open-source developer tools. You can visit Alessandro's blog.