How to enable continuous dictation

Article
08/31/2015

Learn how to capture and recognize long-form, continuous dictation speech input.

Note Voice commands and speech recognition are not supported by Windows Store apps in Windows 8 and Windows 8.1.

In [How to dictate short speech responses] you learned how to capture and recognize relatively short speech input using the recognizeAsync or recognizeWithUIAsync methods of a SpeechRecognizer object. For example, when composing a short message service (SMS) message or when asking a question.

For longer, continuous speech recognition sessions, such as dictation or email, use the continuousRecognitionSession property of a SpeechRecognizer to obtain a SpeechContinuousRecognitionSession object.

What you need to know

Technologies

Windows.Media.SpeechRecognition

Prerequisites

This topic builds on Quickstart: Speech recognition and references the "Continuous Dictation Scenario" of the [Speech and TTS sample]. You don’t need the sample to understand the key points and code snippets explained here, but it does let you experiment freely with the code.

To complete this tutorial, have a look through these topics to get familiar with the technologies discussed here.

Install Microsoft Visual Studio.
Get a developer license. For instructions, see Develop using Visual Studio 2013.
Create your first app using JavaScript.
Roadmap for Windows Store apps using JavaScript
Learn about events with Quickstart: adding HTML controls and handling events
See Speech design guidelines for Windows Phone for helpful tips on designing a useful and engaging speech-enabled app.

Instructions

Set up

Your app needs a few objects to manage a continuous dictation session:

An instance of a SpeechRecognizer object.
A reference to a UI dispatcher to update the UI during dictation.
A way to track the accumulated words spoken by the user.

Here, we declare a SpeechRecognizer instance as a private field of the code-behind class. Your app needs to store a reference elsewhere if you want continuous dictation to persist beyond a single Extensible Application Markup Language (XAML) page.

N/A

During dictation, the recognizer raises events from a background thread. Because a background thread cannot directly update the UI in XAML, your app must use a dispatcher to update the UI in response to recognition events.

Here, we declare a private field that will be initialized later with the UI dispatcher.

N/A

To track what the user is saying, you need to handle recognition events raised by the speech recognizer. These events provide the recognition results for chunks of user utterances.

Here, we hold all the recognition results obtained during the session. New results are appended as they are processed.

N/A

Initialization

During initialization of continuous speech recognition, you must:

Fetch the dispatcher for the UI thread if you update the UI of your app in the continuous recognition event handlers.
Initialize the speech recognizer.
Compile the built-in dictation grammar. Note Speech recognition requires at least one constraint to define a recognizable vocabulary. If no constraint is specified, a predefined dictation grammar is used. See Quickstart: Speech recognition.
Setup the event listeners for recognition events.

We initialize speech recognition in the OnNavigatedTo page event.

Because events raised by the speech recognizer occur on a background thread, create a reference to the dispatcher for updates to the UI thread. OnNavigatedTo is always invoked on the UI thread.
```
N/A
```
We then initialize the SpeechRecognizer instance.
```
N/A
```
We then add and compile the grammar that defines all of the words and phrases that can be recognized by the SpeechRecognizer.

If you don't specify a grammar explicitly, a predefined dictation grammar is used by default. Typically, the default grammar is best for general dictation.

Here, we call CompileConstraintsAsync immediately without adding a grammar.
```
N/A
```

Handle recognition events

Here, you can capture a single, brief utterance or phrase by calling recognizeAsync or recognizeWithUIAsync. However, we want to capture a longer, continuous recognition session.

To do this, we specify event listeners to run in the background as the user speaks and define handlers to build the dictation string.

We then use the continuousRecognitionSession property of our recognizer to obtain a SpeechContinuousRecognitionSession object that provides methods and events for managing a continuous recognition session.

Two events in particular are critical:

resultGenerated, which occurs when the recognizer has generated some results.
completed, which occurs when the continuous recognition session has ended.

The resultGenerated event is raised as the user speaks. The recognizer continuously listens to the user and periodically raises an event that passes a chunk of speech input. You must examine the speech input, using the result property of the event argument, and take appropriate action in the event handler, such as appending the text to a StringBuilder object.

As an instance of SpeechRecognitionResult, the result property is useful for determining whether you want to accept the speech input:

status indicates whether the recognition was successful. Recognition can fail for a variety of reasons.
confidence indicates the relative confidence that the recognizer understood the correct words.

Here, we register the handler for the resultGenerated continuous recognition event in the onNavigatedTo page event.
```
N/A
```
We then check the confidence property. If the value of Confidence is Medium or better, we append the text. We also update the UI as we collect input.

Note the resultGenerated event is raised on a background thread that cannot update the UI directly. If a handler needs to update the UI (as the [Speech and TTS sample] does), you must dispatch the updates to the UI thread through the runAsync method of the dispatcher.
```
N/A
```
We then handle the completed event, which indicates the end of continuous dictation.

The session ends when you call the stopAsync or cancelAsync methods (described the next section). The session can also end when an error occurs, or when the user has stopped speaking. Check the status property of the event argument to determine why the session ended (SpeechRecognitionResultStatus).

Here, we register the handler for the completed continuous recognition event in the onNavigatedTo page event.
```
N/A
```
The event handler checks the Status property to determine whether the recognition was successful. It also handles the case where the user has stopped speaking. Often, a TimeoutExceeded is considered successful recognition as it means the user has finished speaking. You should handle this case in your code for a good experience.

Note the resultGenerated event is raised on a background thread that cannot update the UI directly. If a handler needs to update the UI (as the [Speech and TTS sample] does), you must dispatch the updates to the UI thread through the runAsync method of the dispatcher.
```
N/A
```

Provide ongoing recognition feedback

When people converse, they often rely on context to fully understand what is being said. Similarly, the speech recognizer often needs context to provide high-confidence recognition results. For example, by themselves, the words "weight" and "wait" are indistinguishable until more context can be gleaned from surrounding words. Until the recognizer has some confidence that a word, or words, have been recognized correctly, it will not raise the resultGenerated event.

This can result in a less than ideal experience for the user as they continue speaking and no results are provided until the recognizer has high enough confidence to raise the resultGenerated event.

Handle the hypothesisGenerated event to improve this apparent lack of responsiveness. This event is raised whenever the recognizer generates a new set of potential matches for the word being processed. The event argument provides an hypothesis property that contains the current matches. Show these to the user as they continue speaking and reassure them that processing is still active. Once confidence is high and a recognition result has been determined, replace the interim hypothesis results with the final result provided in the resultGenerated event.

Here, we append the hypothetical text and an ellipsis ("…") to the current value of the output text box. The contents of the text box are updated as new hypotheses are generated and until the final results are obtained from the resultGenerated event.

N/A

Start and stop recognition

Before starting a recognition session, check the value of the speech recognizer state property. The speech recognizer must be in an Idle state.

After checking the state of the speech recognizer, we start the session by calling the startAsync method of the speech recognizer's continuousRecognitionSession property.

N/A

Recognition can be stopped in two ways:

stopAsync lets any pending recognition events complete (resultGenerated continues to be raised until all pending recognition operations are complete).
cancelAsync terminates the recognition session immediately and discards any pending results.

After checking the state of the speech recognizer, we stop the session by calling the cancelAsync method of the speech recognizer's continuousRecognitionSession property.

N/A

Note

A resultGenerated event can occur after a call to cancelAsync.

Because of multithreading, a resultGenerated event might still remain on the stack when cancelAsync is called. If so, the resultGenerated event still fires.

If you set any private fields when canceling the recognition session, always confirm their values in the resultGenerated handler. For example, don't assume a field is initialized in your handler if you set them to null when you cancel the session.

Summary and next steps

Here, you learned how to handle long-form, unconstrained speech dictation, which is useful for authoring emails or documents.

Next, you might like to know how to listen for a continuous series of verbal commands, such as those in a video game. See, [How to listen for continuous phrases from a list].

Responding to speech interactions

Designers

Speech design guidelines