Developing for Speech
In the past, Microsoft offered Windows Desktop Speech technologies for developers through a separate Speech SDK that contained development files and redistributable binaries. The latter files had to be packaged with the application and installed on each user's machine to enable speech capabilities. Because speech technologies have matured and entered mainstream use, this state of affairs has changed for Windows Vista®: the operating system now has integrated speech capabilities and the speech APIs are now included with the Windows SDK.
Windows Vista automatically provides basic speech capabilities to any application that is designed to work with two Windows accessibility technologies: Microsoft Active Accessibility (MSAA) and Microsoft Windows UI Automation (WUIA). At runtime, when speech is used to open or switch to an application, the speech engine queries that application to determine which accessibility features it supports, then works through those. Developers can add accessibility support to their applications by using the Active Accessibility API, which is COM-based. For more information, see "Microsoft Active Accessibility" in the Windows SDK. Managed applications can use the types provided in the Accessibility namespace, which are largely just COM wrappers for the Active Accessibility API. For more information, see "Accessibility Namespace" in the Windows SDK.
For advanced speech capabilities, Microsoft provides native and managed interfaces for developing speech-enabled applications: a COM-based Microsoft Speech API (SAPI) and the .NET Framework 3.0 System.Speech* namespaces. For more information, see "Welcome to Microsoft Speech SDK Version 5.3" and the System.Speech.* namespace in the Windows SDK.
Architecture and Concepts
SAPI is middleware that provides an API and a device driver interface (DDI) for speech engines to implement. The speech engines are either speech recognizers or synthesizers. (Although the word device is used, these engines are typically software implementations.) Although this is opaque to developers, the managed System.Speech* namespace communicates to these engines both directly and indirectly by calling through SAPI (Sapi.dll). Windows Vista supplies default recognition and synthesis speech engines, but this architecture enables plugging in additional ones without changes to applications. Each speech engine is language specific.
Although they share some commonality, the recognition and synthesis speech capabilities can and are used separately. A speech synthesis engine is instantiated locally in every application that uses it, whereas a speech recognition engine can be either instantiated privately or the shared desktop instance can be used. (The programmer has more control over a private instance.) After instantiating an engine, an application can adjust its characteristics, invoke operations on it, and register for speech event notifications. Applications can choose to receive events through window messages, method callbacks, or Win32 events. These events can be filtered through the use of interests supplied to the engine.
Targeted mainly at C/C++ developers, SAPI is divided into the following seven logical categories:
Used to control and customize real-time audio streams compatible with speech synthesis.
Defines the events that can be received during speech synthesis and speech recognition operations.
Used to dynamically define context-free grammars (CFGs) and compile them into the binary form used by the speech recognition engine.
Provides a uniform way for applications and engines to access the user lexicon, application lexicon, and engine private lexicons. Lexicons provide custom word pronunciations for speech synthesis.
Provides a uniform way to find and select SAPI speech data, such as voice files and pronunciation lexicons. SAPI represents each resource as a token object, which enables a program to inspect the various attributes of a resource without instantiating it.
Provides access to the speech recognition engine, contexts, grammars, and the resultant recognized phrases.
Speech Synthesis (Text-to-Speech)
Provides access to the speech synthesis engine characteristics and operations.
In addition, the SDK provides a set of helper classes and functions that simplify programming by consolidating related functionality. For more information, see the "Helper Functions" section, under the Speech SDK portion of the Windows SDK.
Speech Synthesis in SAPI
Speech synthesis, commonly referred to as text-to-speech (TTS), is used to translate either plain text or XML into voice. SAPI 5.3 supports the W3C Speech Synthesis Markup Language (SSML) version 1.0. SSML provides the ability to markup voice characteristics, rate, volume, pitch, emphasis, and pronunciation so that a developer can make TTS sound more natural in their applications.
Lacking specific markup, the engine uses the current voice object — which has an associated volume, pitch, and rate — to synthesize speech. The default voice is defined by the user through the Speech properties in the Control Panel, although custom voices can easily be created and saved as voice tokens.
Using speech synthesis is relatively straightforward because there is one main COM interface: ISpVoice. Through this interface, an application can:
Select a non-default voice and/or change the characteristics of the current voice.
Output speech from a string (Speak method) or a stream (SpeakStream method). These methods can be called synchronously or asynchronously.
Control operations through calls to the Pause and Resume methods of the ISpVoice object.
Optionally, an application can register for speech synthesis events. The application will need to write event-handler logic for events of interest.
If information relating to the speech operations needs to display user messages, the application should check whether the engine supplies this functionality through a call to the ISpVoice::IsUISupported method, and then, if true call DisplayUI.
Collections of hints, called lexicons, can be supplied to the engine to assist it in pronunciation and part-of-speech information for specific words. There are two types of lexicons in SAPI:
User lexicons — created for each user by the speech subsystem, these lexicons have words added either programmatically through the ISpLexicon interface or by the user through an application's UI.
Application lexicons — created and managed by and typically shipped with applications, these are read-only collections of specialized words for specific knowledge domains.
Each of these lexicon types implements the ISpLexicon interface and can be created directly, but SAPI provides the ISpContainerLexicon class which combines the user lexicon and all application lexicons into a single entity, making manipulating the lexicon information much simpler. For more information, see "Lexicon Interfaces" in the Windows SDK.
Speech Recognition in SAPI
Speech recognition is more complicated then speech synthesis, and the API reflects this complexity. Whereas synthesis has one main interface, recognition has at least nine. However, the main interfaces represent the SR engine (ISpRecognizer2), the context in which it is used (ISpRecoContext2), custom grammars (ISpRecoGrammar2), and the resultant interpretation (ISpRecoResult or ISpXMLRecoResult). It is usual that an application registers for one or more recognition events.
Speech recognition (SR) has two modes of operation:
Dictation mode — an unconstrained, free-form speech interpretation mode that uses a built-in grammar provided by the recognizer for a specific language. This is the default recognizer.
Grammar mode — matches spoken words to one or more specific context-free grammars (CFGs). A CFG is a structure that defines a specific set of words, and the combinations of these words that can be used. In basic terms, a CFG defines the sentences that are valid for SR. Grammars must be supplied by the application in the form of precompiled grammar files or supplied at runtime in the form of W3C Speech Recognition Grammar Specification (SRGS) markup or the older CFG specification. The Windows SDK includes a grammar compiler: gc.exe.
In both modes, when the SR engine recognizes a word (or words) it fires a recognition event that an application can subscribe to. Alternately, an application can poll the engine for results.
A single SR engine object can have multiple context objects associated with it because the shared desktop engine can be simultaneously accessed by multiple applications, and even a single application can have different modes or domains of operation, each requiring its own context. A single context object can use multiple grammars to separate different states within the same domain.
The Windows SDK contains a set of examples that demonstrate the use of SAPI speech recognition APIs.
SAPI 5.3 Changes
Windows Vista contains a new version of the SAPI API (version 5.3) which is binary compatible with the last version (5.1) released for Windows XP. SAPI 5.3 has the following general improvements:
Support for W3C XML speech grammars for recognition and synthesis. The Speech Synthesis Markup Language (SSML) version 1.0 provides the ability to mark up voice characteristics, speed, volume, pitch, emphasis, and pronunciation. The Speech Recognition Grammar Specification (SRGS) supports the definition of context-free grammars, with two limitations:
It does not support the use of SRGS to specify dual-tone modulated-frequency (touch-tone) grammars.
It does not support Augmented Backus-Naur Form (ABNF).
User-Specified shortcuts in lexicons, which is the ability to add a string to the lexicon and associate it with a shortcut word. When dictating, the user can say the shortcut word and the recognizer will return the expanded string.
Additional functionality and ease-of-programming provided by new types.
Improved reliability and security.
For more information, see "What's New in SAPI 5.3" in the Windows SDK.
SAPI 5.3 introduces a number of new interfaces, enumerations, and structures. The following table lists the new application-level interfaces. For more information, see "What's New in SAPI 5.3" in the Windows SDK.
Provides the mechanism to filter and queue events. Extends the ISpEventSource interface with the GetEventsEx function, which retrieves extended event information.
Accesses and serializes recognition information contained in a phrase. Extends the ISpPhrase interface to provide result information in SML and audio format.
Enables the creation of different functional views or contexts of a SR engine. Extends the ISpRecoContext interface with better support for grammars and language models.
Controls aspects of the SR engine, including specifying the recognition grammars, the current operational state, which events and results are to be used, when to display the engine UI, and so on. Extends the ISpRecognizer interface by supporting training states, case-sensitive recognition emulation, and resetting the user's recognition profile.
Manages the sets of grammars that the SR engine will recognize. Extends the ISpRecoGrammar interface by supporting customized grammar loading, custom security policies, and setting priority and weights on grammar rules.
Retrieves information about the SR engine's hypotheses and recognitions. Extends the ISpRecoResult interface by enabling updating the recognizer with new text or alternative phrases, and the ability to set a feedback message.
Accesses the recognizer's internal state, which can then be stored and fed back into the recognizer at a later stage.
Acquires the semantic results of speech recognition.
Provides COM Automation access to the ISpeechRecoResult and ISpeechXMLRecoResult interfaces.
Provides Automation access to the semantic results of speech recognition as an SML document.
The following table lists the new engine-level interfaces in version 5.3, all of which are found in the SR engine interface:
Retrieves lists of information about a given word in a dictionary.
Supports SR engine invocation through the ISpRecoContext context interface. Extends the ISpPrivateEngineCall interface by supporting synchronized engine invocation.
Defines the interface to a SR engine's alternate analyzer. Extends the ISpSRAlternates interface to support committing a corrected text string.
Defines the main interface for a SR engine. Extends the ISpSREngine interface with much new functionality, including the capability to directly consume binary CFG data, recognition emulation, controlling characteristics such as training state, weight and priority of rules, and so on.
An interface to retrieve audio data, grammar information, and to send events and return speech recognition information. Extends the ISpSREngineSite interface with support for extended events, retrieving the current recognizer position and time, and the time of an event and method that enables the engine to obtain information about grammar rules.
Managed Speech Namespaces
The .NET Framework 3.0 System.Speech.* namespaces are largely built upon and follow the general programming approaches of SAPI 5.3. These namespaces can be used to speech-enable console, Windows Forms, and Windows Presentation Foundation applications. To use this set of managed libraries, a reference to the System.Speech.dll assembly must be added to the project.
ASP.NET Web applications should not use the System.Speech.* namespaces. Instead, ASP.NET applications can be speech-enabled when powered by the Microsoft Speech Server (MSS) and developed with the Microsoft Speech Application SDK (SASDK). These speech-enabled Web applications can be designed for devices ranging from telephones to Windows Mobile-based devices to desktop computers. For more information, see the Microsoft Speech Server site.
Although .NET Framework 3.0 classes will be made available down-level to Windows XP and Windows Server 2003, the platform support for speech applications is complicated because these technologies are only fully integrated within Windows Vista. Therefore the following limitations and caveats apply:
Although all versions of Windows compatible with the .NET Framework 3.0 include a speech synthesis engine, only Windows Vista and Windows XP Tablet PC Edition integrate a speech recognition engine. For other versions of Windows, it will be necessary to include the Speech SDK redistributable binaries. For more information, see the Licensing Microsoft Speech Technology site.
Certain functionality within the System.Speech.* namespaces depend on SAPI 5.3, but Microsoft has no plans to redistribute SAPI 5.3 binaries down-level to Windows XP or Windows Server 2003. If a managed application that depends upon this advanced functionality is run on one of these older operating systems, a PlatformNotSupportedException will be thrown.
The speech namespaces include some capabilities not found in SAPI 5.3, including a grammar builder (GrammarBuilder), prompt builder (PromptBuilder), an SRGS document object model (SrgsDocument), and strongly-typed grammars.
The following namespaces comprise the managed speech portion of the .NET Framework 3.0 library:
Contains types that describe the audio stream used for speech recognition input and speech synthesis output.
Contains types for implementing speech recognition, including access to the default shared default SR service, control of the SR engine, access to built-in grammars and the ability to create custom grammars, and to receive SR events.
Contains types to create and manipulate SRGS grammars.
Contains types for implementing speech synthesis, including access to the synthesizer engine, characteristics of the voice used, support for SSML documents, and so on.
Supports the creation of SSML-based custom engines for speech synthesis.
The main classes supporting speech synthesis are:
Provides access to the current system speech synthesis engine.
Contains information about a synthetic voice.
Represents the text and associated hints that are to be spoken by the TTS engine.
Dynamically builds a SSML document that can be serialized.
Represents customized characteristics associated with a prompt: volume, pitch, and rate.
SpeakStarted / SpeakStartedEventArgs
Fired when a prompt is first spoken / information related to this event.
SpeakProgress / SpeakProgressEventArgs
Fired when individual words or characters are spoken / information related to this event.
SpeakCompleted / SpeakCompletedEventArgs
Fired when a prompt is completely spoken / information related to this event.
Using speech synthesis is straightforward; the following steps demonstrate a minimal approach:
Create an instance of SpeechSynthesizer.
Create a Prompt instance from a string, SSML document, or PromptBuilder object.
Use the Speak method of the prompt class, or one of the related methods, to output the speech.
For more information, see "System.Speech.Synthesis Namespace" in the Windows SDK.
The main classes supporting speech recognition are:
Provides access to the default shared desktop SR service.
Provides access to any installed SR engine.
Contains information on best and alternate matches by the SR engine.
Provides an engine-specified mechanism to display result-specific feedback text to the user.
Represents a SRGS or CFG grammar loaded from a string, stream, or document.
Provides access to the system provided grammar used for free text dictation.
Used to dynamically build custom recognition grammars.
SpeechDetected / SpeechDetectedEventArgs
Fired when speech is detected by the SR engine in the audio input stream / information related to this event.
SpeechRecognized / SpeechRecognizedEventArgs
Fired when the SR engine recognizes a word or phrase / information related to this event.
Although speech recognition and grammar construction are inherently more complex operations than speech synthesis, the System.Speech.Recognition* namespaces make basic operations straightforward; the following steps demonstrate a minimal approach to recognition:
Create an instance of SpeechRecognizer or SpeechRecognitionEngine.
Create an instance of Grammar or DictationGrammar.
Load the grammar into the recognizer using either the LoadGrammar or LoadGrammarAsync method.
Subscribe to the SpeechRecognized event. The event-handler for this event will typically do the main processing for speech input in an application.
For more information, see "System.Speech.Recognition Namespace" in the Windows SDK.