Microsoft Speech API (SAPI) 5.3
Microsoft Speech API (SAPI) 5.3
This is the documentation for Microsoft Speech API (SAPI) 5.3, the native API for Windows.
These are interfaces, structures, and enumerations that have been added for the SAPI 5.3 release:
This topic also includes conceptual material that describes and explains the new scenarios that SAPI 5.3 supports:
- W3C Speech Synthesis Markup Language
- W3C Speech Recognition Grammar Specification
- Semantic Interpretation
New Managed API for Speech
Windows Vista includes a new .NET namespace, System.Speech, that allows developers to speech-enable applications, especially those based on the Windows Presentation Foundation. Authors of managed applications can use this in addition to, or as an alternative to SAPI. For more information, see the System.Speech.* namespaces in the Windows SDK Class Library. They are:
- System.Speech.Audioformat
- System.Speech.Recognition
- System.Speech.Recognition.SrgsGrammar
- System.Speech.Synthesis
- System.Speech.Synthesis.TtsEngine
New SAPI 5.3 Interfaces
The new interfaces in SAPI 5.3 are:
| Interface Name | Description |
|---|---|
| ISpEnginePronunciation | Provides methods that retrieve lists of information about a given word in a dictionary. |
| ISpEventSource2 | Extends the ISpEventSource interface by providing a function that retrieves extended event information. |
| ISpGrammarBuilder2 | Extends the ISpGrammarBuilder interface. |
| ISpPhoneticAlphabetConverter | Enables applications to convert phonemes from SAPI to the Universal Phone Set (UPS) and from UPS to SAPI. |
| ISpPhoneticAlphabetSelection | Enables the client to control the character phoneset in which pronunciations are encoded by the GetPronunciations method. |
| ISpPhrase2 | Extends the ISpPhrase interface to provide SML result information and the audio stream containing the utterance corresponding to the result. |
| ISpPrivateEngineCallEx | Extends the ISpPrivateEngineCall interface. |
| ISpRecoContext2 | Extends the ISpRecoContext interface. |
| ISpRecognizer2 | Extends the ISpRecognizer interface. |
| ISpRecoGrammar2 | Extends the ISpRecoGrammar interface. |
| ISpRecoResult2 | Extends the ISpRecoResult interface. |
| ISpSerializeState | Enables applications to to save and restore the recognizer's internal state. |
| ISpShortcut | Provides methods that implement user shortcuts. |
| ISpSRAlternates2 | Extends the ISpSRAlternates interface to enable an application to commit a particular text. |
| ISpSREngine2 | Extends the ISpSREngine interface. |
| ISpSREngineSite2 | Extends the ISpSREngineSite interface. |
| ISpXMLRecoResult | Defines the semantic results of speech recognition. |
| ISpeechResourceLoader | Gives applications control over loading resources. |
| ISpeechRecoResultDispatch | Supports IDispatch access to the ISpeechRecoResult and ISpeechXMLRecoResult interfaces. |
| ISpeechXMLRecoResult | Gets recognition results from the ISpXMLRecoResult as an SML document. |
New SAPI 5.3 Enumerations
The new enumerations in SAPI 5.3 are:
| Enum Name | Description |
|---|---|
| DISPID_SpeechXMLRecoResult | Enumerates the types of V-Table access to XML results. |
| PHONETICALPHABET | Placeholder definition. |
| SPADAPTATIONRELEVANCE | Defines levels of bias for adaptation settings. |
| SPADAPTATIONSETTINGS | Defines values for the types of adaptation settings. |
| SPCOMMITFLAGS | Defines values for the types of recognizer corrections. |
| SPGRAMMAROPTIONS | Specifies the types of grammar options in a recognition context. |
| SPMATCHINGMODE | Enumerates the modes of matching. |
| SPPRONUNCIATIONFLAGS | Is used with the ISpEnginePronunciation::GetPronounciations function. |
| SPSHORTCUTTYPE | Enumerates the types of shortcut pair. |
| SPXMLRESULTOPTIONS | Is used to designate whether the main result or the alternates are desired when calling the ISpPhrase2::GetXMLResult or ISpXMLRecoResult::GetXMLResult method. |
| SpeechEmulationCompareFlags | Enumerates values of comparison options in emulation. |
New SAPI 5.3 Structures
The new structures in SAPI 5.3 are:
| Structure Name | Description |
|---|---|
| SPEVENTEX | Contains information about an event and extends the SPEVENT structure. |
| SPNORMALIZATIONLIST | Contains a list of alternative form normalizations for a word. |
| SPRULE | Represents a rule in a collection of run-time grammar rules. |
| SPSEMANTICERRORINFO | Represents information about a recognition error. |
| SPSHORTCUTPAIR | Describes a pairing of a spoken shortcut phrase with the corresponding text or a node in a linked list of shortcut pairs. |
| SPSHORTCUTPAIRLIST | Contains the shortcut pair linked list. |
W3C Speech Synthesis Markup Language
SAPI 5.3 supports the W3C Speech Synthesis Markup Language (SSML) version 1.0, which is defined at http://www.w3.org/TR/speech-synthesis. SSML provides the ability to markup voice characteristics, speed, volume, pitch, emphasis, and pronunciation, so that developers can make TTS sound more natural in their applications.
In addition to SSML, SAPI 5.3 continues to support the proprietary SAPITTS markup language for annotating text for TTS rendering. SSML and SAPITTS have a fairly close mapping - close enough that most SSML can be transformed into SAPITTS. Indeed, this is what SAPI does when it receives SSML, so that underlying TTS engines that have been built for SAPITTS do not need to also support SSML.
SAPI does not support new DDI for TTS engines to accept SSML.
W3C Speech Recognition Grammar Specification
SAPI 5.3 supports the definition of context-free grammars using the W3C Speech Recognition Grammar Specification (SRGS), with these two important constraints:
- It does not support the use of SRGS to specify DTMF (touch-tone) grammars.
- It only supports the expression of SRGS as XML - not as augmented BNF (ABNF).
SRGS is defined at http://www.w3.org/TR/speech-grammar.
In addition to SRGS, SAPI 5.3 continues to support the proprietary SAPI CFG XML format for specifying a grammar.
Semantic Interpretation
SAPI 5.3 enables an SRGS grammar to be annotated with semantic information, so that a recognition result may contain not only the recognized text but also the semantic interpretation of that text. For example, the recognized text of a yes/no grammar might be "yes", "yeah" or "yep", but the semantic meaning of all of these is "yes". This makes it easier for applications to consume recognition results, as well as empowering grammar authors to provide a full spectrum of possible utterances without burdening the developer with the interpretation task.
The annotation of semantic information within SRGS can be either of the following:
- A string literal containing the semantic value.
- A Jscript statement that ultimately returns a string containing the semantic value.
In addition to the annotation of SRGS, SAPI also provides results that contain not only the recognized text but also the semantic information as a hierarchy of name-value pairs.