Microsoft Speech API (SAPI) 5.3

Speech API 5.3
Microsoft Speech API 5.3

Microsoft Speech API (SAPI) 5.3

This is the documentation for Microsoft Speech API (SAPI) 5.3, the native API for Windows.

These are interfaces, structures, and enumerations that have been added for the SAPI 5.3 release:

This topic also includes conceptual material that describes and explains the new scenarios that SAPI 5.3 supports:

New Managed API for Speech

Windows Vista includes a new .NET namespace, System.Speech, that allows developers to speech-enable applications, especially those based on the Windows Presentation Foundation. Authors of managed applications can use this in addition to, or as an alternative to SAPI. For more information, see the System.Speech.* namespaces in the Windows SDK Class Library. They are:

New SAPI 5.3 Interfaces

The new interfaces in SAPI 5.3 are:

Interface NameDescription
ISpEnginePronunciationProvides methods that retrieve lists of information about a given word in a dictionary.
ISpEventSource2Extends the ISpEventSource interface by providing a function that retrieves extended event information.
ISpGrammarBuilder2Extends the ISpGrammarBuilder interface.
ISpPhoneticAlphabetConverterEnables applications to convert phonemes from SAPI to the Universal Phone Set (UPS) and from UPS to SAPI.
ISpPhoneticAlphabetSelectionEnables the client to control the character phoneset in which pronunciations are encoded by the GetPronunciations method.
ISpPhrase2Extends the ISpPhrase interface to provide SML result information and the audio stream containing the utterance corresponding to the result.
ISpPrivateEngineCallExExtends the ISpPrivateEngineCall interface.
ISpRecoContext2Extends the ISpRecoContext interface.
ISpRecognizer2Extends the ISpRecognizer interface.
ISpRecoGrammar2Extends the ISpRecoGrammar interface.
ISpRecoResult2Extends the ISpRecoResult interface.
ISpSerializeStateEnables applications to to save and restore the recognizer's internal state.
ISpShortcutProvides methods that implement user shortcuts.
ISpSRAlternates2Extends the ISpSRAlternates interface to enable an application to commit a particular text.
ISpSREngine2Extends the ISpSREngine interface.
ISpSREngineSite2Extends the ISpSREngineSite interface.
ISpXMLRecoResultDefines the semantic results of speech recognition.
ISpeechResourceLoaderGives applications control over loading resources.
ISpeechRecoResultDispatchSupports IDispatch access to the ISpeechRecoResult and ISpeechXMLRecoResult interfaces.
ISpeechXMLRecoResultGets recognition results from the ISpXMLRecoResult as an SML document.

New SAPI 5.3 Enumerations

The new enumerations in SAPI 5.3 are:

Enum NameDescription
DISPID_SpeechXMLRecoResultEnumerates the types of V-Table access to XML results.
PHONETICALPHABETPlaceholder definition.
SPADAPTATIONRELEVANCEDefines levels of bias for adaptation settings.
SPADAPTATIONSETTINGSDefines values for the types of adaptation settings.
SPCOMMITFLAGSDefines values for the types of recognizer corrections.
SPGRAMMAROPTIONSSpecifies the types of grammar options in a recognition context.
SPMATCHINGMODEEnumerates the modes of matching.
SPPRONUNCIATIONFLAGSIs used with the ISpEnginePronunciation::GetPronounciations function.
SPSHORTCUTTYPEEnumerates the types of shortcut pair.
SPXMLRESULTOPTIONSIs used to designate whether the main result or the alternates are desired when calling the ISpPhrase2::GetXMLResult or ISpXMLRecoResult::GetXMLResult method.
SpeechEmulationCompareFlagsEnumerates values of comparison options in emulation.

New SAPI 5.3 Structures

The new structures in SAPI 5.3 are:

Structure NameDescription
SPEVENTEXContains information about an event and extends the SPEVENT structure.
SPNORMALIZATIONLISTContains a list of alternative form normalizations for a word.
SPRULERepresents a rule in a collection of run-time grammar rules.
SPSEMANTICERRORINFORepresents information about a recognition error.
SPSHORTCUTPAIRDescribes a pairing of a spoken shortcut phrase with the corresponding text or a node in a linked list of shortcut pairs.
SPSHORTCUTPAIRLISTContains the shortcut pair linked list.

W3C Speech Synthesis Markup Language

SAPI 5.3 supports the W3C Speech Synthesis Markup Language (SSML) version 1.0, which is defined at SSML provides the ability to markup voice characteristics, speed, volume, pitch, emphasis, and pronunciation, so that developers can make TTS sound more natural in their applications.

In addition to SSML, SAPI 5.3 continues to support the proprietary SAPITTS markup language for annotating text for TTS rendering. SSML and SAPITTS have a fairly close mapping - close enough that most SSML can be transformed into SAPITTS. Indeed, this is what SAPI does when it receives SSML, so that underlying TTS engines that have been built for SAPITTS do not need to also support SSML.

SAPI does not support new DDI for TTS engines to accept SSML.

W3C Speech Recognition Grammar Specification

SAPI 5.3 supports the definition of context-free grammars using the W3C Speech Recognition Grammar Specification (SRGS), with these two important constraints:

  • It does not support the use of SRGS to specify DTMF (touch-tone) grammars.
  • It only supports the expression of SRGS as XML - not as augmented BNF (ABNF).

SRGS is defined at

In addition to SRGS, SAPI 5.3 continues to support the proprietary SAPI CFG XML format for specifying a grammar.

Semantic Interpretation

SAPI 5.3 enables an SRGS grammar to be annotated with semantic information, so that a recognition result may contain not only the recognized text but also the semantic interpretation of that text. For example, the recognized text of a yes/no grammar might be "yes", "yeah" or "yep", but the semantic meaning of all of these is "yes". This makes it easier for applications to consume recognition results, as well as empowering grammar authors to provide a full spectrum of possible utterances without burdening the developer with the interpretation task.

The annotation of semantic information within SRGS can be either of the following:

  • A string literal containing the semantic value.
  • A Jscript statement that ultimately returns a string containing the semantic value.

In addition to the annotation of SRGS, SAPI also provides results that contain not only the recognized text but also the semantic information as a hierarchy of name-value pairs.