Speech Glossary


A letter sequence formed from the initial letters of a name, such as IPA for International Phonetic Alphabet, or variations on the initial letters, such as XML for Extensible Markup Language.

acoustic model (AM)

A component of a speech engine that describes the sounds of a spoken language. A speech engine uses its acoustic model, in conjunction with its language model and lexicon, to determine how to recognize or pronounce words or phrases.

active grammar

A speech grammar or DTMF grammar that is loaded by a speech recognition engine and is currently active for recognition.


A standard compression algorithm, used in digital communications systems of the European digital hierarchy, to modify and optimize the dynamic range of an analog signal for digitizing.


The process of matching words in a transcript to words in a .wav file.


In speech recognition grammars, one of a set of alternate words or phrases, any of which will match all or part of a rule. In XML-format SRGS grammars, a set of alternative elements is contained within a one-of element. Each alternative is contained in an item element.

ambient noise

The background noise in an area or environment, being a composite of sounds from many sources near and far.


A measure of the strength of a signal, for example and audio signal, determined by the distance from the baseline to the peak of the waveform.

audio format

A collection of characteristics that describe the composition of an audio signal that is being received for recognition, including bit depth, samples per second, bytes per second, block align, channel count, and encoding.

audio level

The perceived loudness of an audio signal that is being received for recognition. Factors include the loudness of the sound source (speaker), the proximity of the sound source to the recording device, and the gain of the recording device.


Audio input that is neither speech which matches the initial rule of any of the recognizer's loaded and enabled speech recognition grammars, nor silence.


A marker that can be placed in a speech synthesis prompt for which the speech synthesizer will raise an event.


The file name extension for an XML-format grammar file that has been compiled to a binary format for consumption by a speech recognition engine.


See context-free grammar (CFG).

close-talk microphone

A standard type of microphone often used in headsets and other devices in which the user speaks directly into the microphone.


An abbreviation for compressor/decompressor. Software or hardware used to compress or decompress digital media, including audio.

confidence score

A value indicating the likelihood that a word or phrase recognized by a speech engine matches the word or phrase actually uttered by the speaker.

context-free grammar (CFG)

A context-free grammar describes how phrases in a language are built from smaller blocks. SRGS grammars are context-free grammars.

conversational understanding

The ability of a system to recognize spontaneous, conversational speech.


In speech synthesis, an attribute that specifies the frequency and location of changes in pitch that the TTS engine should apply when speaking the contents of a prosody element.


In managed code, a class of information about a particular nation or people including their collective name, writing system, calendar used, and conventions for formatting dates and sorting strings.


An ancillary glyph added to a letter, for example ç, ö, or ê.


A mode of speech recognition, made possible by a large grammar with comprehensive coverage of a language, which accepts freely-spoken input.


A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. For example, the word "hello" consists of these diphones: [silence-h] [h-eh] [eh-l] [l-oe] [oe-silence].

diphone concatenation

The process by which a text-to-speech engine concatenates short digital-audio segments and performs inter-segment smoothing to produce a continuous sound.

DOCTYPE declaration

A declaration at the beginning of an XML-format document that gives a public or system identifier for the document type definition (DTD) of the document.

DTMF grammar

A grammar that recognizes dual tone multi-frequency (DTMF) inputs. Contrast with a speech grammar.

dual tone multi-frequency (DTMF)

The signaling system used in telephones with touch-tone keypads, in which each digit is associated with two specific frequencies.

Dublin Core Metadata Initiative

An open forum engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models.


In speech synthesis, an attribute that specifies the desired time that the TTS engine should take to read the contents of a prosody element.

dynamic grammar

A grammar that is created during application execution.


A scripting programming language created to capture the common core language elements of JavaScript and JScript. ECMAScript is used in the tag element of XML-format grammars to define the semantics for words or phrases.


In speech synthesis, the stress or spoken prominence given to a word or phrase.


The process by which text, rather than speech, is provided as input to a speech recognition engine. Emulation is useful to determine whether a word or phrase is in a grammar and to inspect the recognition result for the word or phrase.

Extensible Markup Language (XML)

A markup language that provides a format for describing structured data. XML is a World Wide Web Consortium (W3C) specification, and is a subset of Standard Generalized Markup Language (SGML).

external grammar

A stand-alone grammar file that is linked from another grammar file using a rule reference. Also referred to as an imported grammar.


In grammars, a special rule that will match any speech up to the next rule match, the next token, or until the end of spoken input.

globally unique identifier (GUID)

A program-generated number that creates a unique identity for an object.


A structured list of rules that identify words or phrases that the speech recognition engine should attempt to identify in the spoken input. Also known as a grammar file or a grammar document.

grammar compiler

A compiler that transforms an XML-format grammar file into a binary format file with a .cfg file extension for consumption by a speech recognition (SR) engine.

grammar library

A collection of ready-to-use rules and rule sets designed to recognize commonly used types of user voice input such as dates, times, currency units, numbers, and confirmatory responses, as well as dual tone multi-frequency (DTMF) input.

Grammar Rule Name (GRN) referencing

A type of Semantic Markup Language (SML) script referencing in which the script expression evaluates semantic values of, or assigns semantic values to, the Rule Variable (RV) of the rule element that contains the expression.

Grammar Rule Name (GRN) Rule Variable

A predefined object that holds a semantic value that can be composed of multiple properties. Every rule element in a grammar has a single GRN Rule Variable. The GRN Rule Variable is identified by a dollar sign ($).

Grammar Rule Reference (GRR) referencing

A type of Semantic Markup Language (SML) script referencing in which the script expression evaluates semantic values of the Rule Variable (RV) of a rule element outside of the rule element that contains the expression.

Grammar Rule Reference (GRR) Rule Variable

The Rule Variable of the external rule element to which a grammar rule reference is made. The GRR Rule Variable is identified by a double dollar sign ($$).


A fundamental unit in a written language, for example letters, numbers, punctuation marks.


The file name extension for XML-format grammar files.


See globally unique identifier (GUID).


A word with the same sound (homophone) or spelling (homograph) as another, but with a different meaning.


One of a set of words that are pronounced the same way but differ in meaning, and sometimes in spelling, for example night and knight in English.


One of a set of words of the same written form but of different meaning and usually origin, whether pronounced the same way or not. For example: lead (to conduct) and lead (the metal).

in-process recognizer

A speech recognition engine that is under the control of a single application. Contrast with a shared recognizer.

International Phonetic Alphabet (IPA)

A standardized system of letters and marks, partially based on the letters of the Roman alphabet, used internationally to represent speech sounds.

inverse text normalization (ITN)

In speech recognition, the process of converting the spelled-out textual result of recognized speech into its more commonly-used written form. For example, the recognized text "one dollar and sixteen cents" would be normalized to "$1.16".


The file name extension for JScript files.

language model (LM)

A component of a speech engine that describes how words of a spoken language are constructed into meaningful sequences. A speech engine has a language model that typically encompasses an entire language, such as French. A grammar is a language model that defines only the words and phrases that have meaning for an application. A speech engine uses its language model(s), in conjunction with its acoustic model and lexicon(s), to determine how to recognize or pronounce words and phrases.

language pack

A set of language resources that supports the development and deployment of applications in a particular language.


The name of an element in the Pronunciation Lexicon Specification that contains one or more written representations of a word, one or more pronunciations, and one or more examples.


A file that contains the mapping between the written representations and the pronunciations of words or short phrases in a language. A speech engine typically has a default lexicon. Applications and users may also create lexicons. A speech engine uses the phonetic spellings in lexicons, in conjunction with its acoustic model and language model, to determine how to recognize or pronounce words or phrases.

managed code

Code executed by the Microsoft .NET Framework common language runtime (CLR).


Data that is used to describe other data.


A morpheme is the smallest meaningful unit in the grammar of a language. The word "unladylike" consists of three morphemes and four syllables: "Un", "lady", and "like". The word "dogs" consists of two morphemes and one syllable: "dog", and "s", a plural marker on nouns.


A standard analog signal compression or companding algorithm, used in digital communications systems of the North American and Japanese digital hierarchies, to optimize the dynamic range of an audio analog signal prior to digitizing.

Multipurpose Internet Mail Extensions (MIME)

A protocol widely used on the Internet that extends the Simple Mail Transfer Protocol (SMTP) to permit data (such as video, sound, and binary files) to be transmitted by Internet e-mail without having to be translated into ASCII format first. Used to describe the media type of lexicons.

natural language

A human language, as opposed to a command or programming language traditionally used to communicate with a computer.

natural language understanding

The ability to infer the intended meaning of a natural language utterance based on the words contained in that utterance.


The recognition results in which the speech recognition engine has the highest levels of confidence. N is the number of results returned.

.NET Framework

An environment for building, deploying, and running Web Services and other applications. It consists of three main parts: the Common Language Runtime, the Framework classes, and ASP.NET.


A word, or the phonetic spelling for a word, on a recognition path in a recognition/alternative graph generated by a speech recognition engine.


Describes the written form of a language. The elements of orthography include spelling, hyphenation, capitalization, word breaks, and punctuation.

out-of-grammar utterance

An utterance containing words or phrases that are not included in a speech recognition grammar.


A recording format typically used in desktop applications. In PCM8, the bit depth of each sample is 8 bits. This format results in lower quality audio than with PCM16, but it requires less disk space.


A recording format typically used in desktop applications. In PCM16, the bit depth of each sample is 16 bits.


A notation (consisting of letters, numbers, or characters) that represents a discreet sound in a spoken language. Phones are used to create the phonetic spellings that specify how a word should be pronounced, or to specify the pronunciations of a word that should be recognized. Changing a phone in a word will alter its pronunciation.


A phoneme is a basic component of written language, typically a letter of an alphabet (or a combination of letters) that represents one or more distinct sounds. For example, the letter “c” is a phoneme that may sound like “s” in “cinder”, or like “k” in “catch”. A written word is an assemblage of phonemes. Changing a phoneme in a word will alter its spelling.

The term phoneme is also used as the name of an XML element in the SRGS and SSML specifications that contains the phonetic spelling (consisting of phones) for a word or phrase.

phonetic alphabet

A set of symbols that represents the sounds in spoken languages. Also referred to as pronunciation alphabet. There are three phonetic alphabets used by System.Speech: the International Phonetic Alphabet (IPA), the Universal Phone Set (UPS), and the SAPI Phone Set.


An ordered list of words that are spoken in the same utterance.


A characteristic of a sound that is determined by the frequency of its vibration. A high-pitched sound has a higher frequency; a low-pitched sound has a lower frequency. In speech synthesis, an attribute that specifies the baseline pitch that the TTS engine should apply when speaking the contents of a prosody element.


Optional ending words or phrase.


Optional beginning words or phrase.


A question, directive, greeting, or information spoken by a speech application. The term also refers to the contents of a prompt, which may be unmarked text or text that is formatted with SSML markup.


The way a word or a language is usually spoken.

pronunciation lexicon

See lexicon.

Pronunciation Lexicon Specification (PLS)

The W3C specification that defines the elements, attributes, and syntax for XML-format lexicons used in speech recognition and speech synthesis.


A collection of phonological features (pitch, range, contour, volume, rate, and duration) that define the characteristics of spoken language.

pulse code modulation (PCM)

The most common method of encoding an analog voice signal into a digital bit stream.


In speech synthesis, an attribute that specifies the span of pitch fluctuations that the TTS engine should apply when speaking the contents of a prosody element.


In speech synthesis, an attribute that specifies the tempo that the TTS engine should apply when speaking the contents of a prosody element.

recognition engine

See speech recognition engine.

recognition grammar

See speech recognition grammar.

recognition path

A sequence of words or phonemes that an engine analyzed while attempting to recognize an utterance.

recognition alternate

One of a collection of possible matches for input to a speech recognition engine.

rejection threshold

A confidence value below which a recognition alternate is rejected by the application.

root rule

The rule that is the entry point to a grammar, and that is active when a speech recognition engine loads the grammar.

Root Rule Variable (RRV)

The GRN Rule Variable of the root rule of a grammar. The RRV provides the semantic result of a recognition.


In grammars, a rule defines a pattern or sequence of words or phrases that a speech recognition engine can use to perform recognition.

semantic interpretation (SI)

The process by which a semantic interpreter generates a result based on a spoken word or phrase that matches a grammar rule.

Semantic Interpretation for Speech Recognition (SISR)

The specification which describes how tags may be inserted within SRGS grammars to support basic post-processing or full semantic interpretation.

semantic item

A value returned by a grammar rule when a user's utterance matches the rule.

Semantic Markup Language (SML)

An XML-based markup language that allows the application to identify and parse meaningful parts of speech recognition output.

shared recognizer

The speech recognition engine in Windows that any open application can use to recognize spoken input. Contrast with an in-process recognizer.


In speech recognition, the absence of spoken input or background noise.

In speech synthesis, the space between spoken words; also referred to as the break or pause between words.


The user who utters the speech to be recognized by an application.


The speech recognition engine requires the user to train it to recognize his or her voice.


The speech recognition engine does not require training.

Speech API (SAPI)

The native-code speech application programming interface (API) for the Windows desktop. The System.Speech managed-code API is a subset of SAPI.

speech application

An application that uses the System.Speech managed-code API to provide speech recognition and/or speech synthesis for user interaction.

speech grammar

A grammar that recognizes speech inputs. Contrast with a DTMF grammar.

speech recognition (SR)

The process of converting spoken language into a text transcription. Also referred to as automated speech recognition (ASR).

speech recognition engine

Software that accepts spoken language as input, determines what words and phrases or semantic information are present, and outputs a text transcription of the result. Also referred to as a speech recognizer.

Speech Recognition Grammar Specification (SRGS)

A specification developed by the World Wide Web Consortium (W3C) that defines syntax for representing grammars for use in speech recognition. SRGS enables developers to specify the words and patterns of words that a speech recognition engine can use to perform recognition.

speech synthesis

The process of producing synthesized speech output from text on computers.

speech synthesis engine

Software that converts plain text or text with XML markup into artificial speech. A speech synthesis engine synthesizes the glottal pulse from human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position. Also referred to as text-to-speech engine, TTS engine, speech synthesizer, and voice.

Speech Synthesis Markup Language (SSML)

An XML-based markup language used to control various characteristics of synthetic speech output including voice, pitch, rate, volume, pronunciation, and other characteristics.

speech synthesizer

See Speech Synthesis Engine.


See Speech Recognition Grammar Specification (SRGS).


See Speech Synthesis Markup Language (SSML).


The name of the element in a grammar that contains semantic information, either as string literals or as ECMAScript.

text normalization

Performed as part of speech synthesis, the process of converting numbers, abbreviations, acronyms, and other non-word written symbols into words that a speaker would say when reading that symbol out loud.

text-to-speech (TTS)

Technologies for converting textual (ASCII) information into synthetic speech output.


A string that a speech recognizer can use to match spoken input.


The process of speaking a series of preselected phrases to a speech recognition engine. This provides the engine with information about the voice, speaking habits, and acoustic environment of a specific speaker. Training typically improves the accuracy of speech recognition.


A textual record of a spoken input. Transcriptions are commonly used to analyze the performance of a speech application and its grammars by matching what was said during speech input with what was recognized.


See text-to-speech (TTS).

TTS engine

See speech synthesis engine.


The process of refining a speech application to improve the accuracy of speech recognition or speech synthesis.


A standard analog signal-compression algorithm, used in digital communications systems of the North American digital hierarchy, to optimize the dynamic range of an analog signal prior to digitizing.


A 16-bit character set that replaces ASCII and allows any character from any language to be represented in a text string. The Unicode character set contains a subset for International Phonetic Alphabet (IPA) phonemes.

Uniform Resource Identifier (URI)

A character string used to identify a resource (such as a file) from anywhere on the Internet by type and location. The set of Uniform Resource Identifiers includes Uniform Resource Names (URNs) and Uniform Resource Locators (URLs).

Uniform Resource Locator (URL)

An address for a resource on the Internet, which specifies the protocol used to access the resource, the name of the server on which the resource resides, and (optionally) the path to a resource.

Universal Naming Convention (UNC)

A name used on Windows to access a drive or directory containing files shared across a network.

Universal Phone Set (UPS)

A machine-readable phone set that is based on the International Phonetic Alphabet (IPA).


The 8-bit Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes.


The 16-bit Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in either big-endian or little-endian format.


Anything heard by the engine as a finite series of sounds that the engine attempts to recognize as speech.


The basic visual unit of speech, which essentially corresponds to the position of the mouth and face when pronouncing a phoneme.


The set of words used in the grammars that a speech application uses. Words that are not in the vocabulary, out-of-vocabulary (OOV) words, cannot be recognized by the speech application.


A speech synthesis (TTS) engine that has specific characteristics, such as culture, age, and gender.

voice command

A word or phrase associated with a voice menu. When an engine recognizes a voice command, it notifies the application that owns the voice menu containing the command.

voice grammar

See speech grammar.

voice user interface (VUI)

A voice-controlled application on a computer, Smartphone, game console, or other device or platform that can host applications.


In speech synthesis, an attribute that specifies how loudly to speak the contents of a prosody element.


The file name extension for waveform audio files.

waveform audio file

A file format for storing audio on computers.


Within a grammar, an attribute assigned to alternatives that biases the likelihood that an alternative will be chosen as a match for speech input.

As a property of a grammar, a rating that determines the degree of influence that a grammar will have on the ranking of recognition alternatives, relative to other active grammars.


A component of a grammar that will match any spoken word.

word boundary

The beginning and ending of individual words, marked by the spacing or silence between words in a speech synthesis prompt.

World Wide Web Consortium (W3C)

The organization that sets standards for Web-based technologies, including the syntax for the XML-format technologies used by speech: SRGS, SSML, SISR, SML, and PLS.


The file name extension for an XML file.


A language that describes a way to locate and process items in XML documents by using an addressing syntax based on a path through the document's logical structure or hierarchy.

XPath expression

An expression that searches through an XML document and extracts information from the nodes (any part of the document, such as an element or attribute) in that document.