ACM (Audio Compression Module)
Code typically used by an engine that converts PCM data to a different format.
active voice menu
A set of voice commands that can be recognized.
Storing copies of programs and data to ensure against loss.
The state in which an application listens to each sound, but responds only to commands on the sleep menu. See also awake state.
A device such as an audio speaker or the telephone over which text is played as speech. An audio-destination object is an OLE COM object that supports audio communication interfaces in common with a text-to-speech engine.
An electrical signal with varying voltage that becomes sound when amplified and converted to vibrations played by an audio speaker.
A device such as a microphone or telephone that provides audio data for speech recognition. An audio-source object is an OLE COM object that supports audio communication interfaces in common with a speech recognition engine.
The state in which an application recognizes and executes commands on active voice menus. See also asleep state.
A marker embedded in an audio recording that can be used to locate and play back an audio segment.
The number of milliseconds that the engine waits before regarding a phrase as complete after the user has stopped speaking.
An object defined according to the OLE Component Object Model (COM). A component object has a set of interfaces that communicate with the object, data associated with an instance of the object at run-time, and the ability to support multiple instances of the object running at the same time.
COM (Component Object Model)
See OLE Component Object Model.
Uses rules that predict the words that might follow the word just spoken, reducing the number of candidates that need to be evaluated to recognize the next word.
A continuous utterance without pauses between words. Some speech recognition engines can recognize continuous speech.
A reduction in quality or performance of a communications channel.
The gradual loss of data stored by a speech recognition results object. The information in a results object can occupy a significant amount of memory, so an engine developer may permit the object to discard data automatically as time passes.
Defines a context for the speaker by identifying the subject of the dictation, the expected style of language, and what dictation has already been done.
Audio format controlled by binary or numeric data.
Continuous audio data received from or sent to an audio device.
Digital Signal Processor (DSP)
A general-purpose multiprocessor tailored to a particular type of operation. Applications involving communications, compression and audio are more efficiently performed on a DSP than on the host computer.
A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. For example, the word "hello" consists of these diphones: silence-h h-eh eh-l l-oe oe-silence.
The text-to-speech engine concatenates short digital-audio segments and performs intersegment smoothing to produce a continuous sound.
Every word must be isolated by a pause before and after the word-usually about a quarter of a second-in order for the engine to recognize it.
DTMF (Dual Tone Multi-Frequency)
Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.
A method of controlling echoing on communication lines, in which the sender checks the inbound channel for a slightly delayed duplicate of its own transmission. In echo canceling, the sender adds an appropriately modified, reversed version of its transmission to the path on which it receives information. The result is to erase the echo electronically but leave incoming data intact.
See noise floor.
A program that does the actual work of recognizing speech or translating text into speech. Most speech recognition engines convert incoming audio data to engine-specific phonemes, which are then translated into text for use by an application. A text-to-speech engine performs the same process, only in reverse. An engine object is an OLE COM object that represents a mode of a speech recognition or text-to-speech engine.
Enumerates the speech recognition or text-to-speech modes supported by a particular engine.
engine-specific phoneme character set
A character set that describes phonemes, pauses, and so on, and that is specific to a text-to-speech engine.
The rate of vibration or oscillation, measured in hertz (Hz). The normal human ear can detect sounds ranging from 20 Hz to 20,000 Hz.
The increase in signaling power, measured in decibels (dB), that occurs as the signal is boosted by an electronic device.
global voice menu
A voice menu that is active all of the time regardless of which window is in the foreground.
A set of words and phrases that can be recognized by an engine. A grammar object is an OLE COM object that an application uses to control how an engine uses the grammar to recognize speech.
Globally unique identifier used by an interface or object for identification.
The number of milliseconds that the speech recognition engine waits before discarding an incomplete phrase because the user has stopped speaking.
A set of semantically related functions that an application can call to perform the actions defined for that interface.
Noise or other external signals that affect the performance of a communications channel; also, the electromagnetic signals generated by electronic devices, such as computers, that can disturb radio or television reception.
IPA (International Phonetic Alphabet)
A standard system for indicating specific sounds, first introduced in 1886. The Unicode character set includes all single symbols and diacritics in the most recent revision of the IPA, which occurred in 1989, as well as a few IPA symbols no longer in use.
See pronunciation lexicon.
Provides a set of words to recognize without using strict syntax structures. A limited-domain grammar is a hybrid between a context-free grammar and a dictation grammar.
Adaptation of a software package from English to the needs of a foreign country.
If an instance uses a separate process space from that of the application that invokes it, its data must be marshaled across the process boundary. Each interface contains marshaling code that allows its parameters to be transmitted across process boundaries.
The methods by which the engine matches a detected word to known words in its vocabulary.
A word or phoneme on a recognition path in a recognition/alternative graph generated by an engine.
Any interference that affects the operation of a device. In communications, noise consists of random electronic signals, produced either naturally or by the circuitry, that degrade the quality or performance of a communications channel.
The noise value in the signal-to-noise (SNR) ratio for an environment. In general, the higher the noise floor, the more sensitive the engine will be to background noise.
Similar to a callback function, except the sink is implemented as an interface with a set of functions rather than as a single function.
OLE Component Object Model (COM)
A specification that defines a binary standard for OLE object implementation independent of programming language.
PCM (pulse code modulation)
The most common method of encoding an analog voice signal into a digital bit stream. First, the amplitude of the voice conversation is sampled. Then, the sample is coded into binary data, which can then be switched, transmitted, and stored digitally.
The number of choices at a given node in a recognition path.
The smallest structural unit of sound in any language that can be used to distinguish one word from another.
An ordered list of words that are spoken in the same utterance.
The tone of a sound, which generally is determined by the sound's frequency. A high-pitched sound has a higher frequency; a low-pitched sound has a lower frequency.
A database of pronunciations maintained by a speech recognition or text-to-speech engine. An engine may allow an application to collect new or corrected pronunciations from the end-user.
A rule followed by a text-to-speech engine to convert text into phonemes.
The inflection, timing and accent of speech.
Each speech recognition engine supports one or more recognition modes that conform to a different code set or data set. For example, each language (or dialect) supported by the engine will have a different mode.
A sequence of words or phonemes that an engine analyzed while attempting to recognize an utterance.
A rule followed by a speech recognition engine using a context-free grammar to recognize speech.
A graph generated by a speech recognition engine that depicts the recognition paths explored by the engine in recognizing an utterance.
The number of levels of rules in a context-free grammar.
The database in which configuration information is stored. The database takes the place of most configuration and initialization files for Microsoft® Windows® and new Windows-based programs.
See speech recognition results object.
See pronunciation rule, and recognition rule.
Microsoft Speech application programming interface. A set of routines, protocols, and tools that enable programmers to build speech-enabled applications for Microsoft Windows platforms.
SNR (signal-to-noise ratio)
The amount of power, measured in decibels (dB), by which a signal exceeds the amount of channel noise at the same point of transmission. It provides an indication of the clarity or accuracy with which communication can take place.
The end-user who utters the speech to be recognized by an application. Training performed by a speaker may be stored in a speaker profile.
The engine trains itself to recognize the user's voice while the user performs ordinary tasks.
The engine requires the user to train it to recognize his or her voice.
The engine does not require training. Speaker-independent engines typically start with an accuracy above 95 percent for most users (those who speak without accents).
All of the information the engine has about the speaker, such as a data header, languages for which training has been done, known patterns of speech and the language model, how specific words are pronounced, phonetic training, speaker ID, and speaker preferences.
The ability of a computer to understand the spoken word for the purpose of receiving command and data input from the speaker.
An OLE Component Object Model dynamic-link library (DLL) or executable file (.exe) that performs recognition from a digital-audio stream. Speech recognition engines are supplied by vendors who specialize in the software.
Enumerates the engines that are available to an application.
An engine typically provides an assortment of modes that can be used to recognize speech in different languages, dialects, and audio-sampling rates.
speech-recognition results object
Provides detailed information about a speech recognition event.
speech-recognition sharing object
Enumerates shared engine-audio source pairs, or creates new ones.
The engine looks for subwords—usually phonemes—and then performs further pattern recognition on those.
The text-to-speech engine synthesizes the glottal pulse from human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape and tongue position.
See text-to-speech control tags.
Microsoft Telephony application programming interface. A set of routines, protocols, and tools that enable programmers to build telephony applications for Microsoft Windows platforms.
Refers to computer hardware and software that performs functions traditionally performed by telephone equipment (like voice mail or fax services).
Technologies for converting textual (ASCII) information into synthetic speech output. Used in voice-processing applications requiring production of broad, unrelated, and unpredictable vocabularies, such as products in a catalog or names and addresses. This technology is appropriate when system design constraints prevent the more efficient use of speech concatenation alone.
text-to-speech control tags
Instructions that can be embedded in text sent to a text-to-speech engine to improve the prosody of the spoken text.
An OLE Component Object Model dynamic-link library (DLL) or executable file (.exe) that provides functionality for converting text to digital-audio speech. Text-to-speech engines are supplied by vendors who specialize in the software.
Enumerates the text-to-speech modes provided by all of the engines available to the application.
Analogous to voice quality or personality. Every text-to-speech mode is different, and each allows for different properties such as timbre, accent, language and digital-audio sampling rate.
The point below which an utterance is rejected as unrecognized.
The process of speaking a series of pre-selected phrases for the engine. This provides the engine with more information about the voice of the speaker and can improve speech recognition.
A 16-bit character set that replaces ASCII and allows any character from any language to be represented in a text string. The Unicode character set contains a subset for International Phonetic Alphabet (IPA) phonemes.
Anything heard by the engine as a finite series of sounds that the engine attempts to recognize as speech.
A set of words used in a grammar. A speech recognition engine typically supports several different sizes of vocabulary, which determine the words that the engine can recognize in a given state.
A word or phrase associated with a voice menu. When an engine recognizes a voice command, it notifies the application that owns the voice menu containing the command.
Voice Command site
A speech recognition mode and audio source that together serve as a source of Voice Command input.
A list of voice commands to which an application can respond. A voice menu must be active before an engine can recognize its commands.
A text-to-speech mode and an audio destination that together serve as a destination for Voice Text output.
VU (Volume Units) Meter
An indicator that displays the volume of sound being received by the microphone or through the line-in port. Optimum reception is achieved when the meter registers in the middle area.
The engine compares the incoming digital-audio signal against a prerecorded template of the word.
An atomic Unicode text string. A "word" can have several related vernacular words (such as "Los Angeles") within it because the vernacular words are always used in common.
The degree of isolation between words required for the engine to recognize a word.
A series of words may be spoken in a continuous utterance, but the engine recognizes only one word or phrase.