Grammars: Purpose and Structure
Grammars are structures that contain single words, complex phrases, or lists of words or complex phrases. These grammar structures use Extensible Markup Language (XML) elements and plain text to attempt to match human patterns of speech. Use grammar structures to process command and control situations in which the user speaks orders, commands, responses, or requests to a speech application.
Grammars form the guidelines that the application must use to recognize orders that a user might issue to it. A grammar contains an ordered list of words or phrases that the application uses to recognize user speech. Unless the words or phrases are defined in the grammar structure, the application cannot recognize the user's speech commands.
A very simple application can limit spoken commands to single words, such as "open," or "print." In this case, a grammar is not much more than a list of words. However, many applications require more complex commands or sentences. The user experience demands that speech with computers approach a natural, spoken language level, so users can speak in normal and natural sounding sentences. For example, a ticket ordering application must accept "I want to order two tickets for the 10 P.M. show." This application must also recognize and respond to variations of the same phrase; "I want to buy," "I'd like to buy," or even the more impolite "gimme two tickets."
Voice commands require flexibility in accepting statements, but at the same time, grammars must impose limits on the application. For example, although the statement "my mother is sick," might imply an urgent need to buy an airline ticket, it is unreasonable for the ticket ordering application to process it as a purchase request.
See the following sections to learn more about grammars:
Purpose of Grammars
A grammar does the following:
- Limits Vocabulary—The grammar contains only the exact words or phrases an application needs to match for a successful user response recognition. An application
might need to recognize only a few words that appear in a grammar structure, therefore, the speech recognition engine does not need to
search the entire dictionary. Explicitly providing words in a grammar also improves the
recognition accuracy, because the speech recognition (SR) engine must process speech only to the
extent of confirming a match.
Grammars are often referred to as context-free grammars (CFG). The words or phrases do not need a context in which to assist recognition. Providing context helps, but is not required. The SR engine is less likely to recognize a nonsensical command such as "horn swaggle," than the command "open the file." A good grammar user interface allows for common or naturally spoken commands.
- Filters Response Recognition—The SR engine processes all audio signals it receives, regardless of what is contained in the grammars. The engine determines what the word is, and then matches the word or phrase with the word or phrase defined in the grammar. The advantage of a grammar is that the SR engine returns a successful recognition event only if the grammar is matched. The grammar filters the results to the applications. Otherwise, the application would receive many additional recognition results, few of which have meaning to the application.
- Matches Speech—The grammar matches the speech input for a particular application. Although grammar structures need to be flexible and accommodate a multitude of phrases and phrasing, grammar structures also need to restrict the user's speech to a specific situation or task. Each application has its own natural language. A coffee ordering system, for example, concentrates on language used to order coffee, not language used to order airline tickets. Developers need to tailor grammar structures to serve the application's specific purpose or objective, because users are allowed to make only statements to the application.
- Identifies Rules—Grammar structures use rules or entities to define and order the component words of potential user utterances. Rules defining commonly used utterances can be referenced repeatedly by other rules within the containing grammar, or by rules contained within other grammars.
Another type of grammar structure is a grammar library, which is a predefined grammar file that contains a number of simple rules, complex sets of interrelated rules, or a combination of both that applications can use to recognize specific types of information. For example, the Microsoft Speech Application SDK (SASDK) grammar library contains a Date ruleset that developers can use when implementing a speech application that requires recognizing calendar dates spoken by a user.
As previously noted, a grammar can be composed of many rules. A voice interface for an application generally contains one rule for each menu, menu item, or dialog box that is accessed directly using speech commands or responses. The combination of those rules forms the grammar. However, a statement can only match one rule at a time. Each rule is given an ID. When a successful recognition occurs, the SR engine processes the rule ID as part of the recognition result . The SR engine uses the SemanticItem or listen values defined using Speech Control Editor to process rule IDs and pass this information back to the speech application. For example, the command, open a file, matches only one rule—presumably the ID for the "file open" command rule. If the application must sort the results, such as a series of case or switch statements, the application matches the rule ID instead of each spoken word. Although the application can match each spoken word, the application most likely sorts the grammar using the rule ID.
Grammars are tools for content identification. For example, a customer can say any of the following: I would like a coffee, I'd like coffee, get me a coffee, or coffee please. In all four cases, the phrase is different, but the intent is the same: the customer wants coffee. Grammars can define all combinations of this intent in a single rule. The rule is identified by a unique name. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by the application. The SR engine returns the recognition back to the application with a single rule name. The application uses that name for processing the coffee order. Instead of requiring the application to detect all words in each variation of the phrase, the SR engine and the grammar determine that ahead of time and return only what the application expects: the rule name. See Designing Grammar Rules for more information about implementing rules in grammar structures.
- Provides Semantic Markup Language—Grammars provide the basis of the Semantic Markup Language (SML). SML is used inside the recognition results and allows the application to identify and parse the returned text. An SML output is an XML-formatted output that contains the grammar element, SML. The grammar element SML can have zero, one, or more child elements, depending on whether the input grammar contains markup for semantic interpretation. Script expressions contained in tag elements generate semantic values for items and referenced rules contained in a parent rule.
An SML output always contains a recognition confidence score, the recognized text, and the confidence score for the full utterance of every utterance that activates a grammar. However, using semantic interpretation can increase the granularity of the SML output to obtain confidence scores and semantic values at the rule level. For more information, see SML Output Overview.
Grammars are based on the W3C Speech Recognition Grammar Specification Version 1.0 format (W3C SRG specification), which defines the structure of grammars and grammar rules using XML markup. The grammar compiler transforms the XML elements that define grammar elements into a binary format used by SR engines. This compiling process is performed either before or during application run time. For specific information about XML, see the Extensible Markup Language (XML) specification.
XML provides a flexible structure for describing the list of words or phrases defined in grammars. XML allows developers to use attributes, elements, and plain text to further identify and define text elements, which makes the grammar file easy to maintain and organize. Text elements identify a grammar, and developers can organize the text elements into lists, strings, and numbers. Organizing the text element structures makes them reusable among other grammars.
The basic unit of a grammar is the rule. Grammars must contain at least one rule. A rule defines a pattern or sequence of words or phrases. If the user's statement matches that pattern, the rule is matched by the application. A rule is defined by the contents of a rule element. A rule element can contain other elements including references to other rule elements. The following grammar defines a single rule using a rule element that contains a single item element.
<grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" tag-format="semantics-ms/1.0"> <rule id="ruleColors" scope="public"> <item>red</item> </rule> </grammar>
In the previous example, the rule element (identified as ruleColors) contains a single item element that contains the text "red." If the user says red, the SR engine matches the utterance to the grammar, and returns a successful recognition to the application. Any other utterance spoken by the user does not match the grammar and returns a false recognition.
A rule must contain at least one element. An element represents a utterance made by the user. Sequencing elements allows grammar designers to create the patterns needed for the command. This sequence can be simple such as the previous example, ruleColors; or the sequence can be complex as the Solitaire card game that the Grammar Example: Solitaire demonstrates.
Developers sequence elements such as item elements, variations of item elements, and references to other rules (including those from other grammars) in a particular order, so that grammars can offer rich selections and possibilities of word combinations. For more information about elements, see Grammar XML.
The following information describes some commonly used grammar elements:
- item—Contains any legal rule expansion. A legal rule expansion can consist of a word or other entity that can be spoken, a ruleref element,
a tag element, or any logical combination of these. In the previous example, the item element contains a single
rule expansion consisting of the single word "red".
When an item element contains a combination of rule expansions (for example, a combination of words), the sequence in which the contents of the item element are listed must match the sequence in which the content occurs in the input for recognition to be successful. For example, in the following grammar, the input must contain the phrase "metallic red" for recognition to be successful:
<grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" tag-format="semantics-ms/1.0"> <rule id="ruleColors" scope="public"> <item>metallic red</item> </rule> </grammar>
- one-of—Contains a set of alternative legal rule expansions and increases the flexibility of the grammar
by requiring that the input matches only one of the alternatives. For example, in the following grammar, for recognition to be
successful, the input must contain the initial phrase "I would like the car in." But the full input may be completed by any of
the three color words, "red," "white," or "green."
<grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" tag-format="semantics-ms/1.0"> <rule id="ruleColors" scope="public"> <item>I would like the car in</item> <one-of> <item>red</item> <item>white</item> <item>green</item> </one-of> </rule> </grammar>
- ruleref—Specifies a pointer to another rule with one or many elements that also requires recognition for a successful validation or recognition of the current rule.
Rules are referenced in a grammar using ruleref elements. The ruleref elements have three special attributes; NULL, VOID, and GARBAGE. These rule names define rules that are automatically matched without the user speaking, define rules that are never spoken, and define rules that are matched until the next rule is matched or until the end of spoken input.
The following example defines a rule element identified as ruleColors for a color selection. Another rule then uses the ruleref element to reference ruleColors, twice.
<grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" tag-format="semantics-ms/1.0"> <rule id="buyShirt" scope="public"> <item> Get me a <ruleref uri="#ruleColors" /> shirt and a <ruleref uri="#ruleColors"/> tie</item> </rule> <rule id="ruleColors" scope="public"> <one-of> <item>red</item> <item>white</item> <item>green</item> </one-of> </rule> </grammar>
The customer requests a color item twice, but the grammar only needs to define ruleColors once.