Microsoft Grammar Tools in IVR Development
An interactive voice response (IVR) application is a Web application that allows a caller to use voice and touch-tone inputs to obtain access to information or services over a telephone connection.
IVR applications are developed using a W3C-standard voice browser markup language called VoiceXML. They are comprised of a series of "dialogs" between the caller and the VoiceXML application. The caller’s response to one dialog, which is interpreted by an automated speech recognition (ASR) engine, usually leads to another dialog, and then another, and another, and so forth, until the caller has reached his or her objective. If the series of dialogs fails at some point, the VoiceXML application logic transfers the caller to a live operator.
The purpose of this paper is to describe how tools developed by Microsoft can be used in the development of IVR applications. The paper includes the following sections:
The importance of grammars in VoiceXML dialogs, which discusses the central role of "grammars" in VoiceXML development and describes Microsoft’s grammar tools.
VoiceXML dialogs, which describes the components of a VoiceXML application and how they fit together.
The VoiceXML development cycle, which describes the development process and shows where the Microsoft grammar tools can be used.
In each dialog, the application "prompts" the caller with a question. The caller responds with an answer. A key component of the VoiceXML application, called a "grammar," lists the possible responses that the dialog expects. The speech recognition engine compares what the caller said to the expected responses described in the grammar.
A typical VoiceXML application uses one or more grammars in every individual dialog. The syntax of grammars used with the Microsoft speech recognition engine must conform to the XML form of the W3C’s Speech Recognition Grammar Specification (SRGS) Version 1.0 standard (see http://www.w3.org/TR/2004/REC-speech-grammar-20040316/). Such grammars are called SRGS grammars.
|Non-SRGS grammars exist, but they cannot be used in VoiceXML applications that depend on the Microsoft speech recognition engine.|
The grammars in a given dialog are compiled when the VoiceXML application is run and are made available to the speech recognition engine in their compiled form.
It is of course intuitively obvious that accurate voice recognition is essential to successful IVR Voice XML applications.
Modern speech recognition engines are very sophisticated and can be trained to recognize the general speech of particular individuals. With a large number of anonymous and varied callers, however, such a training approach would be extremely difficult and time consuming. In a VoiceXML application, grammars are used to severely limit the number of words that the speech recognition engine must consider, thereby presenting the speech recognition engine with a more tractable problem.
As an example, suppose a dialog expects a yes or no answer. The dialog’s grammar (every dialog has one or more grammars) informs the speech recognition engine to expect "yes" or "no," along with a number of alternate ways a caller might say them (yeah, yup, right, nah, nope, wrong, and so forth). As a result, the speech recognition engine does not have to determine which word in a vocabulary of tens of thousands was spoken—it simply has to decide which among a handful of words was spoken.
The development of a grammar, which is done by speech engineers, is a very sophisticated endeavor. Other aspects of VoiceXML development are much more straightforward.
In summary, grammars are the keys to successful voice recognition, and therefore are the keys to a successful IVR application.
Microsoft grammar tools
Microsoft has two command-line tools for Windows that a speech engineer can use when developing, testing, and improving grammars:
The Microsoft SRGS grammar validation tool
The grammar validation tool also checks for a group of conditions that are not errors but can cause performance problems. When such conditions are found, they are reported as warnings. Examples of such conditions include:
- Ambiguous branches.
- Cascading phrases.
- Normalization errors.
Errors represent things that would prevent the grammar from being loaded and used by the Microsoft speech interpretation engine. Warnings represent things that may cause undesired runtime behavior—they do not prevent the grammar from being used but indicate issues that the speech engineer may want to evaluate.
The Microsoft grammar tuning tool
This is a command line tool that is used to provide "tuning" information. Tuning is a speech engineering procedure for improving the performance of grammars by analyzing the speech recognizer’s response to caller utterances.
Utterances or transcriptions of utterances and a grammar are passed in to the tool and a recognition result is returned.
The tool has two main components:
- An execution module.
- An analysis module.
Inputs to the grammar tuning tool are files in EMMA format (EMMA: Extensible MultiModal Annotation Markup Language Standard, see http://www.w3.org/TR/2009/REC-emma-20090210/). These files can include four types of data:
- Text data—a set of phrases that are built by hand and represent expected caller utterances, to be used in lieu of real-world data.
- Audio data—laboratory created files of expected caller utterances, to be used in lieu of real-world data.
- Audio data—recordings of the utterances of real-world callers.
- Text data—transcriptions of the recorded utterances of real-world callers.
The grammar tool’s execution module processes the EMMA test files to produce speech recognition results. The grammar tool’s analysis module then compares the speech recognition results with the input speech and produces a report.
The grammar tuning tool can be used at many different points in the VoiceXML development process. These are some of its uses:
- Tuning a grammar to refine confidence levels.
- Checking that the grammar recognizes a given phrase.
- Comparing the performance of two grammars.
VoiceXML applications are composed of a number of dialogs, short conversations between a caller and the application.
In each dialog, the application "prompts" the caller with a question. The caller responds with an answer. A speech recognition engine compares what the caller said to the expected responses described in the dialog’s grammar. If what the caller said is interpreted as being one of the expected responses, there is said to be a "match." Another section of the dialog labeled "filled" contains logic to determine what happens when a match has been fulfilled.
Here is the basic form of a dialog in VoiceXML code:
<field name="myDialog"> <prompt> <!-- speech to be delivered to caller in the form of a question. --> </prompt> <grammar> <!—words or phrases expected in the caller’s response. --> </grammar> <filled> <!— logic that determines what to do when a grammar match is found. --> </filled> </field>
In a VoiceXML application, each <field> element contains a single dialog between the speaker and the application. The complete application is comprised of a number of such dialogs, linked together in a "call flow."
The principal components of a dialog are the <prompt>, <grammar>, and <filled> elements, each of which is a child of the <field> element:
The <prompt> element
This element contains the audio that is delivered to the caller by the VoiceXML application. It may be a pre-recorded audio file or text-to-speech (TTS).
|There may be more than one <prompt> element per dialog.|
The <grammar> element
This element contains a list of acceptable caller responses that the application expects and can understand. A grammar may be embedded as code in the application or may simply be a pointer to an external file. It assists the speech recognition engine in interpreting the caller’s response to the prompts.
|There may be more than one <grammar> element per dialog.|
The <filled> element
This element contains instructions for what to do if the speech recognition engine finds a match between the caller’s response and the grammar’s choices. Possible actions include transitioning to another dialog in the call flow, obtaining information from a database to relay to the caller, booking a ticket, charging a credit card, and many others.
|There can only be one <filled> element per dialog.|
A working example
Here is an example of a small but complete VoiceXML application that includes a single dialog. The caller is prompted (using TTS) for an answer that should be "yes" or "no." The caller's response, if it matches one of the grammar choices, is placed in the yesOrNo variable. In the <filled> section, the VoiceXML application handles the return from the grammar match. In this case, the application just tells the caller what their answer was, using the value of the yesOrNo variable.
<?xml version="1.0"?> <VXML version="2.1" revision="4" xmlns="http://www/w3/org/2001/VXML" xml:lang="en-US"> <form id="mainDialog"> <field name="yesOrNo"> <prompt> You want to fly to New York. Is that correct? </prompt> <grammar version="1.0" mode="voice" root="top"> <rule id="top"> <one-of> <item>yes</item> <item>no</item> </one-of> </rule> </grammar> <filled> <prompt> thank you </prompt> <prompt>you said <value expr="yesOrNo"/> </prompt> </filled> </field> </form> </VXML>
There are roughly seven stages of a speech application development project. The stages are not all completely independent of one another and may be done in parallel. The stages are:
Development of a speech application begins with planning, as do all projects. The goals of the planning process are to identify project personnel and determine schedules and budgets. Personnel for a speech development project can include:
A Project Manager who coordinates the project and interfaces between the client and the project team.
An Integration Architect who oversees the technical specifications and implementation of the project, including the interface to the client’s back-end system.
A Voice User Interface (VUI) Designer who designs the VUI and tests it for usability.
A Speech Engineer who develops, analyzes, and optimizes grammars, thereby optimizing the overall accuracy of speech recognition.
An Audio Engineer who records and edits the prompts and sounds used in the application.
A VoiceXML Application Developer who translates the specifications and UI design into VoiceXML code and ensures that it runs in the production environment.
A Quality Assurance (QA) Engineer who develops a test plan and implements it.
During the planning stage:
- The telephony infrastructure needs to be set up: telephone numbers need to be obtained for development, QA, and production.
- Voice and data connectivity needs to be established to a speech recognition engine.
- Servers need to be allocated and a file structure for audio and grammar files should be established.
- Careful scheduling for the use of vendors (for example, an audio recording service) so that they do not gate the development process.
- An error handling strategy must be developed. Under what circumstances does a caller get transferred to a live operator?
Good voice UI design is critical for the success of an IVR application. In this stage, the flow of the application is determined, prototyped, and tested for usability.
VUI design includes determination of the number of discrete dialogs, call flow logic, specification of audio prompts, and speech recognition requirements.
A good VUI design keeps menus short, avoids extraneous information, and uses prompts that are concise and easy to understand.
Since the application code may not yet exist at this stage, the usability testing can be conducted by mock interactions over a telephone connection. The caller interacts with a person reading from scripts, rather that with an application.
This stage is not independent of the VUI design stage. The proposed dialogs between the application and the caller prompt the caller for a series of responses. The grammars enumerate the expected caller responses, to assist the speech recognition engine in identifying what the caller said.
The VUI design must be such that the grammars are simple and work with a variety of caller attributes (age, gender, type of phone used, accent, and so forth). Grammars can be tested with very simple VoiceXML applications (see Testing SRGS 1.0 Grammars With Tellme Studio), so that the grammars can be tested during the VUI design stage. When speech recognition using the grammars proves to be troublesome, it may be necessary to redesign the VUI by changing the prompts.
Both of the Microsoft grammar tools can be used repeatedly in this stage as the grammars are built and refined.
In this stage, the VoiceXML code for the IVR application is developed, based on the completed VUI design and grammars. The grammars and audio are integrated into the code, and logging commands are added to provide debugging and testing information in the form of logs.
The QA engineer develops a test plan, including test cases, during this stage.
The quality assurance team tests the application following the test plan developed in the preceding stage.
Errors reported by the QA team are fixed by the application developer and then retested by the QA group, until no critical errors remain.
The Microsoft grammar tuning tool can be used in this stage to help optimize the grammars.
The application is deployed and made available to customers.
When several thousand call logs have been obtained, the application’s interpretation of the responses spoken by the numerous callers is analyzed.
In this stage, the application’s performance is tuned by analyzing call logs, call statistics, and whole call recordings to evaluate the caller experience. The speech engineer and the VUI designer analyze the data and prepare a tuning report. This report contains their recommendations for changes to the application that will optimize performance. These changes can include modifying audio prompts, editing grammars, and adjusting speech recognition settings.
The Microsoft grammar tuning tool can be used in this stage to help optimize the grammars.