{ End Bracket }: Building Voice User Interfaces

Article
10/07/2019

{ End Bracket }

Building Voice User Interfaces

Alex Acero

The GUI is the interface of choice for scenarios when the user has a large keyboard, mouse, and display. But most cell phones have just a keypad and a small display. Drivers want to get directions without taking their eyes off the road or their hands off the wheel. And we make many telephone calls where we start interacting with a machine before we reach a human. Speech recognition is a technology that can enable users to access information more efficiently in those scenarios.

GUIs are deterministic (the same action will always produce the same result), whereas voice user interfaces (VUIs) are not (the user can say the same word twice and the system will come up with different answers). Although speech recognition error rates decrease every year due to advances in the technology, they will never be zero. Humans carry on conversations when they don't understand everything but they handle ambiguity gracefully. Likewise, an effective VUI must handle speech recognition errors gracefully. We've been making progress in this area over the last few years.

Voice applications in telephony systems are typically represented as a state machine that takes the caller from one state to another. A banking application could play a WAV file to the user (InitialPrompt="Please say account balance, make a payment, or transaction history") and go to one of three different states depending on the caller's response. The speech recognition engine selects the choice of words Wi that maximizes the posterior probability:

A represents the acoustic waveform, is the so-called acoustic model, and is the language model. Acoustic models, estimated with a technique called Hidden Markov Models, are built-in. Language models are represented with a finite state grammar (FSG) and need to be provided by the developer. Unless prior information is available, developers would set all three choices to be equally likely, and write the grammar in the W3C format:

<one-of>
    <item weight="0.33">Account balance</item>
    <item weight="0.33">Make a payment</item>
    <item weight="0.33">Transaction History</item>
</one-of>

In a VUI, developers need to handle issues such as speech recognition errors, barge-in, and discoverability of options available to callers, which are not problems in a GUI. For example, if the engine's confidence score does not exceed a threshold, a voice application will request confirmation (ConfirmationPrompt="Do you want to hear your account balance?"). If the confidence score is even lower, often because the caller said something other than the three choices, the system reprompts the user (MumblePrompt="Just say one of the following choices:..."). If the caller does not respond, because she does not know what to say, the system tries again, perhaps with another prompt.

To simplify authoring, a graphical dialog designer (see Figure) allows developers to choose from a palette of high-level controls that encapsulate best practices of VUI development, such as handling mumbles and confirmations, without the need for custom code. Such controls include QA, Statement, Get & Confirm, Menu, Recorded Message, and Navigatable List. These have properties specifying all the necessary prompts and recognition branches.

Unfortunately, callers' answers are not always covered by simple grammars like in the previous example. Since manually building robust FSGs is difficult, we have developed technology that can automatically generate the grammar by adapting the system's probabilistic dictation grammar with example sentences provided by the developer (balance, account balance, and checking account balance, for instance, all map to the state <account_balance>). The compiled probabilistic FSG is complex but covers most possible sentences a caller might say. Since the system might recognize "I want to chat my balance" when the caller actually said "I want to check my balance," a statistical classifier is optionally built to determine the callers' intention. In this example, the word balance is strong evidence for it to conclude that the caller wants account balance.

Our goal is to create tools that simplify the development of voice applications much like Visual Basic® simplified the development of GUI applications. More information is available at Speech Technology.

Alex Acero is Research Area Manager at Microsoft Research. He directs speech recognition activities and has contributed to technology shipped in Microsoft speech product.

Additional resources