Printer Friendly Version      Send     
Click to Rate and Give Feedback
Popular Articles
Here the author answers questions regarding the Entity Framework and provides an understanding of how and why it was developed.

By Elisa Flasko (July 2008)
Here is an ASP.NET AJAX data-driven Web application that takes the best features from server- and client-side programming to deliver an efficient, user-friendly experience.

By Bertrand Le Roy (October 2008)
Here we present techniques for programmatic and declarative data binding and display with Windows Presentation Foundation.

By Josh Smith (July 2008)
We build a Silverlight 2.0 application using the InkPresenter to let users annotate a pre-defined collection of images, perform handwriting recognition, and save the annotations and recognized text into a server-side database.

By Julia Lerman (August 2008)
More ...
Read the Blog
Well designed code keeps things that have to change together as close together in the code as possible and allows unrelated things in the code to change independently, while minimizing duplication in the code. In the October 2008 issue of MSDN Magazine, Jeremy Miller shows you some design ...
Read more!
The process for ink capture and analysis on the Tablet PC is straightforward in managed code. To the uninitiated developer, however, creating unmanaged Tablet PC applications can be rather daunting. In the October 2008 issue of MSDN Magazine, Gus Class a quick introduction to the Tablet PC ...
Read more!
Multicore systems are becoming increasingly prevalent, but the majority of software today will not automatically take advantage of this additional processing ability. And multithreaded programming, for anything but the most trivial of systems, is incredibly difficult and error prone today. In the October 2008 issue of MSDN ...
Read more!
Concurrent programming is notoriously difficult, even for experts. You have all of the correctness and security challenges of sequential programs plus all of the difficulties of parallelism and concurrent access to shared resources. In the October 2008 issue of MSDN Magazine, David Callahan describes ...
Read more!
A major advantage of AJAX and Silverlight applications is that they can transparently and continuously interact with a back-end service. The problem is that they run over HTTP, which wasn't designed with security in mind. In the September 2008 issue of MSDN Magazine, Dino Esposito shows you ...
Read more!
Unhandled exception processing shouldn't be a mystery. It's actually quite useful since it gives a crashing application an opportunity to perform last-minute diagnostic logging about what went wrong. In the September 2008 issue of MSDN Magazine, Gaurav Khanna discusses how ...
Read more!
More ...
{ End Bracket }
Building Voice User Interfaces
Alex Acero


The GUI is the interface of choice for scenarios when the user has a large keyboard, mouse, and display. But most cell phones have just a keypad and a small display. Drivers want to get directions without taking their eyes off the road or their hands off the wheel. And we make many telephone calls where we start interacting with a machine before we reach a human. Speech recognition is a technology that can enable users to access information more efficiently in those scenarios.
GUIs are deterministic (the same action will always produce the same result), whereas voice user interfaces (VUIs) are not (the user can say the same word twice and the system will come up with different answers). Although speech recognition error rates decrease every year due to advances in the technology, they will never be zero. Humans carry on conversations when they don't understand everything but they handle ambiguity gracefully. Likewise, an effective VUI must handle speech recognition errors gracefully. We've been making progress in this area over the last few years.
Voice applications in telephony systems are typically represented as a state machine that takes the caller from one state to another. A banking application could play a WAV file to the user (InitialPrompt="Please say account balance, make a payment, or transaction history") and go to one of three different states depending on the caller's response. The speech recognition engine selects the choice of words Wi that maximizes the posterior probability:
A represents the acoustic waveform, is the so-called acoustic model, and is the language model. Acoustic models, estimated with a technique called Hidden Markov Models, are built-in. Language models are represented with a finite state grammar (FSG) and need to be provided by the developer. Unless prior information is available, developers would set all three choices to be equally likely, and write the grammar in the W3C format:
<one-of>
    <item weight="0.33">Account balance</item>
    <item weight="0.33">Make a payment</item>
    <item weight="0.33">Transaction History</item>
</one-of>
In a VUI, developers need to handle issues such as speech recognition errors, barge-in, and discoverability of options available to callers, which are not problems in a GUI. For example, if the engine's confidence score does not exceed a threshold, a voice application will request confirmation (ConfirmationPrompt="Do you want to hear your account balance?"). If the confidence score is even lower, often because the caller said something other than the three choices, the system reprompts the user (MumblePrompt="Just say one of the following choices:..."). If the caller does not respond, because she does not know what to say, the system tries again, perhaps with another prompt.
To simplify authoring, a graphical dialog designer (see Figure) allows developers to choose from a palette of high-level controls that encapsulate best practices of VUI development, such as handling mumbles and confirmations, without the need for custom code. Such controls include QA, Statement, Get & Confirm, Menu, Recorded Message, and Navigatable List. These have properties specifying all the necessary prompts and recognition branches.
Unfortunately, callers' answers are not always covered by simple grammars like in the previous example. Since manually building robust FSGs is difficult, we have developed technology that can automatically generate the grammar by adapting the system's probabilistic dictation grammar with example sentences provided by the developer (balance, account balance, and checking account balance, for instance, all map to the state <account_balance>). The compiled probabilistic FSG is complex but covers most possible sentences a caller might say. Since the system might recognize "I want to chat my balance" when the caller actually said "I want to check my balance," a statistical classifier is optionally built to determine the callers' intention. In this example, the word balance is strong evidence for it to conclude that the caller wants account balance.
Our goal is to create tools that simplify the development of voice applications much like Visual Basic® simplified the development of GUI applications. More information is available at Speech Technology.

Alex Acero is Research Area Manager at Microsoft Research. He directs speech recognition activities and has contributed to technology shipped in Microsoft speech product.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.
Page view tracker