Usability Testing

Speech Server 2004 R2
  Microsoft Speech Technologies Homepage

Test systems for usability before deploying them. Well-designed tests will nearly always reveal problems, and at an early point in the process, designers can react quickly and efficiently, before a design has been committed to code.

Phases of Usability

Testing can occur during design, during development or after deployment.

In a large system, testing for individual pieces of the system might occur over the course of the design and production process, ideally while the design can still benefit from test outputs.

In the final stages of development, all of the system's components are working together for the first time and usability testing can focus on the interactions of all these components. Usability testing at this stage enables designers and engineers to fine-tune and optimize the system. However, because of the schedule, making major design changes is often difficult or impossible at this point.

After an application has been deployed, detailed statistical data can be compiled from logging recognition rates, user abandonment, transfers to operator and task completion rates. In addition, testing can validate the design and track factors such as user satisfaction and successful automation rates in comparison to previously deployed (DTMF) systems.

Wizard of Oz Testing

The first phase of usability testing enables designers to evaluate the overall success or failure of their design by using simulations of interactions with callers. Flaws are revealed before money and time are spent on implementation, allowing designers to collect information and evaluate user responses to the system design. This stage of testing is known as Wizard of Oz testing because of the "man behind the curtain" effect, a reference to the novel by L. Frank Baum and the 1939 classic movie. Wizard of Oz testing is a simulation of a caller-system interaction. The "man behind the curtain," usually a trained tester, either operates a prototype version of the system over a phone line or speaks scripted prompts in response to a user's action. The user experience is similar to interacting with a functioning interactive voice response (IVR) system.

One advantage of Wizard of Oz testing is that rapid iterations, particularly minor changes in wording or call flow, are immediately testable. Wizard of Oz testing is also a highly cost-effective way to compare multiple designs. Although there are significant technical limitations, Wizard of Oz testing can be useful for identifying incomprehensible language or prompts that fail in their functional requirements and result in repeated user error.

The drawback of this approach is that it does not uncover any errors that arise as a result of system performance and recognition rates. Wizard of Oz testing is therefore not useful in determining the effects of timing functions of the design, discovering errors caused by system delays or generating credible technology statistics. A false positive effect can result from Wizard of Oz tests. Because of this possibility, standards applied to the results from a Wizard of Oz test should be extremely high. Designers should address any design problems uncovered by the test.


Prototypes are either pieces of the system or individual components built on a separate development platform. The prototype is much more similar to the finished speech system and it represents a higher fidelity experience for the caller than is possible with Wizard testing. However, prototypes can represent a significant development commitment and can divert resources from the main development effort.

The most important advantage of prototype testing is having the ability to test the recognizer in the system design. Designers can to measure recognition rates, evaluate and adjust the grammars, the call flow and the prompt writing accordingly. From the user's perspective, the experience of interacting with a prototype is a much better approximation of working with the finished system, but there are still drawbacks. Although prototype testing provides some generalized timing data, real data on the effects of system timings and latencies is not viable until final product implementation. Timing plays a crucial role in speech systems and it needs to be measured and evaluated when all the components and subcomponents of a system are up and running. Despite this drawback, prototypes are good for thorough testing of separate modules of the system, testing designs that are too complex for a human tester to administer, while remaining consistent over multiple tests and subjects. Barge-in, which affects many aspects of a speech system's design, can also be tested.

Test Objectives

What kinds of general information does the designer want to receive from the design phase tests? Here are some questions that are useful to ask about the application design:

  • Can the user complete a specific task?
  • Are the prompts causing users to speak effective utterances in grammar?
  • Is one design more efficient than another?
  • Is the user easily able to learn the system? Evaluating this necessitates testing the same users over the same tasks multiple times.
  • Is the user quickly able to share the mental model that the designers intended?
  • Does the user like working with the system?

Some of these questions are highly subjective and can be difficult to answer objectively without tester bias. Subjects in a laboratory want to please investigators, and may not state their criticism of the system. Use of videotape recordings is invaluable for capturing the most impartial and detailed information, allowing for replay and analysis long after the fact by multiple usability analysts.

Appropriate Test Subjects

Test subjects should always represent the most typical users for an application. If a speech system is designed to be a broad-based customer-facing application, then the pool of test subjects should contain a cross section of expert and novice users, people with little or no previous IVR experience as well as people who are completely comfortable working with speech systems. Also consider social and economic status and education level. One important and sometimes ignored variable is testing the ability of the system to handle the speech of diverse users.

Use of Follow-up User Questionnaires and Interviews

A post-test questionnaire allows investigators to collect users' impressions of the overall system and capture their feedback on ease of use and likeability of the voice as well as their satisfaction with the experience of using the system to work on tasks. Professional guidelines detail methods for empirically measuring user satisfaction. Often, descriptions are presented in a statement format, with varying levels of response for the user to choose from, for example:

I thought the system was helpful.

Strongly disagree-----------------------------------------Strongly agree


This kind of approach is also useful when capturing impressions of the voice of the system.

For example:

How likely would you be to recommend this to a friend?

Very unlikely--------------------------------------------Very likely


Post-test interviews are another good way of collecting user assumptions about the speech system. An investigator can react specifically to the events of the usability test and focus on problems that arose during the user's interaction with the system. If a user was unable to complete a task the investigator can focus on the problems the user encountered. Why did a user not use a specific feature during the test process, such as barge-in, or help? Was the feature offered and explained appropriately to the user? Did the feature seem too advanced for the user? Did the user think the feature was unnecessary?