This documentation is archived and is not being maintained.

Programming Models

Speech Server 2007

This content is no longer actively maintained. It is provided as is, for anyone who may still be using these technologies, with no warranties or claims of accuracy with regard to the most recent product version or service release.

With Speech Server, developers can use one of two programming models to develop voice response applications: the managed code programming model and the Web-based programming model. With Speech Server Developer Tools, developers can implement managed code voice response applications using Microsoft Windows Workflow Foundation technology and implement Web-based Web-based voice response applications using Speech Application Language Tags (SALT) technology or VoiceXML technology. For information about Windows Workflow Foundation, see Windows Workflow Foundation.

The Speech Server programming models and implementation technologies are illustrated in the following diagram.


Using Speech Server Developer Tools, you can build the following types of voice response applications:

  • Managed code workflow applications.
  • Web-based SALT or VoiceXML applications.
  • Managed code workflow applications that wrap a Web-based application.

The integration of Windows Workflow Foundation technology in Speech Server enables developers to create workflow-based voice response applications that run in a managed execution environment hosted by ASP.NET. Workflows support a compositional approach to creating applications and model complex, real-world processes as a set of activities. Simple activities implement the execution of specific tasks, while composite activities implement the execution of one or more child activities. Activities can be executed by people or by system functions and can be used to control application flow, concurrency, synchronization, exception handling, and interactions with other applications.

Authoring voice response applications using workflows offers several distinct benefits, including the following:

  • Application source code is centralized and can be compiled into a single assembly because the code is managed.
  • Applications can be more easily extended because they are compositional.
  • Applications are easier to modify because they are more transparent.
  • Simple applications can be developed rapidly, even by developers who have little familiarity with C# or Visual Basic because of the graphical nature of the tools.
  • Application processes, dependencies on external data or processes, and user interactions are easier to manage because all application code is server-side only.

The Web-based programming model enables developers to easily construct Web-based voice response applications using both simple and complex form-filling dialogs with various forms of confirmation strategies. This model envisions speech applications as being based in a distributed Web environment and as comprising two tiers:

  • A logic and data tier that processes business rules and stores and accesses data.
  • A presentation tier that interacts with speech and DTMF input and produces speech output. This is also referred to as the voice user interface (VUI).

Microsoft provides a rich set of APIs to address the logic and data tier. Speech Server adds support for developing Speech Application Language Tags (SALT) and VoiceXML applications to address the presentation tier.

For SALT application development, Speech Server provides both tools and APIs for authoring and running the VUI. The SALT APIs are divided into two layers:

  • The SALT layer. This is the low-level layer that provides a small set of XML elements, with associated attributes and Document Object Model (DOM) object properties, events, and methods that apply a speech interface to Web pages.
  • The Speech Controls layer. This is a high-level layer that provides ASP.NET with advanced controls for authoring SALT applications. On a Web page built with Speech Controls, the Web server translates each Speech Control tag into client-side SALT, as well as the necessary script to tie the application together.

Two additional layers complete the SALT programming model: the RunSpeech and Application Controls layers. RunSpeech is the algorithm that sits directly beneath the Speech Controls and drives the form-filling process, dictating the progress of an application's execution based on the state of the application's semantic items. Application Controls are composites of Speech Controls (together with specialized recognition grammars and prompts) designed to collect particular types of information, and as such, this layer sits on top of the Speech Controls layer.

For VoiceXML application development, Speech Server provides support for a host of application features, markup, and object properties defined by the VoiceXML 2.0 specification, with additional support for several features of the VoiceXML 2.1 specification. These are enumerated in Supported Features in VoiceXML Applications. Speech Server also supports the creation of VoiceXML Web-based voice response application projects in Visual Studio, as well as the integration of existing VoiceXML applications.

SALT and VoiceXML programming each have unique characteristics. Characteristics of the SALT programming approach include:

  • Use of a small set of standardized markup tags.
  • Use of commonly well-known markup and scripting languages.
  • A somewhat steeper learning curve than VoiceXML.

Characteristics of the VoiceXML programming approach include:

  • Use of a large set of markup tags.
  • Use of a more universally accepted standard.
  • The application intent is more transparent from the markup.

Managed code workflow programming offers developers a simpler alternative to the Web-based programming model. For example, Web-based application programming requires the use of client-side JScript and presents issues like handling postback. In contrast, workflow application data handling is much simpler to manage because code execution and data processing are handled on the server side.

In general, voice response application development using workflows is simpler because of the following strengths of the workflow programming model:

  • Programming flexibility
  • Improved activation and flow management
  • The ability to break out of the dialog flow

Programming Flexibility

By focusing primarily on activities, managed code workflow programming enables developers to wrap portions of a voice response application workflow in a higher level activity (raising the level of abstraction and improving the maintainability of the application) and then create reusable components. In this model, a workflow can be seen as a hierarchical structure of component activities, where the workflow is a container of component activities. These component activities can themselves become a container with their own children or remain simple elements (primitives) of the workflow.

The Dialog Activities layer of the Speech Server API offers a number of classes of primitive objects. A few examples include:

  • The QuestionAnswerActivity class for implementing a single question-answer turn in a dialog.
  • The StatementActivity class for playing a single prompt.
  • The RecordAudioActivity class for playing a prompt and initiating the recording of the user response.

The Dialog Activities layer also provides the FormFillingDialogActivity object class that can be used together with the QuestionAnswerActivity, GetAndConfirmActivity, and ValidatorActivity classes to implement mixed-initiative dialogs. With mixed-initiative dialogs, users can optionally provide multiple pieces of information in the same answer. FormFillingDialogActivity automates the selection of application questions to ask for new information or confirm information already recognized, depending on what information is already known and the recognition confidence of that information. FormFillingDialogActivity implements an algorithm equivalent to RunSpeech, which is the algorithm that interprets pages produced by Speech Controls.

Improved Activation and Flow Management

Every container manages the dialog flow between its child components. When a workflow is first activated, it finds the child activity that needs to run first and activates that child. The container waits for the child to finish running and then finds the next child to run. This process continues until the container finds no more child components to run. Flow can be dynamic, as in a semantically driven (goal-driven) dialog for form-filling, or procedural, as in a finite-state machine model.

Ability to Easily Escape the Dialog Flow

Some dialog scenarios require the ability to interrupt the dialog flow and activate a switch to a different task in the application. User commands are a good example of this ability.

In the SALT Web-based voice response application programming model, commands are driven by the dialog flow and form-filling. If the user stops the dialog, the dialog must continue from the point where it stopped. If the developer gives the user the ability to escape a dialog and then reset and restart the dialog from the beginning, the developer must carefully design the application so that it backtracks through all the semantic items that have been set up to that point in the dialog and resets each one. If the developer gives the user the ability to escape the dialog and cause the application to perform a different task, the developer must also design the application to backtrack and evaluate the state of all semantic items that have been set before determining how to proceed.

Using a workflow, developers can program an application so that if the user stops the dialog, the application can play a completely independent dialog, perform tasks that are independent of the state of semantic items in the current dialog, or reset the current dialog altogether.

The managed code APIs are contained in the Microsoft.SpeechServer namespace. The Microsoft.SpeechServer namespace contains several secondary namespaces. These namespaces and a brief description of their purpose are listed in the following table.

SpeechServer API

Namespace Description


For low-level control of core speech services such as creating hosted application containers, creating telephony and conference sessions, manipulating caller information, controlling logging, and controlling recording.


For controlling dialogs. Important classes include FormFillingDialogActivity, QuestionAnswerActivity, StatementActivity, GetAndConfirmActivity, and SemanticItem.


For voice and DTMF recognition. Important classes include RecognizeCompletedEventArgs, DigitDetectedEventArgs, and RecognitionResult.


For programmatically building World Wide Web Consortium (W3C) Speech Recognition Grammar Specification (SRGS) grammars. Important classes include SrgsDocument, SrgsItem, and SrgsRule.


For text-to-speech (TTS) output. Important classes include PromptBuilder and PromptStyle.