Using Speech Synthesis in UCMA 3.0: Working with SSML (Part 3 of 4)

Article
01/20/2015

Summary: Combine the capabilities of Microsoft Unified Communications Managed API (UCMA) 3.0 Core SDK with Microsoft Speech Platform SDK to make synthesized speech in your application more natural sounding. Use Speech Synthesis Markup Language (SSML) to insert pauses, increase or decrease the speech volume, expand abbreviations correctly, and pronounce words or phrases phonetically. Part 3 discusses how to add SSML markup to normalize text in many ways, add emphasis, play an external audio file, load an external SSML file, or fine-tune pronunciations using Universal Phone Set (UPS) phonemes.

Applies to: Microsoft Unified Communications Managed API (UCMA) 3.0 Core SDK | Microsoft Speech Platform SDK

Published: August 2011 | Provided by: Mark Parker, Microsoft | About the Author

Contents

Speech Normalization Using the say-as Element
Speaking with Emphasis
Using an External Audio File
Using an External SSML File
Speaking Phonetically
Part 4
Additional Resources

Download code

This article is the third in a four-part series of articles on how to use speech synthesis in a Microsoft Unified Communications Managed API (UCMA) 3.0 application.

Speech Normalization Using the say-as Element

The Speech Synthesis Markup Language (SSML) say-as element can be used to speak normalized text in several different ways, depending on the value of the interpret-as attribute.

The following example shows the declarations for two variables that are used in the remaining examples in part 3.

PromptBuilder prompt = new PromptBuilder();
String str;

Speaking a Phone Number

Setting the interpret-as attribute to “telephone” causes the text within the say-as element to be spoken as a telephone number.

The following example speaks a string of numeric digits as a telephone number. In this example, the telephone number is rendered as “four two five, five five five, zero one nine nine.”

// Speak a telephone number.
str = "For more information, call <say-as interpret-as=\"telephone\">425-555-0199</say-as>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking the Time

Setting the interpret-as attribute to “time” causes the text within the say-as element to be spoken as a clock time.

The following example renders the time portion as “three fifty two P.M.”

// Speak the time.
str = "The plane arrives at <say-as interpret-as=\"time\">3:52</say-as>P.M";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking a Number as a Fraction

Setting the interpret-as attribute to “number” causes the text within the say-as element to be spoken as a number. The number can take different forms, including an integer, a fraction, a floating-point number, or a Roman number.

The following example renders “3/4” as “three-fourths.”

// Speak a fraction.
str = "The recipe calls for <say-as interpret-as=\"number\">3/4</say-as> of a cup of milk";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking an Address

Setting the interpret-as attribute to “address” causes the text within the say-as element to be spoken as an address. Commonly used abbreviations such as street (St.), avenue (Ave.), or boulevard (Blvd.) are normalized to their complete expanded forms. Abbreviations for state names are normalized to their complete forms. The following example normalizes the state abbreviation OH to Ohio.

// Speak an address.
str = "The Red Sox play in Cincinnati <say-as interpret-as=\"address\">OH</say-as>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

The following example normalizes the string “St. Paul St.” to “Saint Paul Street.” The speech synthesizer can determine from context that the first “St.” should be pronounced “Saint” and the second “St.” should be pronounced “Street.”

// Speak an address.
str = "I live on <say-as interpret-as=\"address\">St. Paul St.</say-as>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking a Number as an Ordinal

Setting the interpret-as attribute to “ordinal” causes the number within the say-as element to be spoken as an ordinal value. For example, “1” is normalized to “first.” The following example speaks the digit 2 as “second.”

// Speak an ordinal number.
str = "We are going on the <say-as interpret-as=\"ordinal\">2</say-as> of June";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking a Date in Month and Day Format

Setting the interpret-as attribute to “date_md” causes the number within the say-as element to be spoken as a date in month and day format, with the number for the day normalized as an ordinal value. The following example normalizes the string 8.21 to “August 21st.”

// Speak as a date - month and day.
str = "His birthday is <say-as interpret-as=\"date_md\">8.21</say-as>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking a Date in Year Format

Setting the interpret-as attribute to “date:y” causes the number within the say-as element to be spoken as a year. The following example normalizes the year 2011 to “two thousand eleven.”

// Speak a number as a year.
str = "In the year <say-as interpret-as=\"date:y\">2011</say-as>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking an Acronym

Setting the interpret-as attribute to “letters” causes the text within the say-as element to be spelled out. The following example normalizes the abbreviation “URL” to “U” “R” “L.”

// Speak an acronym.
str = "Universal resource locator is abbreviated as <say-as interpret-as=\"letters\">URL</say-as>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Speaking with Emphasis

The emphasis element can be used to increase the level of stress with which the contained text is spoken.

Note

The amount of emphasis can vary from voice to voice.

The following example uses the break element to insert a short pause in the spoken sentence, and puts emphasis on the phrase “right away.”

// Speak with emphasis.
str = "We need to get this done <break size=\"small\"/> <emphasis>right away</emphasis>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Using an External Audio File

The AppendAudio method on the PromptBuilder class can be used to append prerecorded audio to a prompt. The following example causes the audio in CHIMES.WAV to be played.

// Speak a prompt that includes an audio file.
String currDirPath = Environment.CurrentDirectory;
prompt.AppendText("Listen for the sound of the chimes.");
prompt.AppendAudio(currDirPath + "\\CHIMES.WAV");
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Using an External SSML File

For complex prompts with many changes in speaking rate or volume, it might be convenient to place the SSML markup in a separate file. The AppendSsml method on the PromptBuilder class can be used to load such a file into the prompt before the prompt is spoken.

The following example shows how to load an external SSML file into a prompt.

// Speak a prompt that is loaded from a file.
String currDirPath = Environment.CurrentDirectory;
prompt.AppendSsml(XmlReader.Create(currDirPath + "\\ssml.xml"));
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

The following example shows the contents of the SSML file that is loaded in the previous example. The prosody element specifies the pitch, rate, or volume at which the text is spoken. The text in this file is spoken as “Your order for three books will be shipped July twenty-first.”

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns:ssml="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<sentence>
  <prosody volume="70">Your order for <break time="500ms" /></prosody>
  <prosody rate="-20%" volume="100"><emphasis>3 <break time="250ms" /> books</emphasis></prosody>
  <prosody volume="70"> <break time="500ms" /> will be shipped <say-as interpret-as="date_md">7.21</say-as></prosody>
</sentence>
</speak>

Speaking Phonetically

The following example uses the AppendSsmlMarkup method on the PromptBuilder instance to speak phonemes for the word “measure,” which contains a ZH sound. For a list of phonemes that are used in English, see Phoneme Table for English (United States).

// Speak some phonemes - "measure" 
str = "In good <phoneme alphabet=\"x-microsoft-ups\" ph=\"M EH . ZH AX RA\">alternate text</phoneme>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

The following example uses phonemes to pronounce the phrase, ”Rodin’s Thinker.” This example presents the pronunciation of the “th” diphthong as it is used in words such as “think,” “thin,” and “thaw.”

// Speak some phonemes - "Rodin's Thinker".
str = "<phoneme alphabet=\"x-microsoft-ups\" ph=\"RA O + UH . S1 D AE N Z\">alternate text</phoneme>";
// prompt.AppendSsmlMarkup(str);
str += "<phoneme alphabet=\"x-microsoft-ups\" ph=\"TH IH NG . K RA\">alternate text</phoneme>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

The following example uses phonemes to pronounce the word ”there.” This example presents the pronunciation of the “th” diphthong as it is used in words such as “then,” “there,” and “other.”

// Speak some phonemes - "there".
str = "Here and <phoneme alphabet=\"x-microsoft-ups\" ph=\"DH EH RA\">alternate text</phoneme>";
prompt.AppendSsmlMarkup(str);
_speechSynthesizer.Speak(prompt);
prompt.ClearContent();

Part 4

Using Speech Synthesis in UCMA 3.0: Code Listing and Conclusion (Part 4 of 4)

Additional Resources

For more information, see the following resources:

About the Author

Mark Parker is a programming writer at Microsoft whose current responsibility is the UCMA SDK documentation. Mark previously worked on the Microsoft Speech Server 2007 documentation.