Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

prosody Element (Microsoft.Speech)

Specifies the pitch, contour, range, rate, duration, and volume for speaking the contained text.

Syntax

<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"> </prosody>

Attributes

Attribute

Description

pitch

Optional. Indicates the baseline pitch for the contained text. This value may be expressed in one of three ways:

  • An absolute value, expressed as a number followed by "Hz" (Hertz). For example, 600Hz.

  • A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. For example +80Hz or -2st. The “st” indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.

  • An enumeration value, from among the following: x-low, low, medium, high, x-high, or default.

contour

Optional. Represents changes in pitch for speech content as an array of targets at specified time positions in the speech output. Each target is defined by sets of parameter pairs, for example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the contained text (a number followed by "%"). The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch, see above.

range

Optional. A value that represents the range of pitch for the contained speech content. This value may be expressed using the same absolute values, relative values, or enumeration values used to describe pitch, see above.

rate

Optional. Indicates the speaking rate of the contained text. This value may be expressed in one of two ways:

  • A relative value, expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the rate. A value of .5 results in a halving of the rate. A value of 3 results in a tripling of the rate.

  • An enumeration value, from among the following: x-slow, slow, medium, fast, x-fast, or default.

duration

Optional. A value in seconds or milliseconds for the period of time that should elapse while the speech synthesis (TTS) engine reads the contents of the element. For example 2s or 1800ms.

volume

Optional. Indicates the volume level of the speaking voice. This value may be expressed in one of three ways:

  • An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. For example, 75. The default is 100.0.

  • A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. For example +10 or -5.5.

  • An enumeration value, from among the following: silent, x-soft, soft, medium, loud, x-loud, or default.

Note

Standards for well-formed, valid XML require attribute values to be enclosed in double quotation marks. For example, <prosody volume="90"> is a well-formed, valid element, but <prosody volume=90> is not.

Remarks

Because prosodic attribute values can vary over a wide range, the speech recognizer interprets the assigned values as a suggestion of what the actual prosodic values of the selected voice should be. The text-to-speech (TTS) engine limits or substitutes values that are not supported. Examples of unsupported values are a pitch of 1 MHz or a volume of 120.

Note

The speech synthesis engines for the Microsoft Speech Platform do not support the contour, range, or duration attributes at this time. Setting values for these attributes will produce no change in the synthesized speech output.

Example

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
 xmlns="http://www.w3.org/2001/10/synthesis"
 xml:lang="en-US">

  <s>
    Your order for <prosody pitch="+1st" rate="-10%" volume="90"> 8 books and 1 reading lamp </prosody> 
    will be shipped tomorrow.
  </s>

</speak>