Article
06/03/2019

June 2019

Volume 34 Number 6

[Speech]

Text-To-Speech Synthesis in .NET

By Ilia Smirnov | June 2019

I often fly to Finland to see my mom. Every time the plane lands in Vantaa airport, I’m surprised at how few passengers head for the airport exit. The vast majority set off for connecting flights to destinations spanning all of Central and Eastern Europe. It’s no wonder, then, that when the plane begins its descent, there’s a barrage of announcements about connecting flights. “If your destination is Tallinn, look for gate 123,” “For flight XYZ to Saint Petersburg, proceed to gate 234,” and so on. Of course, flight attendants don’t typically speak a dozen languages, so they use English, which is not the native language of most passengers. Considering the quality of the public announcement (PA) systems on the airliners, plus engine noise, crying babies and other disturbances, how can any information be effectively conveyed?

Well, each seat is equipped with headphones. Many, if not all, long-distance planes have individual screens today (and local ones have at least different audio channels). What if a passenger could choose the language for announcements and an onboard computer system allowed flight attendants to create and send dynamic (that is, not pre-recorded) voice messages? The key challenge here is the dynamic nature of the messages. It’s easy to pre-record safety instructions, catering options and so on, because they’re rarely updated. But we need to create messages literally on the fly.

Fortunately, there’s a mature technology that can help: text-to-speech synthesis (TTS). We rarely notice such systems, but they’re ubiquitous: public announcements, prompts in call centers, navigation devices, games, smart devices and other applications are all examples where pre-recorded prompts aren’t sufficient or using a digitized waveform is proscribed due to memory limitations (a text read by a TTS engine is much smaller to store than a digitized waveform).

Computer-based speech synthesis is hardly new. Telecom companies invested in TTS to overcome the limitations of pre-recorded messages, and military researchers have experimented with voice prompts and alerts to simplify complex control interfaces. Portable synthesizers have likewise been developed for people with disabilities. For an idea of what such devices were capable of 25 years ago, listen to the track “Keep Talking” on the 1994 Pink Floyd album “The Division Bell,” where Stephen Hawking says his famous line: “All we need to do is to make sure we keep talking.”

TTS APIs are often provided along with their “opposite”—speech recognition. While you need both for effective human-computer interaction, this exploration is focused specifically on speech synthesis. I’ll use the Microsoft .NET TTS API to build a prototype of an airliner PA system. I’ll also look under the hood to understand the basics of the “unit selection” approach to TTS. And while I’ll be walking through the construction of a desktop application, the principles here apply directly to cloud-based solutions.

Roll Your Own Speech System

Before prototyping the in-flight announcement system, let’s explore the API with a simple program. Start Visual Studio and create a console application. Add a reference to System.Speech and implement the method in Figure 1.

Figure 1 System.Speech.Synthesis Method

using System.Speech.Synthesis;
namespace KeepTalking
{
  class Program
  {
    static void Main(string[] args)
    {
      var synthesizer = new SpeechSynthesizer();
      synthesizer.SetOutputToDefaultAudioDevice();
      synthesizer.Speak("All we need to do is to make sure we keep talking");
    }
  }
}

Now compile and run. Just a few lines of code and you’ve replicated the famous Hawking phrase.

When you were typing this code, IntelliSense opened a window with all the public methods and properties of the SpeechSynthesizer class. If you missed it, use “Control-Space” or the “dot” keyboard shortcut (or look at bit.ly/2PCWpat). What’s interesting here?

First, you can set different output targets. It can be an audio file or a stream or even null. Second, you have both synchronous (as in the previous example) and asynchronous output. You can also adjust the volume and the rate of speech, pause and resume it, and receive events. You can also select voices. This feature is important here, because you’ll use it to generate output in different languages. But what voices are available? Let’s find out, using the code in Figure 2.

Figure 2 Voice Info Code

using System;
using System.Speech.Synthesis;
namespace KeepTalking
{
  class Program
  {
    static void Main(string[] args)
    {
      var synthesizer = new SpeechSynthesizer();
      foreach (var voice in synthesizer.GetInstalledVoices())
      {
        var info = voice.VoiceInfo;
        Console.WriteLine($"Id: {info.Id} | Name: {info.Name} |
          Age: {info.Age} | Gender: {info.Gender} | Culture: {info.Culture}");
      }
      Console.ReadKey();
    }
  }
}

On my machine with Windows 10 Home the resulting output from Figure 2 is:

Id: TTS_MS_EN-US_DAVID_11.0 | Name: Microsoft David Desktop |
  Age: Adult | Gender: Male | Culture: en-US
Id: TTS_MS_EN-US_ZIRA_11.0 | Name: Microsoft Zira Desktop |
  Age: Adult | Gender: Female | Culture: en-US

There are only two English voices available, and what about other languages? Well, each voice takes some disk space, so they’re not installed by default. To add them, navigate to Start | Settings | Time & Language | Region & Language and click Add a language, making sure to select Speech in optional features. While Windows supports more than 100 languages, only about 50 support TTS. You can review the list of supported languages at bit.ly/2UNNvba.

After restarting your computer, a new language pack should be available. In my case, after adding Russian, I got a new voice installed:

Id: TTS_MS_RU-RU_IRINA_11.0 | Name: Microsoft Irina Desktop |
  Age: Adult | Gender: Female | Culture: ru-RU

Now you can return to the first program and add these two lines instead of the synthesizer.Speak call:

synthesizer.SelectVoice("Microsoft Irina Desktop");
synthesizer.Speak("Всё, что нам нужно сделать, это продолжать говорить");

If you want to switch between languages, you can insert SelectVoice calls here and there. But a better way is to add some structure to speech. For that, let’s use the PromptBuilder class, as shown in Figure 3.

Figure 3 The PromptBuilder Class

using System.Globalization;
using System.Speech.Synthesis;
namespace KeepTalking
{
  class Program
  {
    static void Main(string[] args)
    {
      var synthesizer = new SpeechSynthesizer();
      synthesizer.SetOutputToDefaultAudioDevice();
      var builder = new PromptBuilder();
      builder.StartVoice(new CultureInfo("en-US"));
      builder.AppendText("All we need to do is to keep talking.");
      builder.EndVoice();
      builder.StartVoice(new CultureInfo("ru-RU"));
      builder.AppendText("Всё, что нам нужно сделать, это продолжать говорить");
      builder.EndVoice();
      synthesizer.Speak(builder);
    }
  }
}

Notice that you have to call EndVoice, otherwise you’ll get a runtime error. Also, I used CultureInfo as another way to specify a language. PromptBuilder has lots of useful methods, but I want to draw your attention to AppendTextWithHint. Try this code:

var builder = new PromptBuilder();
builder.AppendTextWithHint("3rd", SayAs.NumberOrdinal);
builder.AppendBreak();
builder.AppendTextWithHint("3rd", SayAs.NumberCardinal);
synthesizer.Speak(builder);

Another way to structure input and specify how to read it is to use Speech Synthesis Markup Language (SSML), which is a cross-platform recommendation developed by the international Voice Browser Working Group (w3.org/TR/speech-synthesis). Microsoft TTS engines provide comprehensive support for SSML. This is how to use it:

string phrase = @"<speak version=""1.0""
  https://www.w3.org/2001/10/synthesis""
  xml:lang=""en-US"">";
phrase += @"<say-as interpret-as=""ordinal"">3rd</say-as>";
phrase += @"<break time=""1s""/>";
phrase += @"<say-as interpret-as=""cardinal"">3rd</say-as>";
phrase += @"</speak>";
synthesizer.SpeakSsml(phrase);

Notice it employs a different call on the SpeechSynthesizer class.

Now you’re ready to work on the prototype. This time create a new Windows Presentation Foundation (WPF) project. Add a form and a couple of buttons for prompts in two different languages. Then add click handlers as shown in the XAML in Figure 4.

Figure 4 The XAML Code

using System.Collections.Generic;
using System.Globalization;
using System.Speech.Synthesis;
using System.Windows;
namespace GuiTTS
{
  public partial class MainWindow : Window
  {
    private const string en = "en-US";
    private const string ru = "ru-RU";
    private readonly IDictionary<string, string> _messagesByCulture =
      new Dictionary<string, string>();
    public MainWindow()
    {
      InitializeComponent();
      PopulateMessages();
    }
    private void PromptInEnglish(object sender, RoutedEventArgs e)
    {
      DoPrompt(en);
    }
    private void PromptInRussian(object sender, RoutedEventArgs e)
    {
      DoPrompt(ru);
    }
    private void DoPrompt(string culture)
    {
      var synthesizer = new SpeechSynthesizer();
      synthesizer.SetOutputToDefaultAudioDevice();
      var builder = new PromptBuilder();
      builder.StartVoice(new CultureInfo(culture));
      builder.AppendText(_messagesByCulture[culture]);
      builder.EndVoice();
      synthesizer.Speak(builder);
    }
    private void PopulateMessages()
    {
      _messagesByCulture[en] = "For the connection flight 123 to
        Saint Petersburg, please, proceed to gate A1";
      _messagesByCulture[ru] =
        "Для пересадки на рейс 123 в  Санкт-Петербург, пожалуйста, пройдите к выходу A1";
    }
  }
}

Obviously, this is just a tiny prototype. In real life, PopulateMessages will probably read from an external resource. For example, a flight attendant can generate a file with messages in multiple languages by using an application that calls a service like Bing Translator (bing.com/translator). The form will be much more sophisticated and dynamically generated based on available languages. There will be error handling and so on. But the point here is to illustrate the core functionality.

Deconstructing Speech

So far we’ve achieved our objective with a surprisingly small codebase. Let’s take an opportunity to look under the hood and better understand how TTS engines work.

There are many approaches to constructing a TTS system. Historically, researchers have tried to discover a set of pronunciation rules on which to build algorithms. If you’ve ever studied a foreign language, you’re familiar with rules like “Letter ‘c’ before ‘e,’ ‘i,’ ‘y’ is pronounced as ‘s’ as in ‘city,’ but before ‘a,’ ‘o,’ ’u’ as ‘k’ as in ‘cat.’” Alas, there are so many exceptions and special cases—like pronunciation changes in consecutive words—that constructing a comprehensive set of rules is difficult. Moreover, most such systems tend to produce a distinct “machine” voice—imagine a beginner in a foreign language pronouncing a word letter-by-letter.

For more naturally sounding speech, research has shifted toward systems based on large databases of recorded speech fragments, and these engines now dominate the market. Commonly known as concatenation unit selection TTS, these engines select speech samples (units) based on the input text and concatenate them into phrases. Usually, engines use two-stage processing closely resembling compilers: First, parse input into an internal list- or tree-like structure with phonetic transcription and additional metadata, and then synthesize sound based on this structure.

Because we’re dealing with natural languages, parsers are more sophisticated than for programming languages. So beyond tokenization (finding boundaries of sentences and words), parsers must correct typos, identify parts of speech, analyze punctuation, and decode abbreviations, contractions and special symbols. Parser output is typically split by phrases or sentences, and formed into collections describing words that group and carry metadata such as part of speech, pronunciation, stress and so on.

Parsers are responsible for resolving ambiguities in the input. For example, what is “Dr.”? Is it “doctor” as in “Dr. Smith,” or “drive” as in “Privet Drive?” And is “Dr.” a sentence because it starts with an uppercase letter and ends with a period? Is “project” a noun or a verb? This is important to know because the stress is on different syllables.

These questions are not always easy to answer and many TTS systems have separate parsers for specific domains: numerals, dates, abbreviations, acronyms, geographic names, special forms of text like URLs and so on. They’re also language- and region-specific. Luckily, such problems have been studied for a long time and we have well-developed frameworks and libraries to lean on.

The next step is generating pronunciation forms, such as tagging the tree with sound symbols (like transforming “school” to “s k uh l”). This is done by special grapheme-to-phoneme algorithms. For languages like Spanish, some relatively straightforward rules can be applied. But for others, like English, pronunciation differs significantly from the written form. Statistical methods are then employed along with databases for known words. After that, additional post-lexical processing is needed, because the pronunciation of words can change when combined in a sentence.

While parsers try to extract all possible information from the text, there’s something that’s so elusive that it’s not extractable: prosody or intonation. While speaking, we use prosody to emphasize certain words, to convey emotion, and to indicate affirmative sentences, commands and questions. But written text doesn’t have symbols to indicate prosody. Sure, punctuation offers some context: A comma means a slight pause, while a period means a longer one, and a question mark means you raise your intonation toward the end of a sentence. But if you’ve ever read your children a bedtime story, you know how far these rules are from real reading.

Moreover, two different people often read the same text differently (ask your children who is better at reading bedtime stories—you or your spouse). Because of this you cannot reliably use statistical methods since different experts will produce different labels for supervised learning. This problem is complex and, despite intensive research, far from being solved. The best programmers can do is use SSML, which has some tags for prosody.

Neural Networks in TTS

Statistical or machine learning methods have for years been applied in all stages of TTS processing. For example, Hidden Markov Models are used to create parsers producing the most likely parse, or to perform labeling for speech sample databases. Decision trees are used in unit selection or in grapheme-to-phoneme algorithms, while neural networks and deep learning have emerged at the bleeding edge of TTS research.

We can consider an audio sample as a time-series of waveform sampling. By creating an auto-regressive model, it’s possible to predict the next sample. As a result, the model generates speech-kind bubbling, like a baby learning to talk by imitating sounds. If we further condition this model on the audio transcript or the pre-processing output from an existing TTS system, we get a parameterized model of speech. The output of the model describes a spectrogram for a vocoder producing actual waveforms. Because this process doesn’t rely on a database with recorded samples, but is generative, the model has a small memory footprint and allows for adjustment of parameters.

Because the model is trained on natural speech, the output retains all of its characteristics, including breathing, stresses and intonation (so neural networks can potentially solve the prosody problem). It’s possible also to adjust the pitch, create a completely different voice and even imitate singing.

At the time of this writing, Microsoft is offering its preview version of a neural network TTS (bit.ly/2PAYXWN). It provides four voices with enhanced quality and near instantaneous performance.

Speech Generation

Now that we have the tree with metadata, we turn to speech generation. Original TTS systems tried to synthesize signals by combining sinusoids. Another interesting approach was constructing a system of differential equations describing the human vocal tract as several connected tubes of different diameters and lengths. Such solutions are very compact, but unfortunately sound quite mechanical. So, as with musical synthesizers, the focus gradually shifted to solutions based on samples, which require significant space, but essentially sound natural.

To build such a system, you have to have many hours of high-quality recordings of a professional actor reading specially constructed text. This text is split into units, labeled and stored into a database. Speech generation becomes a task of selecting proper units and gluing them together.

Because you’re not synthesizing speech, you can’t significantly adjust parameters in the runtime. If you need both male and female voices or must provide regional accents (say, Scottish or Irish), they have to be recorded separately. The text must be constructed to cover all possible sound units you’ll need. And the actors must read in a neutral tone to make concatenation easier.

Splitting and labeling are also non-trivial tasks. It used to be done manually, taking weeks of tedious work. Thankfully, machine learning is now being applied to this.

Unit size is probably the most important parameter for a TTS system. Obviously, by using whole sentences, we could make the most natural sounds even with correct prosody, but recording and storing that much data is impossible. Can we split it into words? Probably, but how long will it take for an actor to read an entire dictionary? And what database size limitations are we facing? On the other side, we cannot just record the alphabet—that’s sufficient only for a spelling bee contest. So usually units are selected as two three-letter groups. They’re not necessarily syllables, as groups spanning syllable borders can be glued together much better.

Now the last step. Having a database of speech units, we need to deal with concatenation. Alas, no matter how neutral the intonation was in the original recording, connecting units still requires adjustments to avoid jumps in volume, frequency and phase. This is done with digital signal processing (DSP). It can also be used to add some intonation to phrases, like raising or lowering the generated voice for assertions or questions.

Wrapping Up

In this article I covered only the .NET API. Other platforms provide similar functionality. MacOS has NSSpeechSynthesizer in Cocoa with comparable features, and most Linux distributions include the eSpeak engine. All of these APIs are accessible through native code, so you have to use C# or C++ or Swift. For cross-platform ecosystems like Python, there are some bridges like Pyttsx, but they usually have certain limitations.

Cloud vendors, on the other hand, target wide audiences, and offer services for most popular languages and platforms. While functionality is comparable across vendors, support for SSML tags can differ, so check documentation before choosing a solution.

Microsoft offers a Text-to-Speech service as part of Cognitive Services (bit.ly/2XWorku). It not only gives you 75 voices in 45 languages, but also allows you to create your own voices. For that, the service needs audio files with a corresponding transcript. You can write your text first then have someone read it, or take an existing recording and write its transcript. After uploading these datasets to Azure, a machine learning algorithm trains a model for your own unique “voice font.” A good step-by-step guide can be found at bit.ly/2VE8th4.

A very convenient way to access Cognitive Speech Services is by using the Speech Software Development Kit (bit.ly/2DDTh9I). It supports both speech recognition and speech synthesis, and is available for all major desktop and mobile platforms and most popular languages. It’s well documented and there are numerous code samples on GitHub.

TTS continues to be a tremendous help to people with special needs. For example, check out linka.su, a Web site created by a talented programmer with cerebral paralysis to help people with speech and musculoskeletal disorders, autism, or those recovering from a stroke. Knowing from personal experience what limitations they’re facing, the author created a range of applications for people who can’t type on a regular keyboard, can only select one letter at a time, or just touch a picture on a tablet. Thanks to TTS, he literally gives a voice to those who do not have one. I wish that we all, as programmers, could be that useful to others.

Ilia Smirnov has more than 20 years of experience developing enterprise applications on major platforms, primarily in Java and .NET. For the last decade, he has specialized in simulation of financial risks. He holds three master’s degrees, FRM and other professional certifications.

Thanks to the following Microsoft technical expert for reviewing this article: Sheng Zhao (Sheng.Zhao@microsoft.com)
Sheng Zhao is principal group software engineering with STCA Speech in Beijing

Discuss this article in the MSDN Magazine forum