Experimenting with Web API Speech Synthesis

In this article we will explore the Web API and its cross platform capabilities in speech synthesis. Partly out of curiosity, partly as a throwback to the time when it was fun to make the computer read out loud parts of text just because it could, I wanted to explore the capabilities of modern browsers in generating human like speech.

In the process I created a simple react app which demonstrates these capabilities.


Check out the application here.


The application is a progressive web app (PWA), including a service worker and app manifest, so you also test it on your mobile devices and pin it to your homescreen for easy access. As a PWA it is also available if your device goes offline.

Why Web API?

One might say trying to do this in a browser is a bit of an overreach but consider these benefits:

  • No API quotas
  • Simple interface
  • Offline support
  • No change in speech quality after losing connection
  • Multiple voice options to choose from
  • Customizable parameters
  • allows implementing cross platform solutions with ease

Cons

  • Voice quality is not the best (I’m looking at you Microsoft David)

That is really it. With all the benefits outweighing the cons, choosing this approach was a no-brainer.

Speech Synthesis Basics

Before you can start using speech synthesis, you must verify that the current device in fact does support this feature:

1
const hasVoice = 'speechSynthesis' in window;

Unlike with speech recognition, with synthesis you do not need any additional permissions from the user. Simply checking that ability to generate speech exists is sufficient to proceed.

To generate an utterance, we must declare an instance of SpeechSynthesisUtterance. This is the Web API class that can generate speech.

1
let utterance = new SpeechSynthesisUtterance();

For the utterance, we can customize the parameters to generate different types of speech.

You can change the rate, pitch, volume, and language. The actual words to be spoken out loud are defined using the text property.

1
2
3
4
5
utterance.rate = 1; // 0.1 to 10
utterance.pitch = 1; // 0 to 2
utterance.volume = 1; // 0 to 1
utterance.text = 'Hello World';
utterance.lang = 'en-US';

To customize voices you may access the list of available voices on the device:

1
2
const voices = window.speechSynthesis.getVoices();
utterance.voice = voices[0]; // select one of voices

Lastly, have the computer speak out the text, we call speak() with the utterance instance as a parameter:

1
speechSynthesis.speak(utterance);

Runtime

So now we have our basic speech generator working. The question becomes how does it perform on different devices? This is quite an interesting question and we found that while the speech synthesis functionality is supported on all devices we tested (iOS, Android and Windows); there are major differences in the number and types of voices that are available for each platform, including support for foreign languages. This selection of voices is significant because it greatly impacts the actual quality of the output when discussing speech synthesis.

On iOS 12 the list of voices appears as a list of names. You may then circle through each option and look for the best one for your use case. iOS also has good synthesis options for specific foreign languages that we tested. Out of all the options, this appeared to be the best for multi-lingual use cases.

On Android Pie, the list of voices returned many options and appeared to have a vast range of support, however, when changing the voice the synthesis would always result to the same voice being used. It may be an oversight in the code, however that would be strange because the same code works correctly on the other two devices that were tested.

On Microsoft Windows 10 the list of choices for voices includes the relic Microsoft David voice from the 90s, as well as some newer variations of US and UK English. There are some foreign dialects, but the selection is significantly narrower for Microsoft Windows 10 than on the mobile devices that were tested.


Try the application and see what options you get here.


Uses

Considering voice driven apps, and possibly pairing speech synthesis with speech recognition, using browser only support seems like a perfectly viable options. The big drawback there would be trying to get the application to continuously listen to user input, which afaik is not yet possible without explicit user action.

Other uses cases for speech synthesis include accessibility apps, and of course experimenting with this these voices is just plain fun.

Feel free to share your Web API speech synthesis experiences below. We would love to hear some more stories and hear how you have used speech synthesis.

If you enjoyed this article please share it with your friends!