The commercial and clinical impact of speech synthesis technology
Researchers from the Centre for Speech Technology Research developed open-source speech synthesis technology that is widely used for commercial and clinical purposes
What was the problem?
Clinical conditions such as Motor Neurone Disease can result in patients losing their ability to speak over time. Computer-based communication aids only provide a small range of inappropriate voices which do not reflect people's identity.
This is because text-to-speech (TTS) technology (the automatic conversion of written language into speech) traditionally relies on playing back previously recorded speech sounds. This approach requires many hours' worth of high-quality recordings for any given voice, making it impractical for most people to use their own voices as the basis for a communication aid.
What did we do?
The Centre for Speech Technology Research (CSTR) is a pioneer in the field of speech synthesis. It has made significant advances in these areas since the 1980s, resulting in several software tools which are freely available online.
The two main tools are the Festival software toolkit and the HTS software toolkit.
Festival provides a complete TTS framework. It includes both stages of a TTS tool: text analysis and waveform, or sound, generation. The waveform generation plays back recorded speech sounds. From its first version in 1996 to more recent updates in 2007, Festival continued to improve and develop. Advances included better intonation, larger lexicons, and more accurate letter-to-sound prediction.
The HTS toolkit is based on a statistical model of synthesis called the Hidden Markov Model (HMM). HTS is used with Festival. However, it does not rely on playback of recorded speech sounds for waveform generation. Instead, the statistical model can take a sample of a speaker’s voice, and create new speech sounds to match. This is significant because the software can create different speaking styles. The software’s statistical model is adaptive. This means that even speech samples that are low quality, short in duration, or of disordered speech can be used to create normal-sounding, personalised synthetic speech.
What happened next?
Festival and HTS have been released as Open Source with unrestrictive licenses. This has enabled these tools to be widely used as a research and development framework within the industry. Typically, over half of the papers on speech synthesis at industry and academic conferences in the area will be based on research using Festival and HTS toolkits.
CSTR’s speech synthesis technology has the unique ability to repair and reconstruct disordered speech. Communication products based on the Festival and HTS toolkits have been developed to assist people with speech disorders. This work led to the creation of the Speak:Unique project (previously known as the Voicebanking Research Project).
The Speak:Unique project is a collaboration between CSTR (a collaboration between Informatics and Linguistics and English Language at Edinburgh), the Anne Rowling Regenerative Neurology Clinic, and the Euan MacDonald Centre for Motor Neurone Disease.
The project allows researchers to provide individuals with a synthesised voice which is as similar to their own voice as possible. Once a patient has recorded their voice, researchers take the features of voices from people of the same age, sex and region, and mix these together to create an average voice. This average voice can then be combined with the individual's recorded voice in order to produce a huge range of personalised speech from only a very limited set of recordings.
So far the team have built a voicebank of over 1200 voice "donors", to provide the data needed to be able to reconstruct more voices. Scottish Government funded trials saw the programme rolled across Edinburgh and the Lothians, with future expansion planned over the coming year.
I mean, you are your voice, aren't you? […] It’s so your personality as well.. It’s a huge thing to be able to still communicate and people know that it’s you that’s doing the talking and not a machine.
Festival has formed the basis of several commercial products. It has also led to several companies being formed from CSTR, including Rhetorical Systems (and its descendent Phonetic Arts), CereProc, and Speech Graphics. This research has also formed the basis of other technologies that are licensed to a wide range of companies, such as the Combilex dictionary system and voice databases. In addition, major corporations such as AT&T, Google, Nuance, and Microsoft make regular use of Festival and HTS.
Almost every industrial researcher in the field has used or is familiar with both Festival and HTS.
About the researcher
Professor Simon King (Professor of Speech Processing)
Engineering and Physical Sciences Research Council
France Telecom R&D UK Ltd
Medical Research Council
Motor Neurone Disease Association