Enabling people to have a natural conversation with a machine has been a long-standing goal of the ‘human-machine interaction’. We have witnessed a significant, recent, evolution of human-computer interaction, especially with voice assistant software such as Google Home and Amazon Alexa. However, even with these new advancements, we still find ourselves frustrated at speaking to a machine over the phone.
To help combat these frustrations (and as part of further evolution!) Google has announced Google Duplex, a new technology for conducting a natural conversation when completing “real-life” tasks over the phone. For example, scheduling appointments or making reservations.
Introducing Google Duplex
The system makes a conversation with a computer as natural as possible, allowing people to speak normally, as if they were speaking to another person, without having to adapt to the machine on the other end. However, Duplex can only carry out natural conversations after being trained to do so. It cannot carry out just a general conversation…yet!
Conducting Natural Conversations
There a various challenges in conducting a natural conversation as natural language is hard to understand. Emulating natural behaviour is complex, latency expectations require fast processing, and generating natural sounding speech with the right intonations is difficult.
Google use a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine (using Tacotron and WaveNet) to control intonation depending on the circumstance.
The system also sounds more natural as incorporation of speech disfluencies are included (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits. Got all that? In essence, this allows the system to signal in a natural way that it is still processing.
Inside Google Duplex
Google Duplex’s conversations sound real and wholesome due to advances in understanding, interacting, timing and speaking.
At the core of Google Duplex is a recurrent neural network (RNN) which is designed to cope with the challenges above, it was built using TensorFlow Extended (TFX). Google trained it’s Duplex’s RNN on a collection of anonymized phone conversation data. It uses an output of Google ASR (automatic speech recognition) technology as well as features from audio, the history and the parameters of the conversation. Google trained their understanding model separately for an individual task, but leveraged the shared collections across different tasks. Finally, Google used hyperparameter optimization from TFX to further improve the model.
The Google Duplex system is capable of carrying out sophisticated conversations and it completes the majority of its tasks independently, without any human involvement. The system has a self-monitoring capability. This allows it to recognise the tasks it cannot complete on its own. In these cases, it signals to a human operator, who can complete the task. Amazing eh?
Google used real-time supervised training, to train the system in a new domain. In the Duplex system, experienced operators act as the instructors. By monitoring the system as it makes phone calls in a new domain, they can affect the behavior of the system in real time. They continue to do this until the system performs at Google’s desired quality level. After this, the system can make calls autonomously.
Allowing people to interact with technology as naturally as they interact with each other has been a long standing task. However, Google Duplex takes a big step forward in this direction, by making the computer-human interaction as natural as possible.
Google hope that these technology advances will ultimately contribute to a meaningful improvement in people’s experience in day-to-day interactions with computerised systems.
Perhaps one day we’ll be conversing with machines as if they were friends or family, and all the linguistic nuances that brings…maybe…
…in the meantime check out these examples of Duplex making phone calls, using different voices. It’s scary how realistic they are already!