A toddler meanders unsteadily through the living room, pausing by a sleek black cylinder in the corner. “Alexa,” he says in a high-pitched voice. “Play children music.” The cylinder acknowledges the request, despite the muffled pronunciation, and the music starts.
Alexa, a cloud-based speech recognition software from Amazon and the brain of its black cylindrical loudspeaker Echo, has been a big hit around the world – except for the younger ones, who take it for granted. Children will grow up alongside it, just as Alexa will evolve, as the AI powering it learns to answer more and more questions, and – perhaps – one day even converses freely with people.
But anyone older than 10 will know that it hasn’t always been like that. Speech recognition software has come a long, long way to where we are today. Echo is slimmer than a beer glass, but the first speech recognition machines – developed during the middle of the 20th Century – nearly took up an entire room.
Humans have long wanted to speak to machines – or at least make them talk to us. “Voice enables unbelievably simple interaction with technology – the most natural and convenient user interface, and the one we all use every day,” says Jorrit Van der Meulen, VP at Amazon Devices and Alexa EU. “Voice is the future.”
Back in 1773, Russian scientist Christian Kratzenstein, a professor of physiology in Copenhagen, seemed to be thinking along the same lines. He built a peculiar device that produced sounds similar to human vowels using resonance tubes connected to organ pipes. Just over a decade later, Wolfgang von Kempelen in Vienna created a similar Acoustic-Mechanical Speech Machine. And in the early 19th Century, English inventor Charles Wheatstone improved on von Kempelen's system with resonators made out of leather. Their configuration could be changed or controlled by hand to produce different speech-like sounds.
Audrey could recognise the sound of a spoken digit – zero to nine – with more than 90% accuracy
Then in 1881, Alexander Graham Bell, his cousin Chichester Bell and Charles Sumner Tainter built a rotating cylinder with a wax coating, with a stylus that would cut vertical grooves, responding to incoming sound pressure. The invention paved the way for the first recording machine, the "Dictaphone", patented in 1907. The idea was to get rid of stenographers by using the machine to record dictation of notes and letters for a secretary, so that they could later be typed offline. The invention took off, with more and more offices around the globe sporting a secretary with a clunky earpiece, listening to the recordings and transcribing them.
But all those baby steps kept machines passive – until “Audrey”, the Automatic Digit Recognition machine, came along in 1952. Made by Bell Labs, the huge machine occupied a six-foot-high relay rack, consumed substantial power and had streams of cables. It could recognise the fundamental units of speech sounds, which are called phonemes.
Back then, computing systems were extremely expensive and inflexible, with limited memory and computational speed. But regardless, Audrey could recognise the sound of a spoken digit – zero to nine – with more than 90% accuracy, at least when uttered by its developer HK Davis. It worked with 70-80% accuracy for a few other designated speakers, but far less well with voices it was unfamiliar with. “This was an amazing achievement for the time, but the system required a room full of electronics, with specialised circuitry to recognise each digit,” says Charlie Bahr of Bell Labs Information Analytics.
Because Audrey could recognise only voices of designated speakers, its use was limited: for instance, it could offer voice dialling by, say, toll operators, but it wasn’t really a necessity because in most cases manual push-button dialling of numbers was cheaper and easier. Audrey was an early bird – it preceded general purpose computers, and although it was not used in production systems, “it showed that speech recognition could be made practical”, says Bahr.
But there was another goal. “I believe Audrey was initially developed to reduce bandwidth, the volume of data travelling over the wires,” says Bahr’s colleague Larry O’Gorman of Nokia Bell Labs. Recognised speech would require much less bandwidth than the original sound waves. But as telephone switches became digital in the 1970s and 80s, they enabled faster and cheaper call routing, while staying dependent upon an operator recognising a person’s request to dial a number. So, in the 1970s and 80s, a huge effort in Bell Labs’ speech research was to simply do the following: recognise zero to nine digits, and ‘yes’ or ‘no’. “With recognition of these 12 words, the telephone system was able to complete the transition to machine-only telephony,” says O’Gorman.
Audrey was not the only kid on the block, though. In the 1960s, several Japanese teams worked on speech recognition, with the most notable ones a vowel recogniser from the Radio Research Lab in Tokyo, a phoneme recogniser from Kyoto University, and a spoken-digit recogniser from NEC Laboratories.
We don’t want to look things up in dictionaries – so I wanted to build a machine to translate speech – Alexander Waibel
At the 1962 World Fair, IBM showcased its "Shoebox" machine, able to understand 16 spoken English words. There were other efforts in the US, UK and the Soviet Union, with Soviet researchers inventing the dynamic time-warping (DTW) algorithm that they used to build a recogniser capable of working with a 200-word vocabulary. But all these systems were mostly based on template matching, where individual words are matched against stored voice patterns.
The most significant leap forward of the time came in 1971, when the US Department of Defense’s research agency Darpa funded five years of a Speech Understanding Research programme, aiming to reach a minimum vocabulary of 1,000 words. A number of companies and academia including IBM, Carnegie Mellon University (CMU) and Stanford Research Institute took part in the programme. That’s how Harpy, built at CMU, was born.
Unlike its predecessors, Harpy could recognise entire sentences. “We don’t want to look things up in dictionaries – so I wanted to build a machine to translate speech, so that when you speak in one language, it would convert what you say into text, then do machine translation to synthesise the text, all in one,” says Alexander Waibel, a computer science professor at Carnegie Mellon who worked on Harpy and another CMU machine, Hearsay-II.
Moving from single words to phrases wasn’t easy. “With sentences, you get words flowing into each other, you get a lot of confusion and don’t know where the words end and where they begin. So you have things like ‘euthanasia’, which could be ‘youth in Asia’,” says Waibel. “Or if you say ‘Give me a new display’ it could be understood as ‘give me a nudist play’’.”
All in all, Harpy recognised 1,011 words – approximately the vocabulary of an average three-year-old – with reasonable accuracy, thus achieving Darpa’s original goal. It “became a true progenitor to more modern systems”, says Jaime Carbonell, director of the Language Technologies Institute at CMU, being “the first system that successfully used a language model to determine which sequences of words made sense together, and thus reduce speech recognition errors”.
In the years that followed, speech recognition systems evolved further. In the mid 1980s, IBM built a voice activated typewriter dubbed Tangora, capable of handling a 20,000-word vocabulary. IBM’s approach was based on a hidden Markov model, which adds statistics to digital signal processing techniques. The method makes it possible to predict the most likely phonemes to follow a given phoneme.
Google’s trick was to use cloud computing to process the data received by its app
IBM’s competitor Dragon Systems came up with its own approach, and technological advances finally pushed speech recognition far enough that it could find its first applications – such as dolls that kids could train to speak. But still, despite these successes, all the programs at the time used discrete dictation, meaning the user had to pause… after… every… word. In 1990, Dragon released the first consumer speech recognition product, Dragon Dictate, for a whopping $9,000. Then in 1997 Dragon NaturallySpeaking appeared – the first continuous speech recognition product.
“Before that time, speech recognition products were limited to discrete speech, meaning that they could only recognise one word at a time,” says Peter Mahoney, senior vice president and general manager of Dragon, Nuance Communications. “By pioneering continuous speech recognition, Dragon made it practical for the first time to use speech recognition for document creation.” Dragon NaturallySpeaking recognised speech at about 100 words per minute – and it is still used today, for instance, by many doctors in the US and the UK to document their medical records.
In the last 10 years or so, machine learning techniques loosely based on the workings of the human brain have allowed computers to be trained on huge datasets of speech, enabling excellent recognition across many people using many different accents.
Still, the technology stalled until Google released its Google Voice Search app for the iPhone. Google’s trick was to use cloud computing to process the data received by its app. Suddenly, publicly available voice recognition had massive amounts of computing power at its disposal. Google was able to run large-scale data analysis for matches between the user's words and the huge number of human-speech examples it had amassed from billions of search queries. In 2010, Google added "personalised recognition" to Voice Search on Android phones, and Voice Search to its Chrome browser in mid-2011. Apple quickly offered its own version, called Siri, while Microsoft called its AI Cortana, named after a character in the popular Halo video game franchise.
Automatic speech recognition is still far less successful than the human ear in many situations – Larry O’Gorman, Nokia Bell Labs
So what’s next? “Within speech processing, the most mature technology is speech synthesis,” says O’Gorman. “Machine voices now are largely indistinguishable from a human’s. But automatic speech recognition is still far less successful than the human ear in many situations.” While speech can be automatically recognised by a clearly speaking person in an environment with little noise, the so-called “cocktail-party effect” – where humans can understand a single speaker in the din of a party – is still beyond any state-of-the-art technology. Even with Alexa, in a noisy room you have to make sure you’re right near the black cylinder and speak to it clearly and loudly.
Amazon’s attempt at voice recognition was inspired by the Star Trek computer, says Van der Meulen, with the aim of creating a computer in the cloud that’s controlled entirely by your voice—so that you could converse with it in a natural way. Sure, the magic of Hollywood still has the edge on today’s technology, but, says Van der Meulen, “we’re in a golden age of Machine Learning and AI. We’re still a long way from being able to do things the way humans do things, but we’re solving unbelievably complex problems every day.”
Join 800,000+ Future fans by liking us on Facebook, or follow us on Twitter, Google+, LinkedIn and Instagram
If you liked this story, sign up for the weekly bbc.com features newsletter, called “If You Only Read 6 Things This Week”. A handpicked selection of stories from BBC Future, Earth, Culture, Capital, Travel and Autos, delivered to your inbox every Friday.