BBC Future
In Depth

Building Babel: Lost in machine translation

(Copyright: SPL)

(Copyright: SPL)

Scientists have been trying to automatically translate languages for almost as long as computers have been in existence. So why is it so hard?

Earlier this year, the Malaysian Ministry of Defence unveiled its glossy new website, designed to show off its military prowess and high standards to the world. Unfortunately, nobody had bothered to check the English translations.

One section said that the Malaysian government had taken “drastic measures to increase the level of any national security threat” after the country's independence in 1957. Another page suggested women should not wear items that “poke out the eye”, an apparent translation of a rule that women should not wear revealing clothing.

Initially it was just sniggering Malaysians who passed the gaffes around on social media, but the chortles soon became global, triggering the Defence Minister to admit that the ministry had used the free online tool Google Translate. He subsequently ordered the new military site to be removed.

The episode was embarrassing for the Malaysian ministry, but it also provides an object lesson in the limitations of today’s machine translation technology, which despite billions of pounds of research and massive demand from businesses, politicians and the military, not to mention tourists, is still only stuttering along.

According to Phil Blunsom, a lecturer and machine translation researcher at the University of Oxford, the field has made a lot of progress. But a time when a computer can match the interpretive skills of a professional is “still a long way off”.

So why is it so hard to automatically translate texts?

Scientists and academics have been trying to automate translation for almost as long as computers have been in existence. In the 1940s and 1950s it was widely assumed that once the vocabulary and the rules of grammar of a language had been codified, it would make automated translation easy, according to Dr Blunsom. But attempts to make computers learn languages in this way over the next forty years were largely unsuccessful, unless the range of words they were expected to translate was very limited.

"The main problem is that language is too complex," explains Philipp Koehn, a machine translation researcher at the University of Edinburgh School of Informatics. "Language is always ambiguous, so you can’t always use rules, and new vocabulary is always coming in, so you need someone to continually maintain those rules." What it boils down to is that there are simply too many possible rules for them all to be written down, and there are also too many exceptions to those rules, he adds.

Then in the 1980s, computer giant IBM carried out pioneering research into the use of words in sentences. Specifically, its researchers examined the relative frequency of different groups of three words occurring in a sentence. For example, they noted "going to go" occurs far more frequently than "going too, go" or "going two go". So although the three phrases sound almost identical, the first is statistically most likely to be correct.

This apparently simple insight had huge repercussions, opening up a new statistical approach to translation.

"The vast majority of research into machine translation is now pursuing the statistical approach," says Dr Blunsom.

Online services such as Google Translate and Yahoo! Babel Fish both use statistical machine translation techniques – although Yahoo!'s system is best described as a hybrid approach that makes heavy use of rules, as well as statistics.

More than words

The statistical translation approach works by analysing parallel corpora – bodies of text that have already been translated from one language to another. Put simply, the translation system looks out for a word or phrase in one language, that crops up whenever a word or phrase appears in the other language. If it spots "un chien noir" in French every time the phrase "a black dog" occurs in English, then it stores these two phrases together in a "phrase table".

BBC © 2013 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.