‘I have a dream’
Scientists have encoded short messages in DNA for years using simple ciphers. When sequencing pioneer Craig Venter’s team implanted a bacterium with a fully synthesised genome in 2011, they added their names and several famous quotes into the fabricated DNA. The messages were coded using combinations of three bases to signify each character – for example, AGT stood for the letter B.
But encoding longer messages – say, a book or a video file – is far more difficult. DNA can only be synthesised and read as small fragments of around 200 base pairs or smaller, so larger chunks of information must be broken down before they are encoded. When those fragments are synthesised, you get a messy soup containing millions of copies of each one. So, every piece needs an identifier that reveals where it fits into the overall message – “I’m the first fragment” or “I’m the 765th”.
Working on a similar principle, Birney and Goldman chose five files representing a range of formats and (mostly) material of great cultural value. A PDF of the classic 1951 paper in which James Watson and Francis Crick described DNA’s double helix was an obvious choice. The duo originally wanted Shakespeare’s complete works but they underestimated the size of the Bard’s output, so they settled for just the 154 sonnets in ASCII text. A 26-second MP3 clip of Martin Luther King’s “I have a dream” speech filled the audio slot after the duo ruled out Lady Gaga. A copy of the cipher used to encode the data was a practical choice, and a JPEG picture of the EBI was the lone concession to narcissism.
Birney and Goldman also devised a more complex cipher than Church. First, they converted binary data into base-three, replacing every byte – a string of 8 zeroes and ones – with a corresponding string of 5 zeroes, ones and twos. Next, they replaced these numbers with DNA letters, using a code where the meaning of each letter depends on the one before it. For example, A means 1 if it follows a G, but 0 if it follows a T and 2 if it follows a C.
Why so complicated? Because in this code, no letter ever appears twice in a row. Repetitive strings of bases – such as AAAAAAA – are the bane of both DNA synthesisers and sequencers. If you can avoid them, your error rate plummets.
Still, there would be mistakes. “We had to go in saying we were going to make errors,” says Birney. “It’s a disaster to think your technology won’t have errors.” Church got 11 mistakes out of 5.2 million letters – hardly catastrophic, but Birney and Goldman wanted none. So, they built redundancies into their code. They broke the five files into more than 153,000 fragments, each 117 letters long. Each string overlaps with four others, so that every bit of information is repeated four times. If any fragment is synthesised wrongly or cannot be read, its contents can be pieced together from at least three others.