In 2001, the Human Genome Project gave us an almost complete draft of the 3 billion letters in our DNA. We joined an elite club of species that have had their genome sequenced, one that is growing with every passing month.
These genomes contain the information necessary for building their respective owners, but it’s information that we still struggle to parse. To date, no one can take the code from an organism’s genes and predict all the details of its shape, behaviour, development, physiology – the collection of traits known as its phenotype. And yet, the basis of those details are there, all captured in stretches of As, Cs, Gs and Ts. “Cells know pretty reliably how to do this,” says Leonid Kruglyak from Princeton University. “Every time you start with a chicken genome, you get a chicken, and every time you start with an elephant genome, you get an elephant.”
As our technologies and understanding advance, will we eventually be able to look at a pile of raw DNA sequence and glean all the workings of the organism it belongs to? Just as physicists can use the laws of mechanics to predict the motion of an object, can biologists use fundamental ideas in genetics and molecular biology to predict the traits and flaws of a body based solely on its genes? Could we pop a genome into a black box, and print out the image of a human? Or a fly? Or a mouse?
Not easily. In complex organisms, some traits can be traced back to specific genes. If, for instance, you’re looking at a specific variant of the MC1R gene, chances are you’ve got a mammal in front of you, and it has red hair. Indeed, people have predicted that some Neanderthals were red-heads for precisely this reason. “But beyond that, predicting [if something is] a mouse or a whale or a armadillo, we still wouldn’t do well,” says Kruglyak.
Bernhard Palsson from the University of California, San Diego agrees. “Sequencing a woolly mammoth will not predict its properties,” he says. “But you might be able to do a lot better with bacteria.” Their simpler and smaller genomes should in theory make it easier to predict the basic features of their metabolism, or whether they grow using oxygen or not. But even though we can sequence a bacterial genome in under a day, and for just $80, we would still struggle to determine important traits, like how good a disease-causing microbe is at infecting its host.
Finding all the genes in a small genome is hard. Earlier this year, scientists discovered a new gene in a flu virus whose genome consists of just 14,000 letters (small enough to fit into 100 tweets), and had been sequenced again and again. So it should be unsurprising that our own genome, with 3 billion letters, is full of errors and gaps, despite ostensibly being “complete”. In May, another group showed that the reference human genome is missing a gene that may have shaped the evolution of our large brains. “There’s no genome that is completely understood even in terms of the genes within it,” says Markus Covert from Stanford University. “Typically, no function is known for a fourth to a fifth of the genes.”
Genes encode the instructions for assembling proteins, molecular machines that perform vital jobs in our cells. A protein is a long chain of amino acids, and we can predict that chain with perfect precision. But the chain also folds, origami-like, into a complex three-dimensional shape, and the shape dictates everything that the protein does, from the chemical reactions it speeds up to the other molecules it sticks to. Discerning those shapes is laborious work, involving growing pure crystals of the proteins, and bombarding them with X-rays. And despite having hundreds of these structures, even the most powerful computers struggle to accurately compute a protein’s shape from the DNA sequences that produce them. “I see that challenge as the stifling one,” says Palsson.
Protein-coding genes make up just 1.5% of our genome, the rest includes a lot of what is thought to be useless junk with no discernible function. But it also contains regulatory sequences that control when, where and how our genes are used. We need to identify these if we’re ever to predict how a genome leads to a living, breathing organism. The technology for doing that is being developed, and the ENCODE project – the Encyclopaedia of DNA elements – has put it to good use, compiling a catalogue of the various regulatory sequences in our own genome. But ENCODE involved 442 scientists intensely running experiments for a decade, and even its unprecedented catalogue is incomplete.
And even if we have all this information—every gene, protein structure, and regulatory sequence – we’d still need to figure out how it all works together, and how it interacts with its environment. We would need patterns: when and where different genes are activated as an organism develops. We need timings: how quickly chemical reactions take place in a cell, and how proteins speed up that process.
Here, our metaphors let us down. Science writers like to compare the genome to a textbook or a blueprint. That conveys the fact that it stores information, but glosses over its buzzing, dynamic nature – proteins docking on and off to control the activity of genes, huge stretches of DNA that fold and unfold to reveal or hide their sequences, parasitic jumping genes that copy themselves and hop throughout the genome... None of our information stores – not sheet music, not recipe books – are this intricate.
This hasn’t stopped some scientists from trying to simulate this intricacy. In July, Covert announced that he had created a rough simulation of an entire organism – a single-celled microbe called Mycoplasma genitalium. Covert’s model simulates how all of the bacterium’s 525 genes are used, the proteins they produce, how quickly the proteins act, how they interact, and more. It is not completely accurate, but it captures much of M. genitalium’s lifestyle. Two colleagues wrote that the project “should be commended for its audacity alone”.
Still, the stimulation was hard-won. At 525 genes, M.genitalium has the smallest genome outside of viruses (humans have 20-25,000 genes, by comparison), pared down to extreme minimalism by its life as a parasite. It may be one of the simplest living things we can imagine, but modelling this microbe still took around 1,900 experiments and a lot of borrowed knowledge. “Around half of our model comes from experiments that were done in other bacteria,” says Covert. “There’s no way [the genome] would have been predictive by itself.”
Covert also needed to factor in M. genitalium’s environment. It lives only in the stable environment of our urethra, with no light, and steady temperature. “But even then, it occasionally sees the immune system coming after it and there’s no way of modelling that,” says Covert.
The influence of the environment becomes even more crucial for more complex free-ranging organisms. Temperature and acidity affect how proteins behave. The food that an organism consumes, the infections that plague it, and the competitors it interacts with, all affect how it develops, and how its genes are used. Many of these factors leave marks on the genome itself – “epigenetic” tags that dictate the deployment of genes, and can be passed on to the next generation. The environment clearly matters. When making predictions from a genome, the elephant in the room is the room.
Still, Covert’s approach shows one way forward – the dawn of virtual biology. You could sequence a genome, construct a model or simulation, compare that to the real organism, work out the flaws in the model, and rectify those flaws with further experiments. Rinse and repeat. Eventually, you would have a zoo of models. If you have a new genome, start by comparing it to one of the existing simulations and work from there. It’s not quite the black box we envisaged, but it’s something.
If scientists are trying to find fungi or bacteria that can perform a specific job – say, clean up hazardous waste, to produce certain nutrients – it would be valuable to identify such organisms from their genomes alone. “We can use the sequencing to look for phenotypes that are relevant for our objective,” says Jens Nielsen from the Chalmers University of Technology in Sweden. And if that objective is to artificially design new life-forms, as folks like Craig Venter are trying to do, then prediction becomes essential, rather than wishful. “You’d worry about side effects and you’d want computational tools that can avoid them,” says Covert. “When we talk about rationally designing a new organism, you’d want to predict a phenotype.”
“I doubt we’d ever get to 100% prediction because biology is so variable,” says Nielsen. But Kruglyak adds, “I don’t think that in principle, there are any showstoppers that would make it impossible. It would just take a whole lot more work and continued technological development beyond what we can do today.”