In 2001, the Human Genome Project gave us an almost complete draft of the 3 billion letters in our DNA. We joined an elite club of species that have had their genome sequenced, one that is growing with every passing month.
These genomes contain the information necessary for building their respective owners, but it’s information that we still struggle to parse. To date, no one can take the code from an organism’s genes and predict all the details of its shape, behaviour, development, physiology – the collection of traits known as its phenotype. And yet, the basis of those details are there, all captured in stretches of As, Cs, Gs and Ts. “Cells know pretty reliably how to do this,” says Leonid Kruglyak from Princeton University. “Every time you start with a chicken genome, you get a chicken, and every time you start with an elephant genome, you get an elephant.”
As our technologies and understanding advance, will we eventually be able to look at a pile of raw DNA sequence and glean all the workings of the organism it belongs to? Just as physicists can use the laws of mechanics to predict the motion of an object, can biologists use fundamental ideas in genetics and molecular biology to predict the traits and flaws of a body based solely on its genes? Could we pop a genome into a black box, and print out the image of a human? Or a fly? Or a mouse?
Not easily. In complex organisms, some traits can be traced back to specific genes. If, for instance, you’re looking at a specific variant of the MC1R gene, chances are you’ve got a mammal in front of you, and it has red hair. Indeed, people have predicted that some Neanderthals were red-heads for precisely this reason. “But beyond that, predicting [if something is] a mouse or a whale or a armadillo, we still wouldn’t do well,” says Kruglyak.
Bernhard Palsson from the University of California, San Diego agrees. “Sequencing a woolly mammoth will not predict its properties,” he says. “But you might be able to do a lot better with bacteria.” Their simpler and smaller genomes should in theory make it easier to predict the basic features of their metabolism, or whether they grow using oxygen or not. But even though we can sequence a bacterial genome in under a day, and for just $80, we would still struggle to determine important traits, like how good a disease-causing microbe is at infecting its host.
Finding all the genes in a small genome is hard. Earlier this year, scientists discovered a new gene in a flu virus whose genome consists of just 14,000 letters (small enough to fit into 100 tweets), and had been sequenced again and again. So it should be unsurprising that our own genome, with 3 billion letters, is full of errors and gaps, despite ostensibly being “complete”. In May, another group showed that the reference human genome is missing a gene that may have shaped the evolution of our large brains. “There’s no genome that is completely understood even in terms of the genes within it,” says Markus Covert from Stanford University. “Typically, no function is known for a fourth to a fifth of the genes.”
Genes encode the instructions for assembling proteins, molecular machines that perform vital jobs in our cells. A protein is a long chain of amino acids, and we can predict that chain with perfect precision. But the chain also folds, origami-like, into a complex three-dimensional shape, and the shape dictates everything that the protein does, from the chemical reactions it speeds up to the other molecules it sticks to. Discerning those shapes is laborious work, involving growing pure crystals of the proteins, and bombarding them with X-rays. And despite having hundreds of these structures, even the most powerful computers struggle to accurately compute a protein’s shape from the DNA sequences that produce them. “I see that challenge as the stifling one,” says Palsson.