The mapping of the human genome in draft form in 2000 was a turning point in biology. But the announcement really marked a start, rather than an end, of the practical use of genomic information.
Several large-scale projects in the intervening years have mined particular aspects of the genome and combined it with other sources of information. The HAPMAP project, for example, looked at how common single-nucleotide polymorphisms, or SNPs--changes in a single base--varied among a few selected populations. These studies formed the basis for genome-wide association studies to identify DNA regions associated with various diseases.
Another project, called ENCODE, for ENCyclopedia Of DNA Elements, focused on cataloguing the various types of protein-coding and regulatory structures in the genome. By correlating these with biochemical measures of functional activity and the way the DNA is organized in the nucleus, the researchers got a broad view of how the expression of genes is regulated. They also compared the DNA sequences with those of closely and distantly related organisms, to illuminate how the function of the DNA is related to its evolution.
With some 200 co-authors from 80 different institutions, the ENCODE project rivals some big particle-physics experiments for scope and complexity. In fact, the pilot phase of ENCODE selected "only" 1% of the genome--around 3 million bases--for detailed study. An overview of the results appeared in Nature in June 2007. The data are publicly available, and researchers continue to publish papers on aspects of the work. In addition, follow-up work is aimed at analyzing the entire genome.
Among the profound conclusions from the pilot phase that most of the genome is transcribed into RNA, even though only 1.5% or so codes for protein and only about 5% seems is clearly functional. In other words, much of the regulation of genetic activity may be occurring, not at the level of transcription, but at the level of RNA.
The researchers also found that the organization of the chromosomes in the nucleus, in particular the wrapping of the DNA around histones to form nucleosomes, predicts the locations where transcription begins. These results emphasize the known importance of the positions of nucleosomes in regulating genetic activity at different positions.
Some of the researchers looked at various measures of biochemical activity along the DNA, such as binding to proteins that are known to be active in regulation. Their hope was that these assays would serve to identify regions with a biological function in the cell.
Other ENCODE researchers compared the sequences with corresponding sequences from other organisms--both close relatives like mice and distant eukaryotic relatives like yeast. According to a longstanding assumption, the degree of similarity of these sequences, showing how resistant they are to changes from neutral evolution, should also reflect their biological importance.
These studies revealed two surprises. First, not all biochemically active sequences are evolutionarily constrained. This might mean that the biochemical tests don't measure things that are important to the cell after all. Second, and more puzzling, not all of the constrained sequences had any obvious function.
I wrote a story for Science this summer (subscribers only, sorry) discussing possible reasons why evolution and importance don't always track one another.
ENCODE and other large-scale studies will continue to supply us with extensive, detailed information about the genome. The story is only just beginning.