This year, 2015, is the 20th anniversary of our work in sequencing the first genome in history from a living species (1) and the 15th anniversary of the White House announcement with President Clinton of the first draft of the human genome sequence. Our team published its results in the journal Science in 2001. Our more recent study on the first species with a completely synthetic genome (2) has helped to prove that DNA—and our genome—is the software of life.
The 6 billion letters of the human genome sequence do deserve philosophical, as well as scientific, scrutiny. In the sequence of the genome are encoded the instructions for building and maintaining the complex structure and physiology of each human being. Everything we know or hope to understand about health and disease must be understood in the context of the genome. Since the initial publication of the human genome sequence in 2001, this understanding has begun to emerge.
Our first human genome sequence article, featured here, represented a significant scientific milestone not just because it was the first human genome sequence. The method used to sequence the genome—the whole-genome shotgun (WGS) strategy—was at the time considered radical, unproven, and likely to fail, to name just a few of the critiques (3). We had developed WGS and used it to sequence the first genome in history, Hemophilus influenzae (1), and then used that strategy for the next 5 years on multiple microbe and parasite genomes. However, the only large, complex genome sequenced at the time using our method (4) was that of the fruit fly, Drosophila melanogaster, which we sequenced as a test of the assembly algorithms before tackling the human genome. This shotgun-sequencing strategy was faster and cheaper and resulted in a higher-quality sequence than could be obtained in any other way.
In contrast to our approach with WGS, the federal Human Genome Project was conceived as a 15-year, $3 billion project that settled on a clone-by-clone sequencing approach on the assumption that the genome was way too large and complex to consider any other approaches and because the task of creating clone maps of the genome could be distributed over multiple centers in several countries.
To reconstruct the human sequence at Celera, we took advantage of 350 new capillary DNA sequencing machines. The machines could be run in a single production environment to generate 25 million sequence reads of around 600 bp each over a 9-month period for a cost of approximately $100 million. Concurrently, we developed efficient computational algorithms that enabled us to take a whole-genome approach to assembly from the millions of sequence fragments (4).
When we set out to write the manuscript, we wanted to provide as comprehensive an analysis of the human genome as possible, as we had attempted 5 years earlier with the first view of the entire genome of a species (1). When we started our analysis, most basic characteristics of the human genome were not yet known—including, remarkably, any good sense of the number of protein-coding genes. It came as a surprise that the number is closer to 20 000 than the 100 000 to 300 000 estimated earlier. (In comparison, Drosophila has 13 600 protein-coding genes (4).) Our initial comparison of the worm, fly, and human genomes showed that a majority of human genes had orthologs in more experimentally tractable systems, but also that certain protein domains and gene families had expanded in the human lineage, leading to more complex neuronal function, tissue-specific developmental regulation, and hemostasis and immune systems.
Only about 1% of the human genome codes for proteins, which means finding the coding regions and correctly linking their exons together is challenging. Expressed sequence tags (5) and other complementary DNA sequences mapped to the genome sequence showed that alternative splicing was relatively common in human genes, and the diversity of transcripts produced results in protein variants that often carried out tissue-specific functions. More recently, it has come to be appreciated that many classes of non–protein-coding RNAs play important roles in regulation of gene expression.
Every human is unique. What emerged was a picture of a genome in action, not a static, fixed “reference.” Although every human shares about 99.9% of his or her DNA with every other human (and more than 95% with chimpanzees), the differences matter in defining traits and characteristics and risks for disease and environmental sensitivity. By sequencing a mixture of 5 different people, we were able to get a first glimpse of those differences. Whereas most appear to be evolutionarily neutral, we found a highly nonrandom distribution of genetic variants, suggesting that mutation rates and selection vary across the genome. Despite extensive investment in maps of single nucleotide polymorphisms and genome-wide association studies, the genetic basis underlying most traits is incompletely known, even for such simple and clearly genetic features as height.
Our WGS sequencing of the human genome was so successful that >99% of all genomes sequenced to date have used it. Today, each new Illumina X Ten sequencing machine we use at Human Longevity Inc. is equivalent to 1350 Celera-sized sequencing facilities. We can now shotgun-sequence human genomes at a rate of several thousand per month, for a cost approaching only $1000/genome, proving that the methods we introduced 2 decades ago have more than stood the test of time.
This Citation Classic is dedicated to 2 of our colleagues who made essential contributions to sequencing the first human genome: Jeannine Gocayne and Richard J. Mural, both of whom died from cancer.
Featured article: Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science. 2001;291:1304–51.3
↵3 This article has been cited more than 7000 times since publication.
Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.
Authors' Disclosures or Potential Conflicts of Interest: No authors declared any potential conflicts of interest.
- Received for publication June 12, 2015.
- Accepted for publication June 30, 2015.
- © 2015 American Association for Clinical Chemistry