Revolutionary genome sequencing technologies--the $1,000 genome.
The ability to sequence complete genomes and the free dissemination of the sequence data have dramatically changed the nature of biological and biomedical research. Sequence and other genomic data have the potential to lead to remarkable improvement in many facets of human life and society, including the understanding, diagnosis, treatment and prevention of disease; advances in agriculture, environmental science and remediation; and the understanding of evolution and ecological systems.
The ability to sequence many genomes completely has been made possible by the enormous reduction of the cost of sequencing in the past two decades, from tens of dollars per base in the 1980s to a few cents per base today. However, even at current prices, the cost of sequencing a mammalian-sized genome is tens of millions of dollars and, accordingly, we must still be very selective when choosing new genomes to sequence. In particular, we remain very far away from being able to afford to use comprehensive genomic sequence information in individual health care. For this, and many other reasons, the rationale for achieving the ability to sequence entire genomes very inexpensively is very strong.
There are many areas of high priority research to which genomic sequencing at dramatically reduced cost would make vital contributions. 1) Expanded comparative genomic analysis across species, which will yield great insights into the structure and function of the human genome and, consequently, the genetics of human health and disease. Studies to date that have been able to compare small regions of several genomes, and "draft" versions of full genomes, have clearly demonstrated the need for much more complete data sets. While some of the needed data will be obtained over the next two or three years using existing DNA sequencing technology, and while costs will continue their gradual decline, the cost of current approaches to sequence acquisition will continue to limit the amount of useful data that can be produced. 2) Studies of human genetic variation and the application of such information to individual health care, which will also require much cheaper sequencing technology. Today, genetic variation must be assessed by genotyping the relatively few known differences at a relatively small number of loci within the human population. A richer and better characterized catalog of such variable sites is being generated to support more detailed and powerful analyses.
While these methods are, and will become even more, powerful and likely to provide a significant amount of important new information, they are nevertheless only a surrogate for determining the full, contiguous sequence of individual human genomes, and are not as informative as sequencing would be. For example, current genotyping methods are likely to miss rare differences between people at any particular location in the genome and have limited ability to determine long-range information (e.g., genomic rearrangements). Therefore, new methods based on complete genomic sequencing will be needed to use genomic information for individual health care in the most effective manner possible. 3) While the genomes of a few agriculturally important animals and plants have been sequenced, the most informative studies will require comparisons between different individuals, different domesticated breeds and several wild variants of each species. 4) Sequence analysis of microbial communities, many members of which cannot be cultured, would provide a rich source of medically and environmentally useful information. And accurate, rapid sequencing may also be the best approach to microbial monitoring of food and the environment, including rapid detection and mitigation of bioterrorism threats.
Given the broad utility and high importance of dramatically reducing DNA sequencing costs, the National Human Genome Research Institute (NHGRI) is launching two parallel technology development programs. The first has the objective of reducing the cost of producing a high quality sequence of a mammalian-sized genome by two orders of magnitude (see accompanying RFA, HG-04-002). The goal of the second program, described in this RFA, is the development of technology to sequence a genome for a cost that is reduced by four orders of magnitude. For both programs, the cost targets are defined in terms of a mammalian-sized genome, about 3 gigabases (Gb), with a target sequence quality equivalent to, or better than, that of the mouse assembly published in December 2002 [Nature 420:520 (2002)].
The ultimate goal of this program is to obtain technologies that can produce assembled sequence (i.e., de nova sequencing). However, an accompanying shorter-term goal is to obtain highly accurate sequence data at the single base level, i.e., without assembly information, that can be overlaid onto a reference sequence for the same organism (i.e., re-sequencing). This could be achieved, for example, with short reads that have no substantial information linking them to other reads. While the sequence product of this kind of technology would lack some important information, such as information about genomic rearrangements, it would nevertheless potentially be available more rapidly and produce data of great value for certain uses in studying disease etiology and in individualized medicine. Therefore, both programs' objectives include a balanced portfolio of projects developing both de novo and re-sequencing technologies.
State-of-the-art technology (i.e., fluorescence detection of dideoxynucleotide-terminated DNA extension reactions resolved by capillary array electrophoresis [CAE]) allows the determination of sequence "read" segments approximately 1000 nucleotides long. If all of the DNA in a 2-3 Gb genome were unique, it would be possible to determine the sequence of the entire genome by generating a sufficient number (millions) of randomly-overlapping thousand-base reads and align them by overlaps. However, the human and the majority of other interesting genomes contain a substantial amount of repetitive DNA (short [tens to thousands of nucleotides], nearly or completely identical sequences present in multiple [tens to thousands of] copies). To cope with the complexities of repetitive DNA elements and to assemble the thousand-base reads in the correct long-range order across the genome, current genomic sequencing methods involve a variety of additional strategies, such as the sequencing of both ends of cloned DNA fragments, use of libraries of cloned fragments of different lengths, incorporation of map information, achievement of substantial redundancy (multiple reads of each nucleotide from overlapping fragments) and application of sophisticated assembly algorithms to align and filter the read information.
The "gold standard" for genomic sequencing is 99.99% accuracy (not more than one error per 10,000 nucleotides) with essentially no gaps (http://www.genome.gov/10000923). At present, the final steps in achieving that very high sequence quality cannot be automated and require substantial hand-crafting. However, recent experience suggests that the majority of comparative sequence information can be obtained from automatically generated sequence assemblies that have been variously identified as "high-quality draft" or "comparative grade." Therefore, while the ultimate goal is sequencing technology that produces perfect accuracy, the goal of the current program is to develop technology for producing automatically generated sequence of at least the quality of the mouse draft genome sequence that was published in December 2002 [Nature 420:520 (2002)].
Emerging technologies, collectively characterized as sequencing-by-synthesis or sequencing-by-extension, may be able to achieve large numbers of sequence reads by extending very large numbers of different DNA templates simultaneously, but generally only for a few tens of bases as currently practiced. Even if it is possible m extend these reads to several hundred bases, it will still be necessary to link those reads to achieve long-range sequence contiguity. For some purposes, long-range sequence contiguity may not be required. For example, the resequencing of genomes (determination of the DNA sequence for many individuals of a species after a reference sequence for that species has been determined), such as might be used for medical diagnostic purposes, could be achieved by aligning individual reads on the reference sequence. However, short reads, particularly ones with lower per-base quality, can be very difficult to align given the nature of repetitive DNA and of closely-related gene families in complex genomes. Also, chromosomal rearrangements may be difficult to detect without high quality sequence information bridging the breakpoints with enough sequence to know in which repeat the breakpoint lies. The determination of single nucleotide polymorphisms (SNPs) and their phase (for haplotypes) also requires contiguity of varying length. The ultimate goal and a high priority for the NHGRI's sequencing technology development efforts, as exemplified in these two RFAs, continues to be de novo, assembled sequence. However, because of the value of resequencing for many future purposes, these RFAs also solicit the development of very inexpensive technology for very high quality re-sequencing (without assembly).
Most investigators interested in reducing DNA sequencing costs anticipate that a few additional two-fold decreases in cost can yet be achieved with the current CAE-based technology, with a realistic lower limit of perhaps $5 million per mammalian-sized genome. However, it is likely that this efficiency will only be achieved in a few very large, well-capitalized, experienced, automated laboratories. To achieve the broadest benefit from DNA sequencing technology for biology and medicine, systems that are not only substantially more efficient but also more usable by the average research laboratory are needed.
One set of current technology development efforts is aimed at increasing parallel sample processing while integrating the sample preparation and analysis steps on a single platform. Thus, in one approach, lithography is used to create a large number of microchannels on a single device and to integrate an efficient sample injector with each separation channel. Chambers for on-chip DNA amplification, cycle sequencing reactions and sample clean-up have been also developed, and experiments to integrate these steps, an approach that effectively places much of the actual process and process control onto the device, are being conducted in several laboratories. Attendant improvements in separation polymers and in fluorescent dyes will facilitate these developments. As these approaches are based largely on the experience of currently successful high-throughput CAE-based methods, they have potential to produce cost savings in the range of several factors of two beyond the CAE-based system itself. They also have the potential to widen the user base for the technology, as the infrastructure and knowledge needed to conduct relatively high-throughput sequencing, or clinical diagnostic sequencing, would be substantially reduced and simplified.
Other approaches to improving sequencing technology involve methods that are independent of the Sanger dideoxynucleotide chain termination reaction or of electrophoretic separation of the termination products. Two methods that were proposed in the early days of the HGP involve the use of mass spectrometry and sequencing by hybridization. Both methods have been pursued, with some limited success for sequencing, but substantial success for other types of DNA analysis. Both continue to hold additional potential utility for sequencing, although certain inherent limitations will need to be overcome.
More recently, additional methodologies have been investigated. These may be classified into two approaches. One is sequencing-by-extension, in which template DNA is elongated stepwise and each extension product is detected. Extension is generally achieved by the action of a polymerase that adds a deoxynucleotide, followed by detection of a fluorescent or chemiluminescent signal; the cycle is then repeated. Modifications of this approach rely on other types of enzymes and detection of hybridization of labeled oligonucleotides. To obtain sufficient throughput, the method is implemented at a high level of multiplexing, e.g., by arraying large numbers of sequencing extension reactions on a surface. A key factor in this general approach is the manner in which the fluorescent signal is generated and the system requirements thus imposed. Depending on the specific approach, challenges of template extension methods include the synthesis of labeled nucleotide analogues; identification of processive polymerases that can incorporate nucleotide analogs with high fidelity; discrimination of fluorescent nucleotides that have been incorporated into the growing chain from those present in the reaction mix (background); distinction of subsequent nucleotide additions from previous ones; accurate enumeration of homopolymer runs (multiple sequential occurrence of the same nucleotide); maintenance of synchrony among the multiple copies of DNA being extended to generate a detectable signal, or achievement of sensitivity that detects extension of individual DNA molecules; and development of fluidics, surface chemistry, and automation to build and run the system. Current efforts to develop such methods have produced, at best, short sequence reads (less than or equal to 100 bases), so a continuing challenge is to extend read length and develop sequence assembly strategies. NHGRI anticipates that the state of the art for this approach is sufficiently advanced that, with additional investment, it may be possible to achieve proof of principle or even early commercialization for genome-scale sequencing within five years. It is anticipated that the cost of genome sequencing with this technology could be reduced by two orders of magnitude from today's costs. It is important to note that sequencing by extension is one prototype for achieving these time and cost goals, but other technological approaches may also be viable. Reaching this goal is the subject of a parallel RFA, HG-04-002 (http://grants.nih.gov/ grants/guide/rfa-files/RFA-HG-04-002.html).
A second alternative to CAE sequencing seeks to read out the Linear sequence of nucleotides without copying the DNA and without incorporating labels, relying instead on extraction of signal from the native DNA nucleotides themselves. The most familiar model for this approach, but almost certainly not the only way to achieve 10,000-fold reduction in sequencing costs, is nanopore sequencing, first introduced in the mid-1990s. Generally, this approach requires a sensor, perhaps comparable in size to the DNA molecule itself, that interacts sequentially with individual nucleotides in a DNA chain and distinguishes between them on the basis of chemical, physical or electrical properties. Optimal implementation of such a method would analyze intact, native genomic DNA molecules isolated from biological, medical or environmental samples without amplification or modification, and would provide very long sequence reads (tens of thousands to millions of bases) rapidly and at sufficiently high redundancy to produce assembled sequence of high quality. NHGRI anticipates that the science and technology needed to reduce sequencing costs by four orders of magnitude, whether by the nanopore or some other approach, will require substantial basic research and development, and may take as long as ten years to achieve. Such a sustained research program is the subject of this RFA.
The goal of research supported under this RFA is to develop new, or improved technology to enable rapid, efficient genomic DNA sequencing. The specific goal is to reduce sequencing costs by at least four orders of magnitude--$1000 serves as a useful target cost for a mammalian-sized genome because the availability of complete genomic sequences at that cost would revolutionize biological research and medicine. New sensing and detection modalities will likely be needed to achieve these goals. New fabrication technologies may also be required. It is therefore anticipated that proposals responding to this RFA will need to involve fundamental and engineering research conducted by multidisciplinary teams of investigators. The guidance for budget requests accommodates the formation of groups having investigators at several institutions, in cases where that is needed to assemble a team of the appropriate balance, breadth and experience.
The scientific and technical challenges inherent in achieving a 10,000-fold reduction in sequencing costs are clearly daunting. Achieving this goal may require research projects that entail substantial risk. That risk should be balanced by an outstanding scientific and management plan designed to achieve the very high payoff goals of this solicitation.
Although the ultimate goal of this RFA is to develop full-scale sequencing systems, independent research on essential components will also be considered to be responsive. However, it will be important for applicants proposing research on system components or concepts to describe how the knowledge gained as a result of their project would be incorporated into a full system that they might subsequently propose to develop, or that is being developed by other groups. Such independent proposals are an important path for pursuing novel, high risk/high pay-off ideas.
Research conducted under this RFA may include development of the computational tools associated with the technology, e.g., to extract sequence information, including signal processing, and to evaluate sequence quality and assign confidence scores. It may also address strategies to assemble the sequence from the information being obtained from the technology or by merging the sequence data with information from parallel technology. However, this RFA will not support development of sequence assembly software independent of technology development to obtain the sequence.
The quality of sequence to be generated by the technology is of paramount importance for this solicitation. Two major factors contributing to genomic sequence quality are per-base accuracy and contiguity of the assembly. Much of the utility of comparative sequence information will derive from characterization of sequence variation between species, and between individuals of a species. Therefore, per-base accuracy must be high enough to distinguish polymorphism at the single-nucleotide level (substitutions, insertions, deletions). Experience and resulting policy have established a target accuracy of not more than one error per 10,000 bases. All applications in response to this RFA, whether to develop resequencing or de novo sequencing technologies, must propose achieving per-base quality at least to this standard.
Assembly information is needed for determining sequence of new genomes, and ultimately also for genomes for which a reference sequence exists, to detect rearrangements, insertions and deletions. Rearrangements are known to cause diseases; knowledge of rearrangement can reveal new biological mechanisms. The phase of single nucleotide polymorphisms to define haplotypes is important in understanding and diagnosing disease. Achieving a high level of sequence contiguity will be essential to achieve the full benefit from the use of sequencing for individualized medicine, e.g., to evaluate genomic contributions to risk for specific diseases and syndromes, and drug responsiveness. Nevertheless, it is recognized that perfect sequence assembly from end to end of each chromosome is unlikely to be achievable with most technologies in a fully automated fashion and without adding considerable cost. Therefore, for the purpose of this solicitation, grant applications proposing technology development for de novo sequencing shall describe how they will achieve, for about $1000, a draft-quality assembly that is at least comparable to that represented by the mouse draft sequence produced by December 2002: 7.7-fold coverage, 6.5-fold coverage in Q20 bases, assembled into 225,000 sequence contigs connected by at least two read-pair links into supercontigs [total of 7,418 supercontigs at least 2 kb long], with N50 length for contigs equal to 24.8 kb and for supercontigs equal to 16.9 Mb [Nature 420:520 (2002)].
The grant applications will be evaluated, and funding decisions made, in such a way as to develop a balanced portfolio that has strong potential to develop both robust re-sequencing and de novo sequencing technologies. If the estimate that achieving the goal of $1000 de novo genome sequencing incorporating substantial assembly information will require about 10 years to achieve is correct, then re-sequencing technologies might be expected to be demonstrated in a shorter time. Grant applications that present a plan to achieve high quality re-sequencing while on the path to high quality de novo sequencing will receive high priority.
The major focus of this RFA is on the development of new technologies for detection of nucleotide sequence. However, any new technology will eventually have to be effectively incorporated into the entire sequencing workflow, starting with a biological sample and ending with sequence data of the desired quality, and this issue should be addressed. Given that sample preparation requirements are a function of the detection method and the sample detection method affects the way in which output data are handled, these aspects of the problem are clearly relevant and should be addressed in an appropriate timeframe. However, NHGRI is interested in seeing that the most critical and highest-risk aspects of the project, on which the rest of the project is dependent, are addressed and proven as early as possible.
Practical implementation issues related to workflow and process control for efficient, high quality, high-throughput DNA sequencing should be considered early. Some technology development groups lack practical experience in high throughput sequencing, and in testing of methods and instruments for robust, routine operation. Applicants may therefore wish to include such expertise as they develop their suite of collaborations and capabilities.
The goal of this research is to develop technology to produce sequence from entire genomes. It is conceivable that sequence from selected important regions (e.g., all of the gene regions) could be determined in the near future, using more conventional technologies, at very low cost. However, that is not the purpose of this initiative, and grant applications that propose to meet the cost targets by sequencing only selected regions of a genome will be considered unresponsive.
This RFA will use NIH R21, R21/R33, R01 and P01 award mechanism(s). As an applicant you will be solely responsible for planning, directing, and executing the proposed project.
Applicants may request an R01 or P01 (depending on the organization of the proposed project) if sufficient preliminary data are available to support such an application. A fully integrated management and research plan should use the R01 mechanism. The P01 mechanism should be used if multiple projects under different leadership must proceed in parallel; however, the issue of synergy in a multi-focal effort is of great importance and must be addressed in the application.
Applicants requiring support to demonstrate feasibility may apply for either an R21 pilot/ exploratory project or an R21/R33 award, which offers single submission and evaluation of both a feasibility/pilot phase (R21) and an expanded development phase (R33) in one application. The R21/R33 should be used when both quantitative milestones for the feasibility demonstration, and a research plan for the follow-on research, can be presented. The transition from the R21 award to the R33 award will be expedited by administrative review. The R21 alone is appropriate when the possible outcomes of the proposed feasibility study are unclear and it is not possible to propose sufficiently clean-cut and quantitative milestones for administrative evaluation, nor would it be possible to describe the R33 phase of the research in sufficient detail to allow adequate initial review.
This RFA uses just-in-time concepts. It also uses the modular budgeting as well as the non-modular budgeting formats (see http://grants. nih.gov/grants/funding/modular/modular.htm). Specifically, if you are submitting an application with direct costs in each year of $250,000 or less, use the modular budget format. Otherwise follow the instructions for non-modular budget research grant applications. This program does not require cost sharing as defined in the current NIH Grants Policy Statement at http://grants.nih.gov/grants/ policy/nihgps_2001/part_i_1.htm. However, cost-sharing is permitted as a component of institutional commitment.
Applications must be prepared using the PHS 398 research grant application instructions and forms (rev. 5/2001). Applications must have a DUN and Bradstreet (D&B) Data Universal Numbering System (DUNS) number as the Universal Identifier when applying for federal grants or cooperative agreements. The DUNS number can be obtained by calling (866) 705-5711 or through the web site at http://www. dunandbradstreet.com/. The DUNS number should be entered on line 11 of the face page of the PHS 398 form. The PHS 398 document is available at http://grants.nih.gov/grants/ funding/phs398/phs398.html in an interactive format. For further assistance contact GrantsInfo, 301-435-0714, e-mail: GrantsInfo@nih.gov.
The Center for Scientific Review (CSR) will not accept any application in response to this RFA that is essentially the same as one currently pending initial review, unless the applicant withdraws the pending application. However, when a previously unfunded application, originally submitted as an investigator-initiated application, is to be submitted in response to an RFA, it is to be prepared as a NEW application. That is, the application for the RFA must not include an Introduction describing the changes and improvements made, and the text must nor be marked to indicate the changes from the previous unfunded version of the application.
Letters of intent must be received by 14 September 2004. Applications are due by 14 October 2004. The earliest anticipated start date is 1 June 2005.
Contact: Jeffery A. Schloss, Division of Extramural Research, NHGRI, Bldg 31, Rm B2B07, Bethesda, MD 20892-2033 USA, 301-496-7531, fax: 301-480-2770, e-mail: firstname.lastname@example.org.
Reference: RFA No. RFA-HG-04-003
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Fellowships, Grants, & Awards|
|Publication:||Environmental Health Perspectives|
|Date:||May 15, 2004|
|Previous Article:||Toxicogenomics through the Eyes of Informatics: conference overview and recommendations.|
|Next Article:||Near-term technology development for genome sequencing.|