Developing genome and exome sequencing for candidate gene identification in inherited disorders: an integrated technical and bioinformatics approach.
The introduction of NGS has created a sea change in biomedical research by providing new tools to rapidly and cost-effectively perform large-scale automated sequencing. Questions that were previously not possible or cost affordable to pursue and answer are now tractable. The wave of basic research questions investigated with NGS, coupled with improvements in platform chemistries and workflows, have paved the way for the translation of NGS into the clinical diagnostic realm. Notably, in the past 3 years, diagnostic laboratories have begun to develop and implement NGS-based diagnostic assays. These include (1) multigene panels for a variety of inherited disorders and oncology conditions (2-7); (2) human leukocyte antigen locus characterization (8-10); (3) pathogen genome sequencing for identification and assessment of resistance (11,12); (4) exome sequencing for candidate gene discovery in inherited disorders and characterization of the mutational landscape in tumors (13-16); and (5) whole genome sequencing. (17-19) In complement to the high-throughput NGS platforms that have been used to support these initial endeavors, more recently commercialized NGS instrument platforms with faster turnaround times are accelerating the dissemination of NGS into the clinical laboratory domain. Each laboratory embarking upon adopting NGS has been challenged with learning new chemistries and instrumentation as well as entering into the complex realm of NGS data analysis.
In this review, we focus on the utilization of NGS technology to analyze human exomes and whole genomes, specifically for the purposes of identifying candidate disease-causing genes in inherited disorders. The use of exome and genome sequencing is gaining considerable momentum in scenarios for which clinical phenotypes and family studies suggest a genetic etiology and available genetic testing has been noninformative. Whereas most reports to date describe the application of exome or genome sequencing in a clinical research setting, a handful of clinical laboratories are now using exome or genome sequencing to support clinical diagnostics. In this review, the Illumina NGS technology is highlighted, as the authors have greatest experience with this sequencing method and it has been used in most published exome and genome studies. The review is structured to first describe technical aspects of exome and genome sequencing, followed by a discussion of the bioinformatics considerations and challenges associated with this approach. Subsequently, examples from the literature are highlighted and we conclude with a discussion of translational considerations for clinical diagnostics.
EXOME AND GENOME SEQUENCING: AN INTEGRATED TECHNICAL AND BIOINFORMATICS PROCESS
Exome and Genome Library Preparation.--DNA libraries need to be generated for NGS and the process of generating libraries for sequencing on the Illumina platform is illustrated in Figure 1. The initial preparatory steps are the same for genome and exome libraries and include genomic DNA fragmentation and conversion of fragments into an oligonucleotide adapter-tagged library. Multiple methods for genomic DNA fragmentation are used and include nebulization, sonication, restriction enzyme digestion, chemical methods, and sonication by adaptive focused acoustics (Figure 1, options 1 through 6). The resulting DNA fragments are then enzymatically end repaired before ligation of platform-specific adapter oligonucleotides. Enzymatic end repair includes creating blunt ends from the fragmented DNA, which are then 5'-phosphorylated and a 3'-adenine (A) overhang is added. Platform-specific adapters with a 5'-thymine (T) overhang are ligated to the adenylated fragments. The multistep end repair and adapter ligation process is manually intensive. Performing these process steps on liquid handling platforms can increase library preparation throughput while maintaining library quality. As an example of this approach, we use Beckman Coulter Genomics' SPRI-TE (Danvers, Massachusetts) instrument for library preparation, which automates enzymatic end repair and adapter ligation steps and offers an optional size selection step to create a library with a defined fragment-size distribution. Post adapter ligation, a polymerase chain reaction (PCR) step with primers containing "tails" complementary to the adapters, is performed to increase library concentration before exome enrichment or genome sequencing. Figure 1, A through C, shows representative gel electrophoresis results on an Agilent Bioanalyzer (Santa Clara, California) with sheared human genomic DNA generated by sonication with adaptive focused acoustic technology commercialized by Covaris (Woburn, Massachusetts), followed by adapter ligation and PCR. The size of the PCR product increases from the size of the adapter-ligated library by the addition of tails (from the PCR primers), which also contain sequences necessary for (1) annealing to complementary oligonucleotides immobilized on the Illumina flow cell surface and (2) binding to sequencing primers. This size shift is approximately 50 bp and also serves as a control for adapter ligation.
A new technology for library preparation, termed Nextera (developed by Epicentre, Madison, Wisconsin, and acquired by Illumina), simultaneously fragments DNA and introduces adapter sequences in a process termed tagmentation. This technology uses a transposase enzyme complexed with a transposon, which, in this case, contains Illumina adapters (product data sheet). The transposase fragments DNA in an enzyme concentration- and incubation time-dependent manner and adapter sequences are inserted at the cut site. Reaction conditions can be adjusted to generate a fragment-size distribution suitable for Illumina sequencing. The adapter-tagged fragments generated by the Nextera method are then PCR amplified with tailed primers containing Illumina-specific flow cell annealing and sequencing primer sequences and optional indexing (or bar coding) sequences. Indexing multiple libraries for pooling and simultaneous sequencing leverages instrument capacity and reduces costs, as discussed further below. Most library preparation methods currently require input of 1 [micro]g to several micrograms of genomic DNA. The amount of input DNA can be reduced to 50 ng with Nextera technology, (20) which can be very useful for applications with limited amounts of available DNA.
Exome Enrichment.--For exome enrichment, PCR-amplified library preparations are hybridized in solution with a pool of biotinylated exon-specific capture probes. Complexes composed of hybridized probe and complementary library fragments are captured on streptavidin-coated paramagnetic beads, unbound fragments are washed away, and the bead-bound library is used as template for PCR amplification, yielding an exome-enriched library. Before sequencing, genome- or exome-enriched libraries can be size selected by gel purification to obtain a narrow fragment-size distribution (typically 50-100 bp) to facilitate data analysis for insertions and deletions. An aliquot of the library is subjected to quantitative PCR with adapter-specific primers to accurately determine concentration of library fragments that can generate clusters on the Illumina flow cell (Figure 1).
Several commercial vendors offer in-solution exome capture reagents, including Agilent, Roche NimbleGen (Madison, Wisconsin), and Illumina, each differing in targeted capture areas, capture probe sequence composition, and performance characteristics. (21-23) Agilent's Sure-Select capture kit targets 50 Mb of the genome with biotinylated RNA probes designed from the following databases: Consensus Coding Sequence (24) (CCDS), Re fSeq, (25) miRBase, (26) GENCODE, (27) and the Rfam database, (28) which contains sequence information for RNA families and RNA genes. The recently released version 3.0 of Roche NimbleGen's SeqCap EZ Human Exome Library in solution capture kit targets 64 Mb of the genome with DNA probes designed from CCDS.2, RefSeq, Vega, (29) GENCODE, Ensembl, (30) miRBase, and the snoRNABase. (31) The Illumina TruSeq probe set uses DNA probes to capture 62 Mb of sequence with probes designed from CCDS, RefSeq, RefSeq plus (RefSeq exons plus 5' and 3' untranslated regions, microRNA, and noncoding RNA sequences), GENCODE, and predicted microRNA targets.
Illumina Sequencing: Concept.--Several platforms with the throughput capacity for genome and exome sequencing are commercially available and include the Illumina Genome Analyzer and HiSeq series and Life Technologies' SOLiD instruments. For a more detailed description of Illumina, SOLiD, and other platforms used for additional NGS applications, the reader is referred to several publications. (6,32-38) Illumina's HiSeq 2000 instrument with version 3 (v3) chemistry has the capacity to simultaneously perform sequencing in 2 flow cells to yield approximately 600 Gb of sequence with 2X100-base length reads per 12-day run. In this configuration, 2 human genomes per 8-lane flow cell (with 1 genome library distributed in 4 lanes) can be sequenced with average read depth coverage of 30-fold. In comparison, 2 indexed exome samples can be sequenced per lane to yield a 100- to 200-fold average read depth coverage. Coverage requirements for genome and exome sequencing are further discussed below.
Before sequencing, adapter-ligated DNA fragment libraries are hybridized under limiting dilution conditions to complementary oligonucleotides on the surface of the glass flow cell (Figure 2). Each hybridized library fragment is then bridge amplified to generate a clonal DNA fragment "cluster" composed of approximately 1000 copies. Sequencing of clonal clusters proceeds in a cyclic manner with reversible dye-terminator chemistry, allowing only 1 complementary base to be incorporated at a time into the growing strand. Each of the 4 bases is covalently attached to a spectrally unique fluorophore, and high-sensitivity imaging optics are used to capture the fluorescent output of each base post incorporation. Base calls are made according to which fluorophore is detected at each cycle for each cluster on the flow cell. Therefore, each DNA fragment is converted into 1 cluster, which is sequenced in a progressive, cyclic manner to yield a strand whose length is dependent on the number of sequencing cycles. The output of sequencing 1 cluster is a single, composite read. In paired-end sequencing, the same cluster is also sequenced from the opposite end, creating read 2 of the paired-end read.
Illumina Sequencing: Signal to Noise Processing.--Each NGS platform is prone to its own characteristic sequencing errors secondary to their unique chemistries, and these errors need to be considered when analyzing and interpreting sequencing results. For Illumina sequencing, single-nucleotide substitution errors can occur, and several factors contributing to error generation are summarized in Figure 3.39 Since each base is incorporated individually within a growing DNA strand, base incorporation can become out of phase within a clonal cluster if 1 base is skipped (phasing) or multiple bases are incorporated in a single cycle (prephasing), resulting in nonuniform fluorescence within a clonal cluster. (39) Second, increasing background fluorescence during the analytic run leads to a decreased signal to noise ratio. Third, errors will be introduced if a cluster is mixed, that is, if more than 1 unique adapter-ligated fragment is colocalized at the same spot on the flow cell. Finally, there is overlap in the emission spectra of each of the 4 fluorophores, which can make it difficult to determine which base was incorporated (fluorophore cross-talk), a phenomenon exacerbated when clonal clusters are physically close to each other.
Different software applications are available for base calling on the Illumina platform and each one corrects for at least a subset of error sources. (39) The chastity filter that comes with the Illumina platform removes clusters of low purity. The Illumina application Bustard corrects or filters base calls for cross-talk, phasing, and prephasing, then designates the base with the highest signal intensity, which is used to estimate a quality (Q) score [Q = -10 x [log.sub.10](e)] for the base call. The Q score is logarithmically related to error probability (e) and is operationally analogous to the Phred quality score used in Sanger sequencing. (40,41) For example, a base with a Q30 score has a 1:1000 probability of being called incorrectly, and a base with a Q20 score has a 1:100 chance of being called incorrectly. The Q score is calculated for each base along the sequence read and is used as a standard quality metric for downstream data analysis.
Analyzing NGS reads to develop a list of variants in relationship to a reference sequence is a multistep process. A schematic of the pipeline we use, and will discuss below, is shown in Figure 4.
Aligning Reads to a Reference Sequence.--Once sequence base call files have been generated, they are converted to a common file format (one used by many groups is the FASTQ file format) for subsequent analyses and storage. Millions to billions of reads with base-associated Q scores comprise the FASTQ file from exome and genome sequencing, respectively. Multiple alignment-to-reference programs are available including the open-source software Burrows-Wheeler Aligner (42-44) and Novoalign (45) (Figure 4). The initial alignment process involves mapping reads to a best-fit location on the reference sequence. This step associates each read with another quality score, termed the mapping quality score, and mapped reads for the entire data set are stored in a binary alignment file format called BAM. Mapped reads can be visualized and a popular viewer for inspecting reads is the open-source Integrative Genomics Viewer. (46,47) An example of reads from a human genome data set aligned to the reference is shown in Figure 5. In this example, the individual has a heterozygous cytosine to thymine (C>T) change in the VSIG4 gene, as confirmed by Sanger sequencing. The viewer shows coverage over each nucleotide with a change from the reference shown by a colored box. Aligned reads can also be interrogated on the viewer for mapping quality by scrolling over the gray boxes within the viewer. The nucleotides of the reference sequence are shown along the bottom of the viewer along with the amino acid sequence of exons.
Initial and Refined Variant Calling.--Variant calling is the step wherein nucleotides in aligned sequence reads are used to infer the presence and zygosity of variants. In the current workflow, this produces single-nucleotide polymorphisms (SNPs) and insertions and deletions (indels) in a file format termed Variant Call Format (VCF). The Genome Analysis Toolkit (GATK) (48-50) and SAMtools (51,52) are 2 examples of commonly used programs to accomplish this task (Figure 4). To call variants, the ratio of reference to alternate allele bases in the reads are considered, along with other read and alignment parameters including overall read coverage, base quality, and read mapping scores. The simplest variant calling method is a threshold" method that assesses and calls variants when parameters, such as variant read percentage, fall within a fixed range. GATK and SAMtools, however, use a statistical method to calculate the most likely genotype at each alignment position and provide a variant quality score, which is an estimate of the algorithm's "confidence" in the variant call. One significant factor that improves these calculated variant qualities is high read coverage. The higher the read coverage of a variant, the less prone the variant call is to sampling error, which can distort the true ratio of reference to alternate alleles.
Alignment and mapping accuracy differ between algorithms, and empirical comparisons of these algorithms are useful when deciding upon a bioinformatics approach. A trade-off exists between computational speed and mapping accuracy, which can lead to initial alignments with false positive and false negative variants that can be corrected with additional processing. After the initial alignment and variant calling are complete, it is therefore recommended that the alignment be refined to improve the accuracy of the data by (1) local realignment, (2) removing PCR duplicates, and (3) recalibrating variant quality scores (Figures 4 and 6). An important source of false positives and false negatives comes from misalignment of reads around indels. (49) Short reads produced by NGS instruments are difficult to map when the reads contain indels. Often the reads are aligned in the appropriate genomic location but may be shifted by a few bases owing to the indel, thus potentially causing the zygosity of the indel to be incorrectly called and introducing additional flanking false positive variant calls. One extensively used, open-source realignment algorithm for indels is in the GATK toolkit. (49) An example is shown in Figure 6, A, where the zygosity of the 3-bp deletion is unclear and several potential variants are nearby the deletion (top panel). After the refinement (Figure 6, A, bottom panel), the deletion is clearly homozygous and the nearby potential variants are no longer present. Sanger sequencing indicates a homozygous 3-bp deletion and lack of nearby variants, confirming the accuracy of the refined alignment.
A second aspect to data refinement is removing PCR duplicates, which are reads that have the same start and end points. Duplicates arise from sequencing of identical fragments generated by PCR during library preparation. Polymerase chain reaction errors can be introduced and propagated through unequal amplification of the library fragment template, which can lead to false positives or incorrect variant zygosity calling. Removal of PCR duplicates before variant calling is performed with PICARD (53) or SAMtools. Of the duplicate reads, only the read with the highest combined base quality is used (Figure 6, B). When implementing PCR duplicate removal, 10% to 15% of reads are removed from exome data sets and approximately 6% of reads are removed from genome data sets. This difference in percentage of reads removed is due to the greater number of PCR cycles performed in exome versus genome library preparation protocols (eg, 18 versus 10 cycles of PCR, respectively).
A third aspect to refining alignments is recalibrating base quality scores with the GATK toolkit (54) (Figure 6, C). Quality score recalibration results in more accurate quality scores, which are closer to the probability of an incorrect base call. Here, the alignment itself is used to estimate the actual base call error rate by tabulating base mismatch with the reference, with known or expected variant regions excluded (eg, dbSNP55 and putative variant calls for the alignment in question). Bases are categorized by parameters including position along the read, preceding bases in the read, and instrument-assigned quality scores. Base quality score is then updated from the alignment error rates for all bases in the same category. As shown in Figure 6C, the base quality scores generally increase after recalibration (54) (J.D.D.; K.V.V., unpublished data, January 2011). Each position in a read is interrogated with this software and all reads in the BAM file are recalibrated. Once all 3 steps are completed, the BAM file is called again to produce a second, refined VCF file that can be used for further analysis. An additional strategy to reduce false positive variant calls includes a statistically sophisticated algorithm called variant quality recalibration available in GATK, (49) which attempts to determine the relationship between variant error likelihood and several variant parameters.
Variant Annotation.--Variants are annotated for further analysis after the initial and refined alignments are complete. ANNOVAR (56,57) and snpEff (58) may both be used for variant annotation and are designed to run on Linux or UNIX platforms. Annotated variant information is customizable and can include base change from reference; zygosity; location of the change within the gene (ie exon, intron, splicing); genomic, complementary DNA, or protein position (ie, g./c./p. numbers); and variant classification (ie, synonymous, nonsynonymous, missense, indel). Other parameters often included in VCF files, such as read coverage depth, mapping quality, and base quality scores, can also be included in annotation analysis. Structural variants including large indels (greater than approximately 40 to 50 bp) are difficult to identify in exome data unless all break points are in well-covered regions. Structural variant detection requires additional specialized software and is generally prone to much higher miscall rates than SNP and small indel calling software. (59-64) Accurate identification of variants remains a challenge with exome and genome NGS data owing to the difficulty of mapping short reads and limitations of current bioinformatics algorithms. Quality control metrics for variant accuracy can be developed by confirming variants with an alternative method such as Sanger sequencing. At the exome and genome scale, where tens of thousands to millions of variants are called, respectively, one measure of accuracy is to compare NGS variants to those generated by SNP microarrays. Concordance between NGS and SNP arrays is approximately 98% to 99% in exome and genome data (22,35,65,66) (our unpublished results). In addition to serving as a quality control metric, genomic microarray data can aid in candidate gene discovery and will be further discussed below.
Infrastructure and Time Requirements for Exome and Genome Data Analysis.--Processing the raw data files generated by NGS into VCF files with high-quality variants requires personnel with bioinformatics expertise and computing power beyond the capacity of standard personal computers. In the authors' setting, conversion of raw sequencing files into FASTQ files, performance of initial and refined alignments, and variant calling and annotation are performed on workstations configured with 2X6 central processing units with 96 gigabytes (GB) random access memory and 3.33 GHz processor speed. Converting exome and genome data into FASTQ files takes 1 hour or 8 hours, respectively, using 24 threads on these workstations. Initial and refined alignments of exome data require approximately 12 hours and of genome data, approximately 3 days, while variant calling for exomes and genomes takes 3 or 24 hours, respectively. The time required for variant annotation is variable (less than or greater than 1 hour depending on the number of annotation parameters). To reduce processing time, we are currently supplanting individual workstations with a dedicated server and storage space housed at our university's Center for High Performance Computing. This will allow us to process our data on a 16-blade cluster with 2 interactive nodes and 14 compute nodes. The interactive nodes have 24 GB of memory, the compute nodes have 48 GB, and all 16 nodes have 500 GB local hard disk memory and 2.8 GHz processor speed. Our clinical research group includes 2 full-time PhD-level scientists, 2 full-time bioinformaticians, and a technician, who work in a coordinated fashion to generate, process, and analyze exome and genome data.
Technical and Bioinformatics Considerations for Exome and Genome Sequencing
Read Coverage in Exome and Genome Sequencing.--In genome sequencing, the coverage (or number of aligned reads) for each sequenced base is lower than in exome sequencing. Current average coverage across the exome is 100- to 200-fold, assuming 2 exomes are indexed, pooled, and run in a single lane of the Illumina HiSeq 2000 with v3 chemistry. By comparison, the average coverage of a human genome is 30-fold when 1 genome is run in 4 lanes of a flow cell. These coverage differences can translate to different sets of variants called on the same sample. Clark et al (22) (2011) compared genome and exome results from the same individual by using 3 different exome capture reagents, followed by Illumina sequencing. When they restricted their analyses of the genome sequence data set to the captured regions of the exome data sets for Agilent, NimbleGen, and Illumina exome capture probes, they observed that 35 448; 30 097; and 42 633 variants, respectively, were called in common. Additional findings were that (1) variants called in exome sequencing can be missed in genome sequencing owing to lower coverage and base quality scores in the genome data and (2) variants unique to genome sequencing that were targeted in the exome capture, but had low to zero read coverage, were due to enrichment failure at these positions. (22) The authors also found that average variant quality scores for variants in the exome data set were higher than in the genome data set, noting that the average coverage of the exome data was 2 to 3 times that of the genome data. The work of Clark et al (22) highlights the idea that variant quality scores are impacted by read coverage and relates to a study in 2011 by Ajay et al, (67) who sequenced a human genome to an average read coverage of 102-fold with Illumina chemistry. They used this data set to determine metrics for accurate variant calling and suggested that sequencing efficiency be based on the portion of the genome in which variant calls can be determined robustly, or the callable portion." Parameters used to determine the callable portion of the genome included base quality scores, mapping quality scores, and confidence scores as defined by the variant's quality relative to its read depth (or coverage) measurement. Using Illumina Genome Analyzer and HiSeq chemistry available in early 2010, they determined that an average coverage depth of 50X was required to accurately call genotypes for approximately 94% of the callable genome; with 30X average read coverage, approximately 90% of the genome was callable. (67) Here, accuracy of the variants called in the callable portion of the genome was determined by concordance with genotype calls from array data. Importantly, they showed that as the coverage uniformity increased with ever-improving sequencing chemistries and software, the average mapping depth necessary to accurately call 95% of the callable genome decreased. (67) Extending these findings to diagnostic translation implies that it will be critical to empirically determine read coverage requirements for a given sequencing platform to achieve accurate variant calling. Samples for exome sequencing can be indexed or "bar coded" during library preparation by including unique sequences into adapter oligonucleotides, which allows multiple libraries to be sequenced on the same flow cell lane. Reads generated from indexed and pooled samples are assigned to their sample of origin by subsequent bioinformatics deconvolution. The number of samples pooled for indexing strategies should take into account the resulting reduction in per sample average read coverage, as well as the effect of areas of low read coverage, which can preclude accurate variant calling.
Exome Versus Genome Data: Area Sequenced.--As there is no target enrichment in genome sequencing sample preparation, the data generated by this approach include coding, intronic, untranslated regions, and intergenic regions, while exome sequence data only include the regions that are captured by probes. The difference between the 2 data sets is readily visualized by comparing genome to exome data (Figure 7). Figure 7, A, depicts a subset of exons in the well-captured RET gene. The genome data are complete across the entire region, while the exome data have reads only in genomic locations that are enriched by the capture probes. In exome data, there are some regions that do not have probes and others that are poorly captured during library preparation. This translates to missing sequence in the final data set, as demonstrated by the ABCF1 gene in Figure 7, B. This gene is incompletely enriched, with only 5 of the 19 exons shown covered by the capture probes. The genome sequencing data, in contrast, are complete in this region. The absence of probes in exome capture reagents is primarily due to the difficulty of designing unique probes for certain coding regions. If genes or exons of interest are not captured or are inadequately enriched by exome capture, then other approaches, such as Sanger sequencing assays, can be designed to cover these genes.
Next-Generation Sequencing Limitations: Repetitive Sequences and GC Bias.--Two important technical limitations in NGS, which impact genome and exome sequencing, are homologous sequences and guanine-cytosine (GC) bias. Highly repetitive sequences, from a few bases to millions of bases, constitute about 50% of the human genome and include interspersed repeats (highly similar sequences that are spatially separated in the genome) and tandem repeats (repeats adjacent to each other). (68) During hybridization of exome capture probes to human genomic DNA, highly homologous (from pseudogenes or gene families) or repetitive (interspersed or tandem repeat) sequences can be cocaptured and, therefore, coenriched along with targets of interest. There are 3 ways that alignment programs can address repetitive or highly homologous sequences: (1) discard reads in the region of the repetitive sequence, (2) align to the region with the fewest mismatches (best match), and (3) report all alignments. (68) The challenge of aligning repetitive or homologous sequences is encountered in both exome and genome sequencing, and the best strategy for correct read alignment would be to generate accurate reads longer than common types of repeats (which can span a few hundred to thousands of bases) in the genome, (68) which is currently not feasible with available NGS platforms (with the exception that the longer read lengths of Pacific Biosciences' (Menlo Park, California) single-molecule sequencing platform can span some types of repetitive regions). To mitigate this hurdle, paired-end sequencing is leveraged by aligners, using mate-pair information to accurately align short reads generated from repeat sequences. (68) Constructing libraries with longer insert sizes and using average read depth differences to detect repeats are 2 other ways to improve alignment accuracy. (68)
Another technical consideration with exome sequencing is that coding regions with high or low GC content are captured less efficiently under single temperature hybridization conditions used in current protocols. GC bias is known to affect the efficiency of PCR and hybridization of oligonucleotide probes (22) and is, therefore, an inherent source of bias for exome capture methods. The difficulty with which targets with high or low GC content is captured was described by Clark et al (22) and is inherent to all 3 exome capture platforms. Figure 8, A, shows that GC bias is particularly pronounced for capture of the first exon of human genes. This GC bias translates to an overrepresentation of first exons with low or no coverage in exome data sets. Figure 8, B, illustrates this point and shows an example of a gene, MAZ, that has a probe designed to exon 1; however, the capture efficiency of this exon is greatly reduced compared to that of the other exons in the gene.
EXOME AND GENOME SEQUENCING FOR CANDIDATE GENE DISCOVERY: FROM VARIANT LIST TO CANDIDATE GENE(S)
Once a list of variants has been generated and annotated, the process of identifying candidate gene(s) can begin. With genome sequencing data sets, approximately 3 to 3.5 million positions will differ from the reference sequence, depending on the ethnicity of the subject. In exome sequencing, 15 000 to 20 000 changes from the reference will be observed in coding regions. Whether starting with genome or exome data sets, most investigators initially limit analyses to variants in coding regions and those in close flanking proximity to splice sites, as these are the most interpretable" portion of the genome. Owing to the enormity of these data sets, bioinformatics tools are required to narrow down variant lists to a small subset of variants in candidate genes. In this evolving area, 2 main categories of bioinformatics approaches are used: heuristic filtering methods and statistical prediction algorithms or a combination thereof. These approaches can be complemented by incorporating array data that can be used to focus the search space for variants to specific chromosomal regions as described below.
Heuristic Filtering Methods
With an annotated variant list, a typical first step toward causative gene discovery in family studies is to apply heuristic filters based on suspected disease inheritance patterns, disease frequency, and assumptions about the candidate variant (Figure 921). In a presumed "rare" inherited disorder, one first assumption is that the causative variant is not represented in public databases such as dbSNP, the 1000 Genomes Project, (69,70) or in-house control databases. When applied as a filter, this initial assumption removes variants present in these databases and typically reduces an exome variant list by about 95% from approximately 20 000 to approximately 1000 variants. An important consideration with this filter is that some known rare pathogenic variants and more common variants linked to disease by genome-wide association studies are present in dbSNP. As an additional caveat, a subset of the subjects from the 1000 Genomes Project are likely carriers for genetic disease or have genetic diseases of low penetrance or later age of onset. An alternative assumption for filtering is that common variants are nonpathogenic and can be separated into different frequency bins, such as 0% to 1% and 1% to 5%, by minor allele frequency within the context of dbSNP, the 1000 Genomes Project, or in-house control databases. This approach can be problematic in the case of a compound heterozygous set of variants where one variant may be more common in the population than the other. Another option is to remove all variants above a certain minor allele frequency, for example, 5% or greater.
Although most known highly penetrant, disease-causing variants are present at a frequency of less than 1% in the population, deleterious variants can be present at higher frequencies. Evidence for this is shown in a recent study using exons captured from 942 genes in 697 samples. (71) The authors used multiple approaches to investigate the functional spectrum of variants found in different frequency bins with these data. First, they assessed the predicted consequence of the variant on protein function and found that 63% of missense and 78% of nonsense variants were in the less-than-1-percent frequency bin, leaving a proportion of this class of variants in frequency bins higher than 1%. They found a similar pattern when they assessed the consequence of the variant with the impact on protein function prediction algorithms Sorting Intolerant from Tolerant (SIFT) (72,73) and Polymorphism Phenotyping (PolyPhen). (74) Using this metric, they found that 72% of damaging and 63% of possibly damaging variants were found in the less-than-1-percent frequency bin. Therefore, when filtering based on variant frequency, it is important to consider variants with frequencies greater than 1% to avoid missing a causative variant.
The next filtering steps serve to further reduce the list of potential candidate variants. Here we present options for next steps in the heuristic filtering approach, but we suggest that the exact steps used for any given family study be determined empirically. General filters include (1) examining only genes previously implicated in the patient's disease phenotype, (2) examining intersects (ie, shared variants) and differences between affected and unaffected individuals, based on pedigree information, (3) incorporating linkage or identity-by-descent information from genomic microarray analyses, and (4) applying filters on the basis of assumptions about the candidate variant (eg, zygosity, variant classification, or predictions of pathogenicity) (Figure 921). First, genes previously implicated in the patient's disorder can be evaluated initially for causal variants. If the list is not very extensive, this can be done by manual read inspection in a viewer such as the Integrative Genomics Viewer, with examination of variants that fit the postulated inheritance pattern. In the case of a disorder with many known or candidate genes of interest, it may be more efficient to convert the list into a genome browser track that can be used during filtering to specifically highlight variants in these genes. Second, intersect and difference filters are useful if multiple affected and/or unaffected individuals are sequenced. When applying these filters, it is important to consider the zygosity of the variant and the suspected inheritance pattern before removing variants at identical chromosomal positions between individuals. For example, in a family study with a suspected recessive inheritance pattern, other unaffected family members may be carriers for the disorder; therefore, removing variants at identical chromosomal positions could result in removing the causative variant from the data set.
In family studies, identifying regions with structural variation, copy number variation, and regions of loss of heterozygosity shared between affected individuals, which are not present in unaffected individuals, helps to focus the search for causative variants to these regions. In a useful strategy, genomic microarray data can be coupled with exome or genome data to define these regions and to identify locations of identity-by-descent information in family studies. (75) Shared genomic segment analysis (or haplotype phasing) can also be done with either experimental or bioinformatics approaches. (76) An example of a successfully implemented bioinformatics strategy for haplotype phasing leading to candidate gene discovery was demonstrated by Roach et al (77) in 2010. Genome sequence data from a family of 4 with 2 siblings affected by Miller syndrome (Online Mendelian Inheritance in Man [OMIM] No. 263750) and primary ciliary dyskinesia (OMIM No. 608644) were used to computationally determine regions of identical haplotype blocks. By doing so, they reduced their search space to the 22% of the genome identical in the 2 siblings. (77) Assuming a recessive inheritance pattern, they identified compound heterozygous mutations in the coding regions of 4 genes, 2 of which (DHODH and DNAH5) were also uncovered in a separate study involving exome sequencing for the 2 affected individuals. (78) In a recent publication (2011), Browning and Browning (76) describe experimental approaches and algorithms for phasing.
Finally, assumptions about the characteristics of the candidate gene and causal variant can be used for further filtering. One assumption is that the causative variant likely causes a change on the protein level, so changes such as nonsense, missense, splicing, and frameshift variants are prioritized. The presumed inheritance pattern of the disorder can also be considered to prioritize either homozygous or heterozygous variants. For example, in a recessive disorder, the causative variant is either homozygous or compound heterozygous, such that genes harboring single heterozygous mutations can be removed. Another assumption is that the causative variant will have a functional effect and is more likely to occur in a conserved versus variable gene region. To assess the functional effect of a variant, prediction programs such as SIFT, Genomic Evolutionary Rate Profiling, (79,80) and PolyPhen are often used and the results incorporated into the filtering and/or prioritization process. Once a candidate gene has been identified, cross-referencing the literature and consulting databases, including the Human Gene Mutation Database, (81) OMIM, (82) and locus-specific databases (eg, http://www.arup. utah.edu/database/; accessed May 9, 2012), may reveal a previously described genotype-phenotype correlation.
Exome sequencing approaches to uncovering pathogenic variants in a research setting have shown considerable utility, as evidenced by successes published in the literature for a variety of primarily Mendelian disorders with recessive, dominant, and de novo inheritance patterns. A heuristic filtering approach was used to identify causative variants in a familial case of the fatal neurodegenerative disease amyotrophic lateral sclerosis (OMIM No. 105400) with an autosomal dominant inheritance pattern. (83) Exome sequencing was performed in 2 affected individuals from this family, and variants in dbSNP and the 1000 Genomes database were removed, followed by intersection of the remaining variants. This resulted in 1978 shared variants and indels. After removal of synonymous, noncoding SNPs and noncoding indels, the authors Sanger sequenced 75 heterozygous SNPs and 13 heterozygous indels in an unaffected family member and removed shared positions to reduce the list to 24 heterozygous SNPs and 9 heterozygous indels. They then removed variants present in a control database of 200 neurologically normal individuals, which left 6 heterozygous SNPs and no heterozygous indels. Of these, 4 SNPs were predicted to be damaging by SIFT and the authors focused on the VCP gene, which contained a variant previously described in a rare disease with overlapping symptoms with amyotrophic lateral sclerosis. Sequencing VCP in additional cases of amyotrophic lateral sclerosis uncovered 4 additional mutations in this gene that were not present in an extensive set of unaffected controls. VCP encodes an ATPase required for proteasomal degradation but the precise molecular mechanism of disease progression in patients harboring these variants remains to be determined.
In an example of successful implementation of exome sequencing in a single patient, which changed clinical treatment, Worthey et al (84) (2011) conducted exome sequencing on a male child with severe, life-threatening inflammatory bowel disease (OMIM No. 266600). Assuming a recessive inheritance pattern, the authors analyzed 66 genes with compound heterozygous variants and excluded all of them by sequence conservation and frequency. They also analyzed 70 homozygous and hemizygous, nonsynonymous variants and focused on 8 novel, potentially damaging (as predicted by PolyPhen) variants. Of highest priority was a variant in the X chromosome gene XJAP. The XIAP gene is known to be important for programmed cell death and the proinflammatory response. The authors conducted functional studies with patient and control peripheral blood mononuclear cells and demonstrated abnormal XIAP function. The combined molecular and functional results led to a treatment decision wherein the patient underwent an allogeneic cord blood progenitor cell transplant, which, at the time of publication, had resolved the patient's inflammatory bowel disease symptoms.
In another example of successful heuristic filtering using genome sequencing, Bainbridge et al (19) (2011) collected genome sequencing data from 2 fraternal twins affected with the recessive movement disorder dopa-responsive dystonia (OMIM No. 128230). The authors called approximately 2.5 million single-nucleotide variants in each individual with approximately 1.6 million shared between the 2 siblings. After removing variants in dbSNP, 9531 shared variants were identified in the coding region of the genome. Focusing on shared, nonsynonymous variants reduced the list to 4605, with 77 of these variants present at a minor allele frequency of less than 0.5%. The authors then looked for genes with homozygous or 2 or more heterozygous variants and cross-referenced a database of genes known to be involved in dystonia. This approach led to the discovery of compound heterozygous variants in the SPR gene, previously associated with dopa-responsive dystonia. The protein product of this gene is an aldo-keto reductase important for the biosynthesis of BH4, (85) which affects production of both dopamine and serotonin. As a result of the molecular diagnosis, the course of treatment was changed to include both the dopamine precursor (L-dopa) and the serotonin precursor (5-hydroxytryptophan). Treatment with 5-hydroxytryptophan improved both patients' symptoms, with no significant side effects.
Sobreira et al (18) (2010) generated genome sequencing data for 1 individual affected with the autosomal dominant condition metachondromatosis (OMIM No. 156250), which features multiple exostoses mainly involving the hands and feet. The authors used linkage data from other family members to identify 6 shared genomic regions in the affected individuals, then searched the coding sequence of these regions for nonsynonymous and nonsense SNPs and indels. Using this strategy, they identified a causative mutation in the PTPN11 gene, a gene that also contained a nonsense variant in the same exon in a second family member affected with this disease. The PTPN11 gene encodes a signaling molecule in the protein tyrosine phosphatase family known to be involved in multiple cellular processes. (86) These are some examples of successful gene discovery cases from the literature. For an additional perspective, the reader is referred to a recent review by Ku et al (13) (2011) that summarizes a subset of cases from the literature that have successfully used NGS approaches for causative gene discovery.
Statistical Modeling and Prediction Methods
While heuristic filtering methods have proven successful in identifying candidate and causative genes in a growing number of disorders, these are limited in that they do not provide any measure of statistical uncertainty for a given variant or candidate gene. In this context, new candidate gene discovery prediction algorithms are being developed (Figure 9). The Variant Annotation, Analysis, and Selection Tool (VAAST) is one such algorithm. (87) Using a multiparameter likelihood equation, VAAST compares allele frequencies between cases, controls, and background data sets, in conjunction with modeling variant severity by amino acid substitution analysis, to provide a list of variants, each associated with a VAAST ranking score and a P value. The P value is a measure of the probability that a variant is statistically significant in a case as compared to the control data set. The utility of VAAST was shown recently in a publication describing the discovery of a causative variant in a previously uncharacterized, rare, dominant, X-linked Mendelian disorder causing infant boys to have an "aged appearance" and cessation of growth after birth, called Ogden syndrome (OMIM No. 300855). (88) The authors performed exome sequencing of only the X chromosome of affected individuals in 2 unrelated families. In family 1, VAAST was applied with the following assumptions: (1) a dominant inheritance model with incomplete penetrance and (2) frequency of 0.1% or less in control data sets. In family 2, a heuristic filtering model was applied. In both cases, the causative variant was found to reside in the NAA10 gene, which codes for a protein responsible for N-terminal acetylation, a common posttranslational protein modification.
Another approach to predicting causative variants, recently reported by Ionita-Laza et al (89) in 2011, describes a statistical method using a weighted-sum approach, which takes into account "background" variation in genes to avoid having large or highly variable genes in the population rank high on the candidate list, can accommodate related or unrelated data sets, can incorporate linkage or functional data, and uses a computational approach to generate a measure of statistical certainty (P value) for individual genes. They suggest combining the weighted-sum algorithm with heuristic filtering to generate a ranked gene list. As a proof of principle, the authors used exome data sets previously published for Miller syndrome, (78) Freeman-Sheldon syndrome, (90) and Kabuki syndrome (91) and showed that their algorithm predicted the same causative variants and genes that were originally uncovered by heuristic filtering strategies. (89) As statistical modeling algorithms become more versatile and user-friendly, they will increasingly complement or, in some cases, supplant heuristic filtering methods by virtue of their ability to generate ranked gene lists with statistical probability measures.
PROVING CAUSALITY OF CANDIDATE GENES FROM EXOME AND GENOME STUDIES
Genetic Analyses of Candidate Genes
Follow-up genetic and/or functional studies are important to establish causality of a candidate gene with predicted deleterious variants (Figure 9). In some cases, additional laboratory testing of the patient informed by the candidate gene can support causality. There are several different categories of variants that maybe uncovered during a search for candidate genes. First, the gene and variant may have been previously associated with the patient phenotype. Second, the gene may have been previously implicated in the disease phenotype, while the variant is novel. Screening for this variant in patients with similar signs and symptoms, along with unaffected controls, can be powerful for establishing causality. Third, the gene may not have been previously implicated in the patient phenotype but is supported by its known biological function. Here it is essential to understand the frequency of the variant in an ethnically matched data set and to screen for variants in the gene in unaffected individuals and patients with similar signs and symptoms.
Functional Analyses of Candidate Genes
Genetic screening can provide strong evidence for causality, but in vitro and in vivo functional studies may also be required. As an example, Otto et al (92) (2010) used a combination of loss of heterozygosity mapping and exon capture of 828 candidate genes for causal variant identification in nephronophthisis-related ciliopathies (NPHP-RC), recessive disorders leading to cystic kidney disease. A novel variant in one of the target genes, SDCCAG8, which encodes a protein component of the human centrosomal proteome, was identified. An elegant set of experiments showed that the SDCCAG8 protein colocalizes with other proteins known to be involved in NPHP-RC in mouse renal epithelial cells and demonstrated a physical interaction between the protein product of this gene and the protein product of another known NPHP-RC-causing gene. They then knocked down SDCCAG8 expression in zebrafish, which resulted in a phenocopy of developmental defects characteristic of knocking down other genes involved in NPHP-RC. They further observed phenotypes characteristic of NPHP-RC upon knockdown of SDCCAG8 expression in mouse renal epithelial cells. Taken together, these data strongly supported a role for SDCCAG8 in disease patho genesis of NPHP-RC.
Although NGS allows the study of human genetics at unprecedented speed and detail, the role of model organisms in following up candidate genes is still of critical importance. A recent perspective article in Nature Reviews Genetics (93) addressed the utility of research with model organisms in the age of NGS technologies. According to one author, 3 reasons to continue using model organisms to complement human genetic studies are the genome resources that have been accumulated for these organisms, the bank of literature available on basic biological processes in these model systems, and the limitations of genetic and functional studies in humans. (93) The communities studying model organisms are using NGS to expand the knowledge base about these organisms, which will increase their utility in follow-up studies on candidate human genes.
TRANSLATING EXOME AND GENOME SEQUENCING INTO THE CLINICAL LABORATORY
The encouraging successes of exome and genome sequencing in identifying candidate genes have resulted in translational efforts to bring these approaches into clinical diagnostics. While early in the adoption cycle, it is reasonable to project a steady growth in this translational effort. Although tremendous progress has been made in the past 7 years since the initial NGS-based publication, translation of NGS into routine clinical diagnostic practice is faced with multiple challenges. The continuing evolution of NGS poses hurdles for clinical laboratories in terms of validating and maintaining diagnostic tests. Next-generation sequencing commercial vendors have been periodically releasing new versions of chemistries and flow cells and decommissioning prior versions. While improving accuracy and decreasing sequencing costs, these periodic upgrades require that clinical laboratories revalidate their processes. In our experience, the technical workflow of library preparation and performance of sequencing is accomplishable by individuals with high-complexity molecular testing experience. In contrast, identifying individuals with the bioinformatics skill sets needed for analysis of exome and genome scale data sets is a unique challenge. As with platform modifications, there has been a steady stream of bioinformatics innovations for analysis of NGS data sets. Multiple algorithms for alignment, variant calling, and annotation now exist each with their own strengths and weaknesses. Few comparative studies of these algorithms have been performed and the most widely used algorithms require knowledge of LINUX or UNIX command line. Customization of open-source software to achieve integration with existing laboratory information systems requires expertise in interfacing software. To accomplish these bioinformatics needs, clinical laboratories are either collaborating with academic bioinformatics groups or hiring individuals with relevant expertise. An additional resource investment that clinical laboratories face is the computational infrastructure required for processing of exome and genome scale data, including dedicated computational servers with accompanying storage space.
From published literature and our experience, successful candidate gene discovery requires individualized (case-by-case) bioinformatics approaches and can require the application of more than 1 algorithm to generate a manageable list of candidate genes. Up-front decisions are required regarding choice of algorithmic approach based on patient phenotype and, in the setting of family studies, potential modes of inheritance and assumptions regarding penetrance. The resulting candidate gene list may or may not contain genes that have been previously observed or reported in the context of the patient's phenotype. A candidate gene may arise whose known biological function appears possible with respect to the phenotype. Alternatively, it may be difficult to associate any of the genes on the candidate list with the patient phenotype. The pursuit of genetic screening or functional studies, while necessary for establishing causality, require a research infrastructure and are not commensurate with real-time diagnostic needs. To facilitate candidate gene discovery, there is a compelling need for additional innovations in bioinformatics tools and expansion of normal and disease-associated databases. At this juncture in time, there are only a few studies and anecdotal statements that provide insight into the diagnostic sensitivity of exome and genome sequencing for candidate gene discovery. (94,95) The National Institutes of Health Undiagnosed Diseases Program reports that 24% of cases were successfully diagnosed molecularly by using a combination of genomic microarrays and exome and genome sequencing in its first 2 years, whereas rates approaching 50% have been reported by Gillisen et al (95) in 2011.
Interpreting the wealth of data produced by these methods, establishing causality of strong candidate genes, and reporting the results back to the physician and patient are outstanding challenges and topics of active discussion in the medical scientific community. As one example, Jonathan Berg and colleagues (96) recently published an article (in 2011) on interpreting NGS results and what to report to the physician and patient. They suggested separating variants identified by exome and genome sequencing into 3 separate "bins" designated (1) "clinically actionable," for variants linked to disease and which have a clinically established treatment or prevention associated with them, (2) "clinically valid but not directly actionable" for variants that may be clinically valid but are not medically actionable and variants that are associated with conditions for which there is no treatment and, (3) "unknown or no clinical significance" for variants that do not fall into the first 2 bins. These variants would be stored for future reference and could be used in the research setting rather than the clinical setting. Moving forward, we can anticipate additional proposals for variant classification and recommendations for variant reporting from professional organizations.
In summary, NGS has fundamentally impacted and changed biomedical research and is now being translated into the clinical diagnostic realm. Just as Sanger sequencing opened the door for single-gene sequencing and was ultimately leveraged to sequence the first human genome in a multiyear effort, NGS has opened a new door that is allowing increasingly widespread sequencing of human genomes, each being sequenced in a matter of days. As with other new technologies that emerge and are translated into the clinical diagnostic arena, a window of time exists wherein adoption is hampered by the lack of reimbursement until sufficient evidence of utility accrues. The translational process is both an exciting and challenging time for clinical laboratories. Continued evolution of NGS will push the dissemination curve, and laboratorians in conjunction with clinicians will need to collaboratively find paths forward for appropriate clinical use of exome and genome sequencing.
The authors would like to thank Kalyan Mallempati, MS, for running the HiSeq2000 instrument to generate the data shown in this review and Shale Dames, MS, for generating the graphics for Figure 2.
(1.) Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376-380.
(2.) Artuso R, Fallerini C, Dosa L, et al. Advances in Alport syndrome diagnosis using next-generation sequencing. Eur J Hum Genet. 2012;20:50-57.
(3.) Gowrisankar S, Lerner-Ellis JP, Cox S, et al. Evaluation of second-generation sequencing of 19 dilated cardiomyopathy genes for clinical applications. J Mol Diagn. 2010;12(6):818-827.
(4.) Jones MA, Bhide S, Chin E, et al. Targeted polymerase chain reaction-based enrichment and next generation sequencing for diagnostic testing of congenital disorders of glycosylation. Genet Med. 2011;13(11):921-932.
(5.) Vasta V, Ng SB, Turner EH, Shendure J, Hahn SH. Next generation sequence analysis for mitochondrial disorders. Genome Med. 2009;1(10):100.
(6.) Voelkerding KV, Dames S, Durtschi JD. Next generation sequencing for clinical diagnostics--principles and application to targeted resequencing for hypertrophic cardiomyopathy: a paper from the 2009 William Beaumont Hospital Symposium on Molecular Pathology. J Mol Diagn. 2010;12(5):539-551.
(7.) Klee EW, Hoppman-Chaney NL, Ferber MJ. Expanding DNA diagnostic panel testing: is more better? Expert Rev Mol Diagn. 2011;11(7):703-709.
(8.) Proll J, Danzer M, Stabentheiner S, et al. Sequence capture and next generation resequencing of the MHC region highlights potential transplantation determinants in HLA identical haematopoietic stem cell transplantation. DNA Res. 2011;18(4):201-210.
(9.) Holcomb CL, Hoglund B, Anderson MW, et al. A multi-site study using high-resolution HLA genotyping by next generation sequencing. Tissue Antigens. 2011;77(3):206-217.
(10.) Erlich RL, Jia X, Anderson S, et al. Next-generation sequencing for HLA typing of class I loci. BMC Genomics. 2011;12:42.
(11.) Serizawa M, Sekizuka T, Okutani A, et al. Genomewide screening for novel genetic variations associated with ciprofloxacin resistance in Bacillus anthracis. Antimicrob Agents Chemother. 2010;54(7):2787-2792.
(12.) Deshpande NP, Kaakoush NO, Mitchell H, et al. Sequencing and validation of the genome of a Campylobacter concisus reveals intra-species diversity. PLoS One;6(7):e22170. doi:10.1371/journal.pone.0022170.
(13.) Ku CS, Naidoo N, Pawitan Y. Revisiting Mendelian disorders through exome sequencing. Hum Genet. 2011;129(4):351-370.
(14.) Ross JS, Cronin M. Whole cancer genome sequencing by next-generation methods. Am J Clin Pathol. 2011;136(4):527-539.
(15.) Russnes HG, Navin N, Hicks J, Borresen-Dale AL. Insight into the heterogeneity of breast cancer through next-generation sequencing. J Clin Invest. 2011;121(10):3810-3818.
(16.) Wong KM, Hudson TJ, McPherson JD. Unraveling the genetics of cancer: genome sequencing and beyond. Annu Rev Genomics Hum Genet. 2011;12: 407-430.
(17.) LupskiJR, ReidJG, Gonzaga-Jauregui C, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med. 2010;362(13): 1181-1191.
(18.) Sobreira NL, Cirulli ET, Avramopoulos D, et al. Whole-genome sequencing of a single proband together with linkage analysis identifies a Mendelian disease gene. PLoS Genet. 2010;6(6):e1000991. doi:10.1371/journal.pgen.1000991.
(19.) Bainbridge MN, Wiszniewski W, Murdock DR, et al. Whole-genome sequencing for optimized patient management. Sci Trans! Med. 2011;3(87): 87re3.
(20.) Parkinson NJ, Maslau S, Ferneyhough B, et al. Preparation of high-quality next-generation sequencing libraries from picogram quantities of target DNA. Genome Res. 2011;22(1):125-133.
(21.) Coonrod EM, Margraf RL, Voelkerding KV. Translating exome sequencing from research to clinincal diagnostics [published online ahead of print December 16, 2011]. Clin Chem Lab Med. doi:10.1515/cclm-2011-0841.
(22.) Clark MJ, Chen R, Lam HY, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011;29(10):908-914.
(23.) Majewski J, Schwartzentruber J, Lalonde E, Montpetit A, Jabado N. What can exome sequencing do for you? [published online ahead of print July 5, 2011]. J Med Genet. doi:10.1136/jmedgenet-2011-100223.
(24.) NCBI Consensus CDS Project. CCDS Database. http://www.ncbi.nlm.nih. gov/CCDS/CcdsBrowse.cgi. Accessed January 15, 2011.
(25.) NCBI Reference Sequence (RefSeq). http://www.ncbi.nlm.nih.gov/projects/ RefSeq. Accessed January 15, 2011.
(26.) Griffiths-Jones S, Kozomara A. miRBase. http://www.mirbase.org/index. shtml. Accessed January 15, 2011.
(27.) Wellcome Trust Sanger Institute and The National Human Genome Research Institute. GENCODE. http://www.gencodegenes.org/. Accessed April 4, 2012.
(28.) Wellcome Trust Sanger Institute. Rfam database. http://www.sanger.ac.uk/ resources/databases/rfam.html. Accessed January 15, 2011.
(29.) Wellcome Trust Sanger Institute. Vega database. http://vega.sanger.ac.uk/ index.html. Accessed April 4, 2012.
(30.) EMBL-EBI and Wellcome Trust Sanger Institute. Ensembl Genome Browser. http://uswest.ensembl.org/index.html. Accessed April 4, 2012.
(31.) Weber M, Lestrade L. Laboratoire de Biologie Moleculaire Eucaryote: snoRNABase. http://www-snorna.biotoul.fr/. Accessed April 4, 2012.
(32.) Voelkerding KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin Chem. 2009;55(4):641-658.
(33.) Ansorge WJ. Next-generation DNA sequencing techniques. N Biotechnol. 2009;25(4):195-203.
(34.) Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006; 16(6):545-552.
(35.) Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008; 456(7218):53-59.
(36.) Metzker ML. Sequencing technologies--the next generation. Nat Rev Genet. 2010;11(1):31-46.
(37.) Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet. 2011;52(4):413-435.
(38.) Suzuki S, Ono N, Furusawa C, Ying BW, Yomo T. Comparison of sequence reads obtained from three next-generation sequencing platforms. PLoS One. 2011;6(5):e19534. doi:10.1371/journal.pone.0019534.
(39.) Ledergerber C, Dessimoz C. Base-calling for next-generation sequencing platforms. Brief Bioinform. 2011;12(5):489-497.
(40.) Ewing B, Green P. Base-calling of automated sequencer traces using phred. II: error probabilities. Genome Res. 1998;8(3):186-194.
(41.) Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I: accuracy assessment. Genome Res. 1998;8(3): 175-185.
(42.) Burrows-Wheeler Aligner download. sourceforge Web site. http://bio-bwa. sourceforge.net. Accessed January 15, 2011.
(43.) Li H, Durbin R. Fast and accurate short read alignment with Burrows Wheeler transform. Bioinformatics. 2009;25(14):1754-1760.
(44.) Li H, Durbin R. Fast and accurate long-read alignment with Burrows Wheeler transform. Bioinformatics. 2010;26(5):589-595.
(45.) Novoalign download. Novocraft Web site. http://www.novocraft.com/ mail/index.php. Accessed April 4, 2012.
(46.) Integrative Genomics Viewer. Broad Institute Web site. http://www. broadinstitute.org/igv. Accessed January 15, 2011.
(47.) Robinson JT, Thorvaldsdottir H, Winckler W, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24-26.
(48.) Genome Analysis Toolkit. Broad Institute Web site. http://www. broadinstitute.org/gsa/wiki/index.php/The_Genone_Analysis_Toolkit. Accessed April 4, 2012.
(49.) Depristo MA, Banks E, Poplin R, et al. Aframework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43(5):491-498.
(50.) McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303.
(51.) SAMtools. sourceforge Web site. http://samtools.sourceforge.net. Accessed April 4, 2012.
(52.) Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-2079.
(53.) Picard: sourceforge.net. sourceforge Web site. http://picard.sourceforge. net/. Accessed April 4, 2011.
(54.) Base quality score recalibration. Broad Institute Web site. http://www. broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration. Accessed January 20, 2011.
(55.) dbSNP. National Center for Biotechnology Information Web site. http:// www.ncbi.nlm.nih.gov/projects/SNP/. Accessed April 4, 2012.
(56.) ANNOVAR. Open Bioinformatics Web site. http://www. openbioinformatics.org/annovar/. Accessed April 4, 2012.
(57.) Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010; 38(16):e164. doi:10.1093/nar/gkq603.
(58.) snpEff. Broad Institute Web site. http://www.broadinstitute.org/gsa/wiki/ index.php/Adding_Genomic_Annotations_Using_SnpEff_and_VariantAnnotator. Accessed January 21, 2011.
(59.) Chen K, Wallis JW, McLellan MD, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009; 6(9):677-681.
(60.) Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009;19(7):1270-1278.
(61.) Kidd JM, Cooper GM, Donahue WF, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453(7191):56-64.
(62.) Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318(5849): 420-426.
(63.) Sindi S, Helman E, Bashir A, Raphael BJ. A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009;25(12): i222-i230.
(64.) Wang J, Mullighan CG, Easton J, et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods. 2011;8(8): 652-654.
(65.) Lyon GJ, Jiang T, Van Wijk R, et al. Exome sequencing and unrelated findings in the context of complex disease research: ethical and clinical implications. Discov Med. 2011;12(62):41-55.
(66.) Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human genome. Nat Biotechnol. 2009;27(9):847-850.
(67.) Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. Accurate and comprehensive sequencing of personal genomes. Genome Res. 2011;21(9): 1498-1505.
(68.) Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2011;13(1):36-46.
(69.) 69.1000 Genomes Project. http://www.1000genomes.org/. Accessed April 4, 2011.
(70.) 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061-1073.
(71.) Marth GT, Yu F, Indap AR, et al. The functional spectrum of low-frequency coding variation. Genome Biol. 2011;12(9):R84.
(72.) SIFT. J. Craig Venter Institute. http://sift.jcvi.org/. Accessed April 4, 2011.
(73.) Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11(5):863-874.
(74.) Sunyaev S, Ramensky V, Koch I, Lathe W III, Kondrashov AS, Bork P. Prediction of deleterious human alleles. Hum Mol Genet. 2001;10(6):591-597.
(75.) Rodelsperger C, Krawitz P, Bauer S, et al. Identity-by-descent filtering of exome sequence data for disease-gene identification in autosomal recessive disorders. Bioinformatics. 2011;27(6):829-836.
(76.) Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011;12(10):703-714.
(77.) Roach JC, Glusman G, Smit AF, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636-639.
(78.) Ng SB, Buckingham KJ, LeeC, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42(1):30-35.
(79.) GERP. Sidow Lab at Stanford University Web site. http://mendel.stanford. edu/SidowLab/downloads/gerp/. Accessed April 4, 2012.
(80.) Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to beunder selective constraint using GERP -H-. PLoS Comput Biol. 2010;6(12):e1001025. doi:10.1371/journal. pcbi.1001025.
(81.) The Human Gene Mutation Database. Institute of Medical Genetics in Cardiff. http://www.hgmd.org/. Accessed April 4, 2012.
(82.) Online Mendelian Inheritance in Man. National Center for Biotechnology Information Web site. http://www.ncbi.nlm.nih.gov/omim. Accessed January 15, 2011.
(83.) Johnson JO, Mandrioli J, Benatar M, et al. Exome sequencing reveals VCP mutations as a cause of familial ALS. Neuron. 2010;68(5):857-864.
(84.) Worthey EA, Mayer AN, Syverson GD, et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med. 2011;13(3):255-262.
(85.) SPR gene entry. National Center for Biotechnology Information Web site. http://www.ncbi.nlm.nih.gov/gene/6697. Accessed February 1, 2011.
(86.) PTPN11 gene entry. National Center for Biotechnology Information Web site. http://www.ncbi.nlm.nih.gov/gene/5781. Accessed February 2, 2011.
(87.) Yandell M, Huff C, Hu H, et al. A probabilistic disease-gene finder for personal genomes [published on line ahead of print June 23, 2011]. Genome Res. doi:10.1101/gr.123158.111.
(88.) Rope AF, Wang K, Evjenth R, et al. Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet. 2011;89(1):28-43.
(89.) Ionita-Laza I, Makarov V, Yoon S, et al. Finding disease variants in Mendelian disorders by using sequence data: methods and applications. Am J Hum Genet. 2011;89(6):701-712.
(90.) Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461(7261):272-276.
(91.) Ng SB, Bigham AW, Buckingham KJ, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet. 2010;42(9):790-793.
(92.) Otto EA, Hurd TW, Airik R, et al. Candidate exome capture identifies mutation of SDCCAG8 as the cause of a retinal-renal ciliopathy. Nat Genet. 2010;42(10):840-850.
(93.) Aitman TJ, Boone C, Churchill GA, Hengartner MO, Mackay TF, Stemple DL. The future of model organisms in human disease research. Nat Rev Genet. 2011;12(8):575-582.
(94.) Biesecker LG, Mullikin JC, Facio FM, et al. The ClinSeq Project: piloting large-scale genome sequencing for research in genomic medicine. Genome Res. 2009;19(9):1665-1674.
(95.) Gilissen C, Hoischen A, Brunner HG, Veltman JA. Unlocking Mendelian disease using exome sequencing. Genome Biol. 2011;12(9):228.
(96.) Berg JS, Khoury MJ, Evans JP. Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time. Genet Med. 2011;13(6):499-504.
Emily M. Coonrod, PhD; Jacob D. Durtschi, BS; Rebecca L. Margraf, PhD; Karl V. Voelkerding, MD
Accepted for publication May 14, 2012.
From Research and Development, ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, Utah (Drs Coonrod, Margraf, and Voelkerding and Mr Durtschi); and the Department of Pathology, University of Utah School of Medicine, Salt Lake City (Dr Voelkerding).
The authors have no relevant financial interest in the products or companies described in this article.
Reprints: Karl V. Voelkerding, MD, ARUP Institute for Clinical and Experimental Pathology, 500 Chipeta Way, Salt Lake City, UT 84108 (e-mail: firstname.lastname@example.org).
Please note: Illustration(s) are not available due to copyright restrictions.
|Printer friendly Cite/link Email Feedback|
|Author:||Coonrod, Emily M.; Durtschi, Jacob D.; Margraf, Rebecca L.; Voelkerding, Karl V.|
|Publication:||Archives of Pathology & Laboratory Medicine|
|Date:||Mar 1, 2013|
|Previous Article:||Transcription factor E3 protein--positive perivascular epithelioid cell tumor of the appendix presenting as acute appendicitis: a case report and...|
|Next Article:||Radiology estimates of viable tumor percentage in hepatocellular carcinoma ablation cavities correlate poorly with pathology assessment.|