Evaluation of oligonucleotide sequence capture arrays and comparison of next-generation sequencing platforms for use in molecular diagnostics.
As the demand for multigene sequencing increases, traditional PCR amplification of individual exons will become difficult, owing to the increasing number of reactions required for each patient. Emerging technologies designed to target and capture specific regions of the genome and enrich for target genes may drive the adoption of NGS technologies for clinical testing, because they offer a method for rapidly developing numerous assays simultaneously [reviewed in (2)]. In fact, several companies now offer complete exome capture reagents that have been reported to effectively enrich for most coding exons within the human genome.
We evaluated the ability of NimbleGen 385K Custom Sequence Capture Arrays (3, 4) to enrich patient samples for every exon in a panel of 22 genes, most of which are associated with hereditary colorectal cancer. Five patient samples previously tested for germline mutations in one of 4 colorectal cancer-related genes [MLH1,  mutL homolog 1, colon cancer, nonpolyposis type 2 (E. coli); MSH2, mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coli); MSH6, mutS homolog 6 (E. coli); or APC, adenomatous polyposis coli] were selected for library generation. Three samples were repeated on different arrays to study chip reproducibility. Captured sequence libraries were analyzed with both the GS-FLX and the GAII, and a subset of targeted exons was sequenced by the Sanger method to validate the NGS-identified variants.
Materials and Methods
SAMPLE SELECTION AND DNA EXTRACTION
Deidentified clinical samples with documented germline mutations in genes associated with hereditary nonpolyposis colorectal cancer (HNPCC) or polyposis syndromes (MLH1, MSH2, MSH6,or APC) were selected for analysis. DNA was isolated from peripheral blood lymphocytes on the Autopure LS with Puregene[R] chemistry (Qiagen; http://www.qiagen.com) according to the manufacturer's protocol. Samples were quantified and analyzed for quality on the NanoDrop[R] ND-1000 spectrophotometer (Thermo Scientific; http://www.thermo.com). Only DNA samples with an [A.sub.260]/[A.sub.280] ratio [greater than or equal to] 1.8 were processed further. All samples were brought to a final concentration of 250 ng/[micro]L in 1 x Tris-EDTA (10 mmol/L Tris-HCl, 1 mmol/L EDTA, pH 7.0).
LIBRARY PREPARATION AND HYBRIDIZATION
Twenty micrograms of DNA from each sample was sonicated on the Sonics[R] Vibra-Cell[TM] sonicator (Sonics & Materials; http://www.sonicsandmaterials.com). Each sample was evaluated by agarose gel electrophoresis and with an Agilent Bioanalyzer (Agilent Technologies; http://www.agilent.com) to ensure a mean fragment size of 500 bp, as per the NimbleGen Sequence Capture protocol. Samples then underwent library preparation for NimbleGen Sequence Capture according to the manufacturer's protocol. In brief, DNA samples were polished to form blunt-ended fragments, and adapters were ligated onto these fragments. The Agencourt[R] AMPure[R] system (Beckman Coulter Genomics; http://www.beckmangenomics.com) was used to remove unligated linkers and small (<200 bp) fragments. The concentration of each sample was measured with the NanoDrop ND-1000 spectrophotometer, and Cot-1 DNA was added to 5 [micro]g of each library to block repetitive sequences. Hybridization Buffer and Hybridization Component A (Roche NimbleGen) were added to each sample, and the entire volume was applied to each NimbleGen Custom 385K array. The capture arrays were hybridized at 42 [degrees]C for 68-72 h.
On completion of hybridization, each slide was washed according to the manufacturer's protocol. The bound library was eluted in 1.2 mL water at 95 [degrees]C. Samples were dried down with the Savant[R] DNA 120 Speed-Vac Concentrator (Thermo Scientific) and resuspended in 320 [micro]L water. We then performed ligation-mediated PCR (LMPCR) with primers complementary to the adapter sequences to amplify enriched sequences and analyzed post-LMPCR samples with the NanoDrop ND-1000 spectrophotometer to ensure that at least 5 [micro]g of total DNA was available for hybridization.
QUANTITATIVE FLUORESCENCE PCR
Quantitative fluorescence PCR with the LightCycler[R] 480 Real-Time PCR System (Roche Applied Science; http://www.roche-applied-science.com) was performed on pre- and postenriched libraries to calculate the -fold enrichment at several loci. Exon-specific forward and reverse oligonucleotides (see Table 1 in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/ vol56/issue8), Universal ProbeLibrary probes, and LightCycler 480 Probes Master mix (Roche Applied Science) were used according to the manufacturer's protocol. In addition, the housekeeping gene ACTB (actin, beta) was assessed as a negative selection control. At least 500-fold enrichment of the subset of targeted loci was required before proceeding with NGS.
ROCHE 454 SEQUENCING
Sequencing on the GS-FLX was performed off site by 454 Life Sciences with 5 [micro]g of the enriched libraries. Generation of single-stranded template and emulsion PCR were performed according to the manufacturer's protocol (5) with the standard sequencing kit (LR70), and 4 samples were applied to each picotiter plate.
ILLUMINA GAII SEQUENCING
Because of the short read length generated by the GAII (36-base read protocol), each library was digested with BamHI, which removed all but 7 bases of the adapter sequences. Each library (3 [micro]g) then underwent library preparation for GAII sequencing, with the manufacturer's protocol slightly modified to start with a 500-bp gel-excised fragment instead of a 200-bp fragment so that NimbleGen-captured libraries could transition directly into GAII library preparation. We also eliminated the nebulization step, because our NimbleGen libraries were originally generated from sonicated genomic DNA. To each flow cell, 3.5 pmol of 1 sample was added, and cluster generation was performed according to the manufacturer's protocol. Sequencing was performed for 36 cycles with the sequencing-by-synthesis Illumina Sequencing Kit v1.
VALIDATION OF MUTATIONS AND POLYMORPHISMS BY SANGER SEQUENCING
Primers were designed to amplify each exon, including 20-50 bp of flanking intronic sequence, of MLH1, MSH2, MSH6, and the coding region of the APC gene. All primers were 5' tagged with universal primer sequences (sequences available upon request). Samples were prepared for fluorescence sequencing on the ABI 3730 DNA Analyzer with BigDye[R] Terminator v1.1 chemistry and the BigDye XTerminator[TM] Purification Kit (Applied Biosystems; http://www.appliedbiosystems.com). Sequence data were analyzed with Mutation Surveyor[R] v2.41 software (SoftGenetics; http://www.softgenetics.com).
To calculate the -fold enrichment for each locus, we used the expression [E.sup.[DELTA]Cp], where E is the efficiency of amplification and Cp is the point at which the generated fluorescence signal rises above that of the background. [DELTA]Cp was calculated by subtracting the Cp of the captured library from that of the noncaptured library. We used an E value of 2 (maximum PCR efficiency) for all calculations of -fold enrichment.
GS-FLX sequence reads were aligned to the human genome reference sequence (Build 36, Release 49) with the Roche runMapping v.2.0.1 tool (the runMapping script is part of the Roche 454 software package; 454 Life Sciences; http://www.454.com), with the -nrm flags and -reg option to restrict output to regions defined by the NimbleGen capture *.gff file. Run statistics were extracted from the NewblerMetrics.txt file, and the mean read length was computed by dividing the total number of bases sequenced by the total number of reads sequenced. Events were extracted from the HCDiffs file, and the positional depth of coverage was obtained from the AlignmentInfo.tsv file. The Integrative Genomics Viewer version 1.3 (Broad Institute; http:// www.broad.mit.edu/igv) was used with the custom track feature to visualize sequence depth of coverage and positional event mapping. The percent G+ C track was computed by means of a moving average over a 1000-bp window.
GAII sequences were extracted from .tiff images with the Firecrest and Bustard tools from the Illumina Genome Analyzer Pipeline (version 1.0; http://www. illumina.com). We aligned these sequence reads to the human genome reference sequence (Build 36, Release 49) with MAQ version 0.7.1 (6) and discarded any reads that had multiple best-match alignments. The first 7 bases of each read were trimmed before the alignment to account for residual adapter sequences remaining after the restriction digest. Base substitutions were called by MAQ with a quality value threshold of 20. SOAP version 1.05 (7) was used on reads not mapped by MAQ to detect small insertions and deletions. Perl and shell scripts were used for converting formats, parsing files, and comparing results across sequencing platforms.
All statistical analyses were performed with R. The ANOVA for on-target read mapping used a linear model with sample as a factor and was computed across the 3-sample replicate pairs analyzed on both platforms. Biological variation was associated with the linear model [R.sup.2] value. Exon depth-of-coverage statistics were based on a positional depth average across the length of the exon. Interplatform Pearson product-moment correlations in exon depth of coverage were calculated with the 7 samples run in common on the 2 platforms. GC content was calculated with EMBOSS explorer's geecee tool (http://emboss.sourceforge.net/ apps/cvs/emboss/apps/geecee.html). Sensitivity and positive predictive values were computed by assuming all Sanger-detected events were true positives. Events were totaled across samples, with Sanger-identified events not detected by the GS-FLX or the GAII considered as false negatives. Sensitivity = true positives/ (true positives + false negatives); positive predictive value = true positives/(true positives + false positives).
The samples selected for library preparation and sequencing on the NGS platforms were chosen to have different types of HNPCC-causing point mutations (samples 1-4), including point mutations, small insertions, and small deletions (Table 1). We chose samples with these particular types of mutations to evaluate the ability of each sequencing platform to identify different types of sequence variation. Duplicate libraries were prepared for all samples except sample 1 (no point mutations), for which there was insufficient DNA to prepare 2 libraries. Therefore, an additional sample with no HNPCC-related point mutations was included in the second set of library preparations and hybridizations (samples 5-8).
A custom NimbleGen 385K array was designed to target all exons, including 200 bp of the flanking upstream and downstream intronic sequence, for 22 genes, 21 of which are associated with colorectal cancer. These genes are distributed over a wide size range (6-106 exons). Probes are approximately 60-80 bp in length and tiled across all target regions except repetitive sequences, with significant redundancy between probes. Our custom array targeted 410 exons, incorporating just over 272 000 bases (see Tables 2 and 3 in the online Data Supplement).
To determine the degree of enrichment of our targeted loci before NGS, we calculated the -fold enrichment for each sample byquantitative fluorescence PCR with aliquots of pre-enriched and postenriched libraries. We observed enrichment levels that varied depending on the locus tested. Data from a positive-control locus included on every NimbleGen custom array indicated 917- to 5793-fold enrichment. Probes specific to a subset of our custom targeted loci also exhibited a variable -fold enrichment for 1 exon in different samples (3848- to 14 412-fold for MSH2 exon 15) as well as for different exons in the same sample (sample 2:3293-fold for APC exon 1 and 11 347-fold for MSH2 exon 15). These results, although variable, indicated that target sequences from certain exons were consistently captured more efficiently than sequences from other exons (Table 2). Despite this variation, these data met minimum enrichment requirements and therefore indicated successful enrichment of at least a subset of the targeted exons.
From each captured sample library, we submitted 5 [micro]g of sample for GS-FLX sequencing and 3 [micro]g for GAII sequencing. As expected, a single run on the GAII produced approximately 80 times more reads per sample than the GS-FLX, and a high proportion of reads were successfully mapped to the human genome. We mapped 18.5%-62.3% of the GS-FLX reads and 21%-58% of the GAII reads to the target region. An ANOVA revealed that 67% of the GS-FLX variation and 77% of the GAII variation in on-target read percentage was attributable to the biological differences modeled (sample-to-sample differences). The remaining 33% (GS-FLX) and 23% (GAII) of the variation can be thought of as technical variation (i.e., between sample replicates) that arose during the experiment. The off-target reads may be explained by nonspecific sequence binding to the oligonucleotides on the sequence capture arrays. The total coverage of all targeted bases at 5 x was 89% for the GS-FLX and 82.4% for the GAII; at 20 x, the total coverage was 54.6% for the GS-FLX and 86.2% for the GAII (Table 3). The mean (SD) exon depth of coverage was 30 (21.7) for the GS-FLX, with an intrasample correlation coefficient of 0.71. The mean exon depth of coverage for the GAII was 266 (76.4), with an intrasample correlation coefficient of 0.96. Despite the high mean exon read depth and target region coverage, 4 exons [AXIN2 (axin 2) exon 1, KRAS (v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog) exon 1, and PIK3CA (phosphoinositide-3-kinase, catalytic, alpha polypeptide) exons 13 and 14] had no mapped reads. Another 6 exons had mean exon depths of coverage of <5 in all samples on both platforms.
The interplatform correlation coefficient in read depth of coverage across all sequenced exons was 0.14; however, sample pair-specific correlation coefficients for individual genes varied from -0.12 to 0.98, with a median value of 0.69. Variations in depth of coverage may arise from both the capture process and the sequencing methodologies and may reflect such factors as G+ C content, target sequence uniqueness, overlap with repeat regions, and the presence of long homopolymers. As an example, Fig. 1A shows platformspecific coverage for BRAF (v-raf murine sarcoma viral oncogene homolog B1). The correlation between the GS-FLX and GAII platforms in mean exon read depth was 0.45 for one sample. A closer examination shows that exon 1 was poorly covered, with a mean coverage of <2 for both the GS-FLX and the GAII. Low coverage in the first exon of genes was common. GS-FLX data had a mean depth of coverage of 9 in first exons and 32 in all other exons, GAII data had a mean depth of coverage of 42 in first exons and 280 in the other exons. High G+C content may explain the low coverage in the first exons of genes, given that the mean G+ C content of exon 1 is 0.69, compared with 0.51 in the other exons. A high G+ C content may hinder elution from the capture array or LMPCR amplification in later steps; both mechanisms could explain the absence of sequence data. G+ C content does not account for all of the coverage variation in exons, however, as illustrated by PIK3CA exons 13 and 14. These exons have low G+ C contents of 0.31 and 0.42, respectively, yet no mapped reads were obtained from either the GS-FLX or GAII platform (data not shown). The absence of mapped reads in these 2 exons maybe due to the lack of target region uniqueness. Although probes were designed to capture sequences from these regions, both exons have 100% identical matches to off-target regions on chromosome 22, which caused the read-alignment programs to not map reads to these exons. These examples illustrate the importance of reviewing multiple factors when determining the likelihood of sufficient on-target read coverage in an NGS capture experiment.
[FIGURE 1 OMITTED]
The 454 and the GAII identified a large number of sequence variants within our targeted regions with <100% concordance, even between technical replicates (Fig. 1). This result could be due to a number of variables, including efficiency of capture, capture of homologous sequences that cause false-positive findings upon alignment, or artifacts of the sequencing methods themselves. To assess the validity of variants identified by NGS, we sequenced 4 well-characterized genes (MLH1, MSH2, MSH6, and APC) in each sample by the Sanger method. Variants detected by the GSFLX and GAII methods were then compared with the Sanger sequence variants. If we assume all variants detected by Sanger sequencing are true events, the GSFLX had a 65% sensitivity and a 50% positive predictive value, and the GAII had a 67% sensitivity and a 62% positive predictive value. We did not calculate specificity because it would be incorrect to assume Sanger sequencing detects all true events; thus, we could not quantify the true-negative rate.
False-negative variants detected by the GS-FLX were often in regions containing poor sequence coverage (insufficient reads) and homopolymers. This fact led to failure of the GS-FLX to identify a germline mutation in MSH2 known to cause Lynch syndrome (IVS5 + 3A>T, samples 2 and 6) (Table 4). Poor capture, inferred from the low read depth at this position, could have contributed to the absence of detection because this mutation was also missed by the GAII in one of the replicates; however, the presence of the homopolymer likely contributed to this result. The remaining 2 germline mutations were identified by GSFLX sequencing. In another instance, a polymorphism was not called as a high-quality difference because only reverse reads aligned to the reference sequence at that position.
The GAII correctly identified 2 of the 3 germline mutations but, as mentioned above, failed to identify 1 germline mutation in MSH2 owing to poor read coverage. The GAII was also unable to detect several polymorphisms correctly identified by the GS-FLX and the Sanger method. Most of these variants were not reported in our GAII data because of sequences with low quality scores at that position or poor read coverage. We also considered that nearby BamHI restriction sites might have hindered the detection of polymorphisms; however, no such sites were identified near the undetected variants (data not shown). We likewise investigated variants that were found by Sanger sequencing but were not identified by either the GS-FLX or the GAII. Insufficient read depth, in both platforms, was the primary reason these variants were not detected. This result almost certainly reflects the oligonucleotide array capture and not the sequencing platforms (Table 4).
Finally, several variants were identified by NGS technologies and not by the Sanger method; however, these variants all shared several characteristics in common, suggesting they are false-positive mutation calls. First, all of these variants were identified by the GS-FLX and not the GAII. Second, with 1 exception in a homopolymer region, they are not known germline mutations or polymorphisms. Third, none were confirmed in the corresponding replicate sample, and, fourth, most were found at low frequencies, although the frequency of 1 variant did approach 36% in homopolymer regions of suitable read depth (Table 5). These observations led us to hypothesize that these variants were likely false-positive calls. The possibility of low-level somatic mosaicism cannot be completely ruled out; however, the lack of reproducibility between replicates and the failure of the GAII to detect these variants suggest false-positive calls as a more likely explanation. To be inclusive, we used a 10% minimum variant read threshold for event detection. A low threshold can lead to increased false-positive detection and effect performance metrics. If a 30% minimum variant read threshold had been required to detect heterozygous alleles, only 1 false-positive call, attributable to the presence of a homopolymer, would have been made. This requirement would have reduced the total true positives detected by 3 but would have led to an overall improvement in GS-FLX performance compared with Sanger sequencing.
The NGS technologies, as demonstrated by numerous research laboratories and publications, have tremendous potential because of their output capabilities in terms of bases sequenced per run; however, because >99% of a patient's genome currently is not targeted by most clinical laboratory tests, ways of isolating only the desired material are necessary for the use of such methods.
PCR amplification, our current method of choice, has drawbacks in that many reactions are required to sequence an entire gene. Likewise, it is well known that sequence variants located beneath annealed primers can cause allele dropout. The oligonucleotide capture arrays are advantageous because they allow the simultaneous development of a large number of assays in a matter of weeks, compared with the months to possibly years required to design and develop the assays with traditional PCR amplicon-based methods. In addition, this technology does not rely on the use of primers complementary to genomic sequences, thereby eliminating the potential for polymorphisms and rare variants to cause allele dropout.
The oligonucleotide arrays were unable to capture certain exons, however, particularly those high in G+C content and/or those corresponding to repetitive elements. New elution protocols that may aid in the coverage of such regions have been released since these experiments were performed; however, postelution amplification steps may still present an opportunity for G+C bias. Although we observed a reproducible capture profile across multiple samples for most exons, some displayed high variation in capture efficiencies, even between technical replicates of the same sample. In a clinical laboratory, such variation may necessitate the use of backup methods, such as PCR amplification followed by traditional Sanger sequencing, for regions that evade capture.
Likewise, reliance on hybridization technologies also leads to the capture of homologous and/or repetitive sequences, such as pseudogenes. Alignment of these nontarget sequences may lead to the identification of artificial mutations (false positives). Traditional PCR is still the method of choice for dealing with pseudogenes because primers can be designed to take advantage of sequence differences and to amplify only the functional copy of interest. Improved design algorithms that have the potential to improve capture efficiency and specificity are available; we are currently testing such arrays and how their results compare with those generated by this first generation of sequence capture arrays. Likewise, customized design assistance is also available in cases of difficult genomic regions.
Aside from array-based technologies, several capture methods have become commercially available within the past few years. Among these methods are Agilent's SureSelect in-solution hybrid capture method (8), RainDance Technologies' microdroplet PCR-based enrichment (9), and Febit's microfluidics-based HybSelect technology (10). The Agilent and Febit methodologies for sequence capture are similar to NimbleGen arrays in that they rely on hybrid selection, whereas the RainDance methodology is a highly multiplexed PCR reaction. All of these technologies could potentially suffer from one or more of the drawbacks observed with hybrid chip selection; however, all are likely to evolve over time and may transition from useful research tools to clinically useful techniques as they become cheaper and further refined.
With respect to the NGS technologies we evaluated, each has a unique chemistry and therefore different strengths and weaknesses. We confirmed a well-known drawback of the GS-FLX platform with respect to a difficulty in and around homopolymer sequences, but as the output of GS-FLX increases, which has already been observed with the Titanium technology, many read depth-related errors should be corrected. The difficulty with sequencing homopolymers is not likely to be resolved in the near future because of the chemistry used in this process; however, improvements to the mutation detection software in these regions could help.
In general, the GAII overcame poor read depth merely because of the higher number of reads generated by the instrument; however, this platform was not without its own limitations. The short lengths of the reads generated by the GAII makes the detection of insertions and deletions difficult because only a few mismatches can be tolerated when aligning to the reference sequence. With the rapidly increasing read lengths and output, however, along with improved bioinformatics methods for short-read data, these limitations will become less of an issue.
In conclusion, we have demonstrated that oligonucleotide arrays for reducing genomic complexity are successful at enriching multiple genomic loci simultaneously but that some exons resist capture. The GS-FLX and GAII platforms have differences in their ability to detect certain types of variants, which must be taken into account; however, as the cost and error rates of these technologies continue to decrease while the output continues to increase, we foresee their implementation for clinical diagnostics in the near future. More experience with all of the platforms is necessary, however, to address the shortcomings we have identified.
Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.
Authors' Disclosures of Potential Conflicts of Interest: Upon manuscript submission, all authors completed the Disclosures of Potential Conflict of Interest form. Potential conflicts of interest:
Employment or Leadership: None declared.
Consultant or Advisory Role: None declared.
Stock Ownership: None declared.
Honoraria: None declared.
Research Funding: M.J. Ferber, Mayo Clinic.
Expert Testimony: None declared.
Role of Sponsor: The funding organizations played a direct role in the final approval of the manuscript.
Acknowledgments: We acknowledge Bruce Eckloff and Yan Asmann for assistance with GAII sequencing and data analysis, respectively. We are also grateful to Luke Dannenberg and Dan Burgess at Roche/NimbleGen for their help with the sequence capture experiments and for a critical reading of the manuscript. Finally, we thank the Mayo Clinic's Research Computing Facility for their computing support.
(1.) Voelkerding KV, Dames SA, Durtschi JD. Next generation sequencing: from basic research to diagnostics. Clin Chem 2009;55:641-58.
(2.) ten Bosch JR, Grody WW. Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagn 2008;10:484-92.
(3.) Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007; 39:1522-7.
(4.) Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007;4:903-5.
(5.) Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437:376-80.
(6.) Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18:1851-8.
(7.) Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics 2008;24:713-4.
(8.) Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009;27:182-9.
(9.) Kiss MM, Ortoleva-Donnelly L, Beer NR, Warner J, Bailey CG, Colston BW, et al. High-throughput quantitative polymerase chain reaction in picoliter droplets. Anal Chem 2008; 80:8975-81.
(10.) Bau S, Schracke N, Kranzle M, Wu H, Stahler PF, Hoheisel JD, et al. Targeted next-generation sequencing by specific capture of multiple genomic loci using low-volume microfluidic DNA arrays. Anal Bioanal Chem 2009;393:171-5.
Nicole Hoppman-Chaney,  * Lisa M. Peterson,  Eric W. Klee,  Sumit Middha,  Laura K. Courteau,  and Matthew J. Ferber 
 Departments of Laboratory Medicine and Pathology, and  Health Science Research, Mayo Clinic, Rochester, MN.
 Nonstandard abbreviations: NGS, next-generation DNA sequencing; GS-FLX, Roche 454 Genome Sequencer FLX; GAII, Illumina Genome Analyzer II; HNPCC, hereditary nonpolyposis colorectal cancer; LMPCR, ligation-mediated PCR.
 Human genes: MLH1, mutL homolog 1, colon cancer, nonpolyposis type 2 (E. coli); MSH2, mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coll); MSH6, mutS homolog 6 (E. coli); APC, adenomatous polyposis coli; ACTB, actin, beta; AXIN2, axin 2; KRAS, v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog; PIK3CA, phosphoinositide-3-kinase, catalytic, alpha polypeptide; BRAF, v-raf murine sarcoma viral oncogene homolog B1.
* Address correspondence to this author at: 971 Hilton, 200 First St. S.W.,
Rochester, MN 55905. Fax 507-284-0043; e-mail hoppmanchaney.nicole@ mayo.edu.
Received February 18, 2010; accepted May 26, 2010.
Previously published online at DOI: 10.1373/clinchem.2010.145441
Table 1. HNPCC point mutations previously identified in each sample. Sample no. Gene Primary mutation 1 APC None 2,6 MSH2 IVS5 + 3A>T 3,7 MSH6 3702_3703insAGAA 4,8 MLH1 1852_1854delAAG 5 APC None Table 2. Fold enrichment for select loci. Sample Locus APC MLH1 MSH2 MSH6 Positive ACTB exon 1 exon 12 exon 15 exon 4 control (a) (b) 1 2210 4110 3848 ND (c) 917 0 2 3293 6562 11347 9508 2837 0 3 6252 11386 14412 13777 1859 1 4 3848 7858 9675 8135 5793 607 5 2478 4956 4925 5997 3315 0 6 3432 6339 9345 7538 3848 0 7 2750 3259 4482 4640 2896 0 8 2731 4270 6427 6747 3061 0 (a) Positive control locus included on every Sequence Capture chip. (b) Negative control locus (not targeted for enrichment) (c) ND, not determined. Table 3. Sequencing run metrics. Sample High-quality Mapped to reads, n human genome, % GS-FLX 1 (b) 218292 (b) 96.6 2 51912 88.9 3 (b) 171083 (b) 96.5 4 53787 88.1 5 79459 93.9 6 58225 93.2 7 81880 93.0 8 80541 92.5 GAII 1 6482719 93.7 2 6955700 92.6 3 6532133 92.7 4 6548201 92.9 5 N/A (c) N/A 6 6991435 90.9 7 7233645 88.8 8 6775382 91.1 Sample On Mean read target, length, % bases (a) GS-FLX 1 (b) 18.5 240 2 49.2 209 3 (b) 63.2 243 4 50.7 202 5 49.8 220 6 47.5 220 7 54.0 218 8 41.5 215 GAII 1 21.1 29 2 45.2 29 3 57.6 29 4 48.9 29 5 N/A N/A 6 45.2 29 7 58.2 29 8 37.4 29 Sample Bases Bases covered covered (5x), % (20 x), % GS-FLX 1 (b) 93.0 67.7 2 82.3 33.8 3 (b) 95.3 88.0 4 84.6 34.5 5 92.5 63.1 6 85.9 40.3 7 90.9 62.3 8 88.0 46.9 GAII 1 92.3 83.3 2 92.8 87.9 3 92.7 86.9 4 92.6 87.0 5 N/A N/A 6 93.1 87.9 7 92.9 87.7 8 92.1 85.4 (a) The mean read length on GAII is 36 - 7 = 29. Seven bases of residual adapter sequence were trimmed prior to data analysis. (b) Samples 1 and 3 were run 3 times each on the GS-FLX because of insufficient reads. Data shown are the sums of all 3 runs. (c) N/A, not applicable. Table 4. Sanger sequencing-validated variants missed by one or both NGS methods. Sample Gene Position Reference base 2 MSH6 IVS4-101 G 8 MSH6 IVS2+52 T 2 MSH2 IVS5+3 A 2 MSH2 IVS9-91 G 7 MSH2 IVS10+12 G 7 MSH2 IVS9-91 G 7 MSH2 IVS1+9 C 8 MSH2 IVS1+9 C 1 MSH6 116 G 4 MSH6 IVS1+22 C 4 MSH6 186 C 6 MSH2 IVS5+3 A 8 MSH6 IVS1+22 C 8 MSH6 186 C Sample Modified Detected db SNP (a) base by 2 C GS-FLX rs2072447 8 A GS-FLX rs3136282 2 T GAII Germline mutation 2 T GAII rs3732182 7 A GAII rs3732183 7 T GAII rs3732182 7 G GAII rs2303426 8 G GAII rs2303426 1 A Neither rs1042821 4 G Neither rs55927047 4 A Neither rs1042820 6 T Neither Germline mutation 8 G Neither rs55927047 8 A Neither rs1042820 Sample Comment 2 Present; low quality score (b) 8 Present; low quality score (b) 2 Homopolymer 2 Homopolymer 7 Reverse reads only (c) 7 Homopolymer 7 Low depth of coverage (2) 8 Low depth of coverage (3) 1 Low read depth 4 Few/0 reads 4 Few/0 reads 6 Homopolymer; low GAII quality score (b) 8 Few/0 reads 8 Few/0 reads (a) db SNB, single-nucleotide polymorphism database. (b) Was present in multiple reads, but the Q_phred quality score was <20. (c) Only reads from the reverse direction (not the forward direction) aligned at this position; hence, this variant was identified but as a low-quality variant reported in alldiffs and not HCdiffs. Table 5. Variants detected by NGS and not by Sanger sequencing. Sample Gene Position Reference base 1 MLH1 IVS16-12 T 2 APC IVS3+9_+15 TAAAAAG 2 APC 5068 GTAGG 2 APC 7063 T 2 MLH1 2111-2112 TG 2 MSH6 2469 TAA 3 APC IVS3+9_+10 TA 3 APC 5234 A 3 MLH1 588 A 3 MLH1 2019 T 3 MSH6 404 T 3 MSH6 3828 A 4 APC 2982 T 4 APC 6847 A 4 APC 6903 ATCA 4 MSH6 2443 C 6 APC 331 A 6 APC 4279 CC 7 APC 8042 C 7 APC 8108 A 7 APC 8341 G 7 MSH2 1546 AG 8 APC 2524 GAT 8 APC 3652 A 8 APC 3719 GT 8 APC 7032 AC 8 MSH2 2587 T 8 MSH6 2202 G 8 MSH6 2542 A Sample Modified Identified Reads, base by n 1 DEL GS-FLX 32 2 AAAA GS-FLX 14 2 INS GS-FLX 38 2 A GS-FLX 27 2 GAT GS-FLX 20 2 DEL GS-FLX 32 3 DEL GS-FLX 37 3 DEL GS-FLX 101 3 DEL GS-FLX 86 3 DEL GS-FLX 82 3 INS GS-FLX 70 3 DEL GS-FLX 100 4 C GS-FLX 49 4 G GS-FLX 28 4 GATCC GS-FLX 27 4 G GS-FLX 39 6 G GS-FLX 27 6 A GS-FLX 28 7 DEL GS-FLX 50 7 G GS-FLX 45 7 A GS-FLX 33 7 DEL GS-FLX 19 8 TGGATCC GS-FLX 37 8 T GS-FLX 34 8 TTC GS-FLX 30 8 GAA GS-FLX 43 8 A GS-FLX 25 8 C GS-FLX 30 8 TT GS-FLX 25 Sample Percent db SNP, (a) comments with variant 1 25.0 [T.sub.5], no SNP 2 35.7 ([T.sub.4][A.sub.5][G.sub.2]), no SNP 2 10.5 No homopolymer, no SNP 2 11.1 No homopolymer, no SNP 2 15.0 No homopolymer, no SNP 2 12.5 No homopolymer, no SNP 3 13.5 [T.sub.4][A.sub.5], no SNP 3 12.9 [A.sub.5], no SNP 3 12.8 [A.sub.6], SNP 3 11.0 T4, no SNP 3 15.7 No homopolymer, no SNP 3 10.0 No homopolymer, no SNP 4 10.2 [A.sub.4], no SNP 4 14.3 No homopolymer, no SNP 4 11.1 No homopolymer, no SNP 4 10.3 No homopolymer, no SNP 6 11.1 No homopolymer, no SNP 6 10.7 [C.sub.3], noSNP 7 14.0 [C.sub.5], noSNP 7 11.1 [A.sub.4], noSNP 7 12.1 No homopolymer, no SNP 7 15.8 No homopolymer, no SNP 8 10.8 No homopolymer, no SNP 8 11.8 No homopolymer, no SNP 8 10.0 No homopolymer, no SNP 8 14.0 No homopolymer, no SNP 8 12.0 No homopolymer, no SNP 8 10.0 No homopolymer, no SNP 8 12.0 [A.sub.3], no SNP (a) db SNB, single-nucleotide polymorphism database; DEL, deletion; INS, insertion.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Molecular Diagnostics and Genetics|
|Author:||Hoppman-Chaney, Nicole; Peterson, Lisa M.; Klee, Eric W.; Middha, Sumit; Courteau, Laura K.; Ferber,|
|Date:||Aug 1, 2010|
|Previous Article:||Methylation-specific loop-mediated isothermal amplification for detecting hypermethylated DNA in simplex and multiplex formats.|
|Next Article:||Development of a qualitative sequential immunoassay for characterizing the intrinsic properties of circulating cardiac troponin I.|