Printer Friendly
The Free Library
5,675,895 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Using decision forest to classify prostate cancer samples on the basis of SELDI-TOF MS data: assessing chance correlation and prediction confidence.


Class prediction using "omics" data is playing an increasing role in toxicogenomics, diagnosis/prognosis, and risk assessment. These data are usually noisy and represented by relatively few samples and a very large number of predictor variables Noun 1. predictor variable - a variable that can be used to predict the value of another variable (as in statistical regression)
variable quantity, variable - a quantity that can assume any of a set of values
 (e.g., genes of DNA microarray DNA microarray

A small solid support, usually a membrane or glass slide, on which sequences of DNA are fixed in an orderly arrangement. DNA microarrays are used for rapid surveys of the expression of many genes simultaneously, as the sequences contained on a
 data or m/z peaks of mass spectrometry mass spectrometry
 or mass spectroscopy

Analytic technique by which chemical substances are identified by sorting gaseous ions by mass using electric and magnetic fields.
 data). These characteristics manifest the importance of assessing potential random correlation and overfitting of noise for a classification model based on omics data. We present a novel classification method, decision forest (DF), for class prediction using omics data. DF combines the results of multiple heterogeneous but comparable decision tree (DT) models to produce a consensus prediction. The method is less prone to overfitting of noise and chance correlation. A DF model was developed to predict presence of prostate cancer prostate cancer, cancer originating in the prostate gland. Prostate cancer is the leading malignancy in men in the United States and is second only to lung cancer as a cause of cancer death in men.  using a proteomic data set generated from surface-enhanced laser deposition/ ionization ionization: see ion.
ionization

Process by which electrically neutral atoms or molecules are converted to electrically charged atoms or molecules (ions) by the removal or addition of negatively charged electrons.
 time-of-flight mass spectrometry This article is about the mass spectrometry technique. For other uses, see time-of-flight.
Time-of-flight mass spectrometry (TOF-MS) is method of mass spectrometry in which ions are accelerated by an electric field of known strength.
 (SELDI-TOF SELDI-TOF Surface-Enhanced Laser Desorption/Ionization Time-Of-Flight  MS). The degree of chance correlation and prediction confidence of the model was rigorously assessed by extensive cross-validation and randomization randomization (ranˈ·d·m  testing. Comparison of model prediction with imposed random correlation demonstrated biologic relevance of the model and the reduction of overfitting in DF. Furthermore, two confidence levels (high and low confidences) were assigned to each prediction, where most misclassifications were associated with the low-confidence region. For the high-confidence prediction, the model achieved 99.2% sensitivity and 98.2% specificity. The model also identified a list of significant peaks that could be useful for biomarker biomarker /bio·mark·er/ (bi´o-mahr?ker)
1. a biological molecule used as a marker for a substance or process of interest.

2. tumor marker.


bi·o·mark·er
n.
1.
 identification. DF should be equally applicable to other omics data such as gene expression data or metabolomic data. The DF algorithm is available upon request. Key words: bioinformatics, chance correlation, class prediction, classification, decision forest, prediction confidence, prostate cancer, proteomics, SELDI-TOF. Environ en·vi·ron  
tr.v. en·vi·roned, en·vi·ron·ing, en·vi·rons
To encircle; surround. See Synonyms at surround.



[Middle English envirounen, from Old French environner
 Health Perspect 112:1622-1627 (2004). doi: 10.1289/txg.7109 available via http://dx.doi.org/ [Online 5 August 2004]

**********

Recent technologic advances in the fields of "omics," including toxicogenomics, hold great promise for the understanding of the molecular basis of health and disease, and toxicity toxicity /tox·ic·i·ty/ (tok-sis´i-te) the quality of being poisonous, especially the degree of virulence of a toxic microbe or of a poison. . Prospective further advances could significantly enhance our capability to study toxicology toxicology, study of poisons, or toxins, from the standpoint of detection, isolation, identification, and determination of their effects on the human body. Toxicology may be considered the branch of pharmacology devoted to the study of the poisonous effects of drugs.  and improve clinical protocols for early detection of various types of cancer, disease states, and treatment outcomes. Classification methods, because of their power to unravel patterns in biologically complex data, have become one of the most important bioinformatics approaches investigated for use with omics data. Classification uses supervised learning Supervised learning is a machine learning technique for creating a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs.  techniques (Tong tong 1  
tr.v. tonged, tong·ing, tongs
To seize, hold, or manipulate with tongs.



[Back-formation from tongs.
 et ah 2003b) to fit the samples into the predefined categories based on patterns of omics profiles or predictor variables (e.g., gene expressions in DNA microarray). The fitted model is then validated using either a cross-validation method or an external test set. Once validated, the model could be used for prediction of unknown samples.

A number of classification methods have been applied to microarray See micro array.

microarray - A technique for performing many DNA experiments in parallel. Nothing to do with computers.
 gene expression data (Ben-Dor et al. 2000; Simon et al. 2003; Slonim 2002), including artificial neural networks (artificial intelligence) artificial neural network - (ANN, commonly just "neural network" or "neural net") A network of many very simple processors ("units" or "neurons"), each possibly having a (small amount of) local memory.  (Khan et al. 2001), K-nearest neighbor (Olshen and Jain 2002), Decision Tree (DT; Zhang et al. 2001), and support vector machines This article or section may be confusing or unclear for some readers.
Please [improve the article] or discuss this issue on the talk page.
 (SVMs; Brown et al. 2000). Some of the same methods have been applied similarly to proteomic data generated from surface-enhanced laser deposition/ionization time-of-flight mass spectrometry (SELDI-TOF MS) for molecular diagnostics (Adam et ah 2002; Bali et al. 2002). For example, Petricoin et al (2002a, 2002b) developed classification models for early detection of ovarian ovarian /ovar·i·an/ (o-var´e-an) pertaining to an ovary or ovaries.

ovarian

pertaining to an ovary.


ovarian agenesis
 and prostate cancers (PCAs) on the basis of SELDI-TOF MS data using a genetic algorithm-based SVM SVM Support Vector Machines
SVM School of Veterinary Medicine
SVM Solaris Volume Manager
SVM Space Vector Modulation
SVM Storage Virtualization Manager (StoreAge)
SVM Service Module (also abbreviated as S/M) 
.

Omics data present challenges for most classification methods because a) the number of predictor variables normally far exceeds the sample size and b) most data are unfortunately very noisy. Consequently, optimizing a classification model inherently risks overfitting the noise, a result that is difficult to overcome for most classification methods (Slonim 2002). Furthermore, many existing classification methods require predetermination predetermination,
n an administrative procedure whereby a dental professional submits a treatment plan to the carrier before treatment is initi-ated.
 of a set of predictor variables, thereby introducing additional complexity and bias that could adversely affect both model fitting and validation (Ambroise and McLachlan 2002).

In this article a novel classification method, Decision Forest (DF), is proposed for developing classification models using omics data. A DF model is developed by combining multiple distinct but comparable DT models to achieve a more robust and better prediction (Tong et al. 2003a). DF does not require predetermination of predictor variables before model development and is less prone to overfitting of noise. Developing a statistically sound model that fits the data is straightforward with most classification methods, but assuring that the model can accurately classify unknown samples with a known degree of certainty poses a significant challenge. In DF, an extensive cross-validation and randomization testing procedure was implemented, which provides two critical measures to assess a fitted model's ability to predict unknown samples, the confidence level of predictions and the degree of chance correlation. DF is demonstrated in an application to distinguish PCA (tool, programming) PCA - A dynamic analyser from DEC giving information on run-time performance and code use.  samples from normal samples on the basis of a SELDI-TOF MS data set. The results indicate that the reported DF model could be useful for early detection of PCA.

Materials and Methods

Proteomics Data Set

A proteomic data set reported by Adam et al. (2002) is used in this study. The data set consists of SELDI-TOF MS spectra for 326 samples, which is generated using the IMAC-3 chip (Ciphergen Biosystems, Inc., Fremont, CA). Of 326 serum samples used, 167 samples were from the PCA patients, 77 from the patients with benign prostatic hyperplasia benign prostatic hyperplasia
n. Abbr. BPH
A nonmalignant enlargement of the prostate gland commonly occurring in men after the age of 50, and sometimes leading to compression of the urethra and obstruction of the flow of urine.
 (BPH BPH
abbr.
benign prostatic hyperplasia


BPH
Benign prostatic hypertrophy, a very common noncancerous cause of prostatic enlargement in older men.
), and 82 from healthy individuals. The samples were subsequently divided into two classes for this study, cancer samples (167 PCA samples) versus noncancer samples (159 samples including both BPH and healthy individuals) (Qu et al. 2002). Each sample was characterized by 779 peaks of a spectrum. These peaks were determined in the mass range of 2,000-40,000 Da and provided by the original authors (Adam et al. 2002) for this study. All these peaks were used as predictor variables without preselection to develop the DF model.

Decision Tree

A DT model was developed using a variant of the classification and regression tree (CART) method (Breiman et al. 1995), which consists of two steps--tree construction and tree pruning pruning, the horticultural practice of cutting away an unwanted, unnecessary, or undesirable plant part, used most often on trees, shrubs, hedges, and woody vines.  (Clark and Pregibon 1997). In the tree construction process the algorithm identifies the best predictor variables that divide the sample in the parent node into two child nodes. The split maximizes the homogeneity Homogeneity

The degree to which items are similar.
 of the sample population in each child node (e.g., one node is dominated by the cancer samples, and the other is populated pop·u·late  
tr.v. pop·u·lat·ed, pop·u·lat·ing, pop·u·lates
1. To supply with inhabitants, as by colonization; people.

2.
 with the noncancer samples). Then, the child nodes become parent nodes for further splits, and splitting continues until samples in each node are either in one classification category or cannot be split further to improve the quality of the DT model. To avoid overfitting the training data, the tree is then cut down to a desired size using tree cost-complexity pruning (Clark and Pregibon 1997). in the end of the process, each terminal node terminal node - leaf  contains a certain percentage of cancer samples. This percentage specifies the probability of a sample to be the cancer sample. In this study the cutoff 0.5 was used to distinguish cancer samples from noncancer samples. If a terminal node contains the proportion of cancer sample (p) > 50% (i.e., o > 0.5), all the samples in this terminal are designated as cancer samples and p is the probability value assigned to the entire sample in this terminal node. Similarly, samples are noncancer if the probability is < 0.5.

Decision Forest

DF is a consensus modeling technique, where the results of multiple DT models are combined to produce a more accurate prediction than any of the individual independent DT models. Because combining several identical DT models produces no gain, the rationale behind DF is to develop multiple DT models that are heterogeneous with comparable quality. "Heterogeneity het·er·o·ge·ne·i·ty
n.
The quality or state of being heterogeneous.



heterogeneity

the state of being heterogeneous.
" emphasizes each DT model's unique contribution to the combined prediction, which is accomplished by developing each DT model based on a distinct set of predictor variables. "Comparable quality" ensures each DT model's equal weight in combining prediction, which requires each DT model having similar accuracy of prediction. Thus, the development of a DF model consists of three steps (Tong et al. 2003a): a) develop a DT model, b) develop the next DT model based on only the predictor variables that are not used in the previous DT model(s), and c) repeat the first two steps until no additional DT models can be developed. In this process the misclassification rate for each DT model is controlled at a fixed level (3-5%) to ensure the comparable quality of individual DT models. The same classification call in DT is used for determining a sample's classification based on the mean probability value of all DT models used in DF.

Randomization Test for Chance Correlation

Because proteomic data usually contain a large number of predictor variables with a relatively small number of samples, it is possible that the patterns identified by a classification model could be simply due to chance. Thus, we used a randomization testing to assess the degree of chance correlation. In this method the predefined classification of the samples was randomly scrambled to generate 2,000 pseudo-data sets (Good 1994). The DF models were developed for each pseudo-data set, and the results were then compared with the DF model from the real data set to determine the degree of chance correlation.

Model Validation

A common approach for assessing the predictivity of a classification model is to randomly split the available samples into a training set and a test set. The predictivity of a fitted model using all the samples is estimated based on the prediction accuracy for the test set. Arguably ar·gu·a·ble  
adj.
1. Open to argument: an arguable question, still unresolved.

2. That can be argued plausibly; defensible in argument: three arguable points of law.
, the cross-validation method could be considered as an extension of this external validation procedure and might offer an unbiased way to assess the predictivity of a model from a statistical point of view (Hawkins et al. 2003). In this procedure a fraction of samples in the data set are excluded and then predicted by the model produced using the remaining samples. When each sample is left out one at a time, and the process repeated for each sample, this is known as leave-one-out cross-validation (LOO). If the data set is randomly divided into n groups with approximately equal numbers of samples, and the process is carried out for each group, the procedure is called leave-n-out cross-validation (LNO LNO Liaison Officer
LNO Liaison Office
LNO Linuxnewbie.org (a website about Linux for newbies)
LNO Like No Other
LNO Last Ninja Online (forum)
LNO Lawndale Neighborhood Organization
LNO Late Night Option
). Because LOO gives a minimal perturbation perturbation (pŭr'tərbā`shən), in astronomy and physics, small force or other influence that modifies the otherwise simple motion of some object. The term is also used for the effect produced by the perturbation, e.g.  to the data set and therefore might not detect overfitting of a model, the leave-10-out cross-validation (L100) is commonly used for classification models.

It is important to point out that the LNO results vary for each run because the partition A reserved part of disk or memory that is set aside for some purpose. On a PC, new hard disks must be partitioned before they can be formatted for the operating system, and the Fdisk utility is used for this task.  of the data set is changing in a random manner (except for the LOO procedure). The variation increases as the number of left-out samples increases (i.e., n decreases with n > 1). Care must be taken when interpreting the results derived from only one pass through an LNO process, which could lead to a conclusion that might not represent the true predictivity of the fitted model due to chance. Rather, the mean of many passes through the LNO process should well approximate the predictivity of the fitted model. In this study an extensive L100 procedure was implemented in DF, where the L100 process was repeated 2,000 times using randomly divided data sets in each run. The choice of 2,000 runs is based on our previous experience of where reliable statistics can be reached (Tong et al. 2003a). In this validation process a total of 20,000 pairs of training and test sets were generated, and each sample was predicted by 2,000 different models. The results derived from this process provide an unbiased statistic statistic,
n a value or number that describes a series of quantitative observations or measures; a value calculated from a sample.


statistic

a numerical value calculated from a number of observations in order to summarize them.
 for evaluating the predictivity of a fitted model.

Results

DF was applied to the proteomic data set for distinguishing cancer from noncancer. The fitted DF model for the data set contains four DT models, each of them having the comparable misclassifications ranging from 12 to 14 (i.e., 3.7-4.3% error rate; Table 1). The misclassification is significantly reduced as the number of DT models to be combined increases to form a DF model (Figure 1). The four-tree DF model gave 100% classification accuracy. However, it is important to note that a statistically sound fitted model provides limited indication of whether the identified pattern is biologically relevant or is solely due to chance. Neither does such a fitting result provide validation of the model's capability for predicting unknown samples that were not included in the training set used for model development. It is important to carry out a rigorous validation procedure to determine the fitted model with respect to the degree of chance correlation and the level of confidence for predicting unknown samples.

Assessment of Chance Correlation

We compared the predictive accuracy for the left-out samples in the 2,000 L100 runs of the real data set (total of 20,000 pairs of" training and test sets) with those derived from the L100 run for each of the 2,000 pseudo-data sets (total of 20,000 pairs of training and test sets). The distributions of the prediction accuracy of every pair for both real and pseudo-data sets are plotted in Figure 2. The distribution of prediction accuracy of the real data set centers around 95%, whereas the pseudo-data sets are near 50%. The real data set has a much narrower distribution compared with the pseudo-data sets, indicating that the training models generated from the L100 procedure for the real data set give consistent and high prediction accuracy with their corresponding test sets. In contrast the prediction results of each pair of training and test sets in the L100 process for the pseudo-data sets varied widely, implying a large variability of signal: noise ratio among these training models. Importantly, there is no overlap between two distributions, indicating that a statistically and biologically relevant DF model could be developed using the real data set.

Assessment of Prediction Confidence

DF assigned a probability value for each prediction, where samples with the probability value [greater than or equal to] 0.5 were designated as cancer samples, whereas others were designated as normal samples. Figure 3 provides two sets of information derived from the 2,000 L100 runs over 10 equal probability intervals between 0 and 1: a) the number of left-out samples predicted in each bin and b) the misclassification rate in each bin. Analysis shows that the 0.7-1.0 interval has a concordance concordance /con·cor·dance/ (-kord´ins) in genetics, the occurrence of a given trait in both members of a twin pair.concor´dant

con·cor·dance
n.
 of 99.2% for the cancer samples (0.8% false positives), whereas the 0.0-0.3 interval has a concordance of 98.2% for the noncancer sample (1.8% false negatives). These two probability ranges accounted for 79.7% of all left-out samples. The vast majority of misclassifications occur in the 0.3 0.7 probability range, where the average prediction accuracy was only 78.9% but which, fortunately, accounted for only 20.3% of total of left-out samples. Therefore, we defined both the predicted probability ranges of 0.0 0.3 and 0.7-1.0 as the high-confidence (HC) region, whereas the predicted probability range of 0.3-0.7 was considered the low-confidence (LC) region.

Comparison of DF with DT

Table 2 summarizes the statistical results of the 2,000 L100 runs for both DF and DT. Overall, the DF model increases prediction accuracy by about 5% compared with the DT model, from 89.4 to 94.7%. In the HC region, the DF model increases prediction accuracy compared with the DT model by 8% from 90.7 to 98.7%, compared with 15% from 63.8 to 78.9% in the LC region.

Biomarker Identification

In addition to development a predictive model for proteomic diagnostics, identification of potential biomarkers is another important use o[ the SELDI-TOF MS technology (Diamandis 2003). Each DT model in DF determines a sample's classification through a series of rules based Using "if-this, do that" rules to perform actions. Rules-based products implies flexibility in the software, enabling tasks and data to be easily changed by replacing one or more rules.  on selection of predictor variables. Thus, it is expected that the DF-selected variables could be useful as a starting point Noun 1. starting point - earliest limiting point
terminus a quo

commencement, get-go, offset, outset, showtime, starting time, beginning, start, kickoff, first - the time at which something is supposed to begin; "they got an early start"; "she knew from the
 for biomarker identification.

There were two lists of model-selected variables derived from DF, one used in fitting (the fitting-variable list; Table 11 and the other used by at least one of the models in the 2,000-L100 process (the L100-variable list). The L100-variable list contained 323 unique variables, which actually included all variables in the fitting-variable list. Given that the sample population is different among the models in the 2,(/00 L100 runs, the number of models selecting a particular variable should tend to increase in direct proportion to the biologic relevance of the variable. There were 46 variables that were selected > 10,000 times in the 2,000-L100 process ('Fable 3), including all 12 J*l/z peaks identified by Qu et al. (2002) using boosted decision stump A Decision Stump is a weak machine learning model consisting of a Decision Tree with only a single depth. Weak learners are often used as components in ensemble learning techniques such as Bagging and Boosting.  feature selection based on a slightly larger data set. The two-group > test results indicated that 32 of 46 high-frequency variables have p-values < 0.001 (Table 3). Selection of 23 variables from Table 3 that were used in both fitting and L100 with p < 0.001 appears a reasonable approach to choosing a set of proteins for biomarker identification.

Discussion

We developed a classification model for early detection of PCA on the basis of SELDI-TOF MS data using DF. DF is an ensemble method, where each prediction is a mean value of all the DT models combined to construct the DF model. The idea of combining multiple DT models implicitly assumes that a single DT model could not completely represent important functional relationships between predictor variables (m/z peaks in this study) and the associated outcome variables (PCA in this study), and thus different DT models are able to capture different aspects of the relationship for prediction. Given a certain degree of noise always present in omics data, optimizing a DT model inherently risks overfitting the noise. DF minimizes overfitting by maximizing the difference among individual DT models. The difference is achieved by constructing each individual DT model using a distinct set of predictor variables. Noise cancellation (1) The elimination of unwanted signals in an electronic circuit. See noise and dynamic noise reduction.

(2) The elimination of unwanted noise in the environment using noise cancelling headphones.
 and corresponding signal enhancement are apparent when comparing the results from DF and DT. DF outperforms DT in all statistical measures in the 2,000 L100 runs. Whether DT performs better than other similar classification techniques depends on the application domain and the effectiveness of the particular implementation. However, Lira acid Loh (19991 compared 22 DT methods with nine statistical algorithms and two artificial neural network approaches across 32 data sets and found no statistical difference among the methods evaluated. Thus, the better performance of DF than DT implies that the unique ensemble technique embedded Inserted into. See embedded system.  in DF could also be superior to some other classification techniques for class prediction using omics data.

Combining multiple DT models to produce a single model has been investigated for many years (Bunn 1987, 1988; Clemen 1989; Zhang et al. 2003). Evaluating different ways for developing individual DT models to be combined has been a major focus, which have all been reported to improve ensemble predictive accuracy. One approach is to grow individual DT models based on different portions of samples randomly selected from the training set using resampling techniques. However, resampling using a substantial portion of samples (e.g., 90%) tends to result in individual DT models that are highly correlated, whereas using a less substantial portion of samples (e.g., 70%) tends to result in individual DT models of lower quality. Either high-correlated or lower-quality individual DT models can reduce the combining benefit that might otherwise be realized. The individual DT models can also be generated using more robust statistical resampling approaches such as bagging (Breiman 19961 and boosting (Freund and Schapire 19961. However, it is understood that boosting that uses a function of performance to weight incorrect predictions is inherently at risk of overfitting the noise associated with the data, which could result in a worse prediction from an ensemble model (Freund and Schapire 1996). Another approach to choosing an ensemble of DT models centers on random selection of predictor variables (Amit and Gemau 19971. One popular algorithm, random forests, has been demonstrated to be more robust than a boosting method (Breiman 19991. However, in an example of classification of naive in vitro in vitro /in vi·tro/ (in ve´tro) [L.] within a glass; observable in a test tube; in an artificial environment.

in vi·tro
adj.
In an artificial environment outside a living organism.
 drug treatment sample based on gone expression data, Gunther et al. (2003) showed reduced prediction accuracy of random forests (83.3%) compared with DT (88.9%).

It is important to note that the aforementioned a·fore·men·tioned  
adj.
Mentioned previously.

n.
The one or ones mentioned previously.


aforementioned
Adjective

mentioned before

Adj. 1.
 techniques rely on random selection of either samples or predictor variables to generate individual DT models. In each repeat the individual DT models of the ensemble are different; thus, the biologic interpretation of the ensemble is not straightforward. Furthermore, these methods need to grow a large number of individual DT models (> 400) acid could be computationally expensive A computationally expensive algorithm is one that, for a given input size, requires a relatively large number of steps to complete; in other words, one with high computational complexity. . In contrast the difference in individual DT models is maximized in DF such that a best ensemble is usually realized by combining only a few DT models (i.e., four or five). Importantly, because DF is reproducible, the variable relationships are constant in their interpretability for biologic relevance.

Omics data such as we stress in this article normally have a limited number of samples and a large number of predictor variables. Furthermore, the noise associating with both categorical That which is unqualified or unconditional.

A categorical imperative is a rule, command, or moral obligation that is absolutely and universally binding.

Categorical is also used to describe programs limited to or designed for certain classes of people.
 dependent variables and predictor variables is usually unknown. It is consequently imperative to verify that the fitted model is not a chance correlation. To assess the degree of chance correlation of the PCA model, we computed a null distribution In statistical hypothesis testing, the null distribution is the probability distribution of the test statistic when the null hypothesis is true.  of prediction with 2,000 L100 runs based on 2,000 pseudo-data sets derived from a randomization test. The null hypothesis null hypothesis,
n theoretical assumption that a given therapy will have results not statistically different from another treatment.

null hypothesis,
n
 was tested by comparing the null distribution with the DF predictions in 2,000 L100 runs using the actual training data set. The degree of chance correlation in the predictive model can be estimated from the overlap of the two distributions (Figure 2). Generally speaking, a data set with an unbalanced sample population, small sample size, and/or low signal: noise ratio would tend to produce a model with distribution overlapping the null distribution. For the PCA model, the distributions are spaced far apart with no overlap, indicating that the model is biologically relevant.

A model fitted to omics data has minimal utility unless it can be generalized to predict unknown samples. The ability to generalize generalize /gen·er·al·ize/ (-iz)
1. to spread throughout the body, as when local disease becomes systemic.

2. to form a general principle; to reason inductively.
 the model is air essential requirement for diagnostics and prognostics in medical settings and/or risk assessment in regulation. Commonly, test samples are used to verify the performance of a fitted model. Such external validation, while providing a sense of real-world application, must incorporate assurance that samples set aside for validation are representative. Setting aside only a small number of samples might not provide the ability to fully assess the predictivity of a fitted model, which in turn could result in the loss of valuable additional data that might improve the model. Besides, one rarely enjoys the luxury of setting aside a sufficient number of samples for use in external validation in omics research because in most cases data sets contain barely enough samples to create a statistically robust model in the first place. Therefore, an extensive L100 procedure is embedded in DF that can provide an unbiased and rigorous way to assess the fitted model's predictivity within the available samples' domain without the loss of samples set aside for a test set.

A model's ability to predict unknown sample's is directly dependent on the nature of the training set. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently
, predictive accuracy for different unknown samples varies according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 how well the training set represents the given samples. Therefore, it is critical to be able to estimate the degree of confidence for each prediction, which could be difficult to derive from the external validation. In DF the information derived from the extensive L100 process permits assessment of the confidence level for each prediction. For the PCA model the confidence level for predicting unknown samples was assessed based on the distribution of accuracy over the prediction probability range for the left-out samples in the 2,000 L100 runs. We found that the sensitivity and specificity of the model were 99.2 and 98.2% in the HC region, respectively, with an overall concordance of 98.7%. In contrast, a much lower prediction confidence of 78.9% was obtained in the LC region, indicating that these predictions need to be further verified by additional methods. Generally, the number of samples within the HC region compared with the LC region depends on the signal: noise ratio in the data set. For noisy data, more unknown samples will be predicted in the LC region and could be as high as 40-50% (results not shown). For the PCA data set some 80% of the left-out samples predicted in the 2,000 L100 runs were in the HC region, indicating that the data set has a high signal: noise ratio.

A number of classification methods reported in the literature require selection of the relevant or informative predictor variables before modeling is actually performed. This is necessary because the method could be susceptible to noise without this procedure, and the computational cost is prohibitive pro·hib·i·tive   also pro·hib·i·to·ry
adj.
1. Prohibiting; forbidding: took prohibitive measures.

2.
 for iterative it·er·a·tive  
adj.
1. Characterized by or involving repetition, recurrence, reiteration, or repetitiousness.

2. Grammar Frequentative.

Noun 1.
 variable selection during cross-validation. Although these are otherwise effective methods, they could produce what is called "selection bias" (Simon et al. 2003). Selection bias occurs when the model's predictive performance is assessed using cross-validation where only the preselected variables are included. Because of selection bias, cross-validation could significantly overstate prediction accuracy (Ambroise and McLachlan 2002), and external validation becomes mandatory to assess a model's predictivity. In contrast, model development and variable selection are integral in DF. DF avoids the selection bias during cross-validation because the model is developed at each repeat by selecting the variables from the entire set of predictor variables. The cross-validation thereby provides a realistic assessment of the predictivity of a fitted model. Given the trend of ever decreasing computation expense, carrying out exhaustive cross-validation is increasingly attractive, particularly when scarce sample data can be used for training as opposed to external testing. Of course, external validation is still strongly recommended when the amount of data suffices, in which case the cross-validation process will still enhance the rigor rigor /rig·or/ (rig´er) [L.] chill; rigidity.

rigor mor´tis  the stiffening of a dead body accompanying depletion of adenosine triphosphate in the muscle fibers.
 of the validation.
Table 1. Summary of the four DT models combined for developing the DF
model (n = number of misclassifications).

                       DT model 1  DT model 2  DT model 3  DT model 4
                        (n = 12)    (n = 13)    (n = 14)    (n = 14)

Variables (m/z peaks)    9,656       8,067        6,542       7,692
used in each DT model    8,446       8,356        7,934       6,756
                         5,074       5,457        7,195       9,593
                         6,797       2,144        4,497       9,456
                         8,291       7,885        4,080       5,978
                         9,720       7,024        6,199       3,780
                         3,486       7,771        7,481       2,794
                         4,191       3,897        5,586       7,844
                         4,653       4,757        6,099       5,113
                                     6,890        7,070      28,143
                                     2,014       24,400       2,982
                                     9,149        2,887       6,443
                                                  7,054       7,820
                                                  4,475       4,580
                                                  4,537
                                                  7,409
                                                  7,054

Table 2. Comparison of statistics between DF and DT
models in prediction of the left-out samples in
the 2,000 L100 runs.

Prediction accuracy       DF (%)    DT (%)

Overall accuracy           94.7      89.4
Accuracy in HC region      98.7      90.7
Accuracy in LC region      78.9      63.8

Table 3. List of m/z peaks used more than 10,000
times in the 2,000-L100 process, where 23 peaks
are used in fitting with p < 0.001.

m/z           Frequency   p-Value
Peaks (Da)

7,934 (a)     30,203      < 0.001
9,149 (a)     26,482      < 0.001
7,984 (b)     25,171      < 0.001
8,296 (a)     24,793      < 0.001
3,897 (a)     23,754      < 0.001
9,720 (a,c)   22,630      < 0.001
7,776 (a)     21,723        0.003
7,024 (a,c)   21,718      < 0.001
5,074 (a)     20,800      < 0.001
8,446 (a)     20,620      < 0.001
9,656 (a,c)   20,479      < 0.001
6,542 (a,c)   20,219      < 0.001
8,067 (a,c)   20,058      < 0.001
7,692 (a)     19,982        0.004
6,797 (a,c)   19,587      < 0.001
8,356 (a,c)   19,429      < 0.001
7,054 (a)     19,333        0.010
6,099 (a)     19,265        0.004
5,586 (a)     18,103      < 0.001
7,820 (a,c)   17,918        0.359
6,756 (a)     17,668      < 0.001
9,593 (a)     17,615      < 0.001
7,844 (a)     17,611        0.089
4,191 (a)     17,387      < 0.001
3,486 (a)     17,290      < 0.001
4,451 (b)     17,041        0.459
4,079 (a,c)   16,790        0.020
9,456 (a)     16,767      < 0.001
4,653 (a)     16,674        0.002
7,195 (a)     15,832      < 0.001
7,885 (a,c)   15,388      < 0.001
8,277 (b)     15,388      < 0.001
6,072 (b)     15,093      < 0.001
3,963 (b,c)   14,434      < 0.001
3,780 (a)     14,139        0.014
4,291 (b)     13,540      < 0.001
4,102 (b)     13,294        0.001
4,858 (b)     13,076        0.003
6,949 (b,c)   12,555      < 0.001
3,280 (b)     11,808      < 0.001
6,991 (b,c)   11,281        0.122
2,144 (a)     11,110      < 0.001
9,100 (b)     10,578      < 0.001
7,652 (b)     10,159        0.005
5,457 (a)     10,139      < 0.001
6,914 (b)     10,073      < 0.001

(a) Used in fitting.

(b) Not used in fitting.

(c) Reported by Qu et al. (2002).


REFERENCES

Adam BL, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, et al. 2002. Serum protein fingerprinting fingerprinting

Act of taking an impression of a person's fingerprint. Because each person's fingerprints are unique, fingerprinting is used as a method of identification, especially in police investigations.
 coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia Benign prostate hyperplasia (BPH)
Enlargement of the prostate gland.

Mentioned in: Paruresis
 and healthy men. Cancer Res 62:3609-3614.

Ambroise C, McLachlan GJ. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99:6562-6566.

Amit Y, Geman D. 1997. Shape quantization (1) The division of a range of values into a single number, code or classification. For example, class A is 0 to 999, class B is 1000 to 9999 and class C is 10000 and above.

(2) In analog to digital conversion, the assignment of a number to the amplitude of a wave.
 and recognition with randomized ran·dom·ize  
tr.v. ran·dom·ized, ran·dom·iz·ing, ran·dom·iz·es
To make random in arrangement, especially in order to control the variables in an experiment.
 trees. Neural Comput 9:1545-1588.

Ball G, Mian S, Holding F, Allibone RO, Lowe J, All S, et al. 2002. An integrated approach utilizing artificial neural networks and seldi mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics 18:395-404.

Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. 2000. Tissue classification with gene expression profiles. J Comput Biol 7:559-583.

Breiman L. 1996. Bagging predictors. Machine Learning 24:123-140.

Breiman L. 1999. Random Forests. Technical Report 567. Berkeley, CA: Department of Statistics, University of California The University of California has a combined student body of more than 191,000 students, over 1,340,000 living alumni, and a combined systemwide and campus endowment of just over $7.3 billion (8th largest in the United States). .

Breiman L, Friedman J, Olshen R, Stone C, Steinberg D, Colla P. 1995. CART: Classification and Regression Trees. Stanford, CA:Salford System.

Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, et al. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97:262-267.

Bunn DW. 1987. Expert use of forecasts: bootstrapping Bootstrapping

A procedure used to calculate the zero coupon yield curve from market figures.

Notes:
Since the T-bills offered by the government are not available for every time period, the bootstrapping method is used to fill in the missing figures in order to derive the
 and linear models. In: Judgemental Forecasting (Wright G, Ayton P, eds). New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
: Wiley, 229-241.

Bunn DW. 1988. Combining forecasts. Eur J Operational Res 33:223-229.

Clark LA, Pregibon D. 1997. Tree-based models. In: Modern Applied Statistics with S-Plus. (Venables WN, Ripley BD, eds). 2nd ed. New York:Springer-Verlag.

Clemen RT. 1989. Combining forecasts: a review and annotated bibliography An annotated bibliography is a bibliography that gives a summary of the research that has been done. It is still an alphabetical list of research sources. In addition to bibliographic data, an annotated bibliography provides a brief summary or annotation. . Int J Forecasting 5:559-583.

Diamandis EP. 2003. Point: proteomic patterns in biological fluids: do they represent the future of cancer diagnostics? Clin Chem 49:1272-1275.

Freund Y, Schapire R. 1996. Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning (Saitta L, ed). San Francisco San Francisco (săn frănsĭs`kō), city (1990 pop. 723,959), coextensive with San Francisco co., W Calif., on the tip of a peninsula between the Pacific Ocean and San Francisco Bay, which are connected by the strait known as the Golden :Morgan Kaufmann Publishers, 148-156.

Good P. 1994. Permutation One possible combination of items out of a larger set of items. For example, with the set of numbers 1, 2 and 3, there are six possible permutations: 12, 21, 13, 31, 23 and 32.

(mathematics) permutation - 1.
 Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. New York:Springer-Verlag.

Gunther EC, Stone D J, Gerwien RW, Bento A data structure used to store embedded documents in an OpenDoc compound document. Bento, which stands for lunch box in Japanese, provides a "container" to hold the data and a format for defining its contents.  P, Heyes MP. 2003. Prediction of clinical drug efficacy by classification of drug-induced genomic genomic

pertaining to a genome.


genomic clone
see clone.

genomic DNA
the DNA sequences making up the genome of an individual.

genomic library
see gene bank.
 expression profiles. Proc Natl Acad Sci USA 100:9608-9613.

Hawkins DM, Basak SO, Mills D. 2003. Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579-586.

Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. 2001. Classification and diagnostic prediction of cancers using gene expression profiling Microarray technology is often used for gene expression profiling. It makes use of the sequence resources created by the genome sequencing projects and other sequencing efforts to answer the question,  and artificial neural networks. Nat Med 7:673-679.

Lim T-S T-S Temperature-Salinity , Loh W-Y. 2000. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning 40(3):203-218.

0lshen AB, Jain AN. 2002. Deriving quantitative conclusions from microarray expression data. Bioinformatics 18:961-970.

Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, et al. 2002a. Use of proteomic patterns in serum to identify ovarian cancer ovarian cancer

Malignant tumour of the ovaries. Risk factors include early age of first menstruation (before age 12), late onset of menopause (after age 52), absence of pregnancy, presence of specific genetic mutations, use of fertility drugs, and personal history of breast
. Lancet lancet /lan·cet/ (lan´set) a small, pointed, two-edged surgical knife.

lan·cet
n.
 359:572-577.

Petricoin EF Ill, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, et al. 2002b. Serum preteomic patterns for detection of prostate cancer. J Natl Cancer Inst 94:1576-1578.

Qu Y, Adam BL, Yasui Y, Ward MD, Cazares LH, Schellhammer PF, et al. 2002. Boosted decision tree analysis of surface-enhanced laser desorption/ionization Surface-enhanced laser desorption/ionization (SELDI) is an ionization method in mass spectrometry that is used for the analysis of protein mixtures.[1] SELDI is typically used with time-of-flight mass spectrometers and is used to detect proteins in tissue samples, blood,  mass spectral spectral /spec·tral/ (spek´tral) pertaining to a spectrum; performed by means of a spectrum.

spec·tral
adj.
Of, relating to, or produced by a spectrum.
 serum profiles discriminates prostate cancer from noncancer patients. O1in Chem 48:1835-1843.

Simon R, Radmacher MD, Dobbin K, McShane LM. 2003. Pitfalls in the use of DNA microarray data for diagnostic and prognostic prog·nos·tic
adj.
1. Of, relating to, or useful in prognosis.

2. Of or relating to prediction; predictive.

n.
1. A sign or symptom indicating the future course of a disease.

2.
 classification. J Natl Cancer Inst 95:14-18.

Slonim DK 2002. From patterns to pathways: gene expression data analysis comes of age. Nat Genet genet: see civet.  32(suppl):502-508.

Tong W, Hong H, Fang 14, Xie Q, Perkins R. 2003a. Decision forest: combining the predictions of multiple independent decision tree model. J Chem Inf Comp Sci 43:525-531.

Tong W, Welsh W J, Shi L, Fang H, Perkins R. 2003b. Structure-activity relationship Structure-activity relationship is the traditional Practices of Medicinal chemistry which try to modify the effect or the potency of Bioactive chemical compound by modifying its Chemical structure.  approaches and applications. Environ Toxicol Chem 22:1680-1695.

Zhang H, Yu OY, Singer B. 2003. Cell and tumor tumor: see neoplasm.  classification using gene expression data: construction of forests. Proc Natl Acad Sci USA 100:4168-4172.

Zhang H, Yu OY, Singer B, Xiong M. 2001. Recursive partitioning Recursive partitioning is a statistical method for the multivariable analysis of medical diagnostic tests.[1]. Recursive partitioning creates a decision tree that strives to correctly classify members of the population based on a dichotomous dependent variable.  for tumor classification with gene expression microarray data. Proc Natl Acad Sci USA 98:6730-6735.

Address correspondence to W. Tong, Center for Toxicoinformatics, Division of Biometry biometry /bi·om·e·try/ (bi-om´e-tre) the application of statistical methods to biological phenomena.

bi·om·e·try
n.
The statistical analysis of biological data. Also called biometrics.
 and Risk Assessment, NCTR NCTR National Center for Toxicological Research
NCTR National Council on Teacher Retirement
NCTR National Center for Transit Research
NCTR Non-Cooperative Target Recognition
NCTR Northern Colorado Trail Riders
NCTR Non-Cooperative Threat Recognition
, 3900 NCTR Rd., HFT HFT Harbor Freight Tools
HFT High Function Terminal
HFT Hammerfest, Norway (Airport Code)
HFT Hot for Teacher (Van Halen song and tribute band)
HFT Human Factors in Telecommunications
020, Jefferson, AK 72079 USA. Telephone: (870) 543-7142. Fax: (870) 543-7662. E-mail: wtong@nctr. fda.gov

The authors declare they have no competing financial interests.

Received 22 March 2004: accepted 5 August 2004.
COPYRIGHT 2004 National Institute of Environmental Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2004, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Toxicogenomics
Author:Petricoin, Emanuel F.
Publication:Environmental Health Perspectives
Date:Nov 15, 2004
Words:5700
Previous Article:The TAO-Gen algorithm for identifying gene interaction networks with application to SOS repair in E. coli.(Toxicogenomics)
Next Article:Renal toxicogenomic response to chronic uranyl nitrate insult in mice.(Toxicogenomics)



Related Articles
Model selection in genomics. (Editorials).
The Human Proteome Organization (HUPO) and Environmental Health. (Commentary).
Systems toxicology and the chemical effects in biological systems (CEBS) knowledge base.
Crunching the bio-numbers.(Bioinformatics)
Agricultural pesticide use may be associated with increased risk of prostate cancer.(EH Update)
From point B to point A: applying toxicogenomics to biological inference.(Environews / Focus)
Assessment of prediction confidence and domain extrapolation of two structure-activity relationship models for predicting estrogen receptor binding...
Can consumer confidence forecast household spending? Evidence from the European Commission Business and Consumer Surveys.
Meeting report: Validation of Toxicogenomics-Based Test Systems: ECVAM-ICCVAM/NICEATM considerations for regulatory use.(Research)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles