Crunching the bio-numbers.
The session was part of the Atlantic Symposium on Computational Biology and Genome Informatics, which was one of 11 conferences, symposia, and workshops convened under the umbrella of the 7th Joint Conference on Information Sciences, held 26-30 September 2003 in Research Triangle Park, North Carolina. The conference was sponsored by the NIEHS, the Association for Intelligent Machinery, Duke University, the journal Information Sciences, and the Harbin Institute of Technology in China.
According to session chair Bjorn Olsson, a lecturer in computer sciences at the University of Sk6vde, Sweden, the presentations showed that the field of bioinformatics is coming to terms with the capabilities it has to offer. "Going back a few years, when this type of data was completely new and everyone was excited about it, I think people were fumbling in the dark a bit about what to do with all of this data," he says. "It's becoming more clear now what directions we can go in."
Simon Lin, manager of the Duke Bioinformatics Shared Resource, led off the session by presenting the results of a study he and colleagues recently completed proposing an improved method of data classification in proteomics-based research using matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry, an enhanced version of the original technology used for protein identification. Peaks from MALDI-TOF raw data (which exhibit the proteins extant in a biosample) must be brought into registration to correct for random fluctuations before they can be used to classify samples (for example, as diseased or nondiseased). Lin and his team used a new algorithmic approach to the registration problem, employing a statistical model based on the chemical analysis of normal mixtures to achieve registration.
Applying this method to an existing data set of 11 tissue samples from cancerous lungs and 11 samples from healthy lungs achieved a classification rate of 90.9%, with a false-positive rate of 0%. Further, the researchers correctly identified two previously known lung cancer protein markers in the cancerous samples, and discovered seven novel markers worthy of further investigation for biologic relevance.
Improving the accurate classification of diseased versus nondiseased samples is one of the ongoing challenges in the world of bioinformatics. "Decision trees" help researchers classify samples by offering sequential tests of individual attributes. The result of each test determines which test, or branch, should be applied next, until a final classification is reached. Olsson presented work in which he and his colleagues applied the decision tree algorithm C4.5 to microarray-based gene expression data in order to induce decision trees for identification of breast cancer patients.
Using the expression values of the 108 genes identified in the literature as breast cancer-related as input to the decision tree algorithm, the team analyzed gene expression data from 75 women, 53 of whom had been diagnosed with breast cancer. The decision tree method achieved 89% accuracy in classifying samples, based on their gene expression data. Olsson also described the potential utility of decision tree algorithms to study signaling pathways based on gene expression data, as well as to discover additional cancer-related genes.
From decision trees, graduate student Tao Shi of the Department of Human Genetics at the University of California, Los Angeles, took the audience into the woods as he described the use of "random forest" predictors to derive information from microarray data. The random forest approach, which uses a suite of decision trees, can be used to detect clusters in the data, a vital and informative but often difficult step in accurate classification. Shi showed that the random forest approach could help meet the challenge of using gene expression data to classify tumor types, which is increasingly important in molecular biology efforts to characterize cancer subtypes.
Rounding out the session, mathematician Takeharu Yamanaka and research fellow Fred Parham of the NIEHS Laboratory of Computational Biology and Risk Analysis presented a new method of analyzing gene expression to infer genetic interactions, which can help identify signal transduction pathways crucial to the sequence of biochemical events that control cellular function. The method uses a Bayesian network, a type of mathematical framework useful for representing known or hypothesized causal relationships. The network measures levels of messenger RNA from different genes and uses conditional assumptions to represent the influence one gene has on another. "By using the Bayesian network," says Yamanaka, "we can incorporate statistical thought into the analysis, unlike present methods." Parham adds, "Also, with the Bayesian networks, we can hopefully ... say not just that this group of genes is related, but we can also see the causality."
As today's automated, high-throughput instruments routinely churn out masses of data that would have been unimaginable just a decade ago, innovations in bioinformatics such as those presented at the session will be required to bring method to the madness, and ultimately help deliver the improvements in human and environmental health promised by molecular biology.