Crunching the bio-numbers.The quest is oil to extract useful information from the growing mountain of data from today's high-throughput, high-tech biology, and there is a constant demand for new data-mining techniques that are faster and smarter. The work presented at a recent symposium session on microarray and gene expression The MicroArray and Gene Expression (MAGE) group is working on a standard for the representation of microarray expression data that would facilitate the exchange of microarray information between different data systems. MAGE works within the OMG (Object Management Group). analysis shows how bio-number crunchers are contributing many ingenious new approaches to finding the scientific needles in the raw data haystacks Haystacks can be:
The session was part of the Atlantic Symposium on Computational Biology and Genome Informatics, which was one of 11 conferences, symposia, and workshops convened under the umbrella of the 7th Joint Conference on Information Sciences, held 26-30 September 2003 in Research Triangle Park Research Triangle Park, research, business, medical, and educational complex situated in central North Carolina. It has an area of 6,900 acres (2,795 hectares) and is 8 × 2 mi (13 × 3 km) in size. Named for the triangle formed by Duke Univ. , North Carolina North Carolina, state in the SE United States. It is bordered by the Atlantic Ocean (E), South Carolina and Georgia (S), Tennessee (W), and Virginia (N). Facts and Figures
Area, 52,586 sq mi (136,198 sq km). Pop. . The conference was sponsored by the NIEHS NIEHS National Institute of Environmental Health Sciences (NIH, DHHS) , the Association for Intelligent Machinery, Duke University, the journal Information Sciences, and the Harbin Institute of Technology The Harbin Institute of Technology (Simplified Chinese: 哈尔滨工业大学; Traditional Chinese: in China.
According to session chair Bjorn Olsson, a lecturer in computer sciences at the University of Sk6vde, Sweden, the presentations showed that the field of bioinformatics is coming to terms with the capabilities it has to offer. "Going back a few years, when this type of data was completely new and everyone was excited about it, I think people were fumbling in the dark a bit about what to do with all of this data," he says. "It's becoming more clear now what directions we can go in."
Simon Lin, manager of the Duke Bioinformatics Shared Resource, led off the session by presenting the results of a study he and colleagues recently completed proposing an improved method of data classification in proteomics-based research using matrix-assisted laser desorption/ionization Matrix-assisted laser desorption/ionization (MALDI) is a soft ionization technique used in mass spectrometry, allowing the analysis of biomolecules (biopolymers such as proteins, peptides and sugars) and large organic molecules (such as polymers, dendrimers and other time-of-flight (MALDI-TOF MALDI-TOF Matrix Assisted Laser Desorption Ionization - Time of Flight ) mass spectrometry mass spectrometry
or mass spectroscopy
Analytic technique by which chemical substances are identified by sorting gaseous ions by mass using electric and magnetic fields. , an enhanced version of the original technology used for protein identification. Peaks from MALDI-TOF raw data (which exhibit the proteins extant in a biosample) must be brought into registration to correct for random fluctuations before they can be used to classify samples (for example, as diseased or nondiseased). Lin and his team used a new algorithmic approach to the registration problem, employing a statistical model based on the chemical analysis of normal mixtures to achieve registration.
Applying this method to an existing data set of 11 tissue samples from cancerous lungs and 11 samples from healthy lungs achieved a classification rate of 90.9%, with a false-positive rate of 0%. Further, the researchers correctly identified two previously known lung cancer lung cancer, cancer that originates in the tissues of the lungs. Lung cancer is the leading cause of cancer death in the United States in both men and women. Like other cancers, lung cancer occurs after repeated insults to the genetic material of the cell. protein markers in the cancerous samples, and discovered seven novel markers worthy of further investigation for biologic relevance.
Improving the accurate classification of diseased versus nondiseased samples is one of the ongoing challenges in the world of bioinformatics. "Decision trees" help researchers classify samples by offering sequential tests of individual attributes. The result of each test determines which test, or branch, should be applied next, until a final classification is reached. Olsson presented work in which he and his colleagues applied the decision tree algorithm C4.5 to microarray-based gene expression data in order to induce decision trees for identification of breast cancer patients.
Using the expression values of the 108 genes identified in the literature as breast cancer-related as input to the decision tree algorithm, the team analyzed gene expression data from 75 women, 53 of whom had been diagnosed with breast cancer. The decision tree method achieved 89% accuracy in classifying samples, based on their gene expression data. Olsson also described the potential utility of decision tree algorithms to study signaling pathways based on gene expression data, as well as to discover additional cancer-related genes.
From decision trees, graduate student Tao Shi of the Department of Human Genetics Human genetics
A discipline concerned with genetically determined resemblances and differences among human beings. Technological advances in the visualization of human chromosomes have shown that abnormalities of chromosome number or structure are surprisingly at the University of California, Los Angeles UCLA comprises the College of Letters and Science (the primary undergraduate college), seven professional schools, and five professional Health Science schools. Since 2001, UCLA has enrolled over 33,000 total students, and that number is steadily rising. , took the audience into the woods as he described the use of "random forest" predictors to derive information from microarray data. The random forest approach, which uses a suite of decision trees, can be used to detect clusters in the data, a vital and informative but often difficult step in accurate classification. Shi showed that the random forest approach could help meet the challenge of using gene expression data to classify tumor types, which is increasingly important in molecular biology molecular biology, scientific study of the molecular basis of life processes, including cellular respiration, excretion, and reproduction. The term molecular biology was coined in 1938 by Warren Weaver, then director of the natural sciences program at the Rockefeller efforts to characterize cancer subtypes.
Rounding out the session, mathematician Takeharu Yamanaka and research fellow Fred Parham of the NIEHS Laboratory of Computational Biology and Risk Analysis presented a new method of analyzing gene expression to infer genetic interactions, which can help identify signal transduction pathways crucial to the sequence of biochemical events that control cellular function. The method uses a Bayesian network, a type of mathematical framework useful for representing known or hypothesized causal relationships. The network measures levels of messenger RNA mes·sen·ger RNA
See mRNA. from different genes and uses conditional assumptions to represent the influence one gene has on another. "By using the Bayesian network," says Yamanaka, "we can incorporate statistical thought into the analysis, unlike present methods." Parham adds, "Also, with the Bayesian networks, we can hopefully ... say not just that this group of genes is related, but we can also see the causality."
As today's automated, high-throughput instruments routinely churn out masses of data that would have been unimaginable just a decade ago, innovations in bioinformatics such as those presented at the session will be required to bring method to the madness, and ultimately help deliver the improvements in human and environmental health promised by molecular biology.