Model selection in genomics. (Editorials).With the discovery of DNA DNA: see nucleic acid.
or deoxyribonucleic acid
One of two types of nucleic acid (the other is RNA); a complex organic compound found in all living cells and many viruses. It is the chemical substance of genes. , the completion of genome sequencing of a number of organisms, and the advent of powerful high-throughput measurement technologies such as microarrays, it is now commonly said that biology has gone through a revolution. But I also have heard it said that biology is only about to go through a scientific revolution, much as physics did in the 17th century. In messianic mes·si·an·ic also Mes·si·an·ic
1. Of or relating to a messiah: messianic hopes.
2. Of or characterized by messianism: messianic nationalism. hopes, people foretell fore·tell
tr.v. fore·told , fore·tell·ing, fore·tells
To tell of or indicate beforehand; predict.
fore·tell the coming of the Newton of biology, but it is up to us, the scientific community, to set the stage for that to happen.
Both views are valid, each in their own sense. The discovery of DNA and the more recent development of powerful new technologies have certainly revolutionized our understanding of the inner workings of life and allowed us to probe deep into the machinery of living organisms, much as the Copernican system Copernican system, first modern European theory of planetary motion that was heliocentric, i.e., that placed the sun motionless at the center of the solar system with all the planets, including the earth, revolving around it. and Galileo's telescope helped revolutionize astronomy. It was Sir Isaac Newton, however, who placed science on a solid footing by formalizing existing knowledge in terms of mathematical models and universal laws. In some sense, this was the real scientific revolution because it permitted prediction of physical phenomena in a general setting, as opposed to simply describing individual observations. The difference is profound. Whereas a mathematical equation can adequately describe a given set of observations, it may be missing the needed universality for making predictions. Kepler's equations pertained to planets in our solar system solar system, the sun and the surrounding planets, natural satellites, dwarf planets, asteroids, meteoroids, and comets that are bound by its gravity. The sun is by far the most massive part of the solar system, containing almost 99.9% of the system's total mass. . Newton's laws could be used to predict what would happen to two arbitrary bodies anywhere in the universe. The universality of a scientific theory coupled with mathematical modeling allows us to make testable predictions. This ability will have a profound effect on the field of biology.
The hallmarks of a great scientific theory are universality and simplicity. Newton's law of gravity is a case in point. The fact that the force of attraction between any two bodies is proportional to the product of their masses and inversely proportional See
See also: Inversely to the square of the distance between them is both universal and simple. These issues are especially important today in the rapidly evolving field of genomics, where formal mathematical and computational methods are becoming indispensable. So what should be our guiding principles, our beacons of scientific inquiry? One such fundamental principle underpinning all scientific investigation is Ockham's razor Ockham's razor
Methodological principle of parsimony in scientific explanation. Traditionally attributed to William of Ockham, the principle prescribes that entities are not to be multiplied beyond necessity. , also called the "law of parsimony law of parsimony
See Ockham's razor.
Noun 1. law of parsimony - the principle that entities should not be multiplied needlessly; the simplest of two competing theories is to be preferred ."
Consider the following, seemingly straightforward problem. We are presented with a set of data, represented as pairs of numbers (x,y). In each pair, the first number (x) is an independent variable and the second number (y) is a dependent variable. The problem is to choose whether to fit a line (of the form y = a + bx) or a parabolic par·a·bol·ic also par·a·bol·i·cal
1. Of or similar to a parable.
2. Of or having the form of a parabola or paraboloid. function (of the form y = a + bx + [cx.sup.2]). The knee-jerk response might be as follows: Let's fit the parabolic function, since the linear function is clearly a special case of it, just by letting c = 0; thus, the parabola will always provide a better fit to our data set. After all, if it so happens that our data points are arranged on a line, the estimation of parameters (a, b, and c) will simply reveal that c is indeed equal to zero and the parabolic function will reduce to a linear one. Thus, it would seem, three "adjustable" parameters are better than two. Of course, such reasoning could be taken ad absurdum if we had freedom to choose as many parameters as we like. Thus, there must be a tradeoff. Although three parameters surely provide a better fit to the data, the model becomes more complex and so, we sacrifice simplicity. But why is that bad?
To give a general answer, by making a model overly complex, we forfeit predictive accuracy. A complex model may be able to describe the observed data very well, but will it accurately predict future instances? For example, if the data contain random fluctuations or noise, an excessively complex model will "overfit" the data along with the noise and will obviously provide a poor fit to future (unseen) data. The, chief goal of model selection is to find the right balance between simplicity and goodness-of-fit.
Consider gene expression-based cancer classification. The basic idea is simple: Take a number of tumor samples of a known type, measure expressions of thousands of genes for each one, and on the basis of these observations, construct a classifier (model) that will predict the tumor type when presented with an unknown sample. A fundamental question is "What type of classifier should we choose?" This is a crucial step in model selection (in machine learning, the model is called the "hypothesis space"). The next step--actually selecting a particular classifier from the model class (i.e., selecting a particular hypothesis)--is fairly well understood, as it involves the estimation of parameters.
As discussed, it would be unwise to devise an overly complex classifier, consisting of hundreds or thousands of parameters, especially in light of rather small sample sizes (number of tumors) available, which is typically below 100. Such a classifier may have extremely small or even no error on the seen data but may exhibit very high error on unseen data. Hence, its predictive accuracy would be very poor.
So, suitable criteria or methods are needed that would help us strike the right balance between simplicity and goodness-of-fit, such that predictive accuracy can be maximized. Fortunately, recent statistical literature is replete with various approaches, such as the Bayesian information criterion “Schwarz criterion” redirects here. For the term in voting theory, see Schwartz criterion.
In statistics, the Bayesian information criterion (BIC) is a statistical criterion for model selection. , Akaike's information criterion There are a number of statistics that can act as an information criterion. They include:
In the field of toxicogenomics, issues related to prediction and model selection are of vital importance. For example, toxicogenomic biomarkers should reliably predict toxic effects to help us develop safer drugs and chemicals and understand molecular mechanisms of pathogenesis. Models of genetic networks and gene expression-based classifiers are expected to predict consistently a cell's response to a stressful challenge and to classify unknown compounds. A keen awareness of Ockham's razor will help guide us on our quest to understand the nature of living systems and their behavior under various environmental conditions.
Cancer Genomics Laboratory
The University of Texas M.D. Anderson Cancer Center
Houston, Texas “Houston” redirects here. For other uses, see Houston (disambiguation).
Houston (pronounced /'hjuːstən/) is the largest city in the state of Texas and the , USA
Ilya Shmulevich is an assistant professor at the Cancer Genomics Laboratory at The University of Texas M. D. Anderson Cancer Center. He is an associate editor of the Toxicogenomics Section of Environmental Health Perspectives. His research interests include computational genomics Computational genomics is the study of deciphering biology from genome sequences using computational analysis., including both DNA and RNA. Computational genomics focuses on understanding the human genome, and more generally the principles of how DNA controls the , systems biology Systems biology, a field of study in the biosciences, focuses on the systematic study of complex interactions in biological systems. Particularly from 2000 onwards, the term is used widely in the biosciences, and in a variety of contexts. , nonlinear signal and image processing image processing
Set of computational techniques for analyzing, enhancing, compressing, and reconstructing images. Its main components are importing, in which an image is captured through scanning or digital photography; analysis and manipulation of the image, accomplished , and computational learning theory In theoretical computer science, computational learning theory or computational learning problem is a mathematical field related to the analysis of machine learning algorithms. it is traditionally referred to as grammatical inference problem. .