NPCRIT: a novel protein coding region identifying tool using decision tree classifier with trust-region method & parallel scan algorithm.
Many of the challenges in biology are now challenges in computing. Bioinformatics , the application of computational techniques to analyze the information associated with bimolecules on a large scale, has now firmly established itself as a discipline in molecular biology. Bioinformatics is a management information system for molecular biology.
Bioinformatics encompasses everything from data storage and retrieval to the identification and presentation of features within data, such as finding genes within DNA sequence, finding similarities between sequences, structural predictions . Analyzing the coding regions is not the scope of this research.
Finding the coding regions is difficult because, Genomes are small are not continuous (0.1-10.1bp).The coding density  in eukaryotes is less than 50% till date (Only 5% of genes constitutes of coding regions). From the fig 1 we now that DNA is arranged in the form of exons and introns. Introns form maximum part of DNA and extons form the minor part of DNA. But these extons contain the small protein coding regions.
[FIGURE 1 OMITTED]
But, Proteins attach to the DNA and help the strands coil up into a chromosome when the cell gets ready to divide. The point where the proteins add to the strands is very important to this research. This NPCRIT tool uses a Sequential PRM-based protein folding algorithm for finding the point where these proteins add to the ladder.
We also need to find the closet neighbor hood stride, which is the main reason for the cause of dynamism in this research project. We use trust region method for finding the optimized stride . Once we find the neighboring strides we use parallel scan algorithm for processing the sequence to find the coding region.
For better understanding of the specified objectives, we presented the need and difficulty of finding the protein coding regions in Section I, in the Section II we introduced the concept of induction of decision trees, in the section III new trust-region method was discussed, Section IV covers the parallel scan algorithm and section V discusses protein folding algorithm and Section VI gives details regarding data and method and section VII was the strength of the paper which discusses briefly on the experimental results.
Induction of Decision Trees
Decision tree algorithms have been described in many places, for example ID3 and C4.5
(Quinlan 1993), CART (Breiman et al. 1984), and OC1 (Murthy et al. 1993; 1994). The basic algorithm will only be summarized here. All decision tree systems operate using the same underlying algorithm. The input is a set S of examples; in the current study, S consists of a set of non overlapping DNA subsequences.
Identification of protein coding region in DNA
The algorithm is:
Input: DNA Sequence
Output: Decision Tree
1. Split the set S using a test on one or more features, f1, f2, ..., fd.(Using Parallel Scan Algorithm, Section IV)
2. Check the results of the split. If every partition is pure (all examples in the Partition belong to the same class), then stop. Label each leaf node with the name of the class there.(Using the PRM-based protein folding algorithm, Section V)
3. Recursively Calculate the nearest neighbor (Optimized neighbor using Trust Region Method, Section III)
4. Recursively split any partitions that are not pure.
The result of this algorithm is a decision tree with test nodes and leaf nodes. The leaf nodes contain class labels. Decision trees vary primarily in how they choose a OsplitO of the data, and in how they prune the trees they produce.
The data should be formatted as a set of features and a class label; for example, this study includes 21 coding measures (the features), such as dicodon frequencies, hexamer frequencies, and length of the longest open reading frame. Each of these features is a number computed on a fixed-length window of the DNA sequence . There are only two class labels: coding and noncoding.
Given this numeric input, a decision tree algorithm must decide how to use the features to OsplitO the data. For example, if a feature is named Ocodon usage,O one possible test might be codon usage < 2.45. We call this a univariate test because it only uses one feature (variable). With a univariate test, there are only N-1 distinct tests to consider for N examples. These would occur at the midpoints between successive feature values. With N examples and D features, we have D(N-1) such tests to consider. Each test is scored by a goodness measure. The simplest such measure is the one that counts the number of features that would be misclassified by the test (Heath et al. 1993). More sophisticated tests measure the entropy of the initial set and the subsets produced after splitting (Quinlan 1993), or use statistical properties such as the probabilities of each class on either side of the split. (Note that a test always splits the data into two subsets.) The Otwoing ruleO (Breiman et al. 1984) tries to create tests that maximize the purity of both subsets while at the same time splitting the data roughly in half. For definitions of these and other goodness measures, see Murthy et al. (1994). Another, more complicated type of test is the linear discriminant, which was used by Fickett and Tung (1992). In the context of decision trees, this kind of test is called an oblique split. Instead of testing a single feature such as codon usage, these tests include a multivariate combination of the features f1, f2, ..., fd, in the form: a1f1 + a2f2 + ... + adfd [sup.3] ... ad+1
This test, while much more powerful than the simpler univariate test, is also more expensive to compute. However, recent work on randomized decision tree algorithms has resulted in very efficient methods for finding these oblique tests (Murthy et al. 1994). Our NPCRIT system always finds the best axis-parallel split first, and then considers oblique splits and uses those if they work better. The system can be set to use axis-parallel splits only for even greater efficiency.
One interesting result that came out of the current study was that virtually all goodness measures are designed so as to maximize, in one form or another, the overall accuracy of the tree. Thus if one class is much larger than another, as is the case with Fickett and TungOs benchmark training sets, decision trees tend to optimize accuracy on the larger class. To correct this, we developed a new goodness measure called Oaverage accuracy. Average accuracy measures the goodness of a split by simply computing the accuracy of each class for a given test, and then averaging the results across all classes. In order to use all the other goodness measures, though, we needed to produce training sets in which the sizes of the classes were roughly equal. From each of the benchmark data sets, therefore, we extracted a subset of the training data with equal class distributions, and used just the subset for building trees.
Most decision tree systems provide some means for pruning the decision tree after it is built. The initial algorithm takes some training data and builds a complete tree, one that correctly classifies every example in the training set. When the data is noisy, this tree usually OoverfitsO the training data and performs poorly on additional data. A smaller tree often performs better, and offers the advantage of being simpler and faster to use. Our decision tree system uses a pruning technique called cost complexity pruning. First it divides the training data into two sets, the training set (T) and the pruning set (P).
After building a decision tree on T, it then produces a succession of smaller trees by looking at every non-leaf node in the tree, and measuring the cost complexity of that node. Cost complexity considers the number of examples that would be misclassified (the cost) if that node were made into a leaf, and it also measures the complexity of the subtree rooted at that node. These two are combined into the cost complexity measure; for details see Breiman et al. (1984).
The node whose cost complexity is greatest is then deleted or Opruned,O and the process continues. This continues until the tree is pruned down to a single node. The system then examines the series of increasingly smaller trees and computes their accuracies on the pruning set P. In the simplest pruning model, called SE-0 pruning, it chooses the tree with the highest accuracy on the pruning set. The SE-1 pruning method computes the standard error (SE) of the accuracies of all the trees and then chooses the smallest tree whose accuracy is within one standard error of maximal. For this study we considered SE-0, SE-0.5, and SE-1 pruning.
[FIGURE 2 OMITTED]
New Trust-Region Method
We introduce a new trust-region method for unconstrained optimization where the radius of the corresponding neighbors DNS stride update is computed using the model information at the current iterate rather than at the preceding one. The update is then performed according to how well the current model retrospectively predicts the value of the objective function at last iterate. Global convergence torts- and second-order critical point is proved under classical assumptions and preliminary numerical experiments on cuter problems indicate that the new method is very competitive.
Trust-region methods are well-known techniques in nonlinear no convex programming, whose concept has matured over more than thirty years (for an extensive coverage, see Conn, Gould and Toint, 2000). In such methods, one considers a model mk of the objective function which is assumed to be adequate in a _trust region_, which is a neighbourhood of the current iterate [x.sub.k]. This neighbourhood is often represented by a ball in some norm, whose radius k is then updated from iteration k to iteration k + 1 by considering how well [m.sub.k] predicts the objective function value at iterate [x.sub.k+1]. In retrospect, this might seem unnatural since the new radius _k+1 will determine the region in which a possibly updated [modelm.sub.k+1] is expected to predict the value of the objective function around xk+1. Our aim in this paper is to propose a variant of the trust-region algorithm that determines _k+1 according to how well mk+1 predicts the value of the objective function at xk, thereby synchronizing the radius update with the change in models. This paper uses the theoretical properties and practical numerical potential of the new trust region algorithm.
Parallel Scan Algorithm
An important primitive for (data) parallel computing is the scan operation, also called prefix sum which takes an associated binary operator and an ordered set [a1, ..., an] of n elements and returns the ordered set example, plus scan([1,0,3,4,5,6,7]) = [1,3,7,0,4,1,6,3]. Fig 3,4 show all the possible transitions of the strides from one to four.
Notice that computing the scan of an n-element array requires n - 1 serial operations. Assuming we have n DNA sequence, each with one element of the array.
[FIGURE 3 OMITTED]
'To' denote the first sequence stretch of DNA sequence and we have to find the next transition based on the dependency vector. Third cell may depend on one to get transformed to four, depending on the revisiting rule of Cellular Automata. This rule also denotes the transition form one state to another is unique with respect to the state and input at that state.
[FIGURE 4 OMITTED]
Protein Folding Algorithm
The DNA is organized into stretches of genes, stretches where proteins attach to coil the DNA into chromosomes, stretches that "turn a gene on" and "turn a gene off", and large stretches whose purpose is not yet known to scientists. Proteins attach to the DNA and help the strands coil up into a chromosome when the cell gets ready to divide. The genes carry the instructions for making all the thousands of proteins that are found in a cell. The proteins in a cell determine what that cell will look like and what jobs that cell will do.
PRM-based protein folding algorithm for finding the point where these proteins add to the ladder. In contrast to most current folding algorithms, it uses very few energy parameters. Remaining conformational space of a simplified real-space representation of chains to find a minimum energy of an exceedingly simple potential function. This algorithm returns 'r' which is the point where these proteins attach DNA strand.
Input: Protein's native [q.sub.native] Output: A roadmap R of the protein's potential energy landscape Generate n nodes biased towards [q.sub.native] for each q [member of] R do Let [N.sub.k](q) be the k closest neighbors of q for q' [member of] [N.sub.k](q) do if the local planner can find a path between q and q' then add edge (q, q') to R end if end for end for return r
Data and Method
The data used for this study are the human DNA data collected by Fickett and Tung. All the sequences are taken from GenBank in May 1992. Fickett and Tung have provided the 21 different coding measures that they surveyed and compared.
The benchmark human data include three different datasets. For the first dataset, non-overlapping human DNA sequences of length 54 have been extracted from all human sequences, with shorter pieces at the ends discarded.
Every sequence is labeled according to whether it is entirely coding, entirely non-coding, or mixed, and the mixed sequences (i.e., overlapping the exon-intron boundaries) are discarded.
The dataset also includes the reverse complement of every sequence. This means that one-half of the data is guaranteed to be from the non-sense strand of the DNA.
In the next section we will give the experimental results for finding this coding region for all sequence lengths.
The below tables shows the predictive accuracy of different algorithms on both coding and non-coding DNA sequences. In this section we present the results of NPCRIT classifier for Fickett and Tung's dataset. Values are given for the percentage accuracy on test set coding sequences and the percentage accuracy on test set non coding sequences.
[GRAPHIC 1 OMITTED]
[GRAPHIC 2 OMITTED]
The drawback of Database Search is it fixes the gene and predicts the protein coding region. Splicing algorithm only focuses on the length 108 DNA sequence. The Supervised CA is best for predicting 54 and 108 length DNA sequences regions. The UN supervised FMACA is used for predicting regions with DNA codon measure less than 56bp
But NPCRIT is used to find coding regions for all sequence lengths, and it accommodates dynamism by not fixing the gene at the attachment of proteins. The percentage accuracy reported was 75%. Graphs (1, 2) (Data base Search, Splicing Algorithm, Supervised CA, Un Supervised FMACA,NPCRIT) shows that NPCRIT can be used to identify protein coding regions and the results obtained were comparable with other standard algorithms.
NPCRIT (Fig: 5) overcome all the disadvantages of previous standard algorithms like fixing the position of the gene and static order of the DNA sequence. The average accuracy reported is 75%.
[FIGURE 5 OMITTED]
This paper presents the application of parallel algorithms in finding the protein coding regions. The proposed NPCRIT tool may be very much useful to solve many other bioinformatics problems like protein structure prediction, RNA structure prediction, promoter region identification, etc. .
I am grateful to Dr. P. Venkata Narasaiah Director, S.R.K Institute of Engineering and Technology, for his consistent support at every stage of my research work. I specially thank our President Sri B.S Apparao & Secretary Sri B.S. Sri Krishna for providing necessary infrastructure. I am indebted to my colleagues for providing wonderful atmosphere to work with.
 P. Kiran Sree, I. Ramesh Babu, "Identification of Protein Coding Regions in Genomic DNA Using Unsupervised FMACA Based Pattern Classifier" in International Journal of Computer Science & Network Security with ISSN: 1738-7906 Volume Number: Vol. 8, No. 1, 2008.
 P. Kiran Sree, R. Ramachandran, "Identification of Protein Coding Regions in Genomic DNA Using Supervised Fuzzy Cellular Automata" in International journal of Advances in Computer Science and Engineering, with ISSN: 0973-6999, Vol: 1, 2008.
 Eric E. Snyder, Gary D. Stormo, "Identification of Protein Coding Regions In Genomic DNA". ICCS Transactions 2002.
 Stephane Audic * and Jean-Michel Claverie, "Self-identification of protein-coding regions in microbial genomes", Structural and Genetic Information Laboratory, Centre National de la Recherche Scientifique-EP. 91, 2002.
 P. Flocchini, F. Geurts, A. Mingarelli, and N. Santoro (2000), "Convergence and Aperiodicity in Fuzzy Cellular Automata: Revisiting Rule 90," Physica D.
 P. Maji and P.P. Chaudhuri (2004), "FMACA: A Fuzzy Cellular Automata Based Pattern Classifier, "Proceedings of 9th International Conference on Database Systems, Korea, pp. 494-505, 2004.
 C.G. Langton (2000), "Self-reproduction in cellular automata," Physica D, vol. 10, pp. 135-144.
 T. Toffoli (1998), "Reversible computing," in Automata, Languages and Programming, ed. J.W. De Bakker and J. Van Leeuwen, pp. 632-644.
 G. Vichniac (1994), "Simulating physics with cellular automata," Physica D, vol. 10, pp. 96-115.
 S. Chattopadhyay, S. Adhikari, S. Sengupta, and M. Pal (2000), "Highly regular, modular, and cascadable design of cellular automata-based pattern classifier," IEEE Trans. Very Large Scale Integr. Syst., vol.8, no.6,
 J. Fickett (1982), "Recognition of protein coding regions in dna sequences,"Nucleic Acids Res., vol. 10, pp. 5303-5318.
 R. Farber, A. Lapedes, and K. Sirotkin (1992), "Determination of eukaryotic protein coding regions using neural networks and information theory," J. Mol. Biol., vol. 226, pp. 471-479.
 P. Maji and P. P. Chaudhuri (2004), "Fuzzy Cellular Automata For Modeling Pattern Classifier," Accepted for publication in IEICE.
 F. Bastin1, V. Malmedy (2) "A Retrospective Trust-Region Method for Unconstrained Optimization" Proc. Natl. Acad. Sci., USA, vol. 88, pp. 11261-11265.
P. Kiran Sree
Associate Professor, Dept. of Computer Science, S.R.K Institute of Technology, Enikepadu, Vijayawada, India
E-mail : email@example.com, Mobile: +919959818274.
P. Kiran Sree received his B.Tech. in Computer Science & Engineering from J.N.T.U. and M.E. in Computer Science & Engineering from Anna University. He has published many technical papers; both in international and national Journals. His areas of interests include Parallel Algorithms, Artificial Intelligence, Compile Design and Computer Networks. He also wrote books on Analysis of Algorithms, Theory of Computation and Artificial Intelligence. He is now associated with S.R.K. Institute of Technology, Vijayawada.
Table 1: Predictive Accuracy for length 54 human DNA Sequence Algorithm Coding Non Coding Dicodon Usage 61% 57% Bayesian 51% 46% SUCA 78% 72% UN FMACA 79% 72% NPCRIT 79.5% 72.8% Table 2: Predictive Accuracy for length 108 human DNA sequence Algorithm Coding Non Coding Dicodon Usage 58% 50% Bayesian 45% 36% SUCA 74% 69% UN FMACA 75% 70% NPCRIT 76% 73% Table 3: Predictive Accuracy for length 252 human DNA sequence Algorithm Coding Non Coding Dicodon Usage 65% 54% Bayesian 50% 44% SUCA 71% 70% UN FMACA 73% 72% NPCRIT 74.5% 76.0%
Table 1: The values of the parameters in avian-human influenza epidemic model Parameter Value Description c 26.5 the rate at which new birds are born b 5 death rate of susceptible and infected birds m 5 the additional death rate mediated by avian influenza [omega] 2 the rate at which avian influenza is contracted from an average bird individual [lambda] 3 the rate at which new humans are born [mu] 0.015 death rate of susceptible and infected humans d 1 the additional death rate mediated by avian influenza [[beta].sub.1] 0.2 the rate at which avian influenza is contracted from an average bird individua [[beta].sub.2] 0.003 the rate at which mutant avian influenza is contracted from an average human individual [gamma] 0.01 the recovery rate [epsilon] 0.001 the mutation rate [alpha] 0.06 the additional death rate mediated by mutant avian influenza
|Printer friendly Cite/link Email Feedback|
|Author:||Sree, P. Kiran|
|Publication:||International Journal of Biotechnology & Biochemistry|
|Date:||Jan 1, 2008|
|Previous Article:||X-ray crystallographic studies and structure activity relationship of fungicides.|
|Next Article:||Quality assessment of gene expression data for an affymetrix platform with the two sample t-tests statistical analysis.|