Analysis and prediction of exon, intron, intergenic region and splice sites for A. thaliana and C. elegans genomes.1. INTRODUCTION With the completion of the genomes sequencing, more and more efforts were being put into understanding the functional elements encoded in a genome [1,2,3,4,5,6]. Annotation of gene structure in eukaryotic genomes currently involves both computational and experimental approaches [7,8,9,10]. Driven by this explosion of genome data and a need to analyze draft data quickly, genefinding programs have also proliferated, particularly those that were designed for specific organisms [11,12, 13,14,15]. However, the accuracy was still far from satisfaction [16]. Gene prediction methods can be generally classified as composition-based and similarity-based methods. Composition-based methods, also called ab initio gene-finding method, contain two important aspects: type of information and the algorithm. Most types of information measure either codon usage bias, base compositional bias between codon positions or splice site as well as periodicity in base occurrence. Several sophisticated algorithms that deduce the presence of a gene feature using signals and content information have been devised including GenScan [17], Fgenes [18], Genie [19] and MZEF [20]. Although some satisfactory results were obtained by using above software, a considerable proportion of missing or incorrect exon and over predictions were found by using an experimentally validated dataset of some genomic sequences [21]. On the other hand, most ab initio gene prediction programs performed prediction based on large parameters. For example, 12,288 parameters were needed by GeneMark [22]. It will deduce unreliable prediction results for small genome [23]. Similarity-based methods such as Genewise [24] and Procrustes [25] predicted a gene relied on homolog sequences. These methods showed a high sensitivity and specificity for predicting genes whose sequence is closely related to the known input sequence. But some species-specific genes are likely to be missed [7]. In order to improve prediction, the programs of combing protein sequence similarity with ab inito gene-finding algorithms such as GenomeScan [26] were proposed. Despite great progress, the experiment highlighted errors with the various predictions and indicated that both types of gene prediction programs are currently unable to determine whole gene structures consistently [27]. Although programs for splice site and gene structure recognition have reached a high level of performance on internal coding exons, standard splice sites might not be sufficient for defining introns in the genomes [28]. And prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. The distinguishing intergenic region from intron should be very useful to understand the features of the noncoding and regulatory regions. In addition, finding first exons still remains a challenge, except where the true full-length mRNA sequences are available. Unfortunately, most of the available mRNA sequences are incomplete at their 5'ends and do not provide information about first exons. Apparently, the recognition of exon, intron and intergenic DNA at the meanwhile is very helpful for gene recognition. Specially, it is difficulty to distinguish intron from intergenic sequence in past algorithm. In this paper, our goal is to provide a new computational method to predict gene structure base on least increment of diversity algorithm (LIDA). The diversity measure was first introduced and employed in biological classification [29]. It is a kind of information description on state space and a measure of whole uncertainty and total information of a system derived from information theory. To compare the similarity of two sources, one defines the increment of diversity (ID) by the difference of the total diversity measure of two systems and the diversity measure of the mixed system. It can be proved that the higher the similarity of two sources, the smaller the ID. So, the increment of diversity of two sources is essentially a measure of their similarity level. Here, according to the theory of diversity, we firstly predict coding exons, introns and intergenic sequences of A. thaliana and C. elegans based on the analysis of the compositional differences in near splice sites and conserved sequence segments of the three kinds of sequences (exons, introns and intergenic sequences) in the complete genome of these two model organisms. Subsequently, three kinds of coding exons (first coding exons, internal coding exons and last coding exons) are predicted by use of the least increment of diversity algorithm. It may be useful for improving the prediction of splice sites. 2. EXPERIMENTAL 2.1. Data Sample The A. thaliana and C. elegans genomic DNA sequences are obtained from Genbank. The coding exons, introns and intergenic sequences are respectively extracted from the above genomes. According to the length distribution, we divide all sequences of one chromosome into three types of subsets. The ranges of three subsets are respectively (30-200bp), (200-500bp) and (>=500bp) for exon and intron sequences, (30-2000bp), (2000-5000bp) and (>=5000bp) for intergenic sequences. The 15609 first coding exons, 67408 internal coding exons and 15791 last coding exons are extracted from A. thaliana complete genome. The 10904 first coding exons, 87743 internal coding exons and 11035 last coding exons are extracted from C. elegans complete genome. The subsequences with 9 bases length flanking 5' boundary sites (from -5th site to +4th site) and 3' boundary sites (from -4th site to +5th site) are meanwhile extracted respectively from above genome sequences. 2.2. Least Increment of Diversity Algorithm (LIDA) Due to increment of diversity (ID) can measure increment of whole uncertainly (or information) between two data sources, it has been widely applied in bioinformatics investigation, such as protein structural class prediction [30], subcellular location of apoptosis protein [31] and secretory protein prediction [32]. For the purpose of improving prediction capability, ID combined with other predictive model was applied in exon/introns splice site prediction [33], human PoIII promoter prediction [34] and protein predictions [35,36,37,38,39,40,41,42]. For reader's conveniences, the theory of diversity is introduced as follows. Definition 1. For a state space X{[n.sub.1],[n.sub.2], ..., [n.sub.s]} consisting of s information symbols, if [n.sub.1] indicates the numbers of the i-th state, then the diversity for diversity source X:[[n.sub.1], [n.sub.2],..., [n.sub.s]] is defined as [30], D(X) =D([n.sub.1],[n.sub.2], ..., [n.sub.s]) = N log N - [s.summation over (1)][n.sub.1]log[n.sub.i] (1) here N = [[summation].sup.s.sup.i][n.sub.i]. It is easily proved that the diversity equals N fold of information entropy [43]. Definition 2. If there are two sources of diversity in the same space of s dime[n.sub.s]ion, X.[[n.sub.1], [n.sub.2], ..., [n.sub.s]] and Y: [m.sub.1], [m.sub.2], ..., [m.sub.s]], we may define the increment of diversity as [DELTA](X,Y)=D(X+Y)-D(X)-D(Y) (2) where D(X+Y) is the measure of diversity of the mixed source X+Y.[[n.sub.1]+ [m.sub.1], [n.sub.2]+ [m.sub.2], ..., [n.sub.s]+ [m.sub.s]]. Note that A(X,Y) is a function of two sources. It is easily proved that the increment of diversity [Eq.(2)] is nonnegative and symmetry. Therefore, [DELTA](X, Y) is regarded as a quantitative measure of the similarity level of two independence systems. 2.3. Prediction of Exon, Intron and Intergenic Sequence One DNA sequence can be represented by a diversity source: X [[S.sub.i], [N.sub.jk], M.sub.lk]], where [S.sub.1] means the absolute frequency of the i-th trinucleotide in the sequence (i=1,2, ..., [4.sup.3]); [N.sub.jk] means the absolute frequency of base k at the j-th position from the beginning of 5' boundary (j=1, 2, ..., 15), [M.sub.lk] means the absolute frequency of bases k at the l-th position from the end of 3' boundary, (l=1, -2, ..., -15). By calculating above 180 ([4.sup.3]+15x4+15x4) parameters of exons, introns and intergenic sequences in standard sets (training sets), we deduce three standard sources of diversity [[X.sub.[xi]]: [[n.sup.[xi].sub.1], [n.sup.[xi].sub.2], ... [n.sup.[xi].sub.184]] in the state space of 184 dimensions. (here [xi] = e,i,g indicates respectively the exon, intron and intergenic sequence.) Three standard measures of diversity can be deduced by use of similar equations as Eq.(1), namely [D([X.sub.[xi]]) = [N.sub.[xi] log [N.sub.[xi]] - [[summation].sup.184.sub.k=1] [n.sup.[xi].sub.k] log [n.sup.[xi].sub.k] (3) where [N.sub.[xi]] = [[summation].sup.184.sub.k=1][n.sup.[xi].sub.k] (k=1,2, ..., 184), ([xi] = e,i, g). Suppose that X is a DNA sequence whose class is to be predicted. In the same state space, the measure of diversity of sequence X can be expressed as: D(X) = M log M - [[summation].sup.184.sub.k=1] [m.sub.k] log [m.sub.k] (4) where M = [[summation].sup.184.sub.k=1] [m.sub.k] [m.sub.k] (k=1, 2, ..., 184). The increments of diversity between the diversity source X. [m.sub.1], [m.sub.2], ... [m.sub.184]] and the three standard diversity sources [X.sub.[xi]: [[n.sup.[xi].sub.1], [n.sup.[xi]sub.2], ... [n.sup.[xi].sub.184], (here [xi] = e, i, g) are [DELTA](X, X.sub.[xi]) = D (X + [X.sub.[xi]]) - D(X) - D([X.sub.[xi]]) ([xi] = e,i,g) (5) Sequence X can be predicted to be the class for which the corresponding increment of diversity has the minimum value, and can be formulated as follows. [DELTA]([X.sub.[xi], X) = Min {[DELTA]([X.sub.e], X), [DELTA]([X.sub.i], X), [DELTA]([X.sub.g], X)} (6) where [xi] can be e, i or g and the operator Min means taking the minimum value among those in the parentheses, then the [xi] in Eq.(6) will give the sequence class to which the predicted sequence X should belong. 2.4. Prediction of Three Kinds of Coding Exons For each coding exon, the following three kinds of codon positions are investigated to select optimal parameters. 1) The three bases before the 5i boundary sites of exons (acceptor sites) and after the 31 boundary sites of exons (donor sites) are chosen as information parameters of diversity source. AGA GCA [up arrow] ATG G.... A TGC [up arrow] GTA AGA 2) The three bases after the 5/ boundary sites of exons (acceptor sites) and before the 3/ boundary sites of exons (donor sites) are chosen as information parameters of diversity source. AGA GCA [up arrow] ATG G.... A TGC [up arrow] GTA AGA 3) The six bases flanking the 5/ boundary sites of exons (acceptor sites) and the 3/ boundary sites of exons (donor sites) are chosen as information parameters of diversity source. AGA GCA [up arrow] ATG G.... A TGC [up arrow] GTA AGA (where Tindicates the 5' or 3' exon boundary sites) By calculating the absolute frequencies of four bases in above positions near splice sites of first coding exons, internal coding exons and last coding exons, we deduce three standard sources of diversity [X.sub.[xi]: {N.sup.[xi].sub.ja] | j=1,2,3; a=A,C,G,T} in the state space of 12 dimensions (here [xi] = f , i, l corresponding to first coding exon, internal coding exon and last coding exon, respectively). Then, three standard measures of diversity for three coding exons can be calculated by Eq.(1), namely: D([X.sub.[xi]]) = [N.sub.[xi]]) = [N.sub.[xi]] log [N.sub.[xi]] - [[summation].sup.12.sub.k=1] [n.sup.[xi].sub.k] log [n.sup.[xi].sub.k] (7) where [N.sub.[xi]]) = [[summation].sup.12.sub.k=1] [n.sup.[xi].sub.k] (k=1, 2, ..., 12). Suppose that S is an exon whose class is to be predicted. In the same state space, the measure of diversity can be expressed as: D(S) = M log M - [[summation].sup.12.sub.k=1] [m.sub.k] log[m.sub.k] (8) According to Eq.(2), the increments of diversity between source S and three standard sets are [DELTA](S,[X.sub.[xi]) = D(S + [X.sub.[xi]]) - D(S) - D([X.sub.[xi]]) ([xi] = f,i,l) (9) Exon (S) can be predicted to be the class for which the corresponding increment of diversity has the minimum value, can be formulated as follows [DELTA]([X.sub.[xi]], S) = Min{[DELTA]([X.sub.f], S), [DELTA]([X.sub.i], S), [DELTA]([X.sub.l], S)} (10) where [xi] can be f, i or l and the operator Min means taking the minimum value among those in the parentheses, then the [xi] in Eq.(9) will give the class to which the predicted coding exon S should belong. 3. RESULTS 3.1. Evaluating Predicted Performance of Proposed Method In order to evaluate the correct prediction rate and reliability of a predictive method, the sensitivity ([S.sub.n]), specificity ([S.sub.p]) and correlation coefficient (CC) are defined by [S.sub.n] = TP |(TP + FN) [S.sub.p] = TP |(TP + FP) CC =(TPxTN)-(FPxFN)/[square root of (TP+FP)x(TN+FN)x(TP+FN)x(TN+FP)] For a given sequence class [xi], TP denotes the number of the sequences correctly predicted to be in [xi] class sequences (true positive), FP denotes the number of the sequences incorrectly predicted to be in [xi] class sequences (false positive), TN denotes the number of the sequences correctly predicted to be in non-[xi] class sequences (true negatives), FN denotes the number of the sequences incorrectly predicted to be in non-[xi] class sequences (false negative). Sensitivity shows the rate of correct prediction. Specificity shows the confidence level for predictive method. The correlation coefficient (CC) affects the entirely performance of the prediction algorithm. 3.2. The Prediction of Exon, Intron and Intergenic Sequence Approximate 1/2 sequences of standard sets (training sets) and 1/2 testing sets are randomly chosen by computer programs from the corresponding subset. In order to eliminate the dependence of the predictive results on the training dataset, the standard set (training set) are randomly selected 10 times. The numbers of the known coding exons, introns and intergenic sequences are shown in Table 1. Based on the Eq.(6), the three classes of sequences are predicted by use of the 184 information parameters. In order to compare prediction quality of different information parameters, we perform our algorithm to predict exons, introns and intergenic sequences using 64 trinucleotides. The contrast results of test sets between 64 and 184 signals parameters for A. thaliana (A) and C. elegans (C) are shown in Table 2. 3.3. The Prediction of Three Kinds of Coding Exons For predicting three types of coding exons, a total of 1000 first coding exons, 1000 internal coding exons and 1000 last coding exons are randomly selected as training sets from gene sequences of A. thaliana and C. elegans. The remained sequences are regarded as the test sets. In order to eliminate the dependence of the predictive results on the training dataset, this selected procession repeat 10 times. According to Eq.(10), three types of coding exons using different information parameters are predicted. The results are shown in Table 3. As seen from Table 3, the first parameter-chosen method achieve best results among three kinds of parameters. 4. DISCUSSION The recognition results of the exon, intron and intergenic sequence show that the [S.sub.n], [S.sub.p] and CC values with 184 parameters are higher than the results with 64 signals. For A. thaliana (A) and C. elegans (C), the average correct prediction rates of standard sets are 88.6% and 88.2%, the average correct prediction rates of testing sets are 93.6% and 88.4%, respectively. Overall correct prediction rates are 91.1 % and 88.4%, respectively. For evaluating performance of proposed method, exons, introns and intergenic sequences of D. melanogasters and S. cerevisiae were predicted using 184 parameters. The overall accuracies of 92.28% and 94.88% were achieved for D. melanogasters and S. cerevisiae, respectively. We also performed LIDA to predict coding regions and intergenic sequences of E. coli. The overall accuracy of 92.88% was achieved. Despite great progress, however, gene prediction entirely based on DNA analysis is still far from perfect. In the recent comparison of gene-prediction programs, the best algorithms in two well-annotated regions could achieve sensitivities (a measure of the ability to detect true positives) and specificities (a measure of the ability to discriminate against false positives) of less than 95% and 90% for different genomes, respectively [44,45]. In our method, three kinds of sequences (exons, introns and intergenic sequences) are simultaneously predicted. If considering the random effect, the correct prediction rate for three kinds of sequences is only 2/3 of the correct prediction rate for two kinds of sequences (exons and introns). That is to say, if two types of sequences are simultaneously predicted, the random correction rate is 1/2; if three types of sequences are simultaneously predicted, the random correction rate is 1/3. Such as, 90% correct prediction rate for predicting two types of sequences is only same as 60% for predicting three types of sequences. So, same correct prediction rate in our result is higher than the correct prediction rate of two kinds of sequences in any other methods. The results of the prediction for the three types of coding exons indicate that the sensitivity ([S.sub.n]), specificity ([S.sub.p]) and correlation coefficient (CC) are the best by use of three bases before the 5' boundary sites of exons and after the 3' boundary sites of exons in three selections. Especially, the correlation coefficient (CC) is apparently higher in first choosing method than that in second and third methods. It is consistent with the highly conserved sequences near the ends of introns and the conserved GT AG rule. The three kinds of coding exons have not been studied in other methods. In addition, according to the statistical analysis of sequences in the region near splicing sites, we find there are some special preferences for certain bases. The results show that the sequence of the near splice site region is strongly conserved. Except the GT AG rule, there is a strong bias of base G in the -4th site from the 3' term of introns for A. thaliana genome, but the base T is biased in the same site for C. elegans genome. The stop codons of the two model species bias TAA, and the bases GT and AT are biased in the two sites after the stop codon for A. thaliana and C. elegans genomes, respectively. It may be a possible signal for stopping translation. The base A is biased at positions -4, -2 and -1 before translation start sites. And the bases G and A are respectively biased in the 4-th site after translation start sites (TSS). These biases may be relative to the translation start signals. In addition, the base bias of the 1-st sites of the 5' term within internal coding exons and last coding exons is different for A. thaliana from C. elegans genomes. The base G is biased by the A. thaliana, base A is biased by C. elegans. By the further statistics of the base pairs in the boundary region of exons, the first coding exons and internal coding exons in A. thaliana and C. elegans genomes are generally ended by AG. The internal coding exons and last coding exons in A. thaliana genome are generally started by GT, but the two exons in C. elegans genome are generally started by AT. It is possible additional information for splice sites. These results may be very useful to improve correct prediction rate of splice sites. 5. CONCLUSIONS This paper proposed a novel algorithm-increment of diversity for gene structure prediction. This algorithm may be deduced from information entropy. It is well known that the mutual information can describe how to extract information regarding b from source a if the conditional probability p(b1a) is known [33]. But ID is different from mutual information. It can describe increment of complication between two informational sources. Our prediction results also exhibit that ID is a promising method. doi: 10.4236/jbise.2009.26053 6. ACKNOWLEDGEMENTS The authors thank Professor C. J. Benham and Dr. H.Q. Wang in UCDavis for helpful discussions. The work was supported by National Science Foundation of China, No. 30560039. Received 18 June 2008; revised 31 May 2009; accepted 8 June 2009. REFERENCES [1] J. L. Ashurst and J. E. Collins, (2003) Gene annotation: Prediction and testing, Annu. Rev. Genomics Hum Genet, 4,69-88. [2] M. Nowrousian, C. Wurtz, S. Poggeler, and U. Kuck, (2004) Comparative sequence analysis of Sordaria macrospora and Neurospora crassa as a means to improve genome annotation, Fungal Genetics and Biology, 41, 285-292. [3] E. Eden and S. Brunak, (2004) Analysis and recognition of 5'UTR intron splice sites in human Pre-mRNA, Nucleic Acids Res, 32, 1131-1142. [4] M. Kozak, (2006) Rethinking some mechanisms invoked to explain translational regulation in eukaryotes, Gene, 382, 1-11. [5] H. A. Meijer and A. A. M. Thomas, (2002) Control of eukaryotic protein synthesis by upstream open reading frames in the 5'-untranslated region of an mRNA, Biochem. J., 367, 1-11. [6] F. B. Guo and X. J. Yu, (2007) Re-prediction of protein-coding genes in the genome of Amsacta moorei entomopoxvirus, Journal of Virological Methods, 146, 389392. [7] F. B. Guo and C. T. Zhang, (2006) ZCURVE_V: A new self-training system for recognizing protein-coding genes in viral and phage genomes, BMC Bioinformatics, 7, 9. [8] Y. H. Qiao, J. L. Liu, C. G. Zhang, X. H. Xu, and Y. J. Zeng, (2005) SVM classification of human intergenic and gene sequences, Mathematical Biosciences, 195, 168-178. [9] V. Brendal, L. Xing, and W. Zhu, (2004) Gene structure prediction from consensus spliced alignment of multiple SSTs matching the same genomic locus, Bioinformatics, 20, 1157-1169. [10] S. Karlin, J. Mrdzek, and A. J. Gentles, (2003) Genome comparisons and analysis, Current Opinion in Structural Biology, 13, 344-352. [11] S. Gopal, G. A. M. Cross, and T. Gaasterland, (2003) An organism-specific method to rank predicted coding regions in Trypanosoma brucei, Nucleic. Acids Res., 31, 5877-5885. [12] S. D. Schlueter, Q. Dong, and V. Brendel, (2003) GeneSeger@PlantGDB: Gene structure prediction in plant genomes, Nucleic. Acids Res., 31, 3597-3600. [13] J. E. Moore and J. A. Lake, (2003) Gene structure prediction in syntenic DNA segments, Nucleic. Acids Res., 31,7271-7279. [14] J. Wang, et al., (2003) Vertebrate gene predictions and problem of large genes, Nature Reviews Genetics, 4, 741-749. [15] F. Gao and C. T. Zhang, (2004) Comparison of various algorithms for recognizing short coding sequences of human genes, Bioinformatics, 20, 673-681. [16] M. Q. Zhang, (2002) Computational prediction of eukaryotic protein-coding genes, Nature Reviews Genetics, 3,698-709. [17] Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA, J. Mol. Biol., 268,78-94. [18] V. V. Solovyev, A. A. Salamov, and C. B. Lawrence, (1995) Identification of human gene structure using linear discriminant functions and dynamic programming, Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 367-375. [19] M. G. Reese, D. Kulp, H. Tammana, and D. Haussler, (2000) Genie-Gene finding in Drosophila melanogaster, Genome. Res., 10, 529-538. [20] S. Rogic, A. K. Mackworth, and F. B. Ouellette, (2001) Evaluation of gene-finding programs on mammalian sequences, Genome. Res., 11, 817-832. [21] M. Q. Zhang, (1997) Identification of protein coding regions in human genome by quadratic discriminant analysis, Proc. Natl. Acad. Sci., USA, 94, 565-568. [22] J. Besemer, A. Lomsadze, and M. Borodovsky, (2001) GeneMarkS: A self-training method for prediction of gene starts in microbial genomes, implications for [R]nding sequence motifs in regulatory regions, Nucleic. Acids. Res., 29, 2607-2618. [23] F. B. Guo, H. Y. Ou, and C. T. Zhang, (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes, Nucleic. Acids. Res., 31, 1780-1789. [24] E. Birney and R. Durbin, (2000) Using GeneWise in the Drosophila annotation experiment, Genome. Res., 10, 547-548. [25] M. S. Gelfand, et al., (1996) Gene recognition via spliced sequence alignment, Proc. Natl. Acad. Sci., USA, 93, 9061-9066. [26] R. F. Yeh, L. P. Lim, and C. B. Burge, (2001) Computational inference of homologous gene structures in the human genome, Genome. Res., 11, 803-816. [27] 1. M. Meyer and R. Durbin, (2004) Gene structure conservation aids similarity based gene prediction, Nucleic. Acids. Res., 32, 776-783. [28] L. P. Lim and C. B. Burge, (2001) A computational analysis of sequence features involved in recognition of short introns, Proc. Natl. Acad. Sci., USA, 98, 1119311198. [29] R. R. Laxton, (1978) The measure of diversity, J. Theor. Biol., 70, 51-67. [30] Li, Q. Z. and Lu, Z. Q., (2001) The prediction of the structural class of protein: Application of the measure of diversity, J. Theor. Boil., 213, 493-502. [31] Chen, Y. L. and Li, Q. Z., (2007) Prediction of the subcellular location of apoptosis proteins, J. Theor. Biol., 245,775-783. [32] Y. C. Zuo and Q. Z. L, (2009) Using K-minimum increment of diversity to predict secretory proteins of malaria parasite based on groupings of amino acids, Amino Acids, DOI10.1007/s00726-009-0292-1. [33] L. R. Zhang and L. F. Luo, (2003) Splice site prediction with quadratic discriminant analysis using diversity measure, Nucleic. Acids. Res., 31, 6214-6220. [34] J. Lu and L. F. Luo, (2005) Human poIII promoter prediction, Prog. Biochem. Biophys., 32, 1185-1191. [35] H. Lin and Q. Z. Li, (2007) Predicting conotoxin super-family and family by using pseudo amino acid composition and modified Mahalanobis discriminant, Biochem. Biophys. Res. Commun., 354, 548-551. [36] H. Lin, and Q. Z. Li, (2007) Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components, J. Comput. Chem., 28, 1463-1466. [37] F. M. Li and Q. Z. Li, (2008) Using pseudo amino acid composition to predict protein subriuclear location with improved hybrid approach, Amino Acids, 34, 119-125. [38] X. Z. Hu and Q. Z. Li, (2008) Prediction of the R-Hairpins in proteins using support vector machine, Protein J., 27,115-122. [39] H. Lin, (2008) The modified Mahalanobis Discriminant for predicting outer membrane proteins by using chou's pseudo amino acid composition, J. Theor. Biol., 252, 350-356. [40] X. Z. Hu, Q. Z. Li, and C. L. Wang, (2009) Recognition of beta-hairpin motifs in proteins by using the composite vector, Amino Acids, DOI 10.1007/s00726-009-0299-7. [41] W. Chen and L. Luo, (2009) Classification of antimicrobial peptide using diversity measure with quadratic discriminant analysis, J. Microbiol Methods, DOI: 10.1016/ j.mimet.2009.03.013. [42] Y. Feng and L. Luo, (2008) Use of tetrapeptide signals for protein secondary-structure prediction, Amino Acids, 35,607-614. [43] L. Luo, (2006) Information biology: Hypotheses on coding information quantity, Acta Scientiarum Naturalium Universitatis NeiMongol, 37, 285-294. [44] Z. Wang, Y. Z. Chen, and Y. X. Li, (2004) A brief review of computational gene prediction methods, Geno. Prot. Bioinfo., 2, 216-221. [45] L. Stein, (2001) Genome annotation: From sequence to biology, Nature Rev. Genet., 2, 493-503. Hao Lin (1,2) *, Qian-Zhong Li (1), Cui-Xia Chen (1,3) (1) Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot, China; (2) Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China; (3) CapitalBio Corporation, Beijing, China; Correspondence should be addressed to Hao Lin. Email: hlin@uestc.edu.cn
Table 1. The length-distribution of three kinds of sequences
in the chromosomes of the two model species.
Standard set
1st 2nd 3rd
Genome class subset subset subset total
A.thaliana Exon 15229 4723 2126 22728
Chr1~4 Intron 16130 3183 919 20329
Intergenic 6109 2525 1109 9747
C.elegans Exon 10507 4896 1002 16739
Chr1~6 Intron 12181 2859 2283 17354
Intergenic 5023 1446 1109 7617
Test set
1st 2nd 3rd
Genome class subset subset subset total
A.thaliana Exon 14982 4868 2417 22267
Chr1~4 Intron 16181 3405 870 20456
Intergenic 6742 2490 1105 10337
C.elegans Exon 12214 4809 1034 18057
Chr1~6 Intron 13217 2935 2317 18469
Intergenic 5483 1598 1086 8167
Table 2. The results for test set with 64 and 184 signals
of A. thaliana and C. elegans.
A. thaliana
No. of Class
signals of exon Sn (%) SP (%) CC (%)
64 Exon 85 (95, 98) 94 (96, 95) 83 (92, 93)
Intron 85 (81, 73) 89 (91, 83) 78 (80, 73)
Intergenic 86 (92, 83) 65 (78, 80) 69 (79, 75)
184 Exon 84 (91, 94) 96 (98, 98) 84 (90, 9l)
Intron 98 (98, 99) 88 (87, 79) 88 (88, 85)
Intergenic 88 (90, 84) 89 (94, 95) 86 (90, 86)
C. elegans
No. of Class
signals of exon Sn (%) SP (%) CC (%)
64 Exon 73 (78, 88) 89 (95, 95) 70 (74, 89)
Intron 92 (75, 67) 87 (78, 87) 81 (66, 57)
Intergenic 66 (65, 78) 53 (4l, 50) 50 (39, 47)
184 Exon 73 (76, 84) 92 (98, 98) 73 (76, 88)
Intron 99 (99, l00) 90 (85, 93) 91 (88, 92)
Intergenic 79 (85, 87) 65 (63, 90) 65 (67, 85)
The number outside the bracket denotes the predicted results
for the 1st subset. Two numbers in bracket, respectively,
denotes the predicted results for the 2nd subset and the 3rd subset.
Table 3. The results of prediction for three kinds
of exons in A. thaliana and C. elegans genotnes.
A. thaliana
Methods Class of exon Sn (%) SP (%) CC (%)
First First coding exon 86 74 76
choosing Internal coding exon 93 93 77
method Last coding exon 82 96 87
Second First coding exon 90 54 63
choosing Internal coding exon 68 95 55
method Last coding exon 89 56 64
Third First coding exon 86 57 64
choosing Internal coding exon 74 94 58
method Last coding exon 88 62 69
C. elegans
Methods Class of exon Sn (%) SP (%) CC (%)
First First coding exon 86 70 75
choosing Internal coding exon 96 97 81
method Last coding exon 87 98 89
Second First coding exon 82 33 45
choosing Internal coding exon 62 96 38
method Last coding exon 87 34 48
Third First coding exon 86 40 52
choosing Internal coding exon 74 96 50
method Last coding exon 88 49 61
|
|
||||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion