Printer Friendly

New methods for DNA sequence similarity analysis.

1. INTRODUCTION

DNA is the nucleic acid that contains the genetic information for all biological systems. It consists of a long sequence of nucleotide blocks particularly: Adenine (A), Thymine (T), Cytosine (C) and Guanine (G). The arrangement of these four bases determines the exact function and coding capacity of the DNA sequence and helps in distinguishing between individuals.

With the rapid increase in the number of DNA sequences and the need to analyze these sequences and to find the similarity between these sequences, several techniques have been suggested. These techniques either allow insertions and deletions of elements in the DNA sequence [e.g. BLAST, DIALIGN, etc.] or treat the DNA sequences as complete (i.e. with no insertions or deletions) [e.g. Guo et al. 2008, Guohua 2011, etc.].

The first description of a sequence similarity search method that allows insertions, deletions, and gaps was published in [Needleman, 1970] where a computer program for finding similarities in the amino acid sequences of two proteins was developed.

Basic Local Alignment Search Tool (BLAST) is one of the most commonly used web tools for comparing primary biological sequence information whether proteins or DNA sequences. One problem that may occur with web tools is the semantic type mismatch in scientific workflows. This problem was tackled in [Kheiredine and Denis, 2007] and a similarity search on DNA sequences was applied that guarantees semantic type correctness in scientific workflows.

Another computer program that is used for multiple sequence alignment is DIALIGN. This program combines both local and global alignment features and uses dynamic programming in its algorithm [DIALIGN]. In order to improve the performance of this program, a hardware accelerator was built and combined with the program. Experiments showed a clear progress in the retrieval of alignments for large biological sequences. [Azzedine et al., 2010].

Several other tools were built whether for DNA sequences or proteins [Dominic et al., 2008; Elizabeth, 2008; and Boris, 2009]. A study that compares several tools was conducted experimentally in [Petri, 2008] shows that new variations of old algorithms were efficient in practice. In addition, the authors mentioned that the algorithms' efficiency comparison depends on the performance of the computer or the hardware that can impact significantly this performance.

However, the problem with the previous methods is that they usually involve assigning scores that differ from one algorithm to another during the alignment. In addition, the time and space complexity of these methods increase with the increase in the sequence length. Therefore, other techniques that don't take insertions or deletions under consideration are considered. These techniques are either based on graphical representation [e.g. Guohua et al. 2009, Jie et al. 2010, etc.], mathematical equations [e.g. Guo et al. 2008, Vinga and Almeida 2003, etc.], or vectors [e.g. Liao et al., 2005].

The graphical representation is one of the methods that usually don't use insertions or deletions. Using this method, DNA sequences should be encoded in a way that can be represented as a graph. As a result visual inspection of the similarity of the resulting graphs can be concluded easily. In [Milan et al. ,2003] the authors suggested a method of encoding DNA sequences to be represented graphically. In the proposed method, the authors represented each nucleotide as a unique number in a way that the representation avoids loss of information compared with other 2-D representations in which the curve that represents the DNA overlaps and intersects with itself.

In [Guohua, 2009], the four DNA nucleotides were denoted by four vectors within the first and the fourth quadrant in the two dimensional Cartesian coordinate system. In this representation T and C were assigned to the first quadrant while A and G were assigned to the fourth quadrant. The DNA sequence was read base by base. These bases were plotted as succeeding points on the graph along with their corresponding vectors. The resulting graph forms a unique representation for each DNA sequence with no degeneracy. This allows a visual inspection of similarity between DNA sequences.

A novel graphical representation of DNA sequences that comes from the biological knowledge that the four DNA bases can be classified as: R = {A; G} Y={C; T}, M = {A; C} and K = {G; T}, W = {A; T} and S= {G;C} according to their chemical properties were suggested in [Qi et al. 2006]. A mapping was constructed using the previous classification. The resulting points were connected producing a curve which was used to analyze the similarity between sequences [Qi et al. 2006].

In [Vinga and Almeida, 2003], the authors presented an alternative technique for the similarity analysis of DNA sequences. The technique uses the primitive discrimination substrings of sequence S and Q to define a new discrimination measure DM(S, Q) to analyze the similarity such that the smaller the discrimination measure is, the more similar the species are. The approach does not require sequence alignment and graphical representation, and besides, the authors claimed that it is fully automatic. The whole operation process utilizes the entire information contained in the DNA sequences and does not require any Human intervention.

Another mathematical method of analyzing DNA similarity is presented in [Jihong et al., 2011]. In their paper, the authors proposed the Quasi--Multi-Quadrics (Quasi-MQ EMD) method using Multi-Quadrics radial basis function MQ-RBF quasi-interpolation for approximating the extreme envelop, and use it to divide the nonlinear signal sequence corresponding to the DNA sequences into a set of well-behaved IMFs and a residue. The resulting residues are compared in order to conclude the similarity of different DNA sequences.

Ying and Wang [Guo et al. 2003] proposed a new method to analyze the similarity/dissimilarity of DNA sequences based on the graphical representation proposed by Milan et al. (2003). The Euclidean distance formula is used to find the distance between sequences. It was calculated after smoothing the original curve and calculating its curvature.

A novel mathematical metric that is based on graph theory was suggested in [Xingqin et al. 2011]. A weighted directed graph was built for each DNA sequence. A vector that was generated from the adjacency matrix of the graph was assigned to each DNA sequence. The vector was used to measure the similarity between DNA sequences based on both ordering and frequency of nucleotides.

Another method that is based on LZ complexity and dynamic programming to analyze DNA sequences is presented in [Xiaodong et al. 2011], Using this method, each DNA sequence is divided into a word set with the LZ complexity. Analysis of the similarity between DNA sequences is conducted by measuring shared information among their word-sets based on the dynamic programming algorithm. The research result were compatible with literature results.

This paper presents several methods, graphical and statistical, for finding similarity between DNA sequences with a comparison between them.

2. DATA COLLECTION

This paper considers the complete coding sequence of beta-Globin genes from 7 different species, which are relatively conservative and were studied in the literature [Guohua et al. 2009, Qi et al. 2007]. They are respectively Human [GenBank: U01317], Opossum[GenBank: J03643], Gallus [GenBank: V00409], Lemur [GenBank: M15734], Mouse [GenBank: V00722], Rabbit [GenBank: V00882] and Rat [GenBank: X06701]. The DNA sequence for Human, as an example, is shown in Figure 1.
Figure 1. The coding sequence of P-Globin gene of Human[Guohua
et al.,2009].

Human ACCESSION U01317; REGION: join(62187 ... 62278;
62409 ... 62631; 63482 ... 63610) Exon1 1 ... 92; Exon2 93 ...
315; Exon3 316 ... 444;
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTG
CCCT GTGGGGCAAGGT GAACGT GGAT GAAGTT GGT GGTG
AGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCA
GAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG
CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGA
AAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGA
CAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCAC
TGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCC
TGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGG
CAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAA
GTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATC
ACTAA


3. A NEW 2D GRAPHICAL REPRESENTATION OF DNA SEQUENCES SIMILARITY

In order to represent DNA sequences graphically, they should be encoded numerically. The novel 2-D graphical encoding presented by (Milan, 2003), is also used here in our paper. In it, four horizontal lines are drawn separated by unit distances, on which dots representing the bases constituting the considered sequences are placed. Adjacent dots are then connected with lines forming a curve that illustrates the DNA segment considered.

Figure 2 shows the novel 2-D graphical representation of the DNA segment of the first 10 bases, ATGGTGCACC, of the coding sequence of the first Exon of Human beta-Globin gene.

[FIGURE 2 OMITTED]

After representing DNA sequences numerically, to compare between two DNA sequences and see the regions where there is similarity and the regions where there is dissimilarity, the following method is followed.

Let X=[x.sub.1][x.sub.2] ... [x.sub.n] denotes the first DNA sequence, Y=[y.sub.1][y.sub.2] ... [y.sub.n] denotes the second DNA sequence.

Then the two vectors that represent X and Y numerically are:

V1=([w.sub.1],[w.sub.2] ... [w.sub.n]), V2=([z.sub.1],[Z.sub.2],.[Z.sub.n]) where

1<=[w.sub.i],[x.sub.i]<=4 and 1<=i<=n.

V1-V2=([w.sub.1][z.sub.1], [w.sub.2]-[z.sub.2], ..., [w.sub.n]-[Z.sub.n]).

The similarity vector S=([s.sub.1],[s.sub.2],.[s.sub.n]), can be constructed such that:

[S.sub.i]=[S.sub.i-1] + 5 if [w.sub.i]-[Z.sub.i]=0,

[S.sub.i]=[S.sub.i-1] -5 otherwise, where 1<=i<=n and

[S.sub.0]=0.

Figure 3 shows the results of the comparison between: (Human and Rabbit), and (Human and Opossum) where the curve shows regions that are increased which indicates similarity at these regions, and other regions that are decreased which indicates dissimilarity. The similarity (increasing regions) indicates a biological relationship between these two species and suggests an evolutionary relation between them.

While the dissimilarity that appears in some positions on the curve indicates numerous mutations. The curve atpart a of figure 3 has more increasing regions than decreasing regions which indicates a high similarity between Human and Rabbit. While part b of Figure 3 demonstrates a similarity between Human and Opossum until around point 80 where a diversion occurs indicating a high dissimilarity between these two species.

The same can be done to all species. Figure 4 shows the similarity between Human and other species. The resulting figures agree with previous researches results [Guohua et al., 2009].

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

Our method has several advantages over other methods such as:

The similarity/dissimilarity regions can be easily inspected visually. The complexity of the method is linear which is much better than other algorithms that require higher complexity. The resulting similarity figure (such as Figure 4) doesn't overlap or intersect itself. In addition, the method doesn't involve any alignment process.

4. SIMILARITY BASED ON CORRELATION COEFFICIENT

Another method that can be used to check the similarity/dissimilarity between DNA sequences which is not based on graphical representation is shown here. This method depends on the correlation coefficient measure to decide the percent of similarity between sequences. Pearson's correlation coefficient is used here. The Pearson product-moment correlation coefficient(r) for two sets of values, x and y, is given by the formula:

r = [[SIGMA](x - [bar.x])(y - [bar.y])/[square root of [SIGMA] [(x - [bar.x]).sup.2]] [square root of [SIGMA] [(y - [bar.y]).sup.2]]]

where [bar.x] and [bar.y] are the sample means of the two arrays of values.

If the value of r is close to +1, this indicates a strong positive correlation, and if r is close to -1, this indicates a strong negative correlation.

The correlation coefficient gives a number that is between 1 and -1. Taking the absolute value of the correlation, the resulting number is between 0 and 1 with 1 indicating the most similarity and 0 indicating the least similarity.

Table 1 shows the results of calculating the correlation coefficient between the 7 species after representing them numerically using the same representation used above for the graphical method. Note that Human and Rabbit have a similarity of 0.85 while Human and Opossum has a similarity indicator of 0.18 which agrees with the previous graphical results shown above (i.e. Figures: 3 and 4). Table 1 indicates that six species (all except Opossum) are more similar to each other than the Opossum which is the most dissimilar to others. In addition, the most similar pairs of species are Human-Rabbit, Human-Lemur, and Rat-Mouse. These results agree with previous research results [Guohua et al.,2009].

5. SIMILARITY BASED ON GIBBS FREE ENERGY

The Gibbs free energy is one of the measures that best represents the hybridization reaction between the sequence and its Watson-Crick complement. It depends on the DNA sequence itself in addition to several thermodynamic parameters of the environment of the sequence such as the temperature, pressure, acidity of the solvent, ... The more negative the Gibbs energy, the more stable the duplex formed [Phan et al., 2009]. Several methods exist to calculate the Gibbs energy and the one presented in [SantaLucia et al., 1998] will be used here.

Table 2 shows the values of the free energy between the DNA sequence and the complement of the other DNA sequences for the seven species mentioned above. The lower the value of the free energy, the more stable is the structure. Note that the lowest values, which indicate the most similarity, are between the DNA sequence and its complement. Then after investigating the values in the table, note that the most similar pairs of species are: Human-Rabbit, Rat-Mouse, and Human-Lemur which agrees with the results obtained in Table 1.

6. INDIVIDUAL REPRESENTATION OF DNA SEQUENCE NUCLEOTIDES

In this paper, efficient methods are proposed to find the similarity between DNA sequences based on graphical representation and statistical analysis. In the graphical-based method, the similarity can be shown by drawing all DNA sequences under test on the Cartesian graph (X and Y axis) and visually inspecting the distance between the represented lines. The closer the lines to each other, the more similar the DNA sequences are.

In an update to this method, instead of representing all the DNA sequences graphically, in our proposal each nucleotide is represented separately for all sequences. As a result, four graphs will be generated (one for each nucleotide; A, C, G, T). The similarity between DNA sequences will be concluded from the similarity in these four graphs. The same dataset used previously will be used here for comparison and testing.

6.1 Graphical-based Individual Nucleotides Analysis Results

To test the similarity with respect to the A nucleotide alone graphically, A is represented by 1 while other nucleotides are represented as 0. The similarity in the "A" nucleotide between seven species is shown in Figure 5.

[FIGURE 5 OMITTED]

The Figure represents a stacked column type, which means that the length of the column represents how many species have an "A" nucleotide in that position. For example, a column that has a height equal to 7 means all species are similar in having an "A" value in that position, and a column with height 1 means that only one specie has an A in that position and all other species are different from it in terms of the A nucleotide.

The similarity with respect to nucleotides C, G and T can be produced similarly.

6.2 Count-based Individual Nucleotides Analysis Results

Another way of comparison between species is to check the count or percent of each nucleotide in each specie. Table 3 represents the count of nucleotides (A, C, G and T) in each DNA sequence for the seven species; this adds value to the graphical representation mentioned above and as shown below, it indicates a degree of similarity among species.

In order to check the similarity between species based on table 3, the absolute value of the difference in the count of each nucleotide is summed for each pair of species. Totals are shown in table 4. The least the number is means the difference between species is less which indicates more similarity. Results indicate that Lemur-Human, Rabbit-Lemur, Rabbit-Human, and Rat-Mouse are the most similar species, while Opossum is the most dissimilar to all other species which agrees with previous suggested methods.

7. A COMPARISON BETWEEN THE USED METHODS

Previous methods, graphical and statistical, used above agree with their results according to the similarity of 6 species and the dissimilarity of the Opossum with other species. In addition, there is a strong agreement in the results of the correlation coefficient based method with the free energy based method and the count-based method regarding the degree of similarity of the seven species

Graphical methods show visually the regions that are similar and the regions that are not similar, while statistical methods indicate the similarity/dissimilarity of the sequences as a whole and both gave compatible results.

8. CONCLUSION

DNA sequence similarity helps researchers in several areas of the field of bioinformatics. Four techniques for testing the similarity between DNA sequences are demonstrated. The similarity is checked based on a graphical representation, correlation coefficient, Gibbs free energy, and the count of individual nucleotides in each sequence.

The techniques were tested on the beta-Globin gene for seven species. Similarity/dissimilarity results agree between the four techniques. Furthermore, current results show a match with previous research results. Advantages of these techniques over other techniques in the literature are simplicity and efficiency.

REFERENCES

Boukerche, Azzedine; Correa, Jan M.; Melo, Alba Cristina; Jacobi, Ricardo P. (2010) "A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space", IEEE Transactions on Computers, 59(6): 808-821.

BLAST, http://blast.ncbi.nlm.nih.gov/Blast.cgi, Accessed Jan. 2012.

Vishnepolsky, Boris; Pirtskhalava, Malak (2009) "ALIGN MTX--An optimal pairwise textual sequence alignment program, adapted for using in sequence-structure alignment"; Computational Biology and Chemistry, 33(3):235-238. DIALIGN, http://dialign.gobics.de/, Accessed Jan. 2012.

Rose, Dominic; Hertel, Jana; Reiche, Kristin; Stadler, Peter F.; Hackermuller, Jorg (2008) "NcDNAlign: Plausible multiple alignments of non-protein-coding genomic Sequences"; Genomics, 92: 65-74.

Dong, Elizabeth; Smith, Jarrod; Heinze, Sten; Alexander, Nathan; Meiler, Jens (2008) "BCL:: Align--Sequence alignment and fold recognition with a custom scoring function online"; Gene; 422(1-2): 41-46.

Ying, Guo; Wang, Tian-ming (2008) "A new method to analyze the similarity of the DNA sequences"; Journal of Molecular Structure: THEOCHEM 853: 62-67;

Huang, Guohua; Liao, Bo; Li, Yongfan; Yu, Yougui (2009) "Similarity studies of DNA sequences based on a new 2D graphical representation", Biophysical Chemistry 143: 55-59.

Huang, Guohua; Zhou, Houqing; Li, Yongfan; Xu, Lixin (2011) "Alignment-free comparison of genome sequences by a new numerical characterization; Journal of Theoretical Biology" 281(1):107-112.

Feng, Jie; Hu, Yong; Wan, Ping; Zhang, Aibing; Zhao, Weizhong (2010) "New method for comparing DNA primary sequences based on a discrimination measure", Journal of Theoretical Biology 266:703-707

Zhang, Jihong; Wang, Renhong; Bai, Fenglan; Zheng, Junsheng (2011) "A Quasi-MQ EMD method for similarity analysis of DNA sequences", Applied Mathematics Letters 24(12): 2052-2058

Derouiche, Kheiredine; Nicole, Denis A. (2007) "Semantically Resolving Type Mismatches in Scientific Workflows", OTM 2007 Workshops, Springer-Verlag Berlin Heidelberg, Part I, LNCS 4805:125-135.

Liao, Bo; Tan, Mingshu; Ding, Kequan (2005) "A 4D representation of DNA sequences and its application". Chem.Phys. Lett., 402: 380-383.

Randic, Milan; Vracko, Marjan; Lers, Nella; Plavsic, Dejan (2003) "Novel 2-D graphical representation of DNA sequences and their numerical characterization", Chemical Physics Letters 368:1-6.

Needleman, Saul B.; Wunsch, Christian D. (1970) "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443-53.

Kalsi, Petri; Peltola, Hannu; Tarhio, Jorma (2008) "Comparison of Exact String Matching Algorithms for Biological Sequences". In: Proc. BIRD '08, 2nd International Conference on Bioinformatics Research and Development (ed.

M. Elloumi et al.). Communications in Computer and Information Science 13, Springer, 417-426.

Phan, Vinhthuy; Garzon, Max(2009) "On codeword design in metric DNA spaces", Natural Computing: an international journal 8(3):571588.

Qi, Dai; Liu, Xiaoqing; Wang, Tianming (2006) "A novel 2D graphical representation of DNA sequences and its application", Journal of Molecular Graphics and Modelling 25:340-344

Qi, Xiao-Qin; Wen, Jie; Qi, Zhao-Hui (2007) "New3D graphical representation of DNA sequence based on dual nucleotides", Journal of Theoretical Biology 249(4): 681-690.

SantaLucia, Jr. John(1998) "A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics", Proceedings of the National Academy of Sciences, 95:1460-1465.

Vinga, Susana; Almeida, Jonas (2003) "Alignmentfree sequence comparison--a review". Bioinformatics, 19(4) :513-523.

Guo, Xiaodong; Dai, Qi; Han, Bin; Zhu, Lei; Li, Lihua (2011)"Similarity Analysis of DNA Sequences Based on LZ Complexity and Dynamic Programming Algorithm" : Bioinformatics and Biomedical Engineering, (iCBBE) 2011 5th International Conference, 1-4.

Qi, Xingqin; Wu, Qin; Zhang, Yusen; Fuller, Eddie; Zhang, Cun-Quan (2011) "A Novel Model for DNA Sequence Similarity Analysis Based on Graph Theory", Evolutionary Bioinformatics:(7): 149-158.

Maryam Nuser, Izzat Alsmadi and Heba Al-Shaek Salem

Computer Information Systems department

Yarmouk University, Irbid, Jordan

mnuser@yu.edu.jo,ialsmadi@yu.edu.jo and joreehm@yahoo.com
Table 1 the correlation coefficient between the seven
DNA sequences.

          Gallus   Human   Lemur   Mouse   opossum   Rabbit   Rat

Gallus    1        0.65    0.60    0.62    0.18      0.6      0.62
Human              1       0.83    0.78    0.18      0.85     0.73
Lemur                      1       0.73    0.13      0.78     0.74
Mouse                              1       0.18      0.73     0.83
opossum                                    1         0.19     0.12
Rabbit                                               1        0.70
Rat                                                           1

Table 2. The free energy between the DNA sequence
and complement of other DNA sequences

          Gallus   Human   Lemur   Mouse   opossum   Rabbit   Rat

Gallus    -670     -155    -120    -139    -61       -166     -112
Human              -659    -402    -315    -54       -481     -307
Lemur                      -655    -289    -19       -384     -256
Mouse                              -659    -31       -321     -463
Opossum                                    -584      -28      -20
Rabbit                                               -662     -248
Rat                                                           -643

Table 3. The count of each nucleotide in DNA sequences for
the seven species

Species/Nucleotides    A     T     C     G

Human                  88    106   114   136
Lemur                  85    112   108   139
Rat                    99    109   111   125
Opossum                140   132   76    96
Gallus                 90    91    144   119
Mouse                  95    102   120   127
Rabbit                 94    110   104   136

Table 4. Total number of the count difference of each
nucleotide for the seven species.

          Gallus   Human   Lemur   Mouse   opossum    Rabbit   Rat

Gallus    0        64      82      48      182        80       66
Human              0       18      26      156        20       28
Lemur                      0       44      150        18       34
Mouse                              0       150        34       22
Opossum                                    0          136      128
Rabbit                                                0        24
Rat                                                            0
COPYRIGHT 2012 University of the West of Scotland, School of Computing
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

 
Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:deoxyribonucleic acid
Author:Nuser, Maryam; Alsmadi, Izzat; Salem, Heba Al-Shaek
Publication:Computing and Information Systems
Geographic Code:1USA
Date:Oct 1, 2012
Words:3704
Previous Article:Exploration of cloud computing adoption for e-learning in higher education.
Next Article:Knowledge management and SMEs' competitiveness in Nigeria: a conceptual analysis.
Topics:

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters