Printer Friendly

DNA base data hiding algorithm.


In order to protect data through the unsecure networks like the Internet, using various types of data protection is necessary. One of the famous ways to protect data through the Internet is data hiding. Because of the increasing number of Internet users, utilizing data hiding or Steganographic techniques is inevitable. Eliminating the role of the intruder and authorizing the receiver, are eventual goals of these techniques.

Thus, the role of data hiding has become more eminent nowadays. Before employing biological properties of DNA sequences, usually embedding a secret message into the host images was the traditional way of data hiding [1, 2, 3, 4, and 5].

Unfortunately, this had some liabilities. The most important ones was the detection of the distortions of the image when the host image changed to some degrees. This spot was the best spot to start the wholly detection of the secret message through the image. With advent of biological aspects of DNA sequences to the computing areas, new data hiding methods have been proposed by researchers, based on DNA sequences [6, 7, 8, 9, and 10]. The key portion of their work is, utilizing biological characteristics of DNA sequences.

In order to convert binary data into amino acids as a DNA sequence, the base pairing rules must be used. Synthesizing nucleotides in real environment (biology) is done in constant rules:

* Purine Adenine (A) always pairs with the pyrimidine Thymine (T)

* Pyrimidine Cytosine (C) always pairs with the purine Guanine (G)

Always, those rules are done naturally because the opportunities to synthesize hydrogen bonds between A and T (two bonds), and also between C and G (three bonds) is different, basically (hydrogen bonds have been shown with dotted lines in Fig1). These concepts are named Watson-Crick base pairing rules when they discovered DNA's fundamental structure as a Noble prize [13].


In binary computing area, it is possible to change the natural rules by own decision. For example, in biology A is synthesized to T while we can assume A to C or A to G, and so on, as we prefer. Increasing the complexity of the algorithm is the main purpose of the changing the rules. In this paper, the authors consider A = 00, T = 01, C = 10, and G = 11 to convert binary message to DNA sequences.

A way to increase the complexity is complementary pair rule.

Complementary pair rule is a unique equivalent pair which is assigned to every nucleotides base pair. As an example, complementary rule is applied on strand in below [16]:

Complementary rule: ((AC) (CG) (GT) (TA))

DNA strand: AATGC

Applying complementary rule on DNA strand: CCATG

Increasing the complexity is the main purpose of using those rules, in this paper. It means that, finding the original message by intruder needs extra calculations because there are four basic alphabets therefore four likelihood of complementary rule for every DNA sequences. So, the final number of possible those rules are 4 x 3 x 2 x 1 = 24. On the other hand, the possibility to happen a correct guess is 1/24. Extra information and findings for some basic definitions can be obtained in molecular biology reference books [11, 12, and 13].


For the purpose of clarifying the algorithm in current paper, introducing some backgrounds of the knowledge is necessary [8, 11, 12, and 13]. The most important part of each DNA base data hiding algorithm is, manipulating four letters which has been called as nucleotides in biology. The letters are A, C, G, and T. Any composition from them will make a sequence. For instance, two DNA sequences have been appeared in [16]. They mentioned sequences from European Bioinformatics Institute (which is known as EBI Database) [15] for the purpose of extracting DNA sequences of Litmus and Balsaminaceae. So, Litmus with 154 nucleotides and Balsaminaceae with 2283 are shown in below, respectively:





Even though data can be shown by DNA nucleotides, but representing to binary was emerged in 1999 by Rauhe and et al. [17]. They represented numbers by using binary DNA sequences in figure 1. The creativity of their work was in how they could separate binary sequences from each other.


All binary sequences are illustrated in form of s{0|1}e. An arbitrary DNA bits has been concatenated between two terminations, s and e, as start and end, respectively. Annealing and ligation are the way of concatenating DNA bits between s and e. Both of terminations and DNA bits have been made by annealing complementary oligonucleotides. In order to concatenate on both sides, having sticky ends (A, [bar.A], X, [bar.Y]) are necessary. The sticky end A([bar.A]) acts as a variable for correct concatenation of bits and terminations. For subsequent cloning, the sticky ends X and [bar.Y] should be used.

In 2000 Leier and et al. [8] brought a robust technique by utilizing a special key strand. They called it, primer. Primer has key role to decrypt a coded strand. Using a public DNA strand was utilized as a reference sequence in their DNA based encryption technique. In this scheme, the receiver must also be informed about the reference sequence. Namely, the receiver will receive a selected primer and an encrypted strand. The intruder is not able to decrypt the binary data without knowing about both of primer and reference strand, certainly. A primer is a complementary subset from a sort of DNA strand. Normally, the primer is called a short substring. For example, assume S is a DNA strand:

S= "ATGCTTAGTTCCATCGGAGACTAATGGCCTA" and "two primers ATCAA and GATTAC". So "ATCAA and GATTAC are complementary substring of TAGTT and CTAATG", correspondingly. Definitely, a complementary rule is needed to handle the manipulations, correctly. For this reason, they defined a complementary rule which is A-T, T-A, C-G, and finally G-C. Accomplishing those states in biology is done by a series of chemical mechanisms to combine primers with a DNA strand. Indicating the right position of the primers also is shown by a fluorescent chemical substance. When hybridization occurs among primers and substrings of the reference DNA reference sequence, it will become bright, obviously. Except bright places, remaining are dim sections. The exact message of bright portion is binary data '1' and naturally, a dim portion is referred to the binary data '0'. So, according to the above, the proper output of the hybridization is:


Therefore, the final secret message in form of binary is "01010". If the sender prefers to complicate that proposed technique, it can send one part of primers to the legitimate receiver. For the purpose of recovering primers, the receiver must apply PCR. PCR is another chemical scheme which has been known as Polymerase Chain Reaction, scientifically. It can recover primers correctly [16].

Considering a certain substring from DNA sequence as a character was proposed by Peterson in 2001 [9]. By substituting three successive nucleotides as a character, he could hide data in DNA sequence, appropriately. For instance, 'A'=GGC, 'B'=ATG, and so on. So, for this scheme we face to 64 symbols that can be possibly encrypted.

The main liability of this method is the frequency for both of 'E' and 'I' in an English message. Because of those holes, an intruder can apply a cryptanalysis technique base on frequencies of the most repetitive letters in English and subsequently extracts the secret message, simply.

The next scheme in 2002 [18] needs some backgrounds before explaining. An arrangement of nucleotides determines a protein. The responsibility of the proteins is almost every activity in cells. Transcription is the process by which RNA is created, an intermediary copy of the instruction contained in DNA. Naturally, RNA has four bases. They are adenine (A), cytosine (C), uracil (U) and guanine (G). The RNA copy is related to mRNA (abstract of messenger RNA) by throwing away the coming among sequences of RNA. On mRNA, each codon has three nucleotides for the purpose of representing which amino acid must be assigned in the next position. All different amino acids have been listed in Table 1. They are: Phe, Leu, Ile, Val, Ser, Pro, Thr, Ala, Tyr, His, Gln, Asn, Lys, Asp, Glu, Cys, Trp, Arg, Met and Gly. Each codon binds a set of three nucleotides into an anticodon which is called tRNA molecule (transfer RNA). The responsibility of the tRNA is, translating nucleic acid into protein. Totally, there are forty separate tRNA molecules which each one has a binding for one of amino acids, as represented in Table 1.

Apposite tRNA is bound to codon on the mRNA. When the STOP codon was seen, the translation is completed, and then a protein is released. Because of the codon redundancy, Shimanovsky et al. [18] utilized from that feature to hide data in mRNA. In general, each mRNA codon has been composed of three nucleotides. The likely nucleotides are: 'U', 'C', 'A', and 'G'. Therefore, there are many different probable combinations to form each mRNA codon, whereas in Table 1, there are only twenty different amino acids. It is because of some codons might be mapped to the same amino acids. For instance, the codons 'UUA', 'CUU', 'CUA' and 'UUG' are mapped to the same amino acid Leu.

Feasibility of embedding information in the mRNA codon is because of the redundancy. For example, if the codon must be encrypted with 'UUA', while the secret message if four, it is possible to use the codon 'UUG' in order to replace the original. It is feasible because 'UUG' is the fourth codon of all the codons whose mapping amino acid is Leu. Even though previous replacement does not have any effect on the results of transcription, but it modifies the nucleotides of the original sequence, that might probably trigger unidentified consequences.

Although the previous beginning schemes utilize distinct biological features for hiding data in DNA sequences, they are not cost-effective and efficient to employ. In this paper, we apply a scheme to hide data in DNA strands based on a software point of view to improve them.


This section is divided into two phases. The first one is, embedding and the second one is, extracting the original message. At the end of the method, a snapshot of the project will appear.

3.1 Phase1: Embedding Secret Message

In order to explain embedding phase, separating the phases into some successive and vivid sub-phases, is the best way of proposing current method. In below, sub-phases have been shown, respectively.


Obviously, there is an original message M which the sender decides to send via a network to another client who is called receiver. So, there are three sub-phases to provide the final form of M which is M"' and send it to receiver. The first sub-phase is, converting by DNA base pairing rules. The product is M'. M' contains nucleotides sequences. By applying DNA base pairing rules, the message can convert from binary to DNA sequence. Not only DNA base pairing helps to encrypt the message from binary to DNA sequence but also it is applied to decrypt the secret message to original one, truly.

The next (second) sub-phase is, applying complementary rules. Increasing the complexity is the real and exact purpose of this step. By applying the complementary rules, the new form of the M' which is M" emerges. Now, M" is appeared. As mentioned before, both of sender and receiver have a DNA reference sequence from a large number of possibilities base on EBI [14] or NCBI [15] database. It means that, they have selected the same DNA reference sequence, exactly. The exact role of the third sub-phase is, extracting the index of each couple nucleotides in DNA reference sequence, numerically. When all the indexes have been extracted, M'" has been made, properly. M'" is precisely the secret message with some changes through the embedding phase. Now, sender can send the message (M'") to receiver.

Clarification of the current phase is continued by demonstrating an example, step by step. In this example, assume original message M=100111000011 should be sent to the receiver.

DNA Reference Sequence:

[AT.sub.1][CG.sub.2][AA.sub.3][TT.sub.4][CG.sub.5][CG.sub.6][CT.sub.7][GA.sub.8][GT.sub.9] [CA.sub.10][CA.sub.11][AT.sub.12][TC.sub.13][GC.sub.14][GC.sub.15][TG.sub.16][AG.sub.17] [TG.sub.18][AA.sub.19][CC.sub.20]


[Sub-phase.sub.1](A = 00, T = 01, C = 10, G = 11): M' = CTGAAG

[Sub-phase.sub.2]((AC)(CG)(GT)(TA)): M" = GATCCT

[Sub-phase.sub.3](Indexes)]: M"' = 8137

Now, embedding phase is finally completed. Then, sender sends 8,13,7 to the receiver. In the next section, the receiver will apply the extracting phase for extracting the original message by using three consecutive phases.

3.2 Phase2: Extracting Original Message

Now, receiver takes the secret message in form of some numbers. For the purpose of extracting the original message from DNA reference sequence, phase two with its sub-phases will extract the original message, correctly.


So, the first sub-phase manipulates the M'". Because of the nature of the secret message which is some sorts of numbers (exact positions (indexes) of the original message on DNA reference sequence), extracting the message starts by finding the indexes on DNA reference sequence one by one according to the numbers which sender has sent in form of the current secret message. M" is the exact product of the first sub-phase. Consequently, the second sub-phase applies complementary rules on M" in order to extracting M', correctly. The importance of the M' is the form of it. M' is the last form of message, based on DNA nucleotides. Converting the M' to the M is the third sub-phase. Transforming from DNA nucleotides to the binary is the responsibility of the last sub-phase. Now, the receiver has truly extracted the original message M.

Those steps are demonstrated through the example in below:

DNA Reference Sequence:

[AT.sub.1][CG.sub.2][AA.sub.3][TT.sub.4][CG.sub.5][CG.sub.6][CT.sub.7][GA.sub.8][GT.sub.9] [CA.sub.10][CA.sub.11][AT.sub.12][TC.sub.13][GC.sub.14][GC.sub.15][TG.sub.16][AG.sub.17] [TG.sub.18][AA.sub.19][CC.sub.20]


[Sub-phase.sub.1](Indexes): M" = GATCCT

[Sub-phase.sub.2]((AC)(CG)(GT)(TA)): M' = CTGAAG

[Sub-phase.sub.3](A = 00, T = 01, C = 10, G = 11): M=100111000011

So, the receiver extracted the original message, accurately by using a simple algorithm. In the next section, security and liabilities of the algorithm will inspect, briefly.

3.3 Presenting Results

The final picture of the design phases of the prototype is presenting results. The GUI contains two separated parts. The first (left) is, embedding secret message and second (right) is, extracting secret message. Through the each part, there are some sub-phases to complete the intention of the parts. In the left side, in order to complete the embedment, first, original message decoded to the binary. The second step is, converting binary message to the DNA strands by rules: A = 00, T = 01, C = 10 and G = 11.

The results of this step can be seen in first output textbox. As we mentioned, we use complementary rules to increase the complexity by DNA basic synthetic rules which it says only A-C, C-G, G-T and T-A can have a chance to synthesize together. Then, DNA reference sequence is seen which both of Alice and Bob have shared between themselves. The usage of DNA reference sequence is, finding the index of each couple of nucleotide. At the end of the embedment phase, the secret message can be seen in group box of the secret message. Indexes are represented by numbers. Then Alice clicks the button "send to Bob", for sending the secret message to Bob. Now, extraction phase is started.

In side of extraction, Bob must extract the secret message. So, Bob clicks the extraction button and the software starts to recover the secret message. First, indexes are found through the DNA reference sequence (which had shared between Alice and Bob) and corresponding to each index, its couple nucleotide is extracted. The output is a DNA sequence. By applying complementary rules, the previous DNA sequence is converted to the right DNA sequence after binary coding in embedment phase. After that, for each nucleotide, the corresponding binary symbol put to the binary original message textbox. As we know, every eight bits represents a character. So, by separating all the eight bits, the final characters will appear correctly.


In terms of security, each intruder must be aware from the following information, correctly. Without this fundamental information, possibility of extracting original message is near to zero, scientifically. They are:

* DNA reference sequence: there are 163 million DNA reference sequence on EBI database. Therefore, the likelihood of making a doing well conjecture by attacker is 1/24.

* Binary coding rule: as mentioned, the sender is free to select any equivalent binary form for every nucleotide. It means that, A can be '00', '01', '10', or '11'; C can be '00', and so on. In other words, all the binary coding rules are 4 x 3 x 2 x 1 = 24. So, the likelihood of making correct guess by attacker is 1/24.

* Complementary pairing rule: like binary coding rule, there is 4 x 3 x 2 x 1 = 24 complementary alphabet among basic nucleotides. Therefore, the possibility of making successful attack is 1/24.

Eventually, the final probability of making a correct and successful guess by attacker is [1/[163 x [10.sup.6]]] x [1/24] x [1/24].


At the end of the project, it is sensible which data hiding with DNA sequences is a potential and promising way of hiding secret message within DNA sequences. Different type of environment to manipulate the data was the best part of the project. It means that, regardless the basic concepts of Steganography (robustness, capacity, visibility, and performance), utilizing biology concepts could bring new properties to solve ancient difficulties of algorithm such as complexity besides performance. At the beginning of the project, the most important question which reader thought about it was how it is possible to convert binary data to biology and genetics area? As we showed, DNA coding technology helped us to transform binary bits to DNA nucleotides well.

Considering DNA characteristics brings new ideas in data hiding. DNA sequences are potential to implement new data hiding techniques or even transforming previous schemes to new one. In this paper, a reference DNA sequence has been shared between sender and receiver. Not only this DNA reference sequence can be retrieved from EBI [15] or NCBI [14] databases but it can also be simply selected from any database. Therefore, by considering any sort of database, there are 163 million targets to select it. Virtually, guessing the correct DNA sequence by attacker is unachievable.

The crucial feature of the DNA sequences is visibility. Finding secret message in a DNA sequence is difficult because the visibility of the sequences is very low. Therefore, attacker cannot find out whether this sequence is a fake or not. In comparison with previous techniques, such as in images, not only implementing this method is not difficult but also it is formidable to detect, as well.



[1.] C.C. Chang, C.C. Lin, C.S. Tseng, W.L. Tai, Reversible hiding in DCT-based compressed images, Information Sciences 177 (2007)

[2.] C.C. Chang, W.C. Wu, Y.H. Chen, Joint coding and embedding techniques for multimedia images, Information Sciences 178 (2008)

[3.] C.H. Huang, J.L. Wu, Fidelity-guaranteed robustness enhancement of blind-detection watermarking schemes, Information Sciences 179 (2009)

[4.] H.H. Tsai, D.W. Sun, Color image watermark extraction based on support vector machines, Information Sciences 177 (2007)

[5.] H.W. Tseng, C.P. Hsieh, Prediction-based reversible data hiding, Information Sciences 179 (2009)

[6.] C.C. Chang, T.C. Lu, Y.F. Chang, R.C.T. Lee, Reversible data hiding schemes for deoxyribonucleic acid (DNA) medium, International Journal of Innovative Computing, Information and Control 3 (2007)

[7.] C.T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots, Nature 399 (1999)

[8.] A. Leier, C. Richter, W. Banzhaf, H. Rauhe, Cryptography with DNA binary strands, Bio Systems 57 (2000)

[9.] I. Peterson, Hiding in DNA, Muse (2001)

[10.] B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002)

[11.] B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, J.D. Watson, Molecular Biology of the Cell, Garland Publishing, New York & London, (1994)

[12.] A.L. Lehninger, D.L. Nelson, M.M. Cox, Principles of Biochemistry, Worth, New York, (2000)

[13.] J. Watson, N. Hopkins, J. Roberts, J. Steitz, Molecular Biology of the Gene, fourth ed., Benjamin Cummings, Menlo Park, CA, (1987)

[14.] National Center for Biotechnology Information,

[15.] European Bioinformatics Institute,

[16.] H.J. Shiu, K.L. Ng, J.F. Fang, R.C.T. Lee, C.H. Huang, Information Sciences, Volume 180, Issue 11, 1 June(2010)

[17.] Rauhe, H., Vopper, G., Banzhaf, W., Howard, J.C., (1999)

[18.] B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002)

Mohammad Reza Abbasy, Pourya Nikfard, Ali Ordi, and Mohammad Reza Najaf Torkaman

Advanced Informatics School (AIS), International Campus, Universiti Teknologi Malaysia (UTM), Kuala Lumpur.

{ramohammad2, npourya2, oali2, rntmohammad2}
Table 1. The mapping of codon to amino acid [18].

UUU [right arrow] Phe
UUA [right arrow] Leu
CUU [right arrow] Leu
CUA [right arrow] Leu
AUU [right arrow] Ile
AUA [right arrow] Ile
GUU [right arrow] Val
GUA [right arrow] Val
UUC [right arrow] Phe
UUG [right arrow] Leu
CUC [right arrow] Leu
CUG [right arrow] Leu
AUG [right arrow] Ile
AUG [right arrow] Start
GUC [right arrow] Val
GUG [right arrow] Val
UCU [right arrow] Ser
UCA [right arrow] Ser
CCU [right arrow] Pro
CCA [right arrow] Pro
ACU [right arrow] Thr
ACA [right arrow] Thr
GCU [right arrow] Ala
GCA [right arrow] Ala
UCC [right arrow] Ser
UCG [right arrow] Ser
CCC [right arrow] Pro
CCG [right arrow] Pro
ACC [right arrow] Thr
ACG [right arrow] Thr
GCC [right arrow] Ala
GCG [right arrow] Ala
UAU [right arrow] Tyr
UAA [right arrow] Stop
CAU [right arrow] His
CAA [right arrow] Gln
AAU [right arrow] Asn
AAA [right arrow] Lys
GAU [right arrow] Asp
GAA [right arrow] Glu
UAC [right arrow] Tyr
UAG [right arrow] Stop
CAC [right arrow] His
CAG [right arrow] Gln
AAC [right arrow] Asn
AAG [right arrow] Lys
GAC [right arrow] Asp
GAG [right arrow] Glu
UGU [right arrow] Cys
UGA [right arrow] Stop
CGU [right arrow] Arg
CGA [right arrow] Arg
AGU [right arrow] Ser
AGA [right arrow] Arg
GGU [right arrow] Gly
GGA [right arrow] Gly
UGC [right arrow] Cys
UGG [right arrow] Trp
CGC [right arrow] Arg
CGG [right arrow] Arg
AGC [right arrow] Ser
AGG [right arrow] Arg
GGC [right arrow] Gly
GGG [right arrow] Gly
COPYRIGHT 2012 The Society of Digital Information and Wireless Communications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Abbasy, Mohammad Reza; Nikfard, Pourya; Ordi, Ali; Torkaman, Mohammad Reza Najaf
Publication:International Journal of New Computer Architectures and Their Applications
Date:Jan 1, 2012
Previous Article:Reality of aligning IT with business in Czech organizations.
Next Article:Method for generating DC-DC converters with required characteristics.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters