Analysis of an immune algorithm for protein structure prediction.
protein folding Noun 1. protein folding - the process whereby a protein molecule assumes its intricate three-dimensional shape; "understanding protein folding is the next step in deciphering the genetic code"
folding simulation is to determine the native state of a protein from its amino acid amino acid (əmē`nō), any one of a class of simple organic compounds containing carbon, hydrogen, oxygen, nitrogen, and in certain cases sulfur. These compounds are the building blocks of proteins. sequence. In this paper we describe the development and application of an Immune Algorithm (IA) to find the lowest energy conformations for the 2D (square) HP lattice bead protein model. Here we introduce a modified chain growth constructor to produce the initial population, where intermediate infeasible structures are recorded, thereby reducing the risk of attempting to perform wasteful point mutations during the mutation phase. We also investigate various approaches for population diversity tracking, ultimately allowing a greater understanding of the progress of the optimization.
Keywords: HP lattice bead model, immune algorithm, population diversity tracking, protein modelling
Povzetek: V clanku je opisan razvoj in izvedba imunskega algoritma (IA) za iskanje najnizje energijske strukture za 2D (kvadratne) HP mrezno nanizanega modela proteina.
Predicting the 3-dimensional secondary and tertiary structure tertiary structure
The three-dimensional structure of a protein or nucleic acid.
The three-dimensional structure of a protein or nucleic acid. of a protein molecule Noun 1. protein molecule - any large molecule containing chains of amino acids linked by peptide bonds
molecule - (physics and chemistry) the simplest structural unit of an element or compound from its (primary structure) amino acid sequence alone is an important problem in chemical biology Chemical biology is a scientific discipline spanning the fields of chemistry and biology that involves the application of chemical techniques and tools, often compounds produced through synthetic chemistry, to the study and manipulation of biological systems. . Under certain physiological conditions, the amino acid chain will reliably fold into a specific native state (biologically active conformation con·for·ma·tion
One of the spatial arrangements of atoms in a molecule that can come about through free rotation of the atoms about a single chemical bond. ). The protein folding problem is the search for this native state for a given sequence of amino acid residues. The reliability of protein folding is said to be dominated by the presence of a "folding funnel The folding funnel hypothesis is a specific version of the energy landscape theory of protein folding, which assumes that a protein's native state corresponds to its free energy minimum under the solution conditions usually encountered in cells. " on the folding energy landscape since systematic or random searching is clearly infeasible for large numbers of amino acids . Therefore, discovering the nature of the folding energy landscape is necessary to develop a better understanding of the folding dynamics .
Many protein models have been developed, ranging from simple, minimalist models such as the HP lattice bead model , to more complicated and computationally expensive A computationally expensive algorithm is one that, for a given input size, requires a relatively large number of steps to complete; in other words, one with high computational complexity. models such as off-lattice interpretations. The most common lattice structures are 2D square and 3D cubic. More computationally intense models include the dynamical lattice and all-atom models, both introducing more complicated fitness functions.
In this work, the standard HP lattice bead model has been incorporated into an immune algorithm. Despite the minimalistic approach employed by this model, it has been shown to belong to the "NP-Hard" set of problems . Monte Carlo Monte Carlo (môNtā` kärlō`), town (1982 pop. 13,150), principality of Monaco, on the Mediterranean Sea and the French Riviera. , chain growth algorithms , simulated annealing simulated annealing - A technique which can be applied to any minimisation or learning process based on successive update steps (either random or deterministic) where the update step length is proportional to an arbitrarily set parameter which can play the role of a temperature. , genetic algorithms [5, 8, 9], ant colony optimization The ant colony optimization algorithm (ACO), introduced by Marco Dorigo in his PhD thesis, is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs.  and more recently immune algorithms  have been developed by many researchers as heuristic A method of problem solving using exploration and trial and error methods. Heuristic program design provides a framework for solving the problem in contrast with a fixed set of rules (algorithmic) that cannot vary.
1. and approximate solutions for this and other computationally hard problems.
2.1 The HP lattice bead model
In this work, the standard HP lattice bead model is embedded in a 2-dimensional square lattice, restricting bond angles to only a few discrete values . Interactions are only counted between topological neighbours, that is between beads (representing amino acids) that lie adjacent to each other on the lattice, but which are not sequence neighbours . The energies corresponding to the possible topological interactions are as follows:
[[epsilon].sub.HH] = -1.0 [[epsilon].sub.HP] = 0.0 [[epsilon].sub.PP] = 0.0 (1)
By summing over these local interactions, the energy of the model protein can be obtained:
E = [summation over (i<j)] [[epsilon].sub.ij][[DELTA].sub.ij] (2)
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ].
The HP lattice model recognises only the hydrophobic hydrophobic /hy·dro·pho·bic/ (-fo´bik)
1. pertaining to hydrophobia (rabies).
2. not readily absorbing water, or being adversely affected by water.
3. interaction as the driving force in protein folding, with many native structures protecting the hydrophobic core with polar residues, resulting in a compact arrangement . This idea reflects the repulsive nature of the interactions between the hydrophobic residues and the surrounding water molecules .
2.2 The coordinate system
A previous study has illustrated how a local coordinate system offers better performance than a global one for studying protein folding . In this work, a local coordinate system is used to define the folding conformation of the model proteins, that is the position of bead j is defined relative to beads (j - 1) and (j - 2). As the energy is identical for rotationally related structures, the bond between the first two beads lies along the z-axis, with these beads having coordinates (0,0) and (1,0) respectively. As a result, the search spaced is halved. The bond joining the [(j - 1)th and jth beads can be left, right or straight ahead relative to the bond joining the [(j - 2)th and [(j - 1)th bead, corresponding to an integer representation of 0, 1 and 2 respectively. The protein conformation is therefore expressed as a conformation vector, containing a list of 0's, 1's and 2's.
For this study, a set of well investigated protein benchmark sequences have been considered: the tortilla HP benchmark sequences . They range in length from eighteen to fifty beads and are listed in Table 1. The table also includes the energy, [E.sup.*], of the putative global minimum (or conformations, since all of these structures have degenerate global minima) for each sequence.
3 The immune algorithm
An immune algorithm  is inspired by the clonal selection principle employed by the human immune system immune system
Cells, cell products, organs, and structures of the body involved in the detection and destruction of foreign invaders, such as bacteria, viruses, and cancer cells. Immunity is based on the system's ability to launch a defense against such invaders. . In this process, when an antigen enters the body, B and T lymphocytes are able to clone upon recognition and bind to it . Many clones are produced in response and undergo many rounds of somatic somatic /so·mat·ic/ (so-mat´ik)
1. pertaining to or characteristic of the soma or body.
2. pertaining to the body wall in contrast to the viscera.
adj. hypermutation. The higher the fitness of a B cell to the available antigens, the greater the chance of cloning. Cells have a certain life expectancy Life Expectancy
1. The age until which a person is expected to live.
2. The remaining number of years an individual is expected to live, based on IRS issued life expectancy tables. , allowing a higher specific responsiveness for future antigenic attack .
The IA presented here includes the aging, cloning and selection operators used in a previous study by Cutello et al. , with modified constructor and mutation operators. The constructor employs a backtracking (algorithm) backtracking - A scheme for solving a series of sub-problems each of which may have multiple possible solutions and where the solution chosen for one sub-problem may affect the possible solutions of later sub-problems. algorithm that records some of the possible mutations by testing bead placement during chain growth. These possibilities are exploited and updated during the mutation process, preventing an infeasible conformation from occurring based on the preceding self-avoiding structure for a particular point in the model protein chain. In retaining this information, infeasible mutations are not explored, allowing a greater number of constructive mutations to be investigated. Figure 1 illustrates the stages involved in placing two consecutive beads during the chain growth phase. Before committing a bead to the lattice, all possible directions are explored, 1(a), and from the valid options available, a random choice is made, 1(b). Again all possible choices are investigated, marking any infeasible options (note that choosing left will not result in a self avoiding conformation), 1(c), and a valid choice is selected from the remaining options, 1(d). Any remaining valid choices are left unmarked for use in the first mutation phase after the initial chain growth. Once a valid mutation has been made, the entire structure is reconstructed as before marking any infeasible directions as a result of the new conformation vector.
In the basic IA set up, there are a maximum of 10,000,000 fitness evaluations, with the maximum number of generations set to 500,000. In order to estimate the optimal combination of parameters, we adopted the procedure used by Cutello et al., whereby the maximum B-Cell age and the number of clones were each varied from 1 to 10. Population sizes examined were 10, 25, 50, 100 and 200. This provided a combination of 500 different parameter sets for each sequence, which was applied to all the benchmark sequences up to 25 beads in length. All fitness evaluations for the best success rates were collated and graded for overall performance. As a result of this preliminary testing, the results presented below were obtained using a maximum B-Cell age of 4, 3 clones and a population size of 10. All results quoted are averaged over 30 independent runs.
[FIGURE 1 OMITTED]
4.1 Algorithm comparison
With CPU time being hardware dependent, the number of fitness evaluations (together with the percentage success rate) have been used to assess the efficiency of the algorithm, as shown in Table 2 for the benchmark sequences.
It is apparent from Table 2 that, although the use of memory B-Cells  hinders the discovery of global minima for some of the smaller sequences, it enhances the search for the larger, more difficult to find sequences. The memory ability allows mid to high fitness conformations to remain in the population for a longer number of generations. For larger sequences, this allows a more detailed exploration in certain areas of the potential energy surface, permitting the memory B-Cells to converge towards the global solution much sooner. In contrast, for smaller sequences the mid to high fitness range is much smaller, thereby preventing a rapid exploration of the potential energy surface by retaining unfavourable segments of local structure for a larger number of generations. Generally, the use of memory B-Cells allows a more diverse inspection of the potential energy surface, due to a greater number of the degenerate conformations being found. This is achieved as favourable fragments of local structure are not rapidly disposed of during the retirement process, hindering efficiency as a consequence.
The algorithm presented here shows promising results, being comparable to the work of Cutello et al. . While our success rates for the larger sequences (e.g. HP-48) are a little lower, in some cases our number of fitness evaluations show an improvement.
4.2 Analysis of global minima
The compact structural arrangement present in all global minima (GMs) is apparent from the example GMs shown in Fig. 2. With the driving force being the hydrophobic topological contact, it can be seen that compact hydrophobic cores give rise to high fitness conformations. Inspection of the HP-48 global minimum (i) allows us to understand the poor success rate for this sequence. The 5 x 5 hydrophobic core presents a problem to the IA (or other optimization algorithms ) in achieving convergence, as a single misplaced hydrophobic bead will result in only a metastable met·a·sta·ble
Of, relating to, or being an unstable and transient but relatively long-lived state of a chemical or physical system, as of a supersaturated solution or an excited atom. conformation. The problem does not exist for the HP-50 sequence (j), due to the presence of two small hydrophobic cores coupled by a chain of hydrophobic beads, which explains the higher success rate and fewer average structure evaluations necessary for HP-50, compared with HP-48 and (when using memory B-cells) even the much shorter HP-36 sequence . The work of Cutello et al. supports this idea , as similar magnitudes of the number of fitness evaluations for these problematic sequences can be seen, with a much lower success rate for HP-48 than for any other instance.
4.3 Tracking population diversity
The much larger populations required to ensure population diversity can be problematic for both GAs and IAs. In this section, a single run, with population size 200 for sequence HP-20a has been analyzed. The global minimum was found in generation 28, at which point the algorithm was terminated due to meeting the search criteria. In order to help us understand the progress of the optimization and ultimately to improve the methodology, monitoring population diversity and the progress of the algorithm is beneficial.
Figure 3(a) assigns a colour to each of the three possible direction decisions (corresponding to alleles in a genetic sense) made when placing each successive bead. It can be seen that initial structure generation, using the IA's constructor, is indeed statistically uniform, showing the frequency of left (grey), right (light blue) and straight ahead (dark blue) choices at each locus of the model protein chain to be very similar. In contrast, Fig. 3(b) illustrates how this statistical distribution is skewed in the final population (generation 28), in that the IA has concentrated its search to a much narrower region of the potential energy surface. It should also be noted that position 6 in the chain has a very low frequency of the straight ahead choice (dark blue), because (for most population members) previous direction decisions preclude (for structural and/or energetic reasons) this choice from being made at this chain position.
[FIGURE 2 OMITTED]
Figure 4 shows a graphical representation of the initial and final populations of the calculation. By plotting the conformation vector for each population member, the population can be quickly compared for diversity. Population members are ordered by descending fitness and the colour scheme is similar to the allele frequency distribution shown in Fig. 3, but with white replacing grey for the left choice. It is clear that initially the population has high diversity (in agreement with the allele frequency plot shown above), with the algorithm preserving favourable regions of local structure (corresponding to schemata in a GA sense) as the calculation converges. More detailed analysis of the final population shows that there are often correlations (or anti-correlations) between directions at specific loci loci
[L.] plural of locus.
loci Plural of locus, see there , with certain combinations giving rise to favourable energies or infeasible structures, respectively.
For simple protein models such as the HP lattice bead model, the Hamming distance ([d.sub.H], which is the number of bit differences between two conformation vectors) can be used as a simple measure of similarity between structures in the population. Figure 5(a) plots the frequency of the Hamming distances between all pairs of structures in the population as a function of generation. (As the population size is 200, there are a total of 19,900 pair Hamming distances). It can be seen how the diversity of the population changes as the calculation approaches the global minimum (which is found in generation 28). Combining this with a plot of the best, worst and average fitnesses in the population, as a function of generation (Fig. 5(b)), it should be noted that structural diversity shows a more uniform spread (beginning around generation 20) as favourable segments of local structure begin to dominate the population, with the search focussing on a much more concentrated area of the potential energy surface. It is also evident that the population diversity drastically decreases during the final stages of the calculation, not just in the final generation. This confirms that the calculation has not discovered the global minimum by chance, but a directed search strategy has been employed.
[FIGURE 3 OMITTED]
[FIGURE 4 OMITTED]
[FIGURE 5 OMITTED]
Although implementation of a modified constructor for use in the mutation phase of the IA has not always given greater success rates (especially for more challenging sequences), it has allowed for a more efficient search to be performed in some cases, showing a descrease in the number of fitness evaluation performed. The use of population diversity tracking allows a greater understanding of the algorithm's ability to explore areas of the potential energy surface of these simple model proteins. Areas of favourable local structure along the chain can be assessed, illustrating the important allele allele (əlēl`): see genetics.
Any one of two or more alternative forms of a gene that may occur alternatively at a given site on a chromosome. combinations that give rise to the determination of global minima. We are currently applying these approaches to more realistic protein models.
AJB AJB America’s Job Bank
AJB African Journal of Biotechnology
AJB Amt für Jugend und Berufsberatung (German: office for youth and vocational guidance)
AJB American Journal of Botany
AJB Australian Journal of Botany thanks Dr Benjamin Curley for programming assistance and the University of Birmingham Due to Birmingham's role as a centre of light engineering, the university traditionally had a special focus on science, engineering and commerce, as well as coal mining. It now teaches a full range of academic subjects and has five-star rating for teaching and research in several for PhD funding. Calculations were performed on the University of Birmingham's BlueBEAR 1500+ processor cluster, funded by the EPSRC EPSRC Engineering & Physical Sciences Research Council (UK) (under SRIF SRIF
somatotropin release-inhibiting factor 3) and the University of Birmingham.
Received: June 23, 2008
 K.M. Merz (ed.). The Protein Folding problem and Structure Prediction. Birkhauser, 1994.
 N. Krasnogor, W.E. Hart, J. Smith, and D. Pelta Pel´ta
n. 1. (Antiq.) A small shield, especially one of an approximately elliptic form, or crescent-shaped.
2. (Bot.) A flat apothecium having no rim. . Protein Structure Prediction Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as with Evolutionary Algorithms. In Proc. 1999 International Genetic and Evolutionary Computation Conference (GECC GECC General Education Core Curriculum
GECC General Electric Credit Corporation
GECC Group Enabled Cluster Compiler
GECC Geelong Ethnic Communities Council
GECC Glen Ellyn Children's Chorus (Glen Ellyn, Illinois) 099), Orlando, Florida, USA, 1999, pages 1569-1601.
 G.A. Cox and R.L. Johnston. Analyzing Energy Landscapes for Folding Model Proteins. J. Chem. Phys., 124(20):204714, 2006.
 T. Beutler and K. Dill. A Fast Conformational Search Strategy for Finding Low Energy Structures of Model Proteins. Protein Sci. 5(10):2037-2043, 1996.
 R. Unger and J. Moult. Genetic Algorithms for Protein Folding Simulations. J. Mol. Biol., 231:75-81, 1993.
 R. Ramakrishnan, B. Ramachandran, and J.F. Pekny. A Dynamic Monte Carlo Algorithm for Exploration of Dense Conformational Spaces in Heteropolymers. J. Chem. Phys., 106(6):2418-2425, 1997.
 S. Kirkpatrick and C.D. Gellat. Optimization by Simulated Annealing. Science, 220(4598):671-680, 1983.
 W. Liu and B. Schmidt. Mapping of Genetic Algorithms for Protein Folding onto Computational Grids. IEICE IEICE Institute of Electronics, Information and Communication Engineers (Japan)
IEICE Institute of Electronics Information and Communication Engineers T. Inf. Syst., 89(2):589-596, 2006.
 J. Song, J. Cheng, T. Zeng, and J. Mau. A Novel Genetic Algorithm for HP Model Protein Folding. In Proc. Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05), Dalian, China, 2005, pages 935-937.
 A. Shmygelska and H.H. Hoos. An Improved Ant Colony Optimization Algorithm for the 2D HP Protein Folding Problem. Lect. Notes Comput. Sci., 2671:400-412, 2003.
 V. Cutello, G. Niscosia, M. Pavone, and J. Timmis. An Immune Algorithm for Protein Structure Prediction on Lattice Models. IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields. T. Evol. Comput., 11(1): 101-117, 2007.
 W.E. Hart, S. Istrail. HP Benchmarks. www.cs. sandia.gov/tech_reports/compbio/ tortilla-hp-benchmarks.html.
 L.N. de Castro and J. Timmis. Artifical Immune Systems and Their Applications. Springer-Verlag, 1999.
Andrew J. Bennett, Roy L. Johnston, and Eleanor Turpin
University of Birmingham, Birmingham, United Kingdom
Jun Q. He
Aberystwyth University, Aberystwyth, United Kingdom
Table 1: Benchmark HP sequences used in the present study . The lowest energies that have been found for these sequences are indicated by E *. Name Length E * Sequence HP-18a 18 -9 PH[P.sub.2]HP[H.sub.3]P[H.sub.2]P[H.sub.5] HP-18b 18 -8 HPHP[H.sub.3][P.sub.3][H.sub.4][P.sub.2][H.sub.2] HP-18c 18 -4 [H.sub.2][P.sub.5][H.sub.2][P.sub.3]H[P.sub.3]HP HP-20a 20 -10 [H.sub.3][P.sub.2][(HP).sub.2]H[P.sub.2] [(HP).sub.2]H[P.sub.2]H HP-20b 20 -9 HPH[P.sub.2][H.sub.2]PH[P.sub.2]HP[H.sub.2] [P.sub.2]HPH HP-24 24 -9 [H.sub.2][P.sub.2][(H[P.sub.2]).sub.6][H.sub.2] HP-25 25 -8 [P.sub.2]H[P.sub.2][([H.sub.2][P.sub.4]).sub.3] [H.sub.2] HP-36 36 -14 [P.sub.3][H.sub.2][P.sub.2][H.sub.2][P.sub.5] [H.sub.7][P.sub.2][H.sub.2][P.sub.4][H.sub.2] [P.sub.2]H[P.sub.2] HP-48 48 -23 [P.sub.2]H[([P.sub.2][H.sub.2]).sub.2][P.sub.5] [H.sub.10][P.sub.6][([H.sub.2][P.sub.2]).sub.2] H[P.sub.2][H.sub.5] HP-50 50 -21 [H.sub.2][(PH).sub.3]P[H.sub.4]P [(H[P.sub.3]).sub.3]P[(H[P.sub.3]) .sub.2]HP[H.sub.4][(PH).sub.4]H Table 2: Comparison of the percentage success and average number of structure evaluations with and without using memory B-Cells Sequence No Memory B-Cells Memory B-Cells No. No. % Success Evaluations % Success Evaluations HP-18a 100 89,578 100 117,251 HP-18b 100 40,167 100 200,740 HP-18c 100 87,761 100 72,270 HP-20a 100 26,207 100 312,405 HP-20b 100 15,221 100 30,414 HP-24 100 26,580 100 49,616 HP-25 100 79,042 100 95,123 HP-36 63 4,867,993 90 3,082,014 HP-48 3 6,318,721 3 4,195,086 HP-50 50 4,904,031 96 853,706