Solving Protein Folding Problem using Elitism-Based Compact Genetic Algorithm

: Proteins are vital components of living cells. A number of diseases such as Alzheimer's, Cystic fibrosis and Mad Cow diseases are shown to result from misfunctioning of proteins. Problem statement: Protein folding problem is the process of predicting the optimal 3D molecular structure of a protein, or tertiary structure, which is an indication of its proper function. Approach: An enhancement over persistent elitist compact genetic algorithm (pe-cGA) was made to minimize the energy of proteins indicating how far it is from its optimal 3D structure. Energy was calculated using the Empirical Conformational Energy Program for Peptides (ECEPP) package. Results: Experiments were performed on the Met-enkephalin protein. The enhanced algorithm reached an energy of -7.378 in 140,000 iterations surpassing the Distributed Genetic Algorithm (DGA) which reached the same energy in 700,000 iterations. A comparison was also made with the Breeder Genetic Algorithm (BGA) which did not reach this energy in the first place. Conclusions/Recommendations: Results show that the enhanced algorithm is superior to DGA and BGA and a computational alternative to costly laboratory methods and an efficient means for solving organic docking problems.


INTRODUCTION
Proteins are fundamental components of all living cells. The bacteria that infect us, the plants and animals we eat, the hemoglobin that carries oxygen to our tissues, the insulin that signals our bodies to store excess sugar, the antibodies that fight infection, the actin and myosin that allow our muscles to contract, and the collagen that makes up our tendons and ligaments (and even much of our bones) are all examples of proteins. To make proteins, ribosomes string together amino acids into long, linear chains. Like shoelaces, these chains loop about each other in a variety of ways (i.e., they fold). But, as with a shoelace, only one of these many ways allows the protein to function properly. Yet lack of function is not always the worst scenario.
Recent discoveries have shown that some diseases (Alzheimer's disease, Cystic fibrosis, Mad Cow disease, and many cancer types) are the result of misfolded proteins. Also, protein misfolding is behind many of the unexpected difficulties biotechnology companies encounter when trying to produce human proteins in bacteria.
A misfolded protein can actually poison the cells around it, so misfolded protein could be worse than a normally folded one.
The prediction of molecular structure (polypeptide's native conformation) of a protein given only its amino acid sequence is not an easy task, but has numerous potential applications [1] . This structure prediction problem is commonly referred to as the protein folding problem. Efforts to solve it nearly always assume that the native conformation corresponds to the global minimum free energy state of the system. Given this assumption, a necessary step in solving the problem is the development of efficient global energy minimization techniques. This is a difficult optimization problem because of the non-linear and multi-modal nature of the energy function.
The motivation of this work is to find the optimal 3D structure of protein (angles of amino acids) to be used in the treatment by using Estimation of Distribution Algorithm (EDA) [2] .

MATERIALS AND METHODS
The algorithm: Estimation of Distribution Algorithm (EDA): Instead of using traditional recombination and mutation operators, Estimation of Distribution Algorithm (EDA) [1] generates offspring population according to the estimated probabilistic model of parent population. Also, EDAs express the interrelations explicitly through the joint probability distribution associated with the individuals of variables selected at each generation. The probability distribution is calculated from a database of selected individuals of previous generation. Then offspring are generated from sampling this probability distribution. Neither crossover nor mutation is applied in EDAs. But the estimation of the joint probability distribution associated with the database containing the selected individuals is not an easy task. The flow chart of EDA is shown in the Fig. 1. all the variables in a problem as uni-variate (independent). In Uni-variate Marginal Distribution Algorithm (UMDA) [3] , the joint probability distribution is factorized as a product of independent uni-variate marginal distribution.

Bi-variate dependencies:
To solve the problem of pair wise interaction among variables, population based Mutual Information Maximization for Input Clustering (MIMIC) [4] Algorithm, Combining Optimizers with Mutual Information Trees (COMIT) [5] , Bi-variant Marginal Distribution Algorithm (BMDA) [6] were introduced. Where there is at most two-order dependency among variables.

Multiple dependencies:
The factorization of the joint probability is calculated as a product of marginal distribution of variable size. These marginal distributions of variable size are related to the variables that are contained in the same group and to the probability distribution associated with them (variables are strongly related).
In this study, the multiple dependencies are used because all the variables (protein angles) are strongly related.

Elitism-based Compact Genetic Algorithm (ECGA):
There is two elitism-based compact Genetic Algorithms (cGAs)-persistent elitist compact genetic algorithm (pe-cGA), and non-persistent elitist compact Genetic Algorithm (ne-cGA) [11] . The aim is to design efficient compact-type GAs by treating them as Estimation of Distribution Algorithms (EDAs) for solving difficult optimization problems without compromising on memory and computation costs. The idea is to deal with issues connected with lack of memory-inherent disadvantage of cGAs-by allowing a selection pressure that is high enough to offset the disruptive effect of uniform crossover. The point is to properly reconcile the cGA with elitism. The pe-cGA finds a near optimal solution (i.e., a winner) that is maintained as long as other solutions (i.e., competitors) generated from probability vectors are no better. It attempts to adaptively alter the selection pressure according to the degree of problem difficulty by employing only the pair-wise tournament selection strategy.
Step 1. Initialize probability vector for i:=1 to l do p[i] := 0.5; Step 2. Generate one chromosome from the probability vector if the first generation then Step 5.
Check if the probability vector has converged. Go to Step 2, if it is not satisfied.
The probability vector represents the final solution. The ne-cGA further improves the performance of the pe-cGA by avoiding strong elitism that may lead to premature convergence. It may seem that the ne-cGA gives better results, and this is true for some problems. But in this work, we found out (from experimental results) that the pe-cGA is better and more suitable for the protein folding problem.
The pseudo code of the pe-cGA is as in Fig. 2. But from experimental results the pe-cGA alone did not give good results, so we needed to make enhancement over it to get better results.
Enhancement over pe-cGA: In this study, pe-cGA is proposed in solving protein folding problem with the addition of two modifications: mutation, and keeping the best solution so far.

Fig. 3: Mutation
The first modification is the addition of mutation. Mutation will be performed by adding a secondary tournament to each cycle that compares the performance of the current champion string with a mutated version of itself. If the mutated version wins, it replaces the old champion as the elite string for the next tournament. Note that our implementation of mutation presumes elitism. This allows periodic sampling of individuals around the champion independent of the current state of the genome probability vector. We can more formally describe this operation by modifying the standard cGA to contain elitism and by adding a new step as described in the pseudo code of Fig. 3.
We also found that considering the probability vector as the final solution is not the optimal solution, so the second and final modification is to consider the final solution as the best individual (in our case, the one with minimum energy) in all generations.
The complete algorithm became as in Fig. 4.
The algorithm was implemented in C/C++ using Microsoft Visual Studio. Empirical Conformational Energy Program for Peptides (ECEPP) package was used to evaluate energy of proteins.

RESULTS
The target protein in this study is Metenkephalin [12] . Met-enkephalin is the protein that consists of five amino acids.
As we search for the structure that give minimum energy, the fitness of individuals is calculated in terms of energy. In our case the best individual is the individual with the minimum energy. We used an energy evaluator called Empirical Conformational Energy Program for Peptides (ECEPP) [13] to evaluate the individual energy.
Initialize probability vector for i:=1 to l do p[i] := 0.5; Step 2. A comparative study is made with two other algorithms (described below) that solve the same problem on Met-enkephalin protein using "ECEPP. The comparison is based on the best fitness (minimum protein energy) that each algorithm has reached with respect to the overhead (number of energy evaluations) needed to reach this result.
In Distributed Genetic Algorithm(DGA) [14] , the total population is divided into sub populations. Each sub population is often called "island". In each sub population, normal genetic operations are performed for several generations. After a certain number of generations, some of the individuals are chosen and are moved to the other island. This operation is called "migration". Because the population size in each island is small, the early convergence may happen in each island. However, the migration operation prevents the early convergence and maintains the diversity of the solutions during the search. Breeder Genetic Algorithm (BGA) [15] , at each generation, the T % best individuals within the current population of N elements are selected. T % is called truncation rate and its typical values are within the range 10 to 50%. The selected individuals are randomly recombined and their offspring are mutated, so as to generate a new population of N-1 elements.  The best individual of the old population is then added to the new population, and the cycle of life continues. By doing so, the best individuals are treated as superindividuals and mated together, hoping that this can lead to a fitter population. Table 1 shows the performance of DGA and BGA against our enhanced ECGA. The number of evaluations needed is recorded for each algorithm at minimum energy. Figure 5, emphases the difference in performance between the two algorithms and our proposed one. Our proposed algorithm reaches requires little overhead than other two algorithms.
Also we can observe that the breeder algorithm did not reach -7.378 fitness. The DGA reached this fitness but we do not know exactly with how many evaluations, but we can note that our algorithm has reached this fitness with less than half the number of evaluations that the DGA needed to reach fitness 0.
From these results, it is shown that Elitism-based Compact Genetic Algorithm is very effective in solving protein folding problem.

DISCUSSION
From the formulation of the protein folding problem, the angles that describe the protein structure in 3D are strongly inter-related and dependent. This is strongly tackled in Estimation of Distribution Algorithms (EDAs) and consequently the enhanced algorithm that model the interactions between chromosomes in terms of a probability distribution vector. Thus, the enhanced algorithm moves progressively towards the optimal interacting angles 3D structure, by generating individuals conforming with higher fitness probability distribution individuals.
Moreover, it appears that the enhanced algorithm correctly balances exploration and exploitation needed for this problem. Distributed genetic algorithms (DGA) relies more on exploration rather than exploitation. Breeder genetic algorithm (BGA) did not reach the energy reached by enhanced algorithm in the first place. However, the overhead incurred in DGA and BGA is more than that for the enhanced algorithm.

CONCLUSION
In this study, the molecular structure of Metenkephalin protein is predicted. The structure is always assumed to be the global minimum free energy state of the system. For this optimization problem, an enhancement of Elitism-based Compact Genetic Algorithm (ECGA) is made to minimize the protein energy. Results show that the enhanced ECGA have little overhead in terms of number of evaluations needed.