Genetic Algorithm Based Probabilistic Motif Discovery in Unaligned Biological Sequences

: Finding motif in biosequences is the most important primitive operation in computational biology. There are many computational requirements for a motif discovery algorithm such as computer memory space requirement and computational complexity. To overcome the complexity of motif discovery, we propose an alternative solution integrating genetic algorithm and Fuzzy Art machine learning approaches for eliminating multiple sequence alignment process. Problem statement: More than a hundred methods had been proposed for motif discovery in recent years, representing a large variation with respect to both algorithmic approaches as well as the underlying models of regulatory regions. The aim of this study was to develop an alternative solution for motif discovery, which benefits from both data mining and genetic algorithm, and which at the same time eliminates the cost caused by use of multiple sequence alignment. Approach: G enetic algorithm based probabilistic Motif discovery model was designed to solve the problem. The proposed algorithm was implemented using Matlab and also tested with large DNA sequence data sets and synthetic data sets. Results: Results obtained by the proposed model to find the motif in terms of speed and length are compared with the existing method. Our proposed method finds Length of 11 in 18 sec and length of 15 in 24 sec but the existing methods finds length of 11 in 34 sec. Compare to other techniques the proposed one was outperforms the popular existing method. Conclusion: In this study, we proposed a model to discover motif in large set of unaligned sequences in considerably minimum time. Length of motif was also long. The proposed algorithm will be implemented using Matlab and was tested with large DNA sequence data sets and synthetic data sets.


INTRODUCTION
Modern growths in bioinformatics have stimulated many researchers' attention to this area. Biologists, computer scientists, and others from various fields have contributed different researches planning to benefit more from biological data. Motif discovery is one of those benefits of biological data, and naturally it is amongst fashionable bioinformatics topics. Motif discovery basically can be described as follows: for a given sample of sequences can we find the unknown pattern that is implanted in different positions of the given sequences [1] Importance of these patterns for biology comes from the role of motifs at protein DNA binding sites. Furthermore, finding similar sequences can be used at revealing unknown evolutionary relationships between different species.

MATERIALS AND METHODS
Numerous studies were done to discover solutions for motif discovery. Many algorithms have developed to improve the existing popular motif discovery tools by means of performance, length of motifs and/or some other considerations. Stine et al. [5] employed genetic algorithm in their structured Genetic Algorithm (St-GA) to search and to discover highly conserved motifs amongst upstream sequences of co-regulated genes. Liu et al. [6] also employed genetic algorithm for finding potential motifs in the regions of Transcription Start Site (TSS). Pan et al. [7] developed MacosFSpan and MacosVSpan algorithms to mine maximal frequent sequences in biological data. While MacosFSpan and MacosVSpan underline inefficiency of apriori-like algorithms, and seeks a mining solution that works better in biological datasets [6][7] , combine genetic algorithm approach with multiple sequence alignment tools to discover motifs. St-GA [5] also works in similar fashion and needs to make multiple sequence alignment. Among those existing works; most recognized ones are The Multiple Em for Motif Elicitation (MEME) system [2][3][4]9] , proposed a topdown mining method called as ToMMS, which is a promising approach for mining long sequential patterns. Classical mining methods use bottom-up strategy, and step by step go to the largest frequent itemset after finding shorter frequent ones first. On the contrary, top-down strategy starts with a predetermined length and from this large starting point it goes down to search smaller ones until finding a frequent one, then clearly that found one becomes the largest frequent itemset. The only weak point of topdown strategy is specifying its starting point which requires user knowledge. Baloglu and Kaya [12] proposed a GA-based topdown data mining approach for finding motifs in biosequences. It has combined a genetic algorithm and top-down data mining method. However, one of the motivations of motif discovery is to find bigger motifs since finding small ones has no use.
The aim of this study is to develop an alternative solution for motif discovery, which benefits from both data mining and genetic algorithm, and which at the same time eliminates the cost caused by use of multiple sequence alignment. This computational cost of multiple sequence alignment is also emphasized in [5] which suffers from use of time consuming BLAST [8] . For the reason that a combination of machine learning approach and genetic algorithm is not time consuming, we did not only consider computational cost of alignment and how to eliminate, but also we tried to find the most efficient way to handle mining part of our approach. However, one of the motivations of motif discovery is to find bigger motifs since finding small ones has no use. This condition gives a meaning to design a Hybrid model for motif discovery. In this study, we use a hybrid model of GA with Fuzzy Art for motif discovery.
Our solution is based on a combination of genetic algorithm and Fuzzy Art. It is used to discover motifs in biological sequence datasets. There are two main motivations of this approach. First, we use genetic algorithm to find all possible motifs . Select two or more potentially matching motif regions M1, M2...Mn of length 'W' in one or more gene sequence using Genetic algorithm.
Second Train FART Neural Network to Recognize the 'n' previously found potential motifs M1, M2...Mn as 'n' different classes. Classify all the possible segments of window length 'W' of the sequences using Trained FART neural Network. Group the Detected motifs in to 'n' groups based on the class label. Finally we will have n sets of potential Motif in the sequence. Change the Expected length of Motif continue the search if necessary.
The proposed GA based motif discovery algorithm: Step 1: The initial population is two sets of string represented by binary numbers. The selection is made randomly, which contains a bit string which represents the size, location of two or more sub sequences 1 1 s P p ,p , ,p = where, s = The size of the population. The two sets of dissimilar locations in the sequence G = Pointed by the two sets of strings.
Step 2: Evaluation: After the generation is formed, the initial step is compute the fitness value of each member in the population 1 1 s P p ,p , ,p = That is, the fitness of each corresponding subsequence depends on the similarity of the corresponding subsequence pairs. The fitness evaluation process for a chromosome involves the following steps: Here, the fitness function is nothing else but a suitable gene subsequence matching policy such as hamming distance or more sophisticated score matrix based distance measurement algorithm.
Step 4: Create a new population: After the process of evaluation, a new population should be created from the current generation .In this case the three operators (reproduction, crossover, and mutation) are employed.
The size of the population is fixed with regards to the convergence factors. This process also considers previously selected potential motifs.

Reproduction:
The two chromosomes (strings) having the best fitness and the second best fitness are permitted to live and produce offspring in the next generation. The first two best matching sub sequence pairs are selected as new parents. The one-cut-point method of crossover is implemented in this case. In this method one cut-point is selected randomly and the right parts of two parents are inter-changed to produce the offspring. The selection of the crossover point can be performed in a selective manner considering the convergence factors.
Step 5: Mutation: After the crossover, mutation process is performed. The convergence factor is considered for the selection of the mutation level. In this process one or more genes are altered with a probability equal to the mutation rate.
• A sequence of random numbers r k is generated. (In this case, this is the number of bits in the whole population). • In case r i is 1, the i th bit in the whole population is altered from 1-0 or from 0-1. • The chromosomes reproduced are not subjected to mutation, so after the mutation process, the chromosomes should be restored.
The output for a single iteration of the genetic algorithm is the creation of a new population. Go to Step2.
The procedure (iterations) can be repeated for any number of times as desired. The best value of the objective function for the population of every generation is computed. The whole process is repeated for the desired number of times.
Ultimately, two final set of potential motifs 1 1 s P p ,p , ,p = L . 1 1 1 s Q q ,q , ,q = L are obtained. By the use of any one of these two sets other similar patterns in the sequence G can be identified using the sliding window operation. As this operation involves only one pass and matching of S subsequence at all possible window positions (presuming uniform length of motifs in P 1 or P 2 ), clearly, this technique will consume lesser time when compared with the other brute-force approaches of motif discovery. Figure 1 explains the proposed model for motif discovery using genetic algorithm. By applying this method, we can detect a set of potential motifs in a sequence. After that, using the detected motifs as seed, we can find all the similar patterns in overall sequence by sliding window operation. Diagram explaining proposed model: The proposed model for motif discovery using genetic algorithm is depicted in the Fig. 1.

RESULTS
A dataset of 300 E. coli promoter sequences is used for the experiments. This dataset was previously used in Baloglu, U.B.; Kaya, M. [12] The proposed GA based motif discovery model (described in the above diagram) has been implemented using Matlab on windows XP on a normal desktop PC (Intel Pentium 2G.Hz, 512 MB RAM). The built in toolbox in Matlab was not utilized for the customization purposes, instead a custom model for genetic algorithms was developed anew to solve the motif discovery problem. The developed system effectively detected potential motifs in a remarkably minimum period of time.
The optimum parameters to enhance the system performance were found out by altering the GA parameters on trial and error basis. Real time data sets were used to test the proposed model. The proposed research makes use of the gene sequence of Ecoli (EcoliPromoters1_300.seq) as is used in the research works of [12] . A dataset consisting of 300 E. coli promoter sequences is used for the experiments. The length of each sequence present in the dataset is 100 bases. The data set consists of 300 sequences in all. The sum of lengths of all the sequences in the dataset is 30300. The system was programmed so as to discover 5 motifs. The following GA parameters are assumed: The GA Parameters: The Total Population Size: 100 The Total Number of Generations: 20 The Mutation Level: 0.2 The Crossover Rate: 0.20 Table 1 illustrates the performance of the GA based method to find the Motif in terms of time and length:  A graph in Fig. 2 showing performance of the GA based method in terms of Time and Motif length is presented below. It is evident that the Top-Down GA method outperforms the basic motif discovery methods such as the MEME and the Gibbs algorithm. The Fig. 3 depicts the same Table 2 and Fig. 3 show the performance results by the various Genetic Algorithm based techniques. Length

DISCUSSION
The proposed GA based Fuzzy ART method is compared with the implemented GA based exhaustive search method and the Top-Down GA method to prove that the proposed motif discovery algorithm outperforms the existing techniques.
From the above Table 2 and Fig. 4, it is obvious that the proposed GA based Fuzzy Art method outperforms the GA based exhaustive search and Top-Down based GA methods.
In order to overcome the complexity of motif discovery an integration of genetic algorithm and Fuzzy Art mining approach is proposed which eliminates multiple sequence alignment process. From the experimental results, it can be inferred that the proposed method of combination of genetic algorithm and fuzzy art mining outperforms other renown motif discovery algorithms, such as MEME and Gibbs Sampler and Genetic algorithm. The results thus obtained were promising. The proposed model yielded improved performance over the brute force approaches. 5 likely motifs were detected within a minute's time using the proposed model from a sequence of length 30300. The same model can be applied to detect motifs in any sequence apart from gene sequence such as a time series data. Thus this research does not focus on the biological significance of the detected motifs. Focus on the biological significance of the motifs can be developed in the future to address this issue.
A comparative study on the time and length for finding the motifs and performance was made on popular methods such as [3,17,19] , along with GA based exhaustive search and the proposed GA based FART was done. The results from the GA based FART method outperformed the others by a considerable margin. The overall result including the factor of speed and length of finding motif by the use of the proposed method was found to be satisfactory. Even thought the proposed model discovers a given number of N motifs of length L, the issue of discovering the total number of motifs of all possible lengths remains unaddressed and can be considered as a scope of enhancement.

CONCLUSION
To finding the motif of DNA sequence, the proposed GA based model has been successfully designed and implemented on MATLAB under Windows operating system using normal desktop computers The Performance of the proposed model was tested with the very large synthetic numeric data sets and DNA sequence data sets. . Several tests were made on the model and overall significant results were achieved. While considering existing approaches, the performance of the proposed model was very much appreciating. In the 30300-character long gene sequence, it has detected 5 probabilistic motifs in less than a min. The proposed model has discovered only a given number of N motifs of length L each. But, still there are lots of issues such as finding all the total number of motifs of all possible lengths. These issues can be addressed in future. The same model can be applied to detect motifs in any sequence apart from gene sequence such as a time series data. So in this research we did not mention the biological significance of the detected motifs. Future research works in biological domain which will be very much particular about the biological significance of motifs can address these issues.