Genome Sequence Analysis: A Survey

: Problem statement: Sequence analysis problems are NP hard and need optimal solutions. Interesting problems include duplicate sequence detection, sequence matching by relevance, sequence analysis using approximate comparison in general or using tools i.e., Matlab and multi-lingual sequence analysis. The usefulness of these operations is highlighted and future expectations are described. Approach: This study described the concepts, tools, methodologies, algorithms being used for sequence analysis. The sequences contained precious information that needed to be mined for useful purposes. There was high concentration required to model the optimal solution. The similarity and alignments concepts can not be addressed directly with one technique or algorithm, a better performance was achieved by the comprehension of different concepts. Results: We had compared different approaches using exemplary data and found that ClustalW2 is fairly good tool in terms of analysis. We assigned different weight values for relevant features and obtained score 95 in comparison phenomenon and 45 in alignment. Conclusion: Different techniques and approaches had been evaluated and compared.


INTRODUCTION
Sequences are logical units that contain vital information, for instance consider biological sequences that compose of nucleotide base pairs in the form of A (Adenine), T (Thymine), G(Guanine) and C (cytosine). The structure and position of these pairs in sequence determine the personality, habits and inheritance characteristics of species.
The mining of useful information from the vast repositories of sequence data brings interesting results related to genes and their functional properties, the main attention and focus of biologists is to differentiate species on behalf of these functional characteristics, many different solutions have been proposed that claim to bring optimal results. It is worth knowing that direct matching in sequence repository data is not efficient and may bring inaccurate and slow results, so going beyond the exact match is necessary for optimality.
Modern computational technology and good devices has made the job of scientists relatively easy in bringing accurate results, this reflection is quite positive in micro-array DNA technology and image data-sets comparison techniques where huge bulky genetic data is approximately compared promptly.
The data is spread over chips and relevancy is determined. The other tools like MATLAB, TRADOES and EBMT are now broadly used for sequence manipulation. FASTA and BLAST are also very popular in biological researchers for sequence comparisons, different people have developed many tools for analysis of not only the genetic sequences but corpora sequences, the lexical analysis explores the hidden resources in these structures, global alignment tools have replaced local one and multiple alignment techniques have given way to know more about diversity in functional properties of species in sequences.
People are interested in mining some kind of association rules in genetic and lexical data, these rules will better help to understand the patterns in data and further exploration may lead to more knowledgeable and interesting results that could not be available by query application phenomenon. The query application only generates views that are provided through datasets within a confined domain and redefined rules in the form of queries, later solutions present the query enhancement techniques but that are not as optimal as direct rule generation from datasets.
Scientists now use latest systems in biotechnology for storage of genetic data, employing data-ware housing techniques and analyzing the DNA sequences, it is not limited to computations but can solve many different complex biological problems.
The more comprehensive use of these computer aided techniques falls in field of molecular medicine, which is itself a broad filed that involve physical, biological and chemical methods for depiction of molecular structure. Another important aspect of genome analysis is building evolutionary models and phylo-genetic tree structure Fig. 1 describes the sequence analysis hierarchy. In this hierarchy, at the top, general sequence analysis depicts that sequence may be of different nature and kinds, for instance, genetic sequences, protein sequences, multi language sequences, corpora and other mathematical set of occurrences of events or characters or symbols. In genome sequence analysis, biologists are paying a very keen attention to the alignment and micro array analysis today as alignment leads towards interesting facts regarding diversity in species, genetic relationship between species and degree of relevancy that how much one creature is different and similar from others. The micro-array technology brings very collective and near results for sequence analysis and is thought to be a future technology.

MATERIALS AND MATHODS
Sequence comparison: Sequence comparison is a kind of method in which two or more than two sequences are chosen for searching for certain domain specific patterns that need an alignment procedure at first glance, for instance, bioinformatics people quote two kinds of alignment, local and global.
Local is a kind of point to point alignment while global alignment on other hand is spread to a more concentrated area of search which may involve search at different regions, e.g., (Fig. 2).

Sequence analysis tools:
Following are a few tools developed for sequence analysis.

EMBOSS:
This tool has been developed to compared two sequences, it has two sections/parts, one is called Needle which is used when comparison is required at whole length of both sequences and other is called Water which provides region wise similarity in strands.

CLUSTALW 2:
It provides good meaningful sequence match for both DNA and protein sequences and separately shows the degree of similarity and differences in strands in a kind of visual environment and also provides an evolutionary relationship between sequences.  It is supposed to be a fast and accurate multi sequence alignment tool. It requires a supported format for input data strands and can also input data by user command lines, it can provide interactive sequence results for both protein and DNA strands.
MAFFT: MAFFT is a tool designed for alignment of sequences using Fast Fourier Transforms, it is claimed to be high level multi alignment tool with prompt and quick results. The beauty of this tool is the GAP extension feature provided and also requires a specific format for input data strands.

MUSCLE:
MUSCLE is a multi sequence alignment tool and compares the sequence by LOG EXPECTATION; it is supposed to provide better performance than CLUSTALW2 or T-COFEE, it also requires strands to be in specific format and can generate out put data tree that fan help better understand the alignment.

T-COFFEE:
It is also a multi sequence alignment program that has the capability to combine the alignment being derived from some other alignment programs, so it provides a kind of refinement from other tools, it can produce the alignment in a sequence of two by two resulting in global and local alignment.
The phase-wise alignment can be then combined in an integrated final refined multi alignment structure.
Exemplary comparison of tools: Suppose we have a genetic data file that contains sequences of human and mouse.
The genetic data for mouse is sequenced [22] as: Running the specimen data on EBI CLUSTALW2, the alignment score is 2031 and both sequences contain same no. of characters.
From this discussion, it is obvious that pair-wise alignment of both human and mouse genomes have been shown with representation of symbol (*) where match is found and symbol (.) where characters are mismatched, the overall score is 95 for both sequence pairs.
Executing the same data set for EBI Align, we get the gap penalty 10 and extension penalty as 0.5.
The sequence lengths are same, identity representation is 95.9% and similarity is 97.6%, gaps and score are 0.0% and 1693 respectively, the similarity representation is done by vertical lines and difference is shown by (.). MAFFT takes input of both the strands and keeps some default gap penalty, the gap extension is set to 0.123 and gap open are set to 1.53.
Kalign builds the sequence gpo to 11.0 and gpe to 0.85, the alignment is not shown in the form of symbols and clear identifiers are not made so that one has to pay more concentration while viewing the alignment visuals but it is considered to be much better then MAFFT.
MUSCLE (with same datasets execution) generates no gap penalty and gap extension; rather it shows the alignment similarity and differences in the form of visual colors. T-COFFEE generates an alignment score of 61 without mentioning the gap penalty and gap extension, it also displays the results aligned with the introduction of symbol (*) for similarity and (.) for difference. Table 1 depicts the comparative analysis of various tools run on same dataset, the difference in results shows that each tool has tried to solve the NP hard problem of sequence alignment with diverse context, some tools have generated visual alignment and others have given numerical scores. Table 2 shows different criteria's in terms of various features of tools. There is a scoring scheme for measurement of cumulative performance of each tool. Local, global and multiple alignments have been given weight 0.15 each out of 1.  Visual representation of alignment is also a strong feature of a certain alignment tool that has been given weight 0.2. Similarly general score and Phylo-genetic tree depiction are weighted as 0.15 and 0.1. Another important feature for an alignment measuring tool is gap consideration which is given weight 0.1. Against each tool, a percentage sum over sum of (1 = absent, 2 = average, 3 = present) is calculated. Table 2 shows that CLUSTALW2 is well performed tool that contains average features for local and global alignment, full features for multiple alignment, full features for visual representation of strands, full features for score and gap consideration and a very powerful feature that is Phylogenetic tree representation of specimen aligned data, this feature is lacking in all other tools which makes CLUSTALW2 much significant as compared to others.
A review of previous work: Bansal [1] presents a considerable useful idea in the form of a frame work that treats multiple sequences as abstract data type and integrates the information gathered from this frame work. The information gathered is helpful for generation of phylo-genetic tree. Authors have developed a generic high level language library for complex analysis of multiple sequences and derived groups of amino acids in homologous protein which share some common properties along with identification of constrained columns which also conserve some common properties despite mutations resulting into different types of amino acids in the column. PROLOG TOOL is being used to be applied on proposed frame work. A high level abstraction is used at alignment of sequences with the introduction of prolog tool, which some times is not quite useful for generating standard optimal results and overall comparison is not quite visible [1] . Kappen [2] described an annotated technique for comparisons between a mouse chromosome 9 and a human chromosome 15, the data draft sequences had been obtained from genetic databases and a complex map containing 14 genes has been presented as a genome map, the framework described in the study for data interpretation and demonstration can be quite helpful for generation of more complex maps provided time constrained is kept in mind, the ideas may lead towards implementation of automated genome annotation techniques [2] . A very useful feature of this approach was to use information for human and mouse species comparatively and to describe a frame work for the discovery of three previously unknown genes. The limitation of this framework requires more labor in the form of critical evaluation before accepting any kind of predictions and focus must be made on smaller region in the map to bring more sophisticated results [2] . Nahar [3] presented a web based tool that provides comparative genome sequence analysis, this tool is interactive and user of the program can interact with different parts to view/monitor better results. The claim is that idea is novel and one may not analyze DNA sequence directly but with the help of self adjusting maps that could provide possible evolutionary concepts in depiction of certain results. The authors described a strong advantage of this tool to be highly interactive in visual identification of horizontally transferred genes and this kind of functionality is not available in other techniques/tools. The weakness of idea is that some time the user is not intending with the maps and eager to get final approximate results with ease without interacting with application interfaces, secondly the tool is web based so actual application complexity in the form of time frame may not be possible.
Chang [4] proposes a package of integrated comparative analysis for comparison of different genomes, the framework develops efficient gene identification and functional annotations and plots numerous measures for all positions in a long DNA sequence and can perform whole genome comparison [4] . The framework proposes a cross-species pathway comparison on customized starting and ending points of pathways. The idea is comprehensively good as it can depict both section-wise and whole complete analysis, the example illustrated in the study does not cover or highlight the complete idea and more sophisticated understanding may reveal the hidden aspects [4] . More analysis tools for comparative analysis may be required. Cornell [5] has proposed a data-ware house (Genome Information Management System GIMS) that incorporated both genomic sequence and functional data [5] . This ware house has been explained by giving an example of yeast genome data. It can answer many useful queries and serves as a basis for future exploration by creating a large data-ware house with genomic and functional features. The claim is that this framework will provide better effective analysis of genome with functional properties and focuses the development of data management and analysis techniques for use with multiple genome data-sets. If comprehensive storage is available then genomic dataware housing is good appealing idea that can replace conventional approaches for genomic analysis [5] . A little weakness is that more efforts and work is required for the construction of genomic warehouse.
Ahmed [6] has proposed an algorithm that is experimentally evaluated in a distributed grid environment that provides very scalable and low computational cost [6] . As multiple sequence alignment and comparison problem falls in a domain of length so parallel approach focusing on the parts of sequences and then integrated can lead to better approximate results, so main focus remains on utilization of grid computing for large biological data. The algorithm was studies in three different distributed environments including a single cluster environment, a single cluster grid environment and a multi cluster grid environment [6] . A distributed environment is essentially required for application of this approach with many more addition of resources which may be costly as compared to traditional approaches. Agrawal [7] proposes a heuristic approach for multiple sequence alignment. The author claims that dynamic programming algorithm involves computational complexity that brings slow and inefficient results, the author compares proposed algorithm with CLUSTALW which takes O(N2n2) time and claims that modified technique works for O ( N log2(Nn2)), the proposed approach also makes the alignment process more dynamic as the order of sequences added to the multiple sequence alignment also depends on the already computed multiple sequence alignment [7] . The claim is not supported with examples and results; more study is required to depict some solid fruitful results.
Cai [8] has described a comprehensive evolutionary computational approach for multiple sequence alignment by representing a set of 17 clusters of orthologous groups of proteins and compared the results with the standard results from CLUSTALW and found the proposed results better than the standard approach [8] . One major weakness of the idea is that current implementation uses the fixed parameter tractable algorithm for gap 0-1 alignments, it is not feasible for finding alignments when the number of sequences is much larger than 15. The comparison is quite good for small scale and not efficient for large scaled sequences.
Liu Weiguo [9] proposed a streaming approach for multiple sequence alignment, this approach is based on PC graphics hardware, using modern graphics processing units for high performance computing with low cost make it possible to depict more sophisticated results, the authors have reformulated dynamic programming algorithm bases alignment as streaming algorithm in terms of computer graphics hardware boundaries. The proposed technique is quite comprehensively efficient with only weakness of system graphics hardware primitives. Suitable graphics hardware is mandatory for application and execution of approach.
Zhao [10] presents an improved Ant Colony algorithm that is more sophisticated form of previous technique, the authors claim that their modified approach can operate genomic sequences of any length while traditional Ant Colony approach uses fixed length sequences, the modified approach can avoid local optimum problem, so proposed technique brings robust and efficient results. The weakness of this approach is that searching small chunk in larger sequences may bring bad or erroneous results which may reveal the fact that using this approach for multiple sequences alignment would not be so useful as compared to traditional approaches [10] . Arslan [11] described an improved algorithm for multiple sequence alignment problem, this approach considers two layers each of which corresponds to part of the dynamic programming matrix for the alignment of the given sequences and computes each layer differently using dynamic programming technique, in this way the proposed approach is much more efficient than traditional approach that uses weighted automata and performance is claimed to be much better than other approaches.
Davidson [12] depicts an approach that is basically an integration of dynamic programming and heuristic approach with minimal amount of additional overhead, the idea is that dynamic matrix is traversed along anti-diagonals, bounding the computation to exclude partitions of the matrix that can't contain optimal paths, so the heuristic approach will prune the unnecessary paths from this matrix and present an optimal solution to the problem [12] . The second benefit of this approach is that it presents an efficient use of memory by using divide and conquers technique at the cost of some system computations, the weakness of this approach is that implementing for an arbitrary dimensional matrix will be much more difficult than a two dimensional case. Secondly more dissimilar sequences can bring bad results.
Rashid [13] shows a fast dynamic programming based sequence alignment algorithm uses the reduced amino acids alphabet to transform the protein sequences into a sequence of integers and uses n-gram to reduce the length of the sequence and then traditional approach is used to get the similarity measure between two sequences [13] . The results of this proposed approach seem to be quite satisfactory as compared to traditional approaches. Another benefit of this approach is that it requires less space then traditional approaches as it shortens the length of sequences each time but computational overhead is also involved.
Agrawal [14] claims a better performance by presenting a modification to the iterative approach by incorporating in it the use of multiple parameter sets. Preliminary experiments indicate that using multiple parameter sets gives significantly better performance than using a single parameter set and than using a simple match/mismatch scoring scheme. The authors generate a family of matrices at various distances and multiple matrices for different conservation rates have been used for bringing an optimal alignment. The only weakness of this approach is that using too many parameters may degrade performance.

RESULTS AND DISCUSSION
Following techniques serve as foundation for building blocks regarding comparative genome sequence analysis, The methods are discussed below and their comparative analysis is presented in Table 3 and 4: • Dynamic programming method as an extension • Progressive methods • Iterative methods

Dynamic programming methods as an extension:
The dynamic programming method [29,30] used for Global Alignment of a pair of sequences can be extended for Multiple Sequence Alignment. But the limitation of this method is that it can not efficiently align more sequences, when the no. of sequences grows, the performance of the method degrades considerably.
Progressive methods: Progressive Methods [28] use the Dynamic Programming Method to built the MSA (Multiple Sequence Alignment) starting with most related sequences and then progressively adding less related sequences to initial alignment. e.g.: The drawbacks of Progressive Methods are dependent of initial pair-wise Sequence Alignment. The very first sequences must be very closely related sequences, if sequences are closely aligned then there will be few errors but if sequences are not closely aligned there will be more errors.
Iterative methods of MSA: Iterative Methods [29] attempt to correct for the problem raised by Progressive Methods by repeatedly realigning subgroups of sequences and then by aligning these subgroups into Global Alignment [29,30] . The programs MultiAlin and DIALIGN align multiple sequences using these methods [30] . Presented genome maps that can be Limitation requires more labor [2,25,36] analysis human and mouse genomes quite useful for comprehension of in critical evaluation, focus should complex maps with limited constrained, be made on smaller regions in discovered three unknown genes map Comparative Web based tool for comparative Interactive tool for ease of user, User may not be intended with [3,4,23,27] analysis genome sequence analysis introduced self adjusting maps maps and wish to get some for visualization approximate final results, web based tool may not reflect time complexity accurately Sequence Genome sequence analysis tool Depicts efficient gene identification Need more illustration of idea [3,20,34] analysis for visualization and functional annotations, performs rather than an example to reveal whole genome comparison the hidden truths Data Proposed an approach for building Incorporates both genomic and Efforts and labor involves in Table 4: continue Sequence Solution of sequence alignment Require less time and space to solve the Internal calculation are complex [13,17,31,36] alignment problem sequence analysis problem, is also and computational stuff is suitable for other local and global involved sequence optimization problems, can handle both small and large sequences Sequence Multi sequence alignment using Enhances the performance of genetic Local search may degrade the [18,29] alignment fuzzy logic algorithm, the probability of three system performance, scoring operations of genetic algorithm are matrix and space quite fast and accurate and align scores concept are more sequences more efficiently traditional Sequence Sequence matching using fuzzy The assembler designed can work Relies on enhanced fuzzy logic [17,18,32] alignment logic with low quality data, the performance technique and fuzzy approximate measures of assembler were found methodology accurate than other assemblers Sequence Multi sequence alignment using Can operate on set of sequences with Dividing the system into smaller [19,31] alignment recursive technique local, global and multi alignment, blocks can bring computational recursive in nature, certain degree of overhead, the local alignment performance can be evaluated at all phenomenon should not be levels addresses with multiple one Sequence Sequence comparison using Novel idea for sequence comparison, May require more space for pixel [20,32] alignment Matlab histogram comparison Matlab brings accurate results calculation and image comparisons Sequence Duplicate sequence detection Genetic databases may contain Require more time in sequence [21] alignment redundant sequence information, pattern matching due to huge size the algorithm can overcome of genetic data redundant sequence structure

CONCLUSION
Bioinformatics is a very rapidly emerging field of research; the genome sequence analysis is a very interesting and challenging task that needs great attention and focus. The analysis brings very promising relevance between species. We are now able to find certain genetic similarity and differences in apparently different and diverse creatures, the micro-array technology, phylo-genetic tree creation and many other alignment and analysis tools have helped biologist greatly.

Future expectations:
The genome sequence analysis will help biologist to devise genetic therapy and solutions for genetic disorders. It will also open ways to explore genetic diversity in species; a very challenging goal of this study will be to uncover the wealth of biological information hidden in genetic data. A good generalization of these concepts will better help in areas of molecular medicines that would provide more generic sophisticated medicines for curing diseases. It is definitely a genomic revolution and next decade will reveal the real work and achievement for biologists.