DNA Sequence Optimization Based on Continuous Particle Swarm Optimization for Reliable DNA Computing and DNA Nanotechnology

: Problem statement: In DNA based computation and DNA nanotechnology, the design of good DNA sequences has turned out to be an essential problem and one of the most practical and important research topics. Basically, the DNA sequence design problem is a multi-objective problem and it can be evaluated using four objective functions, namely, H measure , similarity, continuity and hairpin. Approach: There are several ways to solve multi-objective problem, however, in order to evaluate the correctness of PSO algorithm in DNA sequence design, this problem is converted into single objective problem. Particle Swarm Optimization (PSO) is proposed to minimize the objective in the problem, subjected to two constraints: melting temperature and GC content . A model is developed to present the DNA sequence design based on PSO computation. Results: Based on experiments and researches done, 20 particles are used in the implementation of the optimization process, where the average values and the standard deviation for 100 runs are shown along with comparison to other existing methods. Conclusion: The results achieve verified that PSO can suitably solves the DNA sequence design problem using the proposed method and model, comparatively better than other approaches.


INTRODUCTION
A nucleic acid is a macromolecule composed of chains of monomeric nucleotide. In biochemistry, these molecules carry genetic information or form structures within cells. The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). DNA, in particular, is universal in living things, as they are found in all cells and viruses. DNA is a polymer, which is strung together from a series of monomers. Monomers, which form the building blocks of nucleic acids, are called nucleotides. Each nucleotide contains a sugar (deoxyribose), a phosphate group and one of four bases: Adenine (A), Thymine (T), Guanine (G), or Cytosine (C). A single stranded DNA consist a series of nucleotides. The two of single-stranded DNA are held together by hydrogen bonds between pairs of bases, which called duplex or double stranded DNA based on Watson-Crick complement. A sequence of DNA can be read from 5'-end (the ribose end) of one sequence and the 3'-end (the phosphate end) of the other sequence.
DNA has certain unique properties such as selfassembly and self-complementary, which makes it able to save an enormous amount of data and perform massive parallel reactions. With the view of the utilization of such attractive features for computation, DNA computation research field has been initiated [1] . Usually, in DNA computing, the calculation process consists of several chemical reactions, where the successful wet lab experiment depends on DNA sequences we used. Thus, DNA sequence design turns out to be one of the approaches to achieve high computation accuracy and become one of the most practical and important research topics in DNA computing.
The necessity of DNA sequence design appears not only in DNA computation, but also in other biotechnology fields, such as the design of DNA chips for mutational analysis and for sequencing [2] . For these approaches, sequences are designed such that each element uniquely hybridizes to its complementary sequence, but not to any other sequence. Due to the differences in experimental requirements, however, it seems impossible to establish an all-purpose library of sequences that effectively caters for the requirements of all laboratory experiments [3] . Since the design of DNA sequences is dependent on the protocol of biological experiments, a method for the systematically design of DNA sequences is highly required [4] .
The ability of DNA computer to perform calculations using specific biochemical reactions between different DNA strands by Watson-Crick complementary base pairing, affords a number of useful properties such as massive parallelism and a huge memory capacity [5] . However, due to the technological difficulty of biochemical experiment, the in vitro reactions may result in incorrect or undesirable computation. Sometimes, DNA computers fail to generate identical results for the same problem and algorithm. Furthermore, some DNA strands or sequences could be wasted because of the undesirable reactions. To overcome these drawbacks, much work has focused on improving the reliability (correctness) and efficiency (economy) of DNA computing [6] .
In this chapter, DNA sequences are designed based on Particle Swarm Optimization (PSO) [7] . Even though the DNA sequence design is a multi-objective problem, using weighted sum method, it is converted into single objective problem. Weighted sum method scalarized a set of objectives into a single objective by premultiplying each objective with a user-supplied weight. This method is the simplest approach and the most widely used classical approach. However, the value of the weights is difficult to determine, it depends on the importance of each objective in the context of the problem and a scaling factor [8] .
DNA sequence design: In DNA computing, perfect hybridization between a sequence and its base-pairing complement is important to retrieve the information stored in the sequences and to operate the computation processes. For this reason, the desired set of good DNA sequences, which have a stable duplex with their complement, are highly required. It is also important to ensure that two sequences are not complements of one another.
Hartemink et al. [12] designed sequences for the programmed mutagenesis, using the exhaustive search method, "SCAN". Although the method is successful, it took much computational time. Penchovsky and Ackermann [13] implemented a random search algorithm to design DNA sequences. Binary information was encoded in DNA strands a twelve-bit DNA library was demonstrated.
Furthermore, Tanaka et al. [14] proposed some sequence fitness criteria and generated the sequences using simulated annealing [15] . The objective is to find proper combinations of the proposed fitness functions in order to find more promising solutions. Marathe et al. [16] chose a dynamic programming approach to design DNA sequences based on Hamming distance. A dynamic programming based algorithm for the selection of sequences with a given free energy was also presented.
Feldkamp et al. [17] used a directed graph to design DNA sequences. The nodes in the graph represent base strands, where each node can be extended into 4 strands that can appear as successors in a longer sequence. Then, by travelling the graph from root to leaf, DNA sequences are formed. This approach can also find a set of orthogonal DNA sequences within a predefined error rate quickly.
Frutos et al. [18] and Arita and Kobayashi [5] developed a template-map strategy to choose a huge number of dissimilar sequences while having to design only a significantly smaller number of templates and maps. Deaton et al. [19,20] used genetic algorithms to generate a set of unique DNA sequences using Hamming distance for measuring the uniqueness of DNA sequences and found better sequences than Adleman's original sequences [3] .
Arita et al. [1] developed a DNA sequence design system using a genetic algorithm with three fitness criteria. Self-complementary sequences were designed for the Whiplash model and compared the results to that of a random generate-and-test algorithm. Shin et al. [21] developed NACST/Seq that implements multi-objective evolutionary optimization to generate sets of DNA sequences. NACST/Seq generated DNA sequences which satisfied all constraints which are H measure , similarity, hairpin and continuity.
Guangzhou et al. [22] designed DNA sequences using PSO based on four fitness criteria. Twenty strands of DNA sequence with 20-mer were connected one by one in the same direction to form a strand of a 400-mer DNA sequence. The strand is denoted as a particle, where 10 particles are created to form a swarm. However, the employed model cannot utilize high -dimensional search space and two of the fitness criteria, GC content and melting temperature are not suitable to be the objective functions. Zhao et al. [23] implemented a multi-objective PSO to design DNA sequences based on three fitness criteria. However, the study failed to describe the modeling of DNA sequence design using PSO and several parameters in the algorithm also have not been explained. Pareto front solutions were not employed.
Compare to previous works [22,23] this chapter proposes different model of DNA sequence design problem. The length and number of sequences can be chosen by user and the particle in this model carries a set of DNA sequences. Dimensions in PSO computation represented strands of DNA sequences and the continuous search space are utilized.

Objectives and constraints in DNA sequence design:
The objective of the DNA sequence design problem is basically to obtain a set of DNA sequences where each sequence is unique or cannot be hybridized with other sequences in the set. In this work, two objective functions, namely H measure and similarity are chosen to estimate the uniqueness of each DNA sequence. Another two additional objective functions, hairpin and continuity, are used to prevent the secondary structure of a DNA sequence. GC content and melting temperature are used as the constraints, which the ranges for these constraints are set by user preference. The formulations for all objectives and constraints can be referred to [24] .
DNA sequence design is actually a multi-objective optimization problem. However, in this chapter, the problem is converted into a single objective problem, formulated as follows: subjected to T m and GC content constraints, where f i are the objective function for each i ∈ (H measure , similarity, hairpin, continuity) and w i are the weights for each f i In this study, the weights are defined by the user.

Particle swarm optimization: Particle Swarm Optimization (PSO) is a population-based stochastic optimization technique developed by Kennedy and
Eberhart in 1995 [7] . This method finds an optimal solution by simulating social behavior of bird flocking. The PSO algorithm consist of a group of individuals named "particles". Each particle is a potential solution to an n-dimensional problem. The group can achieve the solution effectively by using the common information of the group and the information owned by the particle itself. The particles change their state by "flying" around in an n-dimensional search space based on the velocity updated until a relatively unchanging state has been encountered, or until computational limitations are exceeded.
PSO has been successfully applied to solve many optimization problems, such as power system design [25] , data classification [26] , robotic applications [27] , decision making for stock market [28] and simulation and identification of emergent systems [29] .
With reference to the original PSO, each particle knows its best value so far (pbest), velocity and position. Additionally, each particle knows the best value in its neighborhood (gbest). A particle modifies its position based on its current velocity and position. The velocity of each particle is calculated using: c r (gbest s ) Where: = The best position found by particle i gbest k = The best position found by the particle's neighborhood or the entire swarm c 1 and c 2 = The cognitive and social coefficients, respectively, used to bias the search of a particle toward its own best experience (pbest) and the best experience of the whole swarm (gbest) = The inertia weight, which is employed to control the impact of the previous history of velocities on the current velocity of each particle The parameter regulates the trade-off between the exploration and exploitation ability of the swarm. Large values of facilitate exploration and searching new areas, whereas small values of navigate the particles to more refined search. The velocity equation includes two different random parameters, represented by a variable, r 1 and r 2 , to ensure good exploration of the search space and to avoid entrapment in local optima.
The modified position vector, k 1 i s + is obtained using: The standard PSO algorithm to find the best positioning vector in PSO using i number of particles can be summarized as: Optimization of DNA sequence based on PSO: For DNA sequence design application, the proposed approach is based on basic PSO algorithm. A DNA sequence is represented in binary, where A, C, G and T, are encoded as 00 2 , 01 2 , 10 2 and 11 2 , respectively. A sequence contains of several bases, where one sequence represents one dimension. For each dimension, the length is depended on the length of DNA sequence (l) with the general formula of (4 l -1). For 5-mer nucleic acid sequence, the range of the search space is (4 5 -1), from 0 to 1023, in decimal, which represents sequence AAAAA to TTTTT. For a 10-mer DNA sequence, the range of the search space is (4 10 -1), from 0 to 1048576 and for 20-mer DNA sequence, the range is (4 20−1 ), from 0 to 1099511627776.
To obtain a set of 3 DNA sequences, for example, 3 dimensions should be used in a search space. Therefore, each particle in the search space carries 3 DNA sequences. In this study, 20 particles are employed and randomly initialized in the search space.
The basic PSO has been developed for continuousvalued search spaces. In order to eliminate the floating point in the computation, caused by the random values and coefficient factors, the floating values are approximated to the nearest decimal numbers. The decimal numbers are converted to binary and binary representations are converted into sequences. For example, the decimal number of 908.8 is approximated to 909 10 , which is equals to 1110001101 2 , in binary and is converted into "TGATC" DNA sequence. The values of the constraints are 30%-80% for GC content and 50 o C-80 o C for T m . The T m was computed based on the Nearest-Neighbor (NN) method [30] . Table 1 shows the values of PSO control parameters used in the experiments. In this study, a decreasing inertia weight

RESULTS
The results of the proposed approach are compared with existing approaches, taken from Deaton et al. [19] , Guangzhou et al. [22] and Zhao et al. [23] . For each comparison, 100 runs have been performed by PSO and the average performance is exhibited in terms of the mean value and the standard deviation of the objective function evaluations. Results for all of the aforementioned comparisons are reported in Table 3, 4 and 5. Table 2 summarizes parameter values for the objectives and constraints of the DNA sequence design problem. Since there are several ways to determine the weights in Eq. 1 and the weights depended on user preference, in this experiment, the weights are all set to default value, which is 1.
The PSO method is first compared with results given in [19] , which were obtained using a genetic algorithm. The method produced 7 good sequences with the length of 20-mer. Results of the two algorithms are compared in Table 3 and Fig. 1. PSO reached lower values in the total objectives, compared to the GA. The sequences generated by PSO surpassed the sequences from the GA in three objectives.
Sequences designed by PSO show lower values of H measure , continuity and similarity, while sequences from Deaton et al. [19,20] are better than PSO in the hairpin objective. Fig. 2 demonstrates that the fitness function of f DNA leads to convergence after 390 iterations.
The PSO method is then compared with results given in Guangzhou et al. [22] , which were obtained also using PSO.  Table 3: Comparison of the sequences in [19] and the sequences generated by PSO Sequences    [32] .  [19] and the proposed approaches, with 7 sequences and length of 20 mer The method from [22] produced 20 good sequences with the length of 20-mer. Results of the two algorithms are compared in Table 4 and Fig. 3. The total values for all the objectives for the proposed approach were not satisfying, where PSO [22] obtained better values. However, the sequences generated by PSO surpassed the sequences from the PSO [22] in two objectives. Sequences designed by PSO show lower values of hairpin and similarity, while sequences from PSO [22] are better than PSO in the H measure and continuity objectives. Fig. 4 demonstrates that the fitness function of f DNA leads to convergence after 307 iterations. Table 5 and Fig. 5 compares the results of the PSO with Multi-Objective PSO (MOPSO) as given [23] . The sequences generated by MOPSO also have 7 DNA sequences of 20 mer length, similar to sequences Table 4: Comparison of the sequences in [22] and the sequences Fig. 3: Average fitness comparison results between [22] and the proposed approach, with 20 sequences and length of 20 mer generated by SA. PSO significantly outperformed MOPSO for two objectives, namely, H measure and continuity. However, the sequences obtained from MOPSO showed lower values in similarity and hairpin. For the total overall objectives, PSO achieved better minimum value than MOPSO. The convergence pattern of PSO is illustrated in Fig. 6. The particles converge within 800-820 iterations.  Zhao et al. [23] and PSO method with 7 sequences and length of 20.

CONCLUSION
This study presented an application of particle swarm optimization in DNA sequence design. PSO was implemented with four objectives, namely H measure , similarity, continuity and hairpin and subjected to two constraints, GC content and T m . However, the problem is converted to single objective problem, using weight aggregation.
A model to implement PSO for DNA sequence design was presented, where each particle searches for the minimum value of the objective function in an ndimensional search space. Each particle carries n sequences, where the sequences are represented by binary strings.
The results of the PSO were compared to results from a GA, MOPSO and other PSO model. It was shown that PSO can generate better or comparative sequences in several objectives than other systems. However, the proposed approach has to be improved and explored further. Future research will include improvements of the method by considering a multiobjective PSO such as the Vector Evaluated PSO (VEPSO).