DNA Code Word Design for DNA Computing withReal-time Polymerase Chain Reaction

Problem statement: A number of DNA computing models to solve mathematical graph problem such as the Hamiltonian Path Problem (HPP), Traveling Salesman Problem (TSP), and the Shortest Path Problem (SPP), have been proposed and demonstrated. Normally, the DNA sequences used for the computation should be critically designed in order to reduce error that could occur during a computation. We have proposed a DNA computing readout method tailored specifically to HPP in DNA computing using real-time Polymerase Chain Reaction (PCR). The DNA sequences were designed based on a procedure and DNASequenceGenerator was employed to generate the sequences required for the experiment. The drawback of the previous approach is that a pool of DNA sequences need to be generated by DNASequenceGenerator before the selection is done manually, based on several design constraints. Hence, an automatic and systematic approach is needed to generate the DNA sequences based on design constraints. Approach: In this study, a generate-and-test approach was proposed for the same problem subjected to several design constraints. The generate-and-test algorithm consists of two main levels. The first level considered the basic constraints of DNA sequence design, which were melting temperature, GC-percentage, similarity, continuity, hairpin, and H-measure. This was followed by the second level that includes specific constraints formulated based on five rules, which had been used in previous study. A generated sequence was chosen only if the sequence satisfies all the basic and specific constraints. Results: Sequences designed by generate-and-test approach have higher H-measure value than sequences generated by DNASequenceGenerator. However, the generated sequences show lower value for similarity as well as for additional constraints compared to sequences designed by DNASequenceGenerator. Conclusion: The generated DNA sequences were better compared to the sequences, obtained from DNASequenceGenerator.


INTRODUCTION
Since the discovery of PCR [1] , numerous applications have been explored, primarily in the life sciences and medicine, and importantly, in DNA computing as well. The subsequent innovation of realtime PCR has rapidly gained popularity and plays a crucial role in molecular medicine and clinical diagnostics [2] . All real-time amplification instruments require a fluorescence reporter molecule for detection and quantitation, whose signal increase is proportional to the amount of amplified product. Although a number of reporter molecules currently exist, it has been found that the mechanism of the TaqMan hydrolysis probe is very suitable for the design and development of readout method for DNA computing.
A TaqMan DNA probe is a modified, nonextendable dual-labeled oligonucleotides. The 5' and 3' ends of the oligonucleotide are terminated with an attached reporter, such as FAM, and quencher fluorophore dyes, such as TAMRA, respectively, as shown in (Fig. 1) [3] . Upon laser excitation at 488 nm, the FAM fluorophore, in isolation emits fluorescence at 518 nm. Given proximity of the TAMRA quencher, however, based on the principle of Fluorescence Resonance Energy Transfer (FRET), the excitation energy is not emitted by the FAM fluorophore, but rather is transferred along the sugar-phosphate-Q R DNA

5'
3' Fig. 1: Illustration of the structure of a TaqMan DNA probe. Here, R and Q denote the reporter and quencher fluorophores, respectively. backbone to TAMRA. As TAMRA emits this absorbed energy at a significantly longer wavelength (580 nm), the resulting fluorescence is not observable in Channel 1 of real-time PCR instruments [4] . The combination of dual-labeled TaqMan DNA probes with forward and reverse primers is a must for a successful real-time PCR. As PCR is a repeated cycle of three steps (denaturation, annealing, and polymerization), a TaqMan DNA probe will anneal to a site within the DNA template in between the forward and reverse primers during the annealing step, if a subsequence of the DNA template is complementary to the sequence of the DNA probe. During polymerization, Thermus Aquaticus (Taq) DNA polymerase will extend the primers in a 5'-3' direction. At the same time, the Taq polymerase also acts as a "scissor" to degrade the probe via cleavage, thus separating the reporter from the quencher, as shown in (Fig. 2) [5] , where R and Q denote the reporter and quencher dyes, respectively. This separation subsequently allows the reporter to emit its fluorescence [6] . This process occurs in every PCR cycle and does not interfere with the exponential accumulation of PCR product. As a result of PCR, the amount of DNA template increases exponentially, which is accompanied by a proportionate increase in the overall fluorescence intensity emitted by the reporter group of the excised TaqMan probes. Hence, the intensity of the measured fluorescence at the end of each PCR polymerization is correlated to the total amount of PCR product, which can then be detected, using a real-time PCR instrument for visualization.
The DNA sequence design problem arises in molecular self-assembly applications and DNA computing [7,8] , as well as in general purpose laboratory applications such as probe selection for DNA microarrays and primer design for PCR [9,10] . A procedure for DNA sequence design for DNA computing with output visualization based on real-time PCR has been proposed in [11] . The procedure utilized DNASequenceGenerator [12] , a graph-based approach for designing a set of good DNA sequences. In graph-based approach, nodes in the graph represent base strands and a node has four strands that can appear as successors in a longer sequences as its child nodes. Then, by travelling the graph from roof to leaf, DNA sequences can be designed. This approach also can find a set of orthogonal DNA sequences within a predefined error rate quickly. Some modified rules were used for designing probes and primers, based on DNASequenceGenerator, for real-time PCR implementations [11] . As those modified rules are not available in DNASequenceGenerator, a major drawback of this procedure is the manual filtering, which is required for obtaining the desired DNA sequences. Hence, all the calculations have to be carried out manually, which is a time consuming process.
In this study, we propose different procedure based on generate-and-test approach to obtain all the DNA sequences based on user requirements. By considering adding specific constraints for real-time PCR implementation, a good set of DNA sequences can be generated automatically. Generate-and-test algorithm approach has been used recently to design DNA sequences. For example, Pechovsky and Ackermann designed DNA sequences by a random search algorithm [13] . They encoded binary information in DNA strands and demonstrated twelve-bit DNA library. Tanaka et al. [14] have developed the method of encoding design based on DNA free energy, and used random generate-and-test to filter out the good DNA sequences.

Output Visualization of DNA Computation based on
Real-Time PCR: A readout method tailored specifically to HPP in DNA computing based on realtime PCR has been proposed, which employs a hybrid in vitro-in silico approach [11,15] . In the in vitro phase, O(|V| 2 ) TaqMan-based real-time PCR reactions are performed in parallel, to investigate the ordering of pairs of nodes in the Hamiltonian path of a |V|-node instance graph, in terms of relative distance from the DNA sequence encoding the known start node. The resulting relative orderings are then processed in silico, which efficiently returns the complete Hamiltonian path. Previously, graduated PCR, which was originally demonstrated by Adleman [16] , was employed to perform such operations.

Notation and Basic Principle:
denotes a double-stranded DNA (dsDNA), which contains the base-pairs sub sequences, v 1 , v 2 , v 3 , and v 4 , respectively. Here, the subscripts in parenthesis (a,b,c,d) indicate the length of each respective base-pair subsequence. For instance, v 1 (20) indicates that the length of the double-stranded subsequence, v 1 is 20 base-pairs (bp). When convenient, a dsDNA may also be represented without indicating segment lengths (e.g., v 1 v 2 v 3 v 4 ). A reaction denoted by TaqMan(v 0 ,v k ,v l ) indicates that real-time PCR is performed using forward primer v 0 , reverse primer v l , and TaqMan probe v k . Based on the proposed approach, there are two possible reaction conditions regarding the relative locations of the TaqMan probe and reverse primer. In particular, the first condition occurs when the TaqMan probe specifically hybridizes to the template, between the forward and reverse primers, while the second occurs when the reverse primer hybridizes between the forward primer and the TaqMan probe. As shown in (Fig. 3), these two conditions would result in different amplification patterns during real-time PCR, given the same DNA template (i.e., assuming that they occurred separately, in two different PCR reactions). The higher fluorescent output of the first condition is a typical amplification plot for real-time PCR. In contrast, the relatively lower fluorescent output of the second condition, which reflects the cleavage of a lower number of TaqMan probes via DNA polymerase due to the 'unfavourable' hybridization position of the reverse primer, is due to linear rather than exponential amplification of the template. Thus, TaqMan(v 0 ,v k ,v l ) = YES if an amplification plot similar to the first condition is observed, while TaqMan(v 0 ,v k ,v l ) = NO if an amplification plot similar to the second condition is observed.
The Readout Approach: Let the output of an in vitro computation of an HPP instance of the input graph be represented by a 120-bp 5(20) , where the Hamiltonian path V 0 →V 2 →V 4 →V 1 →V 3 →V 5 , begins at node V 0 , ends at node V 5 , and contains intermediate nodes V 2 , V 4 , V 1 , and V 3 , respectively. Note that in practice, only the identities of the starting and ending nodes, and the presence of all intermediate nodes will be known in advance to characterize a solving path. The specific order of the intermediate nodes within such a path is unknown.
The first part of the approach, which is performed in vitro, consists of [(|V|-2) 2 -(|V|-2)]/2 real-time PCR reactions, each denoted by TaqMan(v 0 ,v k ,v l ) for all k and l, such that 0<k< |V|-2, 1< l < |V|-1 , and k < l. For this example instance, so that the DNA template is dsDNA v 0 v 2 v 4 v 1 v 3 v 5 , these 6 reactions, along with the output in terms of "YES" or "NO" are as follows: TaqMan(v 0 ,v 3 ,v 4 ) = NO (6) Note that the overall process consists of a set of parallel real-time PCR reactions, and thus requires O(1) laboratory steps for in vitro amplification. The accompanying SPACE complexity, in terms of the required number of capillary tubes is O(|V| 2 ). Clearly, only one forward primer is required for all real-time PCR reactions, while the number of reverse primers and TaqMan probes required with respect to the size of input graph are each |V|-3.
After all real-time PCR reactions are completed, the in vitro output is subjected to an algorithm for in silico information processing, producing the satisfying Hamiltonian path of the HPP instance in O(n 2 ) TIME (here, n denotes vertex number) as follows: In this algorithm, an array (N[0…|V|-1) that store all the nodes of the Hamiltonian path is defined. In addition, an array of aggregation values (A[1..|V|-2]) that is used to locate the Hamiltonian path in each array of nodes is also defined. Based on the modified algorithm, the input array N is first initialized to N={0,?,?,?,?,5} since the start and the end of the path are known, in advance. Next, the aggregation array A is initialized to A={1,1,1,1}. During the loop operations of the algorithm, the value of in the array A is increased in each iteration steps. The aggregation array A[i] is used for indexing the nodes array for each value of k.

Experiment:
The experiment of output visualization of DNA computing based on real-time PCR includes two phases. The first phase is the preparation of input molecules and the second phase is real-time PCR experiment, which 6 different TaqMan reactions were performed simultaneously on the real-time PCR machine.
After completion, amplification via PCR was performed using the same protocol as POA The PCR product was subjected to gel electrophoresis and the resultant gel image was captured, as shown in (Fig. 4). The 120-bp band in lane 2 shows that the input molecules have been successfully generated. Afterwards, the DNA of interest is extracted. The extracted DNA was used as a template in real-time PCR experiment. The real-time PCR reaction involves primers (Proligo, Japan), TaqMan probes (Proligo, Japan), and LightCycler TaqMan Master (Roche Applied Science, Germany). The sequences for forward primers, reverse primers, and TaqMan probes are shown in Table 2 and 3, basically derive from the generated sequences from Table 1. The resulting realtime PCR amplification plots are shown in (Fig. 5).
The DNA Sequence Design: In order to design a good sequence for each subsequence v 0 , v 1 , v 2 , v 3 , v 4 , and v 5 , five rules have been considered for designing the DNA sequences for primers and probes of real-time PCR according to [17] , which is based on the implementation of Primer Express® software (Applied Biosystems) [18] . Those rules are as follows: • Melting temperature for primers should be between 58-60°C and melting temperature for probes should be 10°C higher • Primers should be 15-30 bases in length • GC content of primers and probes should ideally be 30-80% • For primers: The run of identical nucleotides should be avoided. This is especially true for G, where runs of 4 or more Gs are not allowed. Further, the total number of G and C in the last five nucleotides at the 3' end of the primer should not exceed 2 • For probes, there should be more C than G, and not a G at the 5' end    In this research a DNA sequences design algorithm based on generate-and-test approach is applied to obtain a set of good DNA sequences. The generate-and-test algorithm consists of two main levels. The first level considers the basic constraints of DNA sequence design followed by the second level that includes specific constraints formulated based on 5 rules, which has been mentioned previously.
Basic Constraints: Normally, a set of sequences for DNA computing is designed subjected to six constraints [19] . Those constraints are GC-percentage, melting temperature, continuity, hairpin, H-measure, and similarity.
GC-percentage is expressed as the percentage of nucleotides that are G and C bases in a strand or duplex. Since GC base pairs are held together by three hydrogen bonds while AT base pairs are held together by only two hydrogen bonds, double-stranded DNA with a high GC content is often more stable than DNA with a high AT content. The GC-percentage is simple to calculate; only the length and the number of GC bases are needed, where the length refers to the number of nucleotide base pairs. GC percentage formula is given by the Eq. 1: GC% = (Number of G and C)/ (Length of DNA sequence)×100% (1) Melting temperature, T m , is the temperature at which 50% of the DNA strands are in denatured form and 50% are in double helical form. The simplest model to estimate the melting temperature of DNA duplexes simply counts the base pairs. This model, which is known as the Wallace 2-4 rule [20] , is formulated as follows: T m = 2(number of AT pairs)+4(number of GC pairs) (2) An improved method to calculate melting temperature is given by the Eq.(3), which includes the consideration of the salt concentration [21] :  For self-complementary molecules, C t /4 is replaced by C t . ∆H and ∆S are calculated by summing the nearest-neighbor enthalpy and entropy changes for the entire hybrid. The values of the enthalpies and entropies are evaluated experimentally by [22][23][24][25]. The Breslauer parameter is used, which the nearest-neighbor method is used for calculating T m in this research .

Basic definitions:
The similar basic definition in [26] is used to formulate the hairpin, continuity, H-measure, similarity, and the other additional constraints. An alphabet consists of each single nucleotide and the gab can be defined as i , T ( i, j) 0 , For a given sequence * x ∈ Λ , the number of nonblank nucleotides is defined as: where: 1, nb (a) 0, and a shift of a sequence x by i bases is denoted as follows: Furthermore, a single DNA sequence can be reversed by using this formulation: swap (x , x ) H-measure: The H-measure computes how many nucleotides are complementary to prevent crosshybridization of two sequences including position shift using the formulation as follows: H-measure for the given set of sequences Σ is defined as follows: Similarity: The similarity measure computes the similarity in the same direction of two given sequences to keep each sequence as unique as possible including position shift. The formulation for similarity uses the same approach as H-measure: l j j j 1 l i l Sim (x, y) eq (x ,shift (y,i) Similarity for the given set of sequences Σ is defined as follows: Hairpin: In [23] , the hairpin measure, as formulated in Eq. (16), calculates the probability of a single-stranded DNA, to form a secondary structure: where: pinlen (p, r) min (p,l p r) Continuity: If the same bases occur continuously in a sequence, the sequence could form an unexpected structure. This can be measured by continuity, which is defined as follows: where: i i j i i n n, if n, s.t. eq (a ,a ) 1, for 1 j n, eq (a , a ) 0 c(a,i) Specific or Additional Constraints: Based on the design rules for DNA sequences, probes, and primers, of the real-time PCR implementation [11] , three constraints for DNA sequence design are defined as follows: • LAST5GC3 is the total number of G and C in the last five nucleotides at the 3'end of the DNA sequence. • CMINUSG is the total number of C minus total number of G. • G5END is an indicator whether there is a G at 5'end of the DNA sequence. G5END = 0 for no G and G5END = 1 if G exists at the 5'end.

Generate-and-Test Algorithms:
A generate-and-test approach is implemented to generate a set of DNA sequences subjected to basic and specific/additional constraints. The flowchart of the generate-and-test approach is shown in Fig. 6. Basically, all the constraints can be divided into two types. The first type of constraints only consider single DNA sequence for the computation of the constraint, which are hairpin, continuity, GCpercentage, melting temperature, and all the additional constraints. The second type of the constraints considers two DNA sequences for the calculation of the constraint, which are H-measure and similarity.
The algorithm starts by initializing all the constraint values based on user requirements. The first sequence is generated randomly. This sequence has to satisfy melting temperature, GC-percentage, continuity, hairpin, and all the additional constraints before it is stored in a DNA set for the next iterations. After the first sequence is obtained, the second sequence is randomly generated and the sequence must satisfy all the first type of constraints. This sequence however must satisfy H-measure and similarity constraints, calculated with the first sequence before the sequence is stored in the archive set. The next sequences also must satisfy all the first type constraints and has to be compared with all sequences in the set by calculating H-measure and similarity. The iteration continues until the amount of DNA sequences required is obtained.

RESULT
The sequence generated by generate-and-test algorithm is performed on the Visual C++ 6.0, using a computer with 3.0 GHz Intel P4 processor and 2 GB RAM. The results obtained are compared with the sequence generated in [11] , where the DNASequenceGenerator was employed [12] . All the variables for the generation of six DNA sequences based on HPP are listed in Table 4. The comparison is shown in Table 5 and Fig. 7. For the H-measure, similarity, LAST5GC3, and CMINUSG, the maximum value is allowed for calculating the constraints. For example, if the H-measure value is set to 8, this means that H-measure for DNA sequences cannot exceed 8. This approach is applied for similarity, LAST5GC3, and CMINUSG.

DISCUSSION
Sequences designed by generate-and-test algorithm have higher H-measure average value than sequences generated by DNASequenceGenerator. However, our sequences show lower similarity average value than sequences designed by DNASequenceGenerator. The previous sequences designed by DNASequenceGenerator have average value of hairpin equals to zero and average value of continuity equals to two. Based on the results, our sequences have lower average value than sequences produced by DNASequenceGenerator for additional constraints. Fig. 7: Comparison results of [11] and the proposed generate-and-test algorithm.  Table 5: Comparison results of the sequences in [11] and the sequences based on the generate-and-test algorithm.

CONCLUSION
These studies showed that the generate-and-test algorithm can be used to design a set DNA sequences for DNA computing readout method based on real-time PCR. Sequences generated based on this algorithm are comparable to sequences designed by DNASequenceGenerator. By using 6 basic constraints for DNA sequence design with additional constraints proposed in this research, a set of DNA sequences generated by generate-and-test can be used in DNA computing with output visualization based on real-time PCR. Generated-and-test algorithm is easy to apply and suitable for designing DNA sequences that have many constraints, where no optimization process is needed, to generate a set of good DNA sequences. However, the DNA sequences generated by generate-and-test algorithm are not the optimized solution, because the resultant sequences generated by generate-and-test algorithm are based on threshold values set by user. Hence, for the future study, advanced optimization algorithms, such as Ant Colony Optimization and Particle Swarm Optimization would provide a lot of advantages in designing good DNA sequences.