A Gaussian Process Regression Model for the Traveling Salesman Problem

: Problem statement: Traveling Salesman Problem (TSP) is a famous NP hard problem. Many approaches have been proposed up to date for solving TSP. We consider a TSP tour as a dependent variable and its corresponding distance as an independent variable. If a predictive function can be formulated from arbitrary sample tours, the optimal tour may be predicted from this function. Approach: In this study, a combined procedure of the Nearest Neighbor (NN) method, Gaussian Process Regression (GPR) and the iterated local search is proposed to solve a deterministic symmetric TSP with a single salesman. The first tour in the sample is constructed by the nearest neighbor algorithm and it is used to construct other tours by the random 2-exchange swap. These tours and their total distances are training data for a Gaussian process regression model. A GPR solution is further improved with the iterated 2-opt method. In the numerical experiments, our algorithm is tested on many TSP instances and it is compared with the Genetic Algorithm (GA) and the Simulated Annealing (SA) algorithm. Results: The proposed method can find good TSP tours within a reasonable computational time for a wide range of TSP test problems. In some cases, it outperforms GA and SA. Conclusion: Our proposed algorithm is promising for solving the TSP.


INTRODUCTION
The Traveling Salesman Problem (TSP) is one of the most well-known NP-hard problems in the field of combinatorial optimization. The objective is to find the shortest Hamiltonian cycle among n c cities, where the salesman visits each of the n c cities exactly once and then returns to the starting city. Although its mathematical formulation is simple, the TSP is difficult because it is combinatorial and the number of solutions increases exponentially with the number of cities.
TSP applications can be found in many fields including job sequencing on a single machine or assignment problems (Gilmore and Gomory, 1964), material handling in a warehouse (Ratliff and Rosenthal, 1983), genome rearrangement (Sankoff and Blanchette, 1997), the drilling of printed circuits boards (Duman and Or, 2004), transportation and logistics problem (Rodriguez and Ruiz, 2012).
Several algorithms are designed to solve the TSP problem. The exact algorithm would be to try all possible permutations (order combinations), but the brute force method takes more computational time than the cutting plane method (Dantzig et al., 1954) or the branch and bound method (Little et al., 1963). Other heuristics and approximation algorithms include the nearest neighbor algorithm or the so-called greedy algorithm (Bellmore and Nemhauser, 1968), Lin-Kernighan heuristics (Lin and Kernighan, 1973;Helsgaun, 2000) and the k-opt heuristic (Helsgaun, 2009).
The key contributions of this study: We propose an algorithmic approach for solving a deterministic and symmetric TSP with a single salesman. The method integrates the Gaussian Process Regression (GPR) with the Nearest Neighbor algorithm (NN) and improves its solution by using the iterated 2-opt method. The numerical experiments show that our approach can yield TSP tours within 1-12% of the optimal solutions for the TSP with the size up to 2103 nodes.
The study is organized as follows: Firstly, the literature survey on GPR applications and a conceptual framework of GPR approach are described. Then, our GPR algorithm for solving TSP is explained and the comparing methods are also described in brief. Next, the experimental results of all algorithms and discussion are presented. Lastly, this work is summarized in conclusion.

MATERIALS AND METHODS
Literature survey on GPR applications: The Gaussian process regression is known as a probabilistic approach for a regression model due to its practical and theoretical simplicity and excellent generalization ability (Rasmussen and Williams, 2006). For applications, GPR is employed in many fields, particularly in machine learning, e.g., to estimate the depth of a point in space from observing its image position from two different cameras (Sinz et al., 2004), to learn motion and observation non-parametric system models for sequential state estimation and to apply its algorithm to the problem of tracking an autonomous micro-blimp (Ko et al., 2007), to find near optimal sensor placements in the task as an instance of the artgallery problem (Krause et al., 2008). For traffic problems, GPR is applied to predict the traveling time for an arbitrary traffic path of downtown Kyoto in Japan (Ide and Kato, 2009).
The notations in this paper are as follows: Vectors are represented by small Roman symbols, such as x s and all ones are assumed to be column vectors. The transpose of vectors or matrices is denoted by T . Matrices are represented by capitalized Roman symbols, such as K, and the (i, j) element of a matrix K is K i,j . The identity matrix is I. Finally, the estimated value is denoted with a hat, e.g., ŷ .

Standard Linear Regression (SLR):
The standard linear regression model with Gaussian noise is y = f(x)+ε and Eq. 1: where x is the input vector and w is a vector of weights (parameters to be estimated) of a linear model (Rasmussen and Williams, 2006). Noise ε follows an independent and identically distributed Gaussian distribution with zero mean and noise variance 2 n σ , that is ( ) 2 n N 0, ε σ ∼ , where n is the sample size.
Gaussian process: Gaussian Process (GP) is a collection of random variables, any finite number of which has a joint Gaussian distribution and it is specified by its mean function and covariance function (Rasmussen and Williams, 2006). The mean function m(x) and covariance function k(x,x') are defined a real process as m( Gaussian process regression: Gaussian process regression is a model to estimate the value of a dependent variable or a response from some observations of dependent variables at certain values of independent variables (Rasmussen and Williams, 2006). A training set T s of n observations is denoted as T s = {(x i ,y i )|i =1,..,n}, where x i is an input vector (in our case, a traveling tour) of n c cities and y i is a response (the total distance of the tour). The vector inputs of all n observations are aggregated into the n c ×n matrix X and the total distances are aggregated into the column vector y, so we can write T s = (X, y). The graphical model of GPR is shown in Fig. 1 (Rasmussen and Williams, 2006). In the GPR model, the Gaussian basis function φ(x i ) is specified and it maps a n c city input vector x i into N c -dimensional feature space. Let the matrix Φ(X) be the aggregation of columns φ(x i ) for all inputs in training data set. Therefore, the function f(x) in the SLR model (Eq. 1) becomes φ(x i ) T w, where w is the vector of weight parameters.
For Gaussian distribution, the probability density of the observations given the parameters which is estimated over all cases in a training set is: where w is a bias and w ~ N(0, ∑ p ). The posterior distribution over the weights based on the Bayesian linear model is computed by Bayes' rule. The form of posterior Gaussian distribution with mean w and covariance matrix A −1 is given as Eq.
The predictive distribution in (3) is also Gaussian distribution, with a posterior mean of the weights from (2) multiplied by the possible value x s in a test case (Rasmussen and Williams, 2006). Moreover, the predictive variance is a quadratic form of the possible value x s in a test case multiplied with the posterior covariance matrix. Hence, the predictive distribution becomes Eq. 4: On the right-hand side of (4), the A −1 of size n×n is needed for making a prediction and it may be inconvenient if n is large. However, this term can be rewritten as Eq. 5: where x i and x s are in a training set and a test set, respectively. Let k(x i ,x s ) be ( ) ( ) and it is called a covariance function or a kernel for prediction. Similarly, the covariance matrix K is defined as k(X,X) and each of its elements are determined by a covariance function of the pair of inputs in training set, k(x i ,x j ).
The prior on the noisy observations, independent and identically distributed Gaussian noise ɛ with variance 2 n σ , becomes ( ) 2 n K X,X I + σ . In addition, the joint distribution of the observed response values and the response at the test location is: Thus, the predictive function value is Eq. 6: Covariance function: Many choices of covariance functions are available for GP, for example, Matérn class of covariance function, squared exponential covariance function, rational quadratic covariance function and radial basis covariance function (Rasmussen and Williams, 2006). In this study, the Squared Exponential (SE) covariance function is chosen because it is the most widely-used kernel; it is given by Eq. 7: where δ(x i, x j ) is the Kronecker delta function which equals to 1 if and only if i = j and 0 otherwise and ℓ is the characteristic length-scale.
, σ ℓ and 2 n σ . The partial derivative of (8) with respect to θ is minimized until converging to zero. This gives a vector of optimized hyperparameters.
Tour construction and representation: The sample TSP tours are needed as an input to GPR. The nearest neighbor algorithm (Nuhoglu, 2007) constructs the first TSP tour which is subsequently used to construct other tours by the 2-exchange method. The nearest neighbor algorithm starts at a chosen starting city and then selects the next closest unvisited city until all cities are included in the tour. The total distance of a tour depends on a chosen starting city (Laporte, 1992). The 2-exchange swapping strategy randomly selects two cities in the tour and swaps them to get a new tour (Larranaga et al., 1999).
A tour can be represented in many ways, e.g., path representation, binary string representation, binary matrix representation, edge recombination crossover in binary representation (Larranaga et al., 1999). In this work, the binary string representation is selected to encode all tours for GPR input because it is simple and performs well when making test prediction. Each city in a tour is encoded as a string of [log 2 n] bits and then a complete tour becomes a string of n[log 2 n] bits (Larranaga et al., 1999); that is, a tour 4→2→1→3 is represented by (011001000010).
Tour improvement: Many algorithms have been proposed to improve a tour; for instance, r-opt algorithm, Lin-Kernighan heuristic, simulated annealing algorithm and tabu search (Laporte, 1992). We consider the iterated local search with the 2exchange neighborhood, or the iterated 2-opt algorithm, because the 2-opt is one of the most famous simple local searches (Johnson and McGeoch, 1997;Lourenco et al., 2010). In the 2-opt algorithm, two edges are deleted from a given tour, breaking it into two paths and then those paths are reconnected with two other possible edges. If the new tour is shorter, it becomes the current tour. These steps are repeated until no improvement can be made (i.e., reaching a local optimum). However, the 2-opt algorithm moves with a neighborhood search by starting at the first node and it continues the search process with the next remaining nodes until all nodes are examined (Nuhoglu, 2007). This makes the solution boundary and it can be improved by iterating all search processes, leading to far better solutions (Lourenco et al., 2010). In this study, the iterated 2-opt procedure is modified from the 2-opt algorithm (Nuhoglu, 2007) by iterating the whole process until reaching another local optimum which is better.
Proposed GPR Algorithm: Our GPR algorithm for a deterministic TSP with a single salesman is divided into three phases: I, II and III.
Phase I: We prepare the training dataset T s and the testing dataset.
Input: Distance matrix (D) where D ij is the distance from city i to city j: • Construct the first tour by using the nearest neighbor algorithm and other tours from the first tour by using the random 2-exchange swapping strategy • Encode all tours as binary and aggregate them into X • Calculate the total distance of each tour and aggregate them into y • Find a vector of test tour that has the minimum total distance in X and encode it as binary vector of x s in (3) Output: Binary matrix of tours (X), vector of observed total distances (y) and binary vector of test tour (x s ) Phase II: The GPR model is used to predict the length of an optimal tour.
Input: Binary matrix of tours (X), vector of observed total distances (y) and binary vector of test tour (x s ): • Initialize the hyperparameters ( 2 2 f n , , σ σ ℓ ) • Compute the square exponential covariance function of every possible pairs (x i ,x j ) by (7) • Compute the log marginal likelihood for GPR as specified in (8), and then compute f s by (6) • Compute the squared difference between each value in y and value of f s and identify the ith-index of minimum value; that is: • Find the binary vector of tour (which belong to the ith-index in X) to be the binary vector of predictive tour and decode it to the vector of the predictive tour * x (in cities' number).
Output: Estimated optimal tour * x and its length * y Phase III: This phase implements the iterated 2-opt algorithm to improve * x from Phase II.
Input: Estimated optimal tour * x : • Start with * x and its first edge • Select an edge (a, b) and search for another edge (c, d) and then remove them to break the tour into two paths • Calculate the sum of distances between these two edges and the sum of distances between edge (a, c) and edge (b, d) • Reconnect the tour by these modified edges, only if the sum of distances between the two modified edges is less than that of distances between the two removed edges • Set the obtained tour as an initial tour and repeat the whole process until no improvement (reach a local optimum) • Keep the best tour and set it as an initial tour for next iteration • Repeat Steps 2-6 until no improvement can be made. Return the resulting tour and its total distance Output: Resulting tour * x and its total distance ( * y ).

Comparing methods:
Classical genetic algorithm (GA): GA is a wellknown search heuristic for solving optimization problems using techniques inspired by natural evolution (Chatterjee et al., 1996). It involves three basic steps: evaluation, crossover and mutation. In the first step, a population of individual chromosomes is reproduced and good chromosomes (based on their objective function or "fitness" value) are selected for the next generation with some probability. Next, the crossover step randomly selects pairs of survival chromosomes to the next generation and mates them for producing new chromosomes. The mutation step randomly chooses a chromosome in the new generation completed by the crossover and mutates it at a particular point for a new population. The whole process is iterated until reaching the stopping criteria. In this study, we use GA from a MATLAB toolbox (Kirk, 2007), where the main parameters are population sizes (pop_size), probability of crossover (p c ), probability of mutation (p m ) and the stopping criteria on the number of iteration.
Simulated Annealing Algorithm (SA): SA is a probabilistic method based on the process of material annealing in metallurgy (Laporte, 1992). For the TSP, SA starts from a given initial tour (when temperature is high) and a schedule for gradually decreasing temperature. It generates a new tour by randomly swapping two cities in a current tour and calculates the difference in the length of tour between the current tour and the new tour as ∆E. If the new tour is better than the current tour, it is accepted as current tour; otherwise, the new tour is accepted with a probability given by: where, T is the current temperature, which is decreased by a cooling rate in each iteration. These steps are repeated until reaching the stopping criteria. In this paper, we use SA from a MATLAB toolbox (Seshadri, 2006), where the main parameters are the initial temperature (T int ), end temperature (T end ), cooling rate (T cool ) and the stopping criteria on the number of iteration.

Numerical experiments:
Our GPR algorithm is modified from the GPML toolbox (Rasmussen and Nickisch, 2010) which is implemented in MATLAB. We experiment all algorithms with 60 TSP test problems, with the number of cities ranging from 16 to 2103 from the TSP library (Reinelt, 1995). The computation is done on PC running Intel(R) Pentium Dual CPU 2 GHz. processor with 1 GB of memory. For the GPR algorithm, we use 50 sample tours and the initial hyperparameters are: ℓ = 2, 2 f σ = 1 and 2 n σ = 0 (by trial and error). Because our algorithm needs a training data set which we create randomly, we repeat the GPR algorithm on each TSP instance for nine times. Then total distances are averaged. Our GPR algorithm integrates GPR with the Nearest Neighbor algorithm (NN) and the iterated 2-opt method, namely NN+GPR+Iterated 2-opt.
The performance of all algorithms is the deviation between the total distance of the resulting tour ( * y ) and that of the actual optimal solution ( * y ) for each instance, given by: * * * y -y deviation (%) 100 y = × , and the computational time is also considered.

RESULTS
The results of the proposed algorithm for each instance, including the best solution obtained from nine trials, the average performance of nine trials, the 95% confidence interval, the percentage of the deviation of its solution from the optimal solution, and the average running time, are shown in Table 1. The percentages of the deviation are less than 12% and the average running time is between 2 to 4, 520 sec. For the gr17 problem, the GPR algorithm can find an optimal solution within 3 sec. In addition, for ulysses16, ulysses22, bays29, swiss42, pr107 and si535 problems, the algorithm can find solutions that are within 1% of the optimal solutions. However, the worst case of the deviation from an optimal solution occurs in the rat783 instance, approximately 12%.
The summary results of the proposed algorithm and the comparing algorithms for each instance, including the percentage of the deviation of their solutions from the optimal solutions and the average running time, are shown in Table 2. The comparison plots of the deviation of all algorithms' solutions from optimal solutions and the log scale of average running time among three algorithms are provided in Fig. 2 and 3, respectively. Comparing with GA and SA, the proposed algorithm can find better solutions for 39 TSP instances (out of 60). The average running time of 49 TSP instances are also less than one second. Although the proposed algorithm spends more run time than both comparing algorithms in the remaining 11 TSP instances, the solutions obtained are better.
When the size of the test problem is bigger, our proposed approach performs well while the performance of GA and SA deteriorates, as shown in Fig. 2. Figure 3 shows the average running time consumed by the NN+GPR+Iterated 2-opt algorithm spends more run time than GA and SA when the problem size is larger than 535 cities, but the solution's quality is better.

DISCUSSION
We consider the nearest neighbor method, Gaussian process regression and the iterated 2-opt method. It adopts the NN method to construct the first sample tour and uses it to construct other sample tours by the 2-exchange method, then they are treated as an input for the GPR for predicting the optimal tour. In addition, the solutions from this approach are not very good, far away from the optimal solutions. Thus, the improvement procedure is called for. The iterated 2-opt method is a local search and the combined approach is called "NN+GPR+Iterated 2-opt." The numerical experiments show that it performs well on a set of 60 TSP instances.
Moreover, we compare the proposed method with two well-known methods, i.e., Genetic Algorithm (GA) and Simulated Annealing Algorithm (SA). The experimental results show the NN+GPR+Iterated 2opt algorithm yields better overall solution quality than GA and SA even though there are some TSP instances in which it is not the winner, comparing with GA and SA. Our algorithm also consumes less overall run time than GA and SA. Although there are some TSP instances that it spends more run time than both algorithms, it acquires the better solution quality. Thus, in this study, the NN+GPR+Iterated 2-opt algorithm is the best method, comparing among three approaches.

CONCLUSION
In this study, we propose an algorithm based on Gaussian Process Regression (GPR) for predicting the optimal tour of the deterministic Traveling Salesman Problem (TSP) with a single salesman. This algorithm formulates TSP as a GPR model where the response is the length of traveling tour while the predictor is the traveling tours with the cities' number. The NN+GPR embedded with the iterated 2-opt algorithm achieves a reasonable trade-off between computational time and solution quality.
The results indicate that NN+GPR+Iterated 2-opt performs well on a set of 60 TSP instances. However, it consumes more running time than the two comparing algorithms (genetic algorithm and simulated annealing algorithm) for some TSP instances.

ACKNOWLEDGMENT
The researchers would like to thank Associate Professor Dr. Peerayuth Charnsethikul, Department of Industrial Engineering at Kasetsart University, for his constructive comments and suggestions. Also, the first author gratefully acknowledges the financial support from Prince of Songkla University for this study.