NELDER-MEAD METHOD WITH LOCAL SELECTION USING NEIGHBORHOOD AND MEMORY FOR STOCHASTIC OPTIMIZATION

We consider the Nelder-Mead (NM) simplex algorithm for optimization of discrete-event stochastic simulation models. We propose new modifications of NM to reduce computational time and to improve quality of the estimated optimal solutions. Our means include utilizing past information of already seen solutions, expanding search space to t heir neighborhood and using adaptive sample sizes. We compare performance of these extensions on six t e t functions with 3 levels of random variations. We find that using past information leads to reduct ion of computational efforts by up to 20%. The adaptive modifications need more resources than the non-adaptive counterparts for up to 70% but give better-quality solutions. We recommend the adaptiv e algorithms with using memory with or without neighborhood structure.


INTRODUCTION
An Optimization via Simulation (OvS) is the problem of finding possible set of input variables or decision variables that give maximum or minimum objective function values. In addition, a simulation optimization also aims at minimizing computational resources spent while maximizing the information obtained in a simulation experiment (Carson and Maria, 1997). We are interested in the OvS problems that have stochastic objective functions and continuous decision variables (Alon et al., 2005;Henderson and Nelson, 2006;Olafsson and Kim, 2002;Swisher et al., 2004) for OvS surveys.
Many OvS tools are developed for unconstrained continuous problems. Most of them are based on the random search method that takes objective function values from a set of sample points and uses that information to select the next points. Various techniques differ in the choice of sampling strategies (Andradottir, 2006). A point-based strategy involves sampling points in a neighborhood of the current solution, e.g., the Stochastic Ruler (Alrefaei and Andradottir, 2005) and the Simulated Annealing (Press et al., 2007). A set-based strategy generates a set of candidate solutions from a subset of the feasible region, e.g., the Nested Partitions Method (Shi and Olafsson, 2009) and the Nelder-Mead Simplex (Nelder and Mead, 1965). A population-based strategy creates a collection of candidate solutions using some properties of the previously visited solutions; for example, the Genetic Algorithm (Holland, 2000) and the Evolutionary Strategies (Beyer and Schwefel, 2002).
We focus on the Nelder-Mead (NM) simplex algorithm (Nelder and Mead, 1965), which is originally developed for unconstrained deterministic optimization. It demonstrates wide versatility and ease of use such that it is implemented in MATLAB as a function fminsearch. The NM is also robust with respect to small random variations in the observed objective function values; therefore, it is used for optimizing stochastic problems as well (Tomick, 1995;Humphrey and Wilson, 2000) However, in the case that Science Publications JCS variability in the objective function values are sufficiently large, the NM may terminate before reaching the global optima. For this, Barton and Ivey (1996) propose the algorithm RS9 that improve NM performance for stochastic problems by increasing the shrink parameter and recalculating every point of shrink simplex.
In this study, we propose variants of the NM by utilizing past information and/or proximate points. Specifically, we incorporate: • Information collected since the search begins and • Search neighborhood With numerical experiments, we show that our algorithms provide better solutions while requiring less computational efforts than the original NM.
Generally, an OvS problem can be defined as follows: The objective is to determine an optimal solution, x*, that minimizes the unknown objective function, µ: Θ→ ℝ over a continuous feasible region, where, x∈Θ is a vector of d decision variables and x is called a solution. The objective function µ(x) cannot be observed directly; thus, it is estimated with stochastic simulation, i.e. Equation 1: where, G (x,ξ x ) is a simulation output evaluated at x and This study is organized as follows: Section 2 introduces the original form of the NM simplex algorithm, its existing variants, our extensions and describes design of numerical experimens. Section 3 shows of numerical results. Section 4 discusses the results. Ultimately, we conclude in section 5.

Original NM
The first of the simplex methods is due to Spendley et al. (1962) for deterministic problems. They assume that any point in the domain of search can be constructed by taking a linear combination of the edges adjacent to any given vertices. The original simplex consists of the reflection of one vertex through the centroid of the opposite face. Sometimes a sequence of reflections brings the search back to where it starts. Nelder and Mead (1965) add expansion and contraction moves to accelerate the search and a shrink step is introduced to decrease the lengths of edges adjacent to the current best vertex by half, in case that none of the steps brings acceptable improvement to the original simplex. Figure 1 illustrates 2-dimensional trial points for a simplex consisting of x 0 , x 1 and x 2 . The solid lines simplex is initialized. Other linestyle simplexes show various simplex operations, e.g., an expansion point is E 2 x , a reflection point is R 2 x , internal and external contraction points are i C 2 x and e C 2 x , respectively and C is the centroid of the 2 best points. By the default setting of fminsearch (the NM implementation in MATLAB), a single simulation output (m = 1 in Equation 2) is an estimate of an objective function ˆ(x) µ .
The overall logical steps of the NM algorithm are shown in Fig. 2 and it can be explained in more details as follows.

Initialization: Create an Initial Simplex
• Select a starting point x 0 ∈Θ, a vector of d dimensions • Form an initial simplex of d+1 points, by defining: where, s i are the user-specified initial step sizes. Step 1: Calculate the Reflection Point A worst point on the simplex (recall that we consider a minimization problem, so the worst point is one with the highest sample mean x d ) x d is replaced with another point which has a lower objective function. Let R d x be the reflection of the worst point and x d passes through the centroid C of the d-best points. These points are computed as d 1 Step 2: Update the Simplex Figure 3 shows an initial simplex with dash lines and an updated simplex with solid lines. The updated simplex depends on the relationship between R d (x ) µ and 0 1( as shown in Fig. 3a. Then go to Step 4.
, the search continues in the same direction by calculating the expansion point, E where γ>0 is an expansion parameter, typically 2. Then is estimated from m simulation outputs by Equation 2, then count = count + m. The expansion point is accepted when it improves over the best point in the simplex, x 0 , as shown in Fig. 3b; otherwise, the reflection point is accepted. Go to Step 4.

Step 3: Shrink the Simplex
If the reflection point and contraction point provide no improvement, then the simplex is shrunk toward the best point x 0 as shown in Fig. 3d. Compute the new simplex as follows: where, τ is a shrink parameter, typically 0.5. The objective function values of these new points are estimated count = count +md. Then go to Step 4.

Step 4: Re-order the Simplex Points in Ascending Orders
Then let j = j+1.
End while Return * x and *( x ) µ . Barton and Ivey (1996) adapt the NM algorithm to accommodate stochastic variations in the objective function values. From empirical results, they see that because the NM algorithm relies on the ranks of the objective function values at the simplex vertices, it can make progress in presence of relatively small randomness which does not change the rank of the function value at the simplex points. However, if the variations in the function value are large enough, it affects the relative rank of the simplex vertices and misleads the algorithm. Barton and Ivey (1996) recommend the shrinkage coefficient (in Equation 4) of 0.9 instead of the usual 0.5 to increase the extent of reduction after shrink. This change improves the performance effectively for the cases where the original NM fails. Resampling the best point after shrink reduces the frequency of contraction, but this strategy is not effective in improving algorithm performance.

JCS
Moreover, Barton and Ivey (1996) apply another stopping criterion for stochastic problems, as suggested by Dennis and Woods (1987). The search terminates when the simplex size is sufficiently small: where, and i is the Euclidean norm, i.e., . Besides the stopping criteria in Equation 5, their so-called RS9 is the NM with the following modifications: The objective function is estimated with 6 independent simulation outputs (m = 6 in Equation 2); every solution is resampled every time it is encountered (no search history is kept); and the shrinkage coefficient τ is 0.9. The rescaling operations of the NM algorithm can lead to a too-early termination at a non-optimum if noise is present. Tomick et al. (1995) modify the RS9 further to allow the sample sizes to adjust adaptively to the observed noise in the solution space, called ANRS. Suppose that m j is the minimum number of observations taken at each new trial point during the j th iteration and m 0 = 6: where, b = 1.25 is a factor to increase the sample size, d is a size of the decision variable x, x is the largest integer smaller than x:

New Variants of the Nelder-Mead Simplex Algorithm
We are motivated by several general-purpose optimization algorithms for deterministic problems that are based on a neighborhood search; for example, the very large scale neighborhood search (Pichitlamken and Nelson, 2003), neighborhood search based on tabu search and complete local search with memory for solving the uncapacitated facility location problem (Pichitlamken et al., 2006). At each iteration, the search iteratively moves from the current solution to one of its neighbors which is better than itself and any other solutions in the neighbor-hood. Neighborhood search strategy and statistical selection of the best are used in OvS in Tomick et al. (1995) followed by a framework for OvS in More et al. (1981).
We define the neighborhood of solution x = [x 1 ,x 2 ,…,x d ] as N(x) consisting of all already-seen solutions which lie inside the region of: where, ε is the user-specified maximum neighborhood distance. We exclude any neighbors that lie outside the feasible space Θ, or ones which are further than ε from any given x.
The aim of using past information is to save simulation effort by avoiding sampling at every encounter. In our implementation of the NM search, we compare the sample averages of all candidates and select the one with the smallest average as the winner.
We propose two NM-based algorithms with memory as follows.

The Nelder-Mead Selection with Memory (NMSM)
Simulation outputs that have been obtained for revisited solutions and their candidate solutions are kept in a database and they replace new sampling. Nevertheless, for already-seen solutions, NMSM adds one simulation output every time it is encountered so as to protect the search from unusually good or bad history.

The Nelder-Mead Selection with Memory and using Neighborhood (NMSMN)
The NMSMN is the NMSM integrated with a neighborhood. It constructs a neighborhood for every vertex of the simplex and estimates their objective function values. The best solution in the neighborhood replaces the original vertex. The advantages of the NMSMN are that they utilize past information of previous encounters and it also augments the search area to their neighborhood. Let N(x j ) be a neighbor set of x j with ε = 0.01 in Equation 7. Thus V j is a collection of visited solutions and their neighborhood during the j th iteration where V 0 = φ. If x∉V j , V j = V j-1 ∪N(x j ). The NM are changed as follows: For x∈{x 0 , x 1 , x 2 ,…, x d , x R , x E , x C }, before generating new simulation outputs, we check whether they are already visited solutions by the algorithm in Fig. 4.

The Adaptive Nelder-Mead Selection with Memory (ANSM)
The ANSM is NMSM, but when search reaches the step of updating the simplex, the j are estimated by the NMSM where Step 4 is modified as follows: Step 4: Re-Order the Simplex Points In ascending orders then compare all updated expected objective functions of simplex points for determining the m j+1 by Equation 6. Let j = j +1.

The Adaptive Nelder-Mead Selection with Memory and using Neighborhood (ANSMN)
This modification applies ANSM, to the NMSMN. It combines the advantage of utilizing memory of revisited solution and their neighbors and efficiently spending resources to ensure further progress approaching to minimum objective function value.

Numerical Experiments
In section 2.3.1, we describe a set of test functions and their starting solutions. Section 2.3.2 explains the main figures of merit that we use to evaluate and compare the performance of many modifications of the NM. Section 2.3.3 discusses the empirical test setup.

Test Functions
We test the unconstrained optimization algorithm on a set of six deterministic test functions; that is: As  More et al. (1981). They were also used in Humphrey and Wilson (2000) and Barton and Ivey (1996) for optimization of noisy responses. Test function 6 is adapted from Neddermeijer et al. (2000). Each of these deterministic test functions has a unique optimum.

Test Function 1: Paraboloid Function
The paraboloid function is defined as: The starting point is given by x = [d,d,…,d]. The optimal of function value g* =1 is achieved at point x* = [0,..,0]. Figure 5 depicts the polynomial function for case d = 2. This function is concave, symmetric and having only one minimum point. It is easy to optimize if no noise exists. However, when noise is present, optimization is difficult.

Test Function 2: Variably Dimensioned Function
The variably dimensioned function is given by: j(x 1) .
The starting point is given by x = [x 1 , x 2 , …, x d ], where x j =1-(j/d), j = 1,2,…,d. The optimal function value g* = 1 is achieved at point x* = [1,…,1]. Figure 6 depicts the variably dimensioned function for d = 2. The search area is U-curve, which is a crossed flat area. There are numerous local minima in the region of flat area but only one unique global minima exist.

Test Function 4: Extended Rosenbrock
The extended Rosenbrock function is defined as:

Test Function 5: Brown's Almost-Linear Function
The Brown's almost-linear function is given by: The solution x = [1/2,…,1/2] is used as the starting point. The optimal function value g* = 1 is achieved at the point x* = [λ,..,λ,λ 1-d ] where λ satisfies dλ d -(d+1) λ d-1 +1 = 0. Humphrey and Wilson (2000) compute the value of λ is 0.5 for d = 2. Figure 9 illustrates the Brown's almost-linear function for d = 2. The function is not linearly separable and has the basic form of a nonlinear least squares problem.

Test Function 6: Symmetrical Gaussian Function
The symmetrical Gaussian function is defined as:

Search Performance Measures
When the search terminates, optimal solutions can be estimated in at least 3 ways: The solution on-hand, the most frequently visited solution, or the solution with the best cumulative averages (Banks, 1998;Andradottir, 1999). Our preliminary experiments find that the solutions on hand outperform other estimates for optimal solutions. Motivated by Humphrey and Wilson (2000), we evaluate the search performance via the average of the following performance measures over many replications:

Logarithm of the Total Number of Simulation Outputs
To measure the computation work performed by a simulation optimization procedure, we compute: L ln(total number of simulation outputs) ≡ It provides at best a rough indication of the total computational work required by a simulation procedure.

Deviation of the Best Estimated Optimal Function Value from the True Optimal Value
For a measure of accuracy of the best result delivered by a simulation optimization procedure, we consider Equation 9: This measure cannot be employed for Test Function 1 since each coordinate of the true optimum for the paraboloid function is equal to zero.

Empirical Test Setup
Our implementations are run on MATLAB by modifying fminsearch function. The NM coefficients are as follow: α = 1, γ = 2, β = 0.5 and τ = 0.5. The initial step size, s i as shown in Equation 3, is 10 −4 . Minimum deviation ε x and μ ε are 10 −4 . Maximum budget consumption Nµ and N search are 10 5 . To estimate the objective function, the sample size m is 6. Maximum neighborhood distance ε is 0.01. Three level of standard deviation of random noise ξ x is {0.75 g (x*), 1.00g (x*), 1.25 g (x*)}. The factor of increasing simulation size b in Tomick (1995)  3. RESULTS Table 1 shows the budget consumption of each algorithm until computation budget is exhausted or until the search is unable to get any improvements. Table 2 contains the average deviation in estimating the objective function values as defined in (9). As expected, when the degree of randomness increases, a given test problem becomes more difficult. Most algorithms fare worse and their estimated optimal solutions are further away from the optimal solutions because all algorithms give smaller D at lower random noise. That means * (x ) µ is not far from µ (x*). On the contrary, Test Function 4 is the most difficult to optimize even when the level of randomness is low.
Moreover, using the adaptive strategy with memory gives the smallest D although the adaptive feature requires more computational effort. For example, for Test Function 1 and at all random noise levels, ANRS9 consumes more computational resource than RS9, but it rewards with better estimated optimal solutions, i.e., smaller D . Similar results can be observed between ANSM+RS9 and ANSMN+RS9 in comparison with NMSM and NMSMN, respectively. Considering D, no algorithms decisively wins at all noise levels, but almost wining algorithms involve the adaptive method. Similarly, when we consider L, no algorithms outperforms completely at all noise levels.
For better comparison, we compare relative ratio of both, L and D , between pairs of algorithms. The data is divided into 3 sets. Firstly, comparing between the adaptive and non-adaptive methods, e.g., ANRS9/RS9, ANSM+RS9/NMSM +RS9 and ANSMN+RS9/NMSMN, for most test functions and almost all noise levels, the adaptive methods spends 79% more computation effort than the non-adaptive methods, but they give better estimates of the optimal solutions by reducing D, for up to 50%. For example at σ x =1.00 g(x*) for Test Function 2, all adaptive algorithms, ANRS9, ANSM+RS9 and ANSMN+ RS9, yield the estimates of the optimal solutions closer to the true optimum than the non-adaptive algorithms, RS9, NMSM+RS9 and NMSMN+RS9 respectively; D, is reduced approximately by 30% although they spend more computational effort. Except for σ x = 0.75g(x*) of Test Function 3, the adaptive algorithms with using memory (e.g., ANSM+RS9 and ANSMN+RS9) are slightly less than the corresponding the non-adaptive algorithms (i.e., NMSM+RS9 and NMSMN+RS9, respectively); and for Test Function 4 with σ x = {0.75 g(x*), 1.00g (x*)}, the adaptive algorithms with using memory (e.g., ANSM+RS9 and ANSMN+RS9) the true optimum, i.e., D increases. For Test Functions 1, 3, 5 and 6, D increases when the standard deviation of random noise goes up, across all algorithms. For Test Function 2, almost every algorithms also exhibit this pattern except ANSMRS9. Moreover, the results show that for Test Functions 1-3 and 5-6, it is not difficult to find do not give better improvement than their counterparts.  Secondly, comparing between memory and nonmemory methods, e.g., NMSM/NM, NMSM+RS9/RS9 and ANSM+RS9/ANRS9, the results show that memory deployment saves on resource consumption, spending on average 77% less than non-memory counterparts for all test functions and noise levels, e.g., NMSM+RS9 and ANSM+RS9 consume less resources for about 30%, than RS9 and ANRS9, respectively. This is because solutions that are less than ε x apart are classified to be the same. If the search revisits the already seen parts of the solution space, it may use the sampling data from the previous visits, instead of resampling anew. For most test functions and noise levels, on average, utilizing memory reduces deviation D by up to 80% of nonutilizing memory, aside from Test Function 4 that involves the adaptive methods. For example, Test Function 5 for all noise levels, NMSM, NMSM+RS9 and ANSMN+RS9 give better optimal solutions by up to 6% to 49% of NM, RS9 and ANRS9, respectively.

JCS
The rest of comparing is between neighbor and nonneighbor methods, e.g., NMSMN/NMSM, NMSMN+RS9 /NMSM+RS9 and ANSMN+RS9/ANSM+RS9. Regarding the memory-utilizing property, for all noise levels and almost all test functions, except Test Function 4, the nonadaptive methods which incorporate the neighborstructure neither saves computation effort nor improves estimates of the optimal solution, e.g., on Test Functions 2 and 6, NMSMN and NMSMN+ RS9 give indistinguishable results on D and L from NMSM and NMSM+RS9, respectively. In other words, using neighbor-structure is greedy and misled to a non-optimum compared to the algorithms without neighbor-structure. On the other hand, ANSMN+RS9 provides a better optimal solution and spends less computation effort than ANSM+RS9 by 32% and 29%, respectively. These results show that using neighbor-structure on adaptive algorithms provide improved estimated optimal solutions. The search with good performance when there is no limitation on computational resource is AMSMN+ RS9 because it gives the least D at all noise levels for most test functions, except Test Function 4. If computational computational resource is limited, NMSM+RS9 performs better.

DISCUSSION
We show that the Nelder-Mead algorithm which is designed for deterministic optimization can be modified to accommodate stochastic outputs. Using past information generally decreases computational effort and not jeopardizes the performance significantly. Although all adaptive-with-memory algorithms spends more computational resources, they give better optimal solutions comparing with their corresponding non-adaptive ones. Exploiting neighborhood and utilizing memory are helpful for adaptive algorithms, but not for non-adaptive algorithms. The search performance is improved on either ANSM+RS9 and ANSMN+RS9 for unconstrained resource consumption, or NMSM and NMSMN for economical resource consumption.

CONCLUSION
Utilizing past information in continouse optimization saves computational resource. To improve an estimated optimal solutions without limitation of computational effort, uses past information corporates on adaptive method. For our future work, we extend our algorithm to discrete search space and apply simulation to decision making such as queuing or inventory problem.