A Modified Conjugate Gradient Formula for Back Propagation Neural Network Algorithm

: Problem statement: The Conjugate Gradient (CG) algorithm which usually used for solving nonlinear functions is presented and is combined with the modified Back Propagation (BP) algorithm yielding a new fast training multilayer algorithm. Approach: This study consisted of determination of new search directions by exploiting the information calculated by gradient descent as well as the previous search direction. The proposed algorithm improved the training efficiency of BP algorithm by adaptively modifying the initial search direction. Results: Performance of the proposed algorithm was demonstrated by comparing it with the Neural Network (NN) algorithm for the chosen test functions. Conclusion: The numerical results showed that number of iterations required by the proposed algorithm to converge was less than the both standard CG and NN algorithms. The proposed algorithm improved the training efficiency of BP-NN algorithms by adaptively modifying the initial search direction.


INTRODUCTION
Gradient based methods are one of the most widely used error minimization methods used to train back propagation networks. The BP training algorithm is a supervised learning method for multi-layered feedforward neural networks [1] . It is essentially a gradient descent local optimization technique which involves backward error correction of the network weights. Despite the general success of BP in learning the neural networks, several major deficiencies are still needed to be solved. First, the BP algorithm will get trapped in local minima especially for non-linearly separable problems [2] . Having trapped into local minima, BP may lead to failure in finding a global optimal solution. Second, the convergence rate of BP is still too slow even if learning can be achieved. Furthermore, the convergence behavior of the BP algorithm depends very much on the choices of initial values of connection weights and the parameters in the algorithm such as the learning rate and the momentum.
Improving the training efficiency of neural network based algorithm is an active area of research and numerous papers have been proposed in the literature. Early days BP algorithm saw further improvements. Later, as summarized by Bishop [3] various optimization techniques were suggested for improving the efficiency of error minimization process or in other words the training efficiency. Among these are methods of Fletcher and Powell [4] and the Fletcher-Reeves [5] that improve the conjugate gradient method of Hestenes and Stiefel [6] and the family of Quasi-Newton algorithms proposed by Huang [7] .
It has been recently shown that a BP algorithm using gain variation term in an activation function converges faster than the standard BP algorithm [8][9][10] . However, it was not noticed that gain variation term can modify the local gradient to give an improved gradient search direction for each training iteration. This fact allow us to develop and investigate several important CG-formulas in order to improve the rate of convergence of the proposed type of algorithms.
This study suggests that a simple modification to the search direction can also substantially improve the training efficiency of almost all major (well Known and developed) optimization methods. It was discovered that if the search direction is locally modified by a gain value used in the activation function of the corresponding node, significant improvements in the convergence rates can be achieved irrespective of the optimization algorithm used. Furthermore the proposed algorithm is robust, easy to compute and easy to implement into several well known nonlinear test problems as will be shown later.
The remaining of the study is organized as follows: Materials and methods are proposed first. The outlines and the implementation of the proposed algorithm with CG algorithm have been discussed second. Experimental results are discussed after that and finally, the concluding remarks with short discussion for further research are listed later on.

MATERIALS AND METHODS
First, a novel approach for improving the training efficiency of gradient descent method (BP algorithm) which was presented by Nawi et al. [12] are discussed. Their method modified the initial search direction by changing the gain value adaptively for each node. Following their iterative method for changing the initial search direction using a gain value: Algorithm [12] : Initialize the weight vector with random values and the vector of gain values with one. Repeat the following steps 1, 2 and 3 on an epoch-by-epoch basis until the given error minimization criteria are satisfied.
Step 1: By introducing gain value into activation function, calculate the gradient for weight vector by using Eq. 6 and gradient for gain value by using Eq. 9.
Step 2: Calculate the gradient descent on error with respect to the weights and gains values.
Step 3: Use the gradient weight vector and gradient of gain calculated in step 1 to calculate the new weight vector and vector of new gain values for use in the next epoch.
In general, the objective of a learning process in neural network is to find a weight vector w which minimizes the difference between the actual output and the desired output. Namely: Suppose for a particular input pattern o 0 and let the input layer is layer 0. The desired output is the teacher pattern t = [t 1 …t n ] T and the actual output is L i o , where L denotes the output layer. Define an error function on that pattern as: The overall error on the training set is simply the sum, across patterns, of the pattern error E. Consider a multilayer feed Forward Neural Network (FNN) [1] has one output layer and one input layer with one or more hidden layers. Each layer has a set of units, nodes, or neurons. It is usually assumed that each layer is fully connected with a previous layer without direct connections between layers which are not consecutive. Each connection has a weight. By introducing "gain variation" term in this activation function, then only updating formulas for s 1 δ are changed while others are the same as the standard back propagation. To simplify the calculation, taken from the Eq. 2 we then can perform gradient descent on E with respect to s ij w . The chain rule yields: where, s In particular, the first three factors of (4) indicate that: As we noted that the iterative formula (5) for s j δ is the same as standard BP [11] except for the appearance of the value gain. By combining (4) and (5) yields the learning rule for weights: where, η is a small positive constant called 'step length' or 'learning rate' and the search direction or gradient vector is In this approach, at step n is the calculation for gradient of error g (n) is modified by including the variation of gain value to yield: The gradient descent on error with respect to the gain can also be calculated by using the chain rule as previously described; it is easy to compute as: Then the gradient descent rule for the gain value becomes: At the end of each iteration the new gain value is updated using a simple gradient based method as given by the formula: For more details of this method and its implementations [12] .

CG-type method:
In order to overcome the difficulties of the standard delta rule learning technique, where this turning through the steepest descent direction. Which causes the zigzag problematic, where it increases the turning time. To improve the global rate of convergence of the standard delta rule learning technique, we have suggested the new delta rule based on CG type method. The CG method is an iterative optimization approach, the gradient is n-components vector and the gradient direction is called the steepest ascent or descent (the Fletcher-Reeves method tries to exploit the fact that for a quadratic function of n variables, n linear searches along the mutually conjugate directions will locate the minimum). But this direction has a local property and not a global one. Thus any method which makes use of the gradient vector, can be expected to give the minimum point. The steps of CG algorithm are:

Algorithm (CG-based NN-algorithm):
• Select the initial weights randomly with small values and set p = 1 • Select the next training pair from the training set [T p ] apply the input vector [x p ] to the network input and specify the desired output vector. Then calculate the actual outputs of the network by using these two formulae: = − +β and find the corresponding step size (µ i ) by updating the object (x) to minimize the value of object (x) and adjust the weights of the network in a way that minimizes the error.
When (x i+1 ) is minimum terminate the iterations; if not, continue to repeat step (3) for each vector in the training set until the total error for the entire set is acceptable low, where d i is the update direction of the object at the iteration and d i , the initial direction, is chosen to be the steepest descent direction, µi is the step size along the direction d i ; β i is chosen to insure that the direction vector is conjugate to all previous directions and g i is the gradient of the activation function of the iteration. There is a number of different choices for β i .
In this study, Fletcher and Reeves has been used. It is one of the most commonly used and its formula is: The other type of β i is Al Assady and Al Bayati and Polak and Ribere. The formula of β i of these two types are: Following we are going to investigate and develop a new formula for the extended FR formula which is based on the non-quadratic models, this will increase the rate of convergence of the In fact, these quantities are well-known and derived with respect of quadratic models, certainly the FR formula has a standard FRCGmethod. Now most of the currently used optimization methods use a local quadratic representation of the objective function. But the use of quadratic model may be inadequate to incorporate all the information so that more general models than quadratic are proposed as a basis for CG algorithms.
It is shown that CG methods with T i 1 i 1 g g + + in the numerator of β i has strong global convergence theorems with exact and inexact line searches but has poor performance in practice. On the other hand the CG methods with T i 1 i g y + in the numerator of β i has uncertain global convergence for general non-linear functions, but has good performance in practice. Therefore CG methods has been frequently modified and improved by many authors, for example [14][15][16] have proposed further modifications of the conjugate gradient methods which are based on some non-quadratic models.
Here we generalize the FRCG by considering more general function than quadratic which we call it as quasi-sigmoid function, our goal is to preserve and increase the convergence properties of FR algorithm and to force its performance in practice.

A generalized FRCG algorithm (based on nonquadratic model):
If q(x) is a quadratic function defined by: Where: G = n×n symmetric and positive definite matrix b = Constant vector in R n and c is constant Then we say that f is Defined as a nonlinear scaling of q(x) if the following conditions hold: And: The following proportions are immediately derived from the above conditions: • Every contour line of q(x) is a contour line of f • If x • is minimize of q(x) then it's also a minimize of f In this area there are various published works. The special polynomial case: where, ε 1 , ε 2 scalars are has been investigated by [16] . A rational model has been developed by [15] where: Another rational model was considered by [14] where: A new extended CG-formula: Here we consider the new model function defined by: We call it as quasi sigmoid function where q>0 with further assumption that: From Eq. 20 we have: Return to Eq. 20 and solve it for q assuming that Then we can compute the value of ρ i using function value at two points x k+1 and x k from Eq. 24: Where: Our objective is to investigate an extended FRCG method applied to function F(q(x)) obtained by nonlinear scaling of q(x). The extended FRCG method can be done by modifying the search directions as follows: And for i≥0: Where: Where: The method of FRCG applied to f(x) = q(x) and the extended (ERFCG) method defined by (27-29) and applied to f(x) = F(q(x)) with ρ i defined by (26) generate identical set of directions and identical sequence of points {x i }.
Proof: The prove is by induction for k = 0 we have: Hence the two methods generates the same set of directions.
Our proposed algorithm known as (NEW) begins the minimization process with an initial estimate w 0 and an initial search direction as: Then, for every epoch by using our proposed method in Eq. 7 the search direction at (n+1) th iteration is calculated as: where the scalar β n is to be determined by the requirement that d n and d n+1 must fulfill the conjugacy property [3] and (ρ n ) is defined by Eq. 26.

Outlines of the new algorithm:
Step 1: Initializing the weight vector randomly, the gradient vector g 0 = 0 and gain value as one.
Let the first search direction d 0 = g 0 . Set β 0 = 0, epoch = 1 and n = 1. Let N t is the number of weight parameters Select a Convergence Tolerance CT.
Step 2: At step n, evaluate gradient vector g n (c n ) with respect to gain vector c n and calculate gain vector.
Step 3: Evaluate E(w n ). If E(w n )<CT then STOP training ELSE go to step 4. Step 6: If [(epoch +1)/Nt] = 0 THEN 'restart' the gradient vector with d n = g n-1 (c n-1 ) ELSE go to step 7.
Step 7: Calculate the optimal value for learning rate η* by using line search technique such as: Step 8: Update Step 11: Set n = n+1 and go to step 2.

RESULTS
The performance criterion used in this research focuses on the speed of convergence, measured in number of iterations and CPU time. The test problems used to verify our new proposed algorithm are taken from the open literature by Prechelt [11] . The numerical results have been carried out on a Pentium IV using MATLAB version 7.0. On our test problems, four algorithms have been computed. The first algorithm is the standard CG with Fletcher-Reeves update (traincgf). The other algorithm is a standard CG algorithm computed from the NAG library (CGFR). The third one is the Nawi et al. [12] NN-algorithm computed by using their algorithm with their computer program listed in [12] . Finally, the new proposed ECG based NN algorithm (NEW) which is implanted by modifying NN-algorithm. This algorithm uses a major change in calculating the gradient step by in forcing the value of the new parameter ρ i which is given in Eq. 25 of this study. We have noticed that if the value of this parameter is one, then, this new algorithm will coincide with Nawi et al. [12] algorithm but for positive values the new algorithm produces better numerical results because of the use of the non-quadratic model in the derivation of the CG-formula To compare the performance of the proposed algorithm with respect to others: standard optimization algorithms from the (MATLAB NNtoolbox) network parameters such as network size and architecture (number of nodes, hidden layers etc), values for the initial weights and gain parameters were kept same. For our test problem the neural network had one hidden layer with five hidden nodes and sigmoid activation function was used for all nodes. All algorithms were tested using the same initial weights that were initialized randomly from range [0, 1] and received the input patterns for training in the same sequence.
Default values were used for the heuristic parameters, of the above algorithms, unless stated otherwise. For the purpose of comparison, all tested algorithms were fixed with the values of learning rate = 0.3 and momentum term = 0.4. The initial value used for the gain parameter was one. The results of all these four algorithms will be presented as Table 1 and 2 which summarize the performance of the algorithms for simulations that have reached solution. All algorithms were trained with 100 trials, if an algorithm fails to converge, it is considered that it fails to train the FNN, but its epochs, CPU time and generalization accuracy are not included in the statistical analysis of the algorithms.
Test problem (1) Fisher [13] : Iris classification problem: This is a classical classification dataset made famous by Fisher [13] , who used it to illustrate principles of discriminate analysis.  This is perhaps the best-known database to be found in the pattern recognition literature. Fisher's study is a classic in the field and is referenced frequently to this day. The selected architecture of the FNN is (4-5-3-3) with target error was set as (0.01) and the maximum epochs to (2000). Table 1 shows that the proposed algorithm reached the target error after only about (25) epochs as opposed to the standard CGFR at about (39) epochs and clearly we see that there is an improvement ratio, nearly (3.6), for the number of epochs compare to NN-toolbox and almost (3.681) for the convergence time. The following main computer program used with Nawi et al. [12] routine is: Test problem (2) Fisher [13] : Winconsin breast cancer problem: This dataset was created based on the 'breast cancer Wisconsin' problem dataset from [12] . This problem tries to diagnosis of breast cancer by trying to classify a tumor as either benign or malignant based on cell descriptions gathered by microscopic examination. The selected architecture of the FNN is (9-5-2-2). The target error is set as to (0.015) and the maximum epochs to (2000).
In Table 2, it is worth noticing that the performance of the new algorithm since it take only (33) epochs to reach the target error compare to Nawi et al. [12] (39) epochs and to CGFR at about (65) epochs and worst for traincgf that need about (71) epochs to converge. Still the proposed algorithm outperforms others three algorithms with a considerable improvements.

DISCUSSION
In this study, we have introduced neural network algorithm based on several optimization update. The new algorithm is compared with three well known standard CG and NN algorithms using the Iris Classification Problem and Winconsin Breast Cancer Problem. Our numerical results indicate that the new technique has an improvements of about (10-15%) NOE; CPU and CT tools.

CONCLUSION
In this research, new fast learning algorithm for neural networks which are based on CG updates with adaptive gain training algorithm (NEW) is introduced. The proposed algorithm improved the training efficiency of BP-NN algorithms by adaptively modifying the search direction. The initial search direction is modified by introducing the gain value. The proposed algorithm is generic and easy to implement in all commonly used gradient based optimization processes. The simulation results showed that the proposed algorithm is robust and has a potential to significantly enhance the computational efficiency of the training process.