Rule Extraction from Radial Basis Functional Neural Networks by Using Particle Swarm Optimization

: Radial basis functional neural networks (RBFNN) provide an outstanding possibility for generating rules for solving pattern classification problems. One of the most important factors in RBFNN is finding out the center and spread. This paper examines rules extracted from RBF networks trained by Particle swarm Optimization (PSO). The selection of the RBFNN centers, spreads and the network weights can be viewed as a system identification problem. Our Simulation results using Radial Basis Functional Neural Networks (RBFNN) was applied to the PAT, WBC and IRIS data sets as a classification problem to illustrate the new knowledge extraction technique. The results indicate that training RBFNN with PSO can provide comparable generalization of rules with less training time.


INTRODUCTION
A Radial Basis Functional neural network (RBFNN) is trained to perform a mapping from an mdimensional input space to an n-dimensional output space. RBFNN's can be used for discrete pattern classification, function approximation, signal processing, control, or any other application, which requires a mapping from an input space to an output space. Many recent developments of RBFNN and its applications can be found in Neuro computing special issues on RBFNN [1,2].
An RBFNN consists of the m-dimensional input x being passed directly to a hidden layer. Suppose there are c neurons in the hidden layer. Each of the c neurons in the hidden layer applies an activation function, which is a function of the Euclidean distance between the input and an m-dimensional prototype vector. Each hidden neuron contains its own prototype vector as a parameter. The output of each hidden neuron is then weighted and passed to the output layer. The outputs of the network consist of sums of the weighted hidden layer neurons. Figure 1 shows a schematic form of an RBFNN network. It can be seen from the basic architecture, that the design of an RBFNN requires several decisions, including the following: 1. How many neurons will reside in the hidden layer? 3. What function will be used at the hidden units (i.e., what is the function g (·))? 4. What weights will be applied between the hidden layer and the output layer? The performance of an RBFNN network depends on the number and location (in the input space) of the centers, the shape of the RBFNN functions at the hidden neurons, and the method used for determining the network weights. Some researchers have trained RBFNN networks by selecting the centers randomly from the training data [3]. Some have used unsupervised procedures (such as the k-means algorithm) for selecting the RBFNN centers [2] , while others have used supervised procedures for selecting the RBFNN centers [4] . This study will be divided into four parts: We begin with motivation and detailed description of Radial basis functional Neural Network (RBFNN) in first part and second part. The third part will review the most recent optimization technique namely Particle Swarm Optimization (PSO). We will end with the rule extraction for different data types for pattern recognition and its future avenue.
Several training methods separate the tasks of prototype determination and weight optimization for classification and rule generation. This trend probably arose because of the quick training that could result from the separation of the two tasks. In fact, one of the primary contributors to the popularity of RBFNN networks was probably their fast training times as compared to gradient descent training (including back propagation) shown in Figure 1, it can be seen that once the prototypes are fixed and the hidden layer function g(·) is known, the network is linear in the weight parameters w. At that point training the network becomes a quick and easy task that can be solved via linear least squares. (This is similar to the popularity of the optimal interpolative net that is due in large part to the efficient non-iterative learning algorithms that are available [5,6] ) Training methods that separate the tasks of prototype determination and weight optimization often do not use the input-output data from the training set for the selection of the prototypes. For instance, the random selection method and the k-means algorithm result in prototypes that are completely independent of the input-output data from the training set. Although this results in fast training, it clearly does not take full advantage of the information contained in the training set. Gradient descent training of RBFNN networks has proven to be much more effective than more conventional methods [4] . However gradient descent training can be computationally expensive. This paper extends the results of [4] and formulates a training method for RBFNN's based on Particle Swarm Optimization. This new method proves to be quicker than gradient descent while still providing performance at the same level of effectiveness.
Training a neural network is, in general, a challenging nonlinear optimization problem. Various derivative-based methods have been used to train neural networks, including gradient descent [4] , Kalman Filtering [7,8] , and the well-known back-propagation [9]. Derivative-free methods, including genetic programming [10] , learning automata [11] , and simulated annealing [12] have also been used to train neural networks.
Derivative-free methods have the advantage that they do not require the derivative of the objective function with respect to the neural network parameters. They are more robust than derivative-based methods with respect to finding a global minimum and with respect to their applicability to a wide range of objective functions and neural network architectures. However, they typically tend to converge more slowly than derivative-based methods. Derivative-based methods have the advantage of fast convergence, but they tend to converge to local minima. In addition, due to their dependence on analytical derivatives, they are limited to specific objective functions and specific types of neural network architectures.

INTERPRETATION OF RADIAL BASIS FUNCTION NEURAL NETWORKS
The multi layered feed forward network (MFN) is the most widely used neural network model for pattern classification applications. This is because the topology of the MFN allows it to generate internal representations tailored to classify the input regions that may be either disjointed or intersecting. The hidden layer nodes in the MFN can form hyper planes to partition the input space into various regions and the output nodes can select and combine the regions that belong to the same class. Back propagation (BP) is the most widely used training algorithm for the MFN's.
Recently researchers have begun to examine the use of Radial Basis Function neural networks (RBFNN) for pattern classification problems due to a number of drawbacks of BP-trained networks. Although a BP network produces decision surfaces that effectively separate training examples of different classes, this does not necessarily result in the most plausible or robust classifier. The decision surfaces of BP networks may not take on any intuitive shapes because regions of the input space not occupied by training data are classified arbitrarily, not according to proximity to training data. In addition, BP networks have no mechanism to detect that a case to be classified has fallen into a region with no training data. This is a serious drawback since the power system operates within a wide range of system and fault conditions. The RBFNN consists of an input layer made up of source nodes and a hidden layer of a sufficiently high dimension. The output layer supplies the response of the network to the activation patterns applied to the input layer. The nodes within each layer are fully connected to the previous layer as shown in the Figure 1. The input variables are each assigned to a node in the input layer and pass directly to the hidden layer without weights. The hidden nodes, or units, contain the radial basis functions (RBFNN's) and are represented by the bell-shaped curve in the hidden nodes as shown in the Fig 1. RBFNN Algorithm: This section describes how we used an RBFNN network to classify the data sets. RBFNN used here has an input layer, a hidden layer consisting of Gaussian node function, an output layer, and a set of weights, W to connect the hidden layer and output layer. We denote x to be the input vector to the network, where x = (x 1 , x 2 , x 3 , ……x D ), and D is the embedding dimension. We call o the ANN output vector, where o = (o 1 , o 2 , o 3 , …. o n ) T is the number of out put nodes. We have P training patterns. The RBFNN classification problem is to approximate the mapping from the set of inputs, to the set of outputs, For an input vector x(t) , the output of j th output node produced by an RBFNN is given by Where C i is the center of the i th hidden node, σ i is the width of the ith center, and m tot is the total number of hidden nodes. Using vector notation, let ϕ = ( φ 1 (t),φ 2 (t),…..,φ mtot (t)) and w j = (w 1j , w 2j , ….., w mto (t j )) and RBFNN output can be written as o j = w j ϕ T( t) .
The cost function of the network for the jth output is then calculated as e = (d -o j ) where d = desired output. The RBFNN classifier contains four sets of parameters that have to be learned form the examples. They are the centers, c i (t), number of centers m tot , variances σ I, and weights w ij, . We denote all the RBFNN's centers by C whole . In our implementation of RBFNN, classes do not share centers. Each of these sets of centers is trained with a separate PSO clustering run. In each PSO run (corresponding to a different class), only the training vectors for that class would be used for clustering as described in the next section.
Once the RBFNN centers are initialized by PSO then the weights are updated according to the following: The centers are then updated according to the following: The variances are not updated in this experiment to minimize the time and thus σ ij = σ ij There are several reasons for using an RBFNN in our classification problem. First many neural networks require nonlinear optimization for training. The second reason for employing a RBFNN classifier is that the internal representation of training data of an RBFNN is intuitive. Each RBFNN center approximates a cluster of training of data vectors that are close each other in Euclidean space. When a vector is input to the RBFNN, the center near to that vector becomes strongly activated, in turn activating certain output nodes.
The hypothesis space implanted by these learning machines is constituted by functions of the form The nonlinear activation function Øk expresses the similarity between any input pattern x and the center v k by means of a distance measure. Each function Ø k defines a region in the input space (receptive field) on which the neuron produces a appreciable activation value. If the common case when the Gaussian function is used, the center C k of the function σ k defines the prototype of input cluster k and the variance Ø k the size of the covered region in the input space.
The local nature of RBFNN networks makes them an interesting platform for performing rule extraction. However, the basis functions overlap to some grade in order to give a relatively smooth representation of the distribution of training data [13,14]. This overlapping is a shortcoming for rule extraction. Few rule extraction methods directed to RBFNN have been developed [15,16,17] .
The rule extraction method for RBFNN derives descriptions in the form of ellipsoid. Initially, assigning each input pattern to their closest center of RBFNN node according to the Euclidean distance function a partition of the input space is made. When assigning a pattern to its closest center, this one will be assigned to the RBFNN node that will give the maximum activation value for that pattern. From these partitions the ellipsoid are constructed. Next, a class label is assigned for each center of RBFNN units. Output value of the RBFNN network for each center is used in order to determine this class label. Then, for each node an ellipsoid with the associated partition data is constructed. Once determined the ellipsoid, they are transferred to rules. This procedure will generate a rule by each node.

PARTICLE SWARM OPTIMIZATION
Particle Swarm Optimization (PSO) is a population based stochastic search process, modeled after the social behavior of a bird flock [18,19,20] . The algorithm maintains a population of particles, where each particle represents a potential solution to an optimization problem.
In the context of PSO, a swarm refers to a number of potential solutions to the optimization problem, where each potential solution is referred to as a particle. The aim of the PSO is to find the particle position that results in the best evaluation of a given fitness (objective) function. Each particle represents a position in N d dimensional space, and is "flown" through this multi-dimensional search space, adjusting its position towards both • The particle's best position found thus far, and • The best position in the neighborhood of that particle. Each particle I maintains the following information : • x i : The current position of the particle. • v i : The current velocity of the particle. • y i : The personal best position of the particle. Using the above notation, a particle's position is adjusted according to 1  where w is the inertia weight c 1 and c 2 are the acceleration constants, r 1,j (t), r 1,j (t) ~ U(0,1), and k=1, … ., N d . The velocity is thus calculated based on three contributions: 1) a fraction of the previous velocity, 2) the cognitive component which is a function of the distance of the particle from its personal best position, and 3) the social component which is a function of the distance of the particle from the best particle found thus far (i.e; the best of the personal bests). The personal best position of the particle is calculated as Two basic approaches to PSO exists based on the interpretation of the neighborhood of particles. Equation (6) reflects the gbest version of PSO where, for each particle, the neighborhood is simply the entire swarm. The social component then causes particles to be drawn toward the best particle in the swarm. In the lbest PSO model, the swarm is divided into overlapping neighborhoods, and the best particle of each neighborhood is determined. For the lbest PSO model, the social component of equation (6) changes to where is the best particle in the neighborhood of the i-th particle.
The PSO is usually executed with repeated application of equations (6) and (7) until a specified number of iterations has been exceeded. Alternatively, the algorithm can be terminated when the velocity updates are close to zero over a number of iterations. PSO Clustering :In the context of clustering, a single particle represents the N c cluster centroid vectors. That is, each particle xi is constructed as follows: x i = (m i1 , …, m ij , …, m i N c ) ………….. (10) where m ij refers to the j-th cluster centroid vector of the i-th particle in the cluster C ij . Therefore, a swarm represents a number of candidate clustering for the current data vectors. The fitness of particles is easily measured as the quantization error,

gbest PSO clustering Algorithm
Using the standard gbest PSO, data vectors can be clustered as follows : 1. Initialize each particle to contain N c randomly selected cluster centroids. 2. For t = 1 to t max do a) For each particle i do b) For each data vector Z p i) calculate the Euclidean distance d(Z p, m ij ) to all cluster centroids C ij ii) Assign Z p to cluster C ij , such that d(Z p, m ij ) = min ∀c=1, ..., Nc {d(Z p, m ic )} iii) calculate the fitness function using (6 ) c) Update the global best and local best positions d) Update the cluster centroids using equations (6) and (7). where t max is the maximum number of iterations.

DISCUSSION
In order to evaluate the performance of the rule extraction algorithm, we carried out a two fold experiment with PAT [21] , WBC, and IRIS data sets. The time for the error to converge with the center optimized by PSO was compared with the center optimized by genetic algorithm are presented in Table.7. The result shows if the RBFNN centers are optimized by PSO then the network takes less time for getting trained Table.7. In case of IRIS data set the overlap in the rule extracted is better than training RBFNN with genetic algorithm [22] . In case of WBC the convergence of the genetic algorithm takes more time than PSO [23] . The PAT [21] data set was included to show that this methodology can handle any type of classification task. The algorithms associated to the extraction method were simulated using MATLAB v6.5. Simulation environment: In this section we describe and illustrate the use of Particle Swarm Optimization training for the centers of an RBFNN network. We tested the algorithms of the previous sections on the classical PAT dataset [21], and Wisconsin breast cancer (WBC) data set, and IRIS dataset. PAT database: The PAT data set contains training set consisting of 450 exemplars and the test set containing 430 exemplars for a total of 880 exemplars.
• WBC database: The WBC data contains 400 exemplars and the test set containing 299 exemplars for a total of 699 exemplars.
• IRIS plants database : The IRIS data contains 50 exemplars from each category for a total of 150 patterns. We randomly divided the patterns into training and test sets, containing 34 exemplars from each category. The rest from each category were used for testing purpose.
The input data were normalized by replacing each feature value x by x= (x − µx) / x where µx and x denote the sample mean and standard deviation of this feature over the entire data set. The networks are trained to respond with the target value y ik =1, and y jk = 0 ∀ j ≠ i, when presented with an input vector x k from the i th category.
The MATLAB m-files were used to generate the simulation results presented in this section. The training algorithms were initialized with prototype vectors randomly selected from the input data on a two fold basis and with the weight matrix W set to 1 and σ initialized to 1.

SIMULATION RESULTS
Fitness Convergence: Fig.2, Fig.3, Fig.4 shows the fitness convergence of PAT, WBC and IRIS datasets. Error: Indication of error is a key attribute for any simulation results. Our observation through simulation for errors of PAT [21] , WBC and IRIS datasets respectively are shown in Fig.5, Fig.6 and Fig.7, Tabular Data: The results of centers obtained by RBFNN from our simulation studies are shown in tables. Table-1, Table-3, and Table-5 shows the centers of PAT, WBC and IRIS datasets respectively. Table-2,  Table-4, and Table-6, shows the weights obtained for PAT, WBC and IRIS datasets respectively. Table-7 shows the time required to train the data sets with PSO and Genetic Algorithm.         (o(r, 2) >= 1.0000 or o(r, 2) <= 1.0000) then Class=3

CONCLUSION
The success of neural network architecture depends heavily on the availability of effective learning algorithms. The theoretical strength of the Particle Swarm Optimization (PSO) is yet to be used in hundreds of technologies, and this paper demonstrates that RBFNN network training is yet another fruitful application of Particle Swarm Optimization (PSO). Our simulation using MATLAB v6.5 verifies that initialization of the centers through Particle Swarm Optimization provides better performance. Further research could focus on the application of Particle Swarm Optimization (PSO) training to RBFNN networks with alternative forms of the generator function. (Recall that in this paper the deviations and weight matrix was initialized to ones.) Applying these techniques to large problems to obtain experimental verification of the computational savings of training the centers by PSO can be included as a future work