ARTIFICIAL BEE COLONY ALGORITHM INTEGRATED WITH FUZZY C-MEAN OPERATOR FOR DATA CLUSTERING

Clustering task aims at the unsupervised classification of patterns in different groups. To enhance the quality of results, the emerging swarm-based algorithms now-a-days become an alternative to the conventional clustering methods. In this study, an optimization method based on the swarm intelligence algorithm is proposed for the purpose of clustering. The significance of the proposed algorithm is that it uses a Fuzzy C-Means (FCM) operator in the Artificial Bee Colony (ABC) algorithm. The area of action of the FCM operator comes at the scout bee phase of the ABC algorithm as the scout bees are introduced by the FCM operator. The experimental results have shown that the proposed approach has provided significant results in terms of the quality of solution. The comparative study of the proposed approach with existing algorithms in the literature using the datasets from UCI Machine learning repository is satisfactory.


INTRODUCTION
Clustering is a data mining technique that is widely studied in several research fields such as statistical pattern recognition, machine learning, information retrieval and data mining. Clustering deals with unsupervised classification of patterns into clusters. Clustering approaches can be divided as, partitioning methods, hierarchical methods, (Yu et al., 2010), (Krinidis and Chatzis, 2010), fuzzy clustering (Dervis and Ozturk, 2010). Suguna (2011) density based clustering, artificial neural clustering, statistical clustering, grid based, mixed and more (Cheng-Fa and Yen, 2007). Among these approaches, partitional and hierarchical clustering algorithms are the two important approaches in research areas. Partitional clustering aspires directly to obtain a single partition of the set of items into groups. In partitional clustering algorithms, the datasets are partitioned into a specific number of clusters and then it is evaluated based on certain criterion. The hierarchical clustering is considered as a technique of cluster analysis, which tries to construct a hierarchy of clusters. Hierarchical clustering analyzes all the database items individually and treats each of them as separate clusters. The method repeatedly joins the clusters by changing the intercluster distances. Among these algorithms, partitional clustering uses less memory and time for execution.
As far as clustering is considered, Fuzzy C means that it has a major role as it has been used since earlier days. Fuzzy clustering, as a soft clustering method, has been widely studied and successfully applied in clustering and classification. Among the fuzzy clustering methods, Fuzzy C-Means (FCM) algorithm (Dervis and Ozturk, 2010) is the most popular method used in data clustering. Many researches are involved in data clustering using different methods. A genetic algorithm based approach to decide the clustering problem by (Mualik and Bandyopadhyay, 2000) was experimented to evaluate the clustering performance. (Krishna and Murty, 1999) have proposed an approach called genetic K-means algorithm for clustering analysis, which expresses a basic mutation operator controlled clustering named as distance-based mutation. The main challenges faced in the clustering algorithm are that there are no optimization functions for optimizing the clusters. When redundant data collections are considered, optimization is highly essential for efficient clustering.

JCS
Later, different optimization algorithms such as genetic algorithm, particle swarm optimizations are arrived for cluster optimization. Recently, Davis Karaboga and Basturk (2008) have proposed an algorithm called Artificial Bee Colony Algorithm for cluster optimization. The Artificial Bee Colony Algorithm is based on the foraging behavior of the honey bees. Honey-bee is among the most closely studied social insect. Their foraging behavior, knowledge, remembrance and information sharing features have recently been one of the most stimulating study areas in swarm intelligence. The ABC algorithm is recently introduced in the cluster optimization process. So, it is bound with some defaults, which may affect in some particular data optimizations. In order to improve the performance, a new algorithm is proposed to replace the existing ABC algorithm (Dervis and Ozturk, 2011;Changsheng et al., 2010;Bahriye and Karaboga, 2012;Suguna, 2011;Visu et al., 2012). In this approach, an ABC algorithm with FCM operator is introduced to improve the optimization efficiency of ABC algorithm.
In the normal ABC algorithm, the optimization contains three different stages. They are employed as bee phase, onlooker bee phase and the scout bee phase. The input datum is categorized into any of the employed bee or the onlooker bee. The employed bee and the onlooker bee are processed according to the fitness of the solutions. Scout bee is one of the attracting elements in the ABC algorithm, which is introduced to the bee colony when an abandon solution is developed. In such a case, a random bee is introduced into the bee colony. The conventional approach is slightly modified in the proposed approach using FCM operator. Instead of random assignment, the scout bee is introduced according to a fuzzy function. In this approach, the scout bee is introduced after every cycle.
The main contributions in this study are: • A modified ABC algorithm is used for optimizing the clusters • Processing is concentrated on the onlooker bees • At the end of every cycle, a scout bee is introduced • The scout bee is introduced using the FCM operator

Modified ABC Algorithm for Clustering (AB-FF)
The modified ABC algorithm stands for Artificial Bee Colony with FCM function. The proposed approach is a hybrid algorithm, to incorporate the FCM operator into the ABC algorithm. So, developing a method which will provide effectiveness of both the algorithms is a tedious task. The proposed algorithm is well designed to obtain all the features of the above mentioned algorithms. A modification is proposed in the ABC algorithm with the help of FCM function. The ABC algorithm consists of three phases i.e., the employed bee phase, the onlooker bee phase and the scout bee phase. In the three phases, employed bee phase and onlooker phase are inevitable phases whereas the scout bee phase is a random phase. So, in the proposed work, the FCM operator is incorporated in scout bee phase.
In ABC algorithm (Dervis and Ozturk, 2011), a random solution is generated for the scout bees, when an abandon solution occurs. The random solution is expected to deliver the best solution, but it is not dependable. The proposed approach comes with an alternative, in each cycle, a solutions for the scout bees is introduced by the FCM operator. The new solution from the FCM operator is generated based on the solutions of the employed bee and onlooker bee phases for the better optimization results. The fitness function used in the proposed approach is given below Equation 1: The data are initially processed according to the number of clusters. The centroids are defined for the processing according to the ABC algorithm. The dataset can be considered as the following set with 'n' elements: Assuming that the two clusters are generated from the given dataset. Two centroids for the clusters are selected from the dataset D:

JCS
where, c 1 and c 2 represent the two centroids of the two clusters respectively. i and j be the representation of distance values between the centroids and the data points. After the distance calculation, the data are moved into the cluster, which has the least distance value when compared with other distance values of the data point. Finally, the data points are grouped into two clusters according to their least distance value. The f i values of the clusters are calculated from the distance value of the data points. Consider the following clusters: Table 1 represents the clusters with its distance value. Now, the proposed approach finds the f 1 values from the distances according to the following formula Equation 3: i.e., In general: The fitness function is calculated as the sum of all the f i values Equation 4: The fitness value is calculated for finding the most relevant data for the next population. The solution with better fitness value among the population survives and the solutions with worst fitness value are rejected.
In the three phases of the ABC algorithm, the employed bee and the onlooker phase are same as that described in the basic ABC algorithm and the employed bee is considered as the initial population. The major modification is made in the scout bee phase.

Employed Bee Phase
The ABC algorithm consists of a multidimensional search space, in which there are employed bees and onlooker bees. Both the bees stated above are characterized by their experience in finding the food source. The data available are selected and their corresponding solutions are randomly created using the uniform distribution. The initial population is then selected for the employed bee phase. Consider the solution in Table 2.
This is the generated solution after the initialization process. This solution is changed in the employed bees using Equation 5. These employed bees possess the food location. Here, the position values are marked with the notation 'I'.

Onlooker Bee Phase
In the above table, E stands for the employed bee variants and O stands for the onlooker bee variants. In the employed bee phase, only the E variants in the Table  2 are considered. i.e., the solution for the first onlooker bee can be calculated using the following formula: where, k and j are random indices and Φ ij is a randomly produced number in the range [-1, 1]. As per the equation, a new solution is generated. The solution is treated with the fitness function in order to obtain the fitness value. The new fitness value is compared with the previous best value Table 3. If the new fitness is better than the old, the new solution will be selected and the old one be rejected. This process will continue until all the employed bees are processed. The employed bee phase can be elaborated by means of a numerical example.

Example 1:
Consider the following distance values:

JCS
Here, the ABC algorithm first starts processing the employed bee in the cycle.
The distance values are iteratively changed by changing the indices k and j in the Equation (2). Thus, a set of data with different fitness values are generated. Among those set of distance values, the set of values with higher fitness value is selected for the scout bee phase.

Scout Bee Phase
The scout bee is a randomly assigned bee in the ABC algorithm, if an abandon solution occurs. In other words, if there is no new solution obtained at the end of the cycle, a bee position will be randomly assigned to get a new solution. In the proposed approach (as shown in Fig.  1), a new method is used to introduce the scout bee, in the case of an abandon solution. Instead of adding a random position to the bee colony, the proposed approach uses the FCM function to introduce the scout bee. The scout bee is produced from the onlookers with the highest fitness, though they have not possessed best fitness than the previous one. The onlookers with high fitness are sorted in their ascending order of their fitness values: Here, O i is the set of onlooker bee positions, which has the highest fitness. A scout bee is generated from the processing of the above set with the help of the FCM function. Thus, the new scout bee can be generated from the following function Equation 6 and 7: The FCM function generates a new position for the scout from the membership values and the centroid values. As mentioned above, only one centroid is defined since a single scout bee has to be introduced to the context. Unlike the FCM algorithm, in the proposed approach a single iteration is conducted to obtain the new solution. The main advantage in the proposed approach is that the scout bee is added to the solution by processing from the best fitness values obtained from the last solution in the cycle instead of randomly assigning a scout bee. Moreover the scout bee is introduced in every cycle, which improvises the optimal solution and the overall execution time. The Psuedocode for Modified ABC Algorithm for Clustering is given in Fig. 2.

Results and Performance Evaluation
The proposed approach deals with the clustering of data based on the ABC algorithm. The method we have proposed incorporates the FCM function with the ABC algorithm for obtaining better efficiency. The performance of the proposed approach is evaluated in the following section under different evaluation criteria. The algorithm is implemented in the JAVA language and executed on a core i5 processor, 2.1MHZ, 4 GB RAM computer.

Dataset Description
The proposed hybrid clustering algorithm is tested on three different datasets and compared with other optimization algorithms in the literature. The three datasets are namely Iris, Thyroid and Wine datasets taken from UCI Machine Learning Laboratory.
The Iris dataset: This data set contains 3 categories of 50 objects, in which each class refers to a type of iris plant. There are 150 instances with four numeric features and no missing attribute value. The attributes of the iris data set are sepal length in cm, sepal width in cm, petal length in cm and petal width in cm.
The Thyroid dataset: This data set contains three types of 215 patients suffering from human thyroid diseases. Thyroid diseases are tested based on 5 different tests and no missing attribute value.
The Wine dataset: This Data Set is the results of a chemical analysis of 178 wines grown in the same region in Italy, but derived from three different cultivars. Wine type is based on 13 attributes derived from chemical analysis of the wine.
The performance of the proposed hybrid method (AB-FF) is plotted here. The evaluation factors considered for the experimentation is a number of clusters and the number of cycles for the execution. The proposed approach has selected three test cases for the datasets, the best case, worst case and average case. The best case includes the cycle value in which the algorithm's optimum performance and the worst case include the algorithm's worst performance producing cluster cycles. The average case includes the cluster cycles, which gives a satisfactory output by the Hybrid algorithm.

Evaluation Based on Time
The performance of the proposed approach is evaluated on the basis of time in the iris dataset. Three cases have been selected that are the time of execution on worst case, average case and the best case. The performance of the proposed approach is plotted in the graph given below.   Figure 3 shows that as the number of iteration increases, the time for execution also increases proportionally. The analysis from the graph shows that higher execution time is needed for the best case, for executing the data at different cycle iterations. The responses of the Iris data are impressive in the case of time for execution. The figure illustrates that with the application of the FCM function, remarkable reduction in time for execution is determined. Figure 4 shows that as the number of iteration increases, the time for execution also increases proportionally. The analysis from the graph shows higher execution time is needed for the best case for executing the data at different cycle iterations. Also the case of Wine dataset is not different from the IRIS data. The time for execution proportionally increases as the number of clusters increase. Considering the wine data set, there is much difference in the iris data. The time of execution is remarkably changes for wine dataset. Figure 5 shows that as the number of iteration increases, the time for execution also increases proportionally. The analysis from the graph shows that higher execution time is needed for the best case for executing the data at different cycle iterations. When considering the case of thyroid data, in spite of the large dataset, the response to time execution is remarkable. The AB-FF algorithm took less time for execution for thyroid dataset at different levels.

Evaluation Based on Intra-Cluster Distance
Here, the performance of the proposed approach is evaluated on the basis of intra-cluster distance of the clusters. Intra-cluster distance on worst case, average case and the best case are selected as three parameters for the evaluation. The performance of the proposed approach is plotted in the graph given below. Figure 6 shows that unlike the time for execution, as the number of iteration increases the intra-cluster distance decreases proportionally. The intra-cluster distance is. Figure 7 shows that unlike the time for execution, as the number of iteration increases the intra-cluster distance decreases proportionally. The intra-cluster distance is lower for the best case, which is evident from the plotted graph. The wine data become more responsive, when the FCM function is introduced to the ABC algorithm. The enhancement has got improved results regarding the intra-cluster distance. Figure 8 shows that unlike the time for execution, as the number of iteration increases the intra-cluster distance decreases proportionally. The intra-cluster distance is lower for the best case, which is evident from the plotted graph. The case of thyroid is not different from the other two datasets, as the number of cycles increases the intra-cluster distance reduces. The responses are plotted in the figures mentioned below.

Thyroid Dataset
The reduced number of intra-cluster distance is because the FCM operator gives importance to the membership function. The membership function evaluates values each minute and specifies the centroid. This feature of the FCM operator gives emphasis to the proposed ABC algorithm.

Comparative Analysis and Discussion
Here, we describe the comparative study of the proposed approach with different similar algorithms.

JCS
The comparative evaluation is done based on the intra cluster distances of the proposed approach and that of the other comparing methods. The methods used for the comparison are Genetic algorithm, Tabu search algorithm, Simulated annealing, Ant colony algorithm, K-NM-PSO and Artificial bee colony algorithm. The detailed comparison is plotted below in Table 4. The comparison table shows the response of AB-FF data with the other similar algorithms. The Table 4 is plotted in reference to the research conducted by Changsheng et al. (2010). Table 4 shows the comparison of the proposed approach with some similar algorithms, which are related to the optimization of the clustering process. We have selected three different datasets for comparison, which includes Iris dataset, Thyroid dataset and Wine dataset. The study of the comparison data from three datasets has stated that in the best cases, our proposed approach has given a minimum intra cluster distance when compared to the other methods. The analyses have shown the significance of our approach in optimization of the clusters.

CONCLUSION
In this study, a cluster optimization methodology is proposed by highlighting the Artificial Bee Colony (ABC) algorithm. The proposed approach deals with a modified ABC algorithm for cluster optimization. The major modification is made on the scout bee phase of the ABC algorithm. In the scout bee phase, rather than applying a random position to the scout bee, the position is assigned with the help of the Fuzzy C-Means operator (FCM). The scout bee is introduced after every cycle, which results in the reduced number of cycle iterations. The experimentation is done with three different datasets namely, the Iris dataset, the Thyroid dataset and the Wine dataset. The comparative analysis has shown that the computational result obtained from the modified algorithm is very encouraging in terms of the quality of the solution and the execution time.