Document Clustering Based on Firefly Algorithm

: Document clustering is widely used in Information Retrieval however, existing clustering techniques suffer from local optima problem in determining the k number of clusters. Various efforts have been put to address such drawback and this includes the utilization of swarm-based algorithms such as particle swarm optimization and Ant Colony Optimization. This study explores the adaptation of another swarm algorithm which is the Firefly Algorithm (FA) in text clustering. We present two variants of FA; Weight-based Firefly Algorithm (WFA) and Weight-based Firefly Algorithm II (WFA II ). The difference between the two algorithms is that the WFA II, includes a more restricted condition in determining members of a cluster. The proposed FA methods are later evaluated using the 20Newsgroups dataset. Experimental results on the quality of clustering between the two FA variants are presented and are later compared against the one produced by particle swarm optimization, K-means and the hybrid of FA and -K-means. The obtained results demonstrated that the WFA II outperformed the WFA, PSO, K-means and FA-Kmeans. This result indicates that a better clustering can be obtained once the exploitation of a search solution is improved.


Introduction
Text clustering is a data mining technique that automatically groups a large amount of documents into meaningful categories, formally known as clusters (Miner et al., 2012). Clustering techniques can be divided into two types; Hierarchical and Partitional clustering techniques (Jain et al., 1999). Hierarchical clustering constructs a hierarchy of clusters using either top down approach (i.e., divisive) or bottom up (i.e., agglomerative) (Jain et al., 1999). The top down approach starts by organizing all items in a single cluster and later divides it into smaller groups. Such an operation can be seen in the Bisect Kmeans (Kashef and Kamel, 2009). On the other hand, the bottom up method starts from single clusters that contain a single item. The clusters are later merged based on an identified similarity measures (Jain et al., 1999;Forsati et al., 2013). The UPGMA is popular algorithm of such approach (Yujian and Liye, 2010).
The partitional clustering divides datasets into clusters into a single level (Luo et al., 2009). The center-based clustering is one of the most popular partitional clustering approaches and K-means (Jain, 2010;Velmurugan and Santhanam, 2010) is one type of it.
In K-means algorithm, clusters are formed by minimizing the objective function which it is the sum of distance between the center of cluster and each item in the cluster. Even though K-means is a simple method and is efficient for large dataset, it suffers from the local optima problem (Forsati et al., 2013).
Various works have tried to overcome the local optima problem by integrating clustering techniques with Swarm Intelligent algorithms (SI). Swarm Intelligent is defined as "The emergent collective intelligence of groups of simple agents" (Bonabeau et al., 1999). Examples of Swarm Intelligent algorithms includes the Artificial Bee Colony (ABC) (Karaboga and Ozturk, 2011), Cuckoo Search Optimization algorithm (Zaw and Mon, 2013), Ant Colony Optimization (He et al., 2006) and particle swarm optimization (Cui et al., 2005). These types of Swarm Intelligent algorithms have been utilized in text clustering; however, they need to predefine the number of k clusters. The determination of the k number of clusters is considered a problem as a user may not have any knowledge about the dataset prior to clustering.
Based on literature (Sayed et al., 2009;Tan et al., 2011), the problem of determination of the k number of clusters have been solved using one of these two approaches; the estimation clustering approach operates by using one of clustering performance metrics. Initially such an approach defines the range for k values; low and high value, then execute the clustering method with different k clusters and measure the performance metrics. The maximum/minimum value of performance metrics can be chosen to represent the best obtained clusters (Sayed et al., 2009). The second approach is the swarm based approach that mimics the capability of swarm insects such as ants, flocks, bees, etc. to solve hard problems (Tan et al., 2011). Swarm based approach utilize swarm like agents to group data directly without the need to define the number of clusters. One type of this approach is Dynamic FClust that is based on bird flocks of agents (Saka and Nasraoui, 2010).
In this study, the Firefly Algorithm (FA) (Yang, 2010a;2010b;Yang and He, 2013), which was introduced by Xin-She Yang 2008, is utilized for text clustering. Two contributions made include the Weight-based Firefly Algorithm (WFA) (Mohammed et al., 2014) and Weight-based Firefly Algorithm II (WFA II ). The WFA for text clustering works directly without the need to define number of clusters. A benchmark dataset obtained from 20Newsgroups is used to demonstrate the effectiveness of the proposed FA methods. Comparison of performance is made against the Particle Swarm Optimization (PSO) (Cui et al., 2005), K-means (Jain, 2010) and a hybrid model of FA and K-means (Rui et al., 2012).
The organization of this study is as follows; Section 2 provides the related work in text clustering while section 3 includes description on the standard firefly algorithm. Section 4 and 5 contain elaboration on the proposed clustering algorithms and the results are presented in section 6. Discussion is presented in section 7 and finally, conclusion of the work is made in section 8.

Related Works
Text clustering is grouping and organizing similar documents in the same group and dissimilar documents in different group (Miner et al., 2012). Clustering algorithms can be divided into two main categories; partitional and hierarchical (Jain et al., 1999). K-means is a well-known partitional clustering algorithm. This is due to its efficiency, simplicity and can easily be implemented (Jain, 2010). Nevertheless, K-means always converge into local optima. Such a situation has led researchers to the K-means with Swarm intelligent algorithms in order to search for optimal solution (Cui et al., 2005;He et al., 2006;Karaboga and Ozturk, 2011;Zaw and Mon, 2013). The pseudo code of K-means (Jain, 2010) shows in Fig. 1.
Swarm intelligent is "the emergent collective intelligence of groups of simple agents" (Bonabeau et al., 1999). Swarm intelligent algorithms are of metaheuristic optimization algorithms that are designed from biological behavior of insects or animals. Metaheuristic optimization algorithm is utilized to find optimal or near optimal solution. It is proven successful in many hard problems such as speech recognition (Hassanzadeh et al., 2012), image processing (Horng and Jiang, 2010) and text clustering (Cui et al., 2005;He et al., 2006;Karaboga and Ozturk, 2011;Zaw and Mon, 2013). Meta-heuristic optimization algorithm includes two important components; exploration and exploitation. Where, exploration explores globally the search space to find diverse solutions, while, Exploitation focuses the search in specific region (local region). The balance between Exploration and Exploitation is the key for successful any optimization algorithms (Boussaïd et al., 2013).
Example of Meta-heuristic optimization algorithm includes the Particle Swarm Optimization (PSO) which was invented by Kennedy and Eberhart (1995). The basic idea of PSO comes from the flock and foraging behavior where each solution has n dimensions search space. The birds did not have search space, so it is called "Particles". Each particle has a fitness function value that can be computed using a velocity of particles flight direction and distance (Hong et al., 2010). The pseudo code of basic PSO clustering algorithm (Cui et al., 2005) is illustrated in Fig. 2.  (Cui et al., 2005) Feng et al. (2010, PSO was proposed for divisive clustering. The result demonstrates that PSO has high performance quality (lower Entropy). Lu et al. (2009), researchers proposed PSO with an objective function that is based on extended Jaccard coefficient to maximize similarity between documents. Results indicate that PSO performs better than the K-means, agglomerative, graph based and Bisect K-means. On the other hand, PSO algorithm has also been integrated with K-means as a way to find the center of clusters. Results indicated that the hybrid of PSO and K-means outperformed the single models of PSO and K-means (Cui et al., 2005). However, recent finding (Yang, 2010a) shows that Firefly Algorithm (FA) is a better swarm-based approach in finding optimal solution as compared to PSO and Genetic Algorithm (GA). Firefly Algorithm (FA) is a younger optimization algorithm and was developed by Xin-She Yang at Cambridge University. It has two important issues, the light intensity and the attractiveness. For optimization problems, the light intensity, I, of a Firefly at a particular location, x, can be determined by I(x) α f(x) objective function. The attractiveness β is relative. Its change depends on the distance between two fireflies (Yang, 2010b;2010a;Yang and He, 2013).
Firefly Algorithm (FA) has been utilized in many optimization problems such as classification (Nandy et al., 2012;Ming-Huwi et al., 2012), image processing (Horng and Jiang, 2010;Hassanzadeh et al., 2011) and anomaly detection (Adaniya et al., 2013). In relating to clustering benchmark datasets, FA offers a better solution compared to PSO and Artificial Bee Colony (ABC) (Karaboga and Basturk, 2011). Banati and Bajaj (2013), FA was employed on unlabeled data (un-supervised) by using objective function which maximizes homogeneity and minimizes heterogeneity. The result demonstrates that the utilized FA is a better approach than Particle Swarm Optimization (PSO) and Differential Evolution (DE) (Storm and Price, 1997) algorithms. Furthermore, in web mining (Rui et al., 2012), the Firefly Algorithm (FA) was integrated with K-means and also three swarm algorithms that includes the Wolf, Bat and Cuckoo. Experiments result showed that the FA-Kmeans is better than PSO-Kmeans. The pseudo code of integrated Firefly and K-means is depicted in Fig. 3.
In this study, we adapt the standard FA which has been implemented in data clustering (Senthilnath et al., 2011) and present two variants of FA that are to be used in text clustering. The variants include the WFA (Mohammed et al., 2014) and WFA II which enhance the exploitation of WFA that leads to better clustering. Experiments are later conducted to evaluate WFA (Mohammed et al., 2014) and WFA II and the winner will be compared against three existing clustering methods; standard Particle Swarm Optimization (PSO) (Cui et al., 2005), K-means (Jain, 2010) and integrated firefly algorithm with K-means (Rui et al., 2012).

Standard Firefly Algorithm
Firefly Algorithm (FA) is a swarm intelligence optimization algorithm that has been utilized in solving the NP hard problem (Fister et al., 2013). Firefly algorithm has two important variables; the light intensity and the attractiveness. The light intensity, I, of a firefly can be related with objective function f(x). The value of x is the location (position) of firefly. Every location has different value of light intensity. The objective function can be maximized or minimized, depending on the problem. The attractiveness, β, is related with light intensity relatively, meaning that when two fireflies attracted each other, the highest intensity will attract the lower intensity and the value of β changes based on the distance between the two fireflies. The attractiveness, β, formula is shown in Equation 1 (Yang, 2010a;Yang and He, 2013) Equation 1: where, β 0 is the attractiveness when the distance r has value 0. Y is the absorption coefficient value between (0-1). The movement of one firefly i to another firefly j is determined based on Equation 2 (Yang, 2010b;Yang and He, 2013): where, X i is the position of first firefly; X j is the position of second firefly. ε i refers the vector to random numbers selected from a Gaussian distribution (Yang, 2010a). The pseudo-code of a standard Firefly Algorithm is as shown in Fig. 4. When the standard FA is applied in clustering (Senthilnath et al., 2011;Rui et al., 2012;Banati and Bajaj, 2013), the number of fireflies are pre-defined and each firefly carries one random solution. This solution is the K number of clusters; hence we must pre-determine the value of k. However, such an approach is not suitable when we do not have any knowledge about the dataset. Hence, this study addresses the shortcoming by proposing WFA which is the adaptation of FA in text clustering.

Firefly Algorithm for Text Clustering
Existing work on clustering numerical data using FA has been demonstrated in (Senthilnath et al., 2011;Banati and Bajaj, 2013). Nevertheless, the FA has not been tested on text data. Hence, we proposed WFA (Mohammed et al., 2014) that adapts the standard FA in text clustering. The pseudo code of WFA (Mohammed et al., 2014) is presented in Fig. 5.

Construction of a Vector Space Model
In document clustering, the Vector Space Model (VSM) represents document as a vector in the vector space (Aliguliyev, 2009). The VSM is constructed by performing three steps; construction of a Term-Frequency (TF) database, construction of a normalized Term-Frequency (TF) database and creation of a Term Frequency-Inverse Document Frequency (TF-IDF) table. A Term-Frequency (TF) database is created by calculating the occurrence of each term in the document and represented as matrix. The rows represent terms (words) and the columns represent the documents. The value between them is the occurrence of each term (term frequency) (Manning et al., 2008).
A normalized matrix transforms the Term-Frequency (TF) database into normalized form, where the representation of each value is in the range of (0, 1). This can be achieved by dividing the occurrence of Terms (TF) by the length of each document (Length) as in using Equation 3 (Manning et al., 2008): where, the Length can be calculated using Equation 4 (Manning et al., 2008): where, m is the number of term in a collection, V is the term frequency, d is the document. Term Frequency-Inverse Document Frequency (TF-IDF) is a technique that has been widely used to represent documents in the form of numerical weights in the vector space (Manning et al., 2008;Forsati et al., 2013). TF-IDF for each term in a document is equal to the term frequency multiply by the inverse documents frequency, idf, which can be calculated using Equation 5 (Manning et al., 2008): The inverse documents frequency idf for specific terms is the logarithm of number of documents in the collection divided by the number of documents in the collection that contain a term. Equation 6 shows how can calculated the idf (Manning et al., 2008): The total weight of each document is obtained by the summation of the TF-IDF for all terms in that document. The total weight for each document is calculated using Equation 7: where, j is the number of documents, i is the number of the terms.

Similarity Measure between Documents
Similarity between two documents is measured using the cosine function. Cosine function of two documents is computed using Equation 8 (Luo et al., 2009) The document vectors are represented by the normalized term frequency, hence Equation 8 becomes the following (Luo et al., 2009) where, m is the number of terms in the collection. The value of cosine similarity is in the range between (0, 1). The cosine similarity approaches 1 when two documents are identical and far away when they are not identical.

Clustering
In WFA (Mohammed et al., 2014), each firefly represents a single document and the initial brightness, I, of the particular firefly is represented by the total weight of a document which is calculated using Equation 7. Hence, in (Mohammed et al., 2014), we proposed Equation 10: In WFA (Mohammed et al., 2014), in order to determine the center of a cluster, clustering is based on a single condition. Such an approach produces acceptable result but it consumes large computational effort (time and complexity). This is because firefly (document) with a brighter light needs to compete with other fireflies (documents). Figure 6 shows an example of how WFA (Mohammed et al., 2014) finds a center. For example, in Fig. 6, the circle object indicates documents of class 1 while the triangles are of class 2. The value allocated along with the shapes indicates the total weight for the particular document. In Fig. 6, document A has the brightest light (i.e., 20), meaning that distance between document A and other documents (both the triangle and circle objects) need to be determined. Only documents (either the triangle or circle shapes) that has a small distance to Document A will be considered.
The movement of fireflies depends on the distance between the positions of two documents in the search space, which can be calculated using cartesian distance function (Yang, 2010b) as shown in Equation 11: The previous process continues until the number of iteration reaches a pre-determined value. After that, the fireflies are sorted based on its brightness. The one with the brightest light is identified as the best point, meaning as the center of a cluster. Once a center is determined, we find documents that are similar to the center and this is identified using the cosine similarity measure as in Equation 9. Documents having high similarity value is located in the first cluster while the ones with lower values in a second cluster. Such an approach requires a threshold value and in this study, the threshold is set to 0.15. The second cluster will include documents that have less similarity values (i.e., less than 0.15). The process of finding a centroid and its cluster repeats in the second cluster and the process is replicated until all documents are grouped accordingly.

Weight-Based Firefly Algorithm II (WFA II )
The second variant of FA in text clustering is the WFA II and it employs a more restrictive condition in identifying members of a cluster. Such an approach improves the exploitation process that happens during a search, hence leading to producing higher quality of clusters. The improvement is made by including a second condition based on similarity between two competing documents. Figure 7 illustrates the competition between fireflies (documents) which is based on two condition. First, the brightest firefly will attract the less bright ones, hence generating a list of potential documents that could bemoved towards the center. Documents in the list will then be evaluated based on their similarity with the center (2nd condition). Only documents with similarity value greater than 0.3 is moved towards the center using Equation 2. This is followed by updating the light intensity of the firefly (center) using Equation 12: where, β is the attractiveness between two documents and can be computed using Equation where, i and j are documents, Min TFIDF is the minimum value of document weight derived from TFIDF value and Max TFIDF is the maximum value of document weight derived from TFIDF value. The previous process continues until the number of iterations reaches a pre-determined value. After that, we need to rank the firefly based on their brightness. The one with the brightest light is identified as the centroid. The pseudo-code of the WFA II is as shown in Fig. 8.

Experimental Results
In order to demonstrate the effectiveness of the proposed variants of FA for text clustering, several experiments were performed. The metrics of evaluation includes the F-measure, Entropy, Purity and ADDC (Average distance between documents and centroid) and is compared against K-means (Jain, 2010), Particle Swarm Optimization (PSO) (Cui et al., 2005) and the FA-Kmeans (Rui et al., 2012). All experiments were performed in Matlab on windows 8 with a 2000 MHz processor and 4 GB memory. We execute our algorithms, K-means, PSO and FA-K-means with five (5) different number of iterations. For each iteration, we execute the algorithms for 20 times and calculate the average value for each performance measures.

Data Sets
We used dataset obtained from text collection source which has been extensively used by researchers in the area of text clustering. The dataset, denoted by 20Newsgroups, were extracted from UCI machine learning repository (Lichman, 2013). The chosen documents have only one topic. The topics of documents are hardware, baseball and electronic. The number of documents in the dataset is (300) and the number of terms is (2275). The description of the dataset is presented in Table 1.

Evaluation Metrics
The F-measure (Murugesan and Zhang, 2011) depends on the recall and precision values and is mostly used in Information Retrieval (Manning et al., 2008). The total F-measure is the sum of F-measure for all classes. The equation to collect maximum value of F-measure is in Equation16: where, θ k means the class, C j means the cluster, R(θ k , C j ) is recall measure and P(θ k , C j ) is precision measure which is shown in Equation 17 and 18 respectively:  The total F-measure is the sum of maximum accuracy (F-measure) of individual class weighted according to the class size. It is shown as in Equation 19: On the other hand, the Entropy measures the goodness of clusters and randomness (Murugesan and Zhang, 2011;Shannon, 1948). Also it measures the distribution of classes in each cluster. The clustering solution reaches a high performance when clusters contain documents from one single class. In this situation, the entropy value of clustering solutions is zero. A smaller value of entropy demonstrates better cluster performance. Equation 20 indicates the entropy of output cluster C j which is the sum of probability distribution of classes in cluster C j : where, C j is output clustering from clustering algorithm, j is the number of output clustering, θ c is known classes and c is the number of known classes. The total entropy for clustering algorithm is the sum of single cluster entropies weighted according to the cluster size and can be calculated through Equation 21: where, N is the number of documents in the collection. On the other hand, Purity is a measure of clustering quality by measuring the extent of the cluster that contains only one class of data (Murugesan and Zhang, 2011). It also can be defined as the maximal precision value for each class. The purity depends on the maximum number of documents in class θ k and in cluster C j respectively, which can be computed using Equation 22: The cluster purity is calculated using Equation 23: The Average Distance between Documents and Cluster centroid (ADDC) (Cui et al., 2005;Forsati et al., 2013) where, m is the number of terms in dataset, d j, d i are two documents in dataset. Table 2 tabularizes the obtained results of F-measure, Entropy, Purity and ADDC. From the table, it is noted that the WFA II has higher value in iteration 1, 2, 5 and 20 of F-measure (0.5878, 0.5791, 0.5945 and 0.5753) compared than WFA. Further, the WFA II has higher purity (0.8513, 0.8455 and 0.8285) which it is generated at iteration 1, 2 and 5. The result of Entropy is smallest value (0.5675, 0.5776 and 0.6245) in WFA II also at iteration 1, 2 and 5.Based on literature, it is learned that a good clustering solution is the one with F-measure and purity values approaching to 1 and Entropy value approaching to 0 (Murugesan and Zhang, 2011). Hence, the WFA II algorithm produces a better F-measure, Purity and Entropy in less iteration than the one produced by the WFA Algorithm. This indicates that the WFA II algorithm produces better quality clusters.

Results of Comparison of WFA and WFA II
Further, it is noted from Table 2 that the ADDC value for WFA II is less than the WFA. The smaller value of ADDC is more compact clustering solution (Forsati et al., 2013).    Figure 9 illustrates the quality performance metrics; F-measure, Entropy, Purity and ADDC results between the WFA and WFA II . Figure 9a includes the F-measure curve of WFA and WFA II . We notice that the curve of WFA II increase in iteration 1, 2, 5 and 20 than the curve of WFA. This means that WFA II generates highest Fmeasure than WFA. Figure 9b involves the Entropy curve of WFA and WFA II . We can observe that the curve of WFA II lower than WFA in iteration 1, 2 and 5 than the curve of WFA. Figure 9c involves the Purity curve of WFA and WFA II . We notice that the curve of WFA II rise in iteration 1, 2 and 5 compared against the curve of WFA.
In Fig. 9d, it can be seen that the curve of ADDC in WFA II is smooth since iteration 1 until the last iteration, i.e., 20, while the WFA curve rise in iteration 5. Furthermore, WFA II produce a smaller ADDC value than WFA. According to (Murugesan and Zhang, 2011;Forsati et al., 2013) a better clustering solution is the one with a high value of F-measure, Purity but a low value of Entropy and ADDC. Table 3 tabularizes results on F-measure, Entropy, Purity and ADDC for the WFA II and three comparative algorithms; K-means (Jain, 2010), PSO (Cui et al., 2005) and FA-Kmeans (Rui et al., 2012). From the table, it is noted that the WFA has high values of F-measure and the best value (0.5945) is obtained in iteration 5. On the other hand, the highest value for is at 0.5018 and is produced in iteration 10. Similar situation is also seen in the results generated by the PSO; the best value of F-measure is obtained in iteration 10. On the other hand, the FA-Kmeans produces the lowest F-measure in most iteration.

Results of Comparison of WFA II and Existing Algorithms
As for entropy, value that approaches 0 indicates that it is a better algorithm (Murugesan and Zhang, 2011). In Table 3, it is learned that the WFA II has the smallest entropy produced in all iteration, where the best value (0.5675) is generated in iteration 1. While for Kmeans, PSO and FA-Kmeans, we noticed that there is not much difference between the entropy values in most iteration. Furthermore, the obtained values are larger as compared to the WFA II .
Similar to the F-Measure metrics, a larger value of purity indicates that it is a better algorithm (Murugesan and Zhang, 2011). Purity is higher in WFA II , where the best value (0.8513) generates in iteration 1, while, PSO generate worst Purity. Based on literature (Murugesan and Zhang, 2011;Forsati et al., 2013), it is learned that a good clustering solution is the one with F-measure and purity values approaching to 1 and Entropy value approaching to 0. Hence, the WFA II algorithm produces a better F-measure, Purity and Entropy than the one produced by K-means, PSO and FAK-means Algorithm. This indicates that the WFA II algorithm produces better quality clusters.
Further, it is noted from Table 2 that the ADDC value for WFA II is less than the WFA. The smaller value of ADDC is more compact clustering solution (Forsati et al., 2013). Figure 9 illustrates the quality performance metrics; F-measure, Entropy, Purity and ADDC results between the WFA and WFA II . Figure 9a includes the F-measure curve of WFA and WFA II . We notice that the curve of WFA II increase in iteration 1, 2, 5 and 20 than the curve of WFA. This means that WFA II generates highest Fmeasure than WFA. Figure 9b involves the Entropy curve of WFA and WFA II . We can observe that the curve of WFA II lower than WFA in iteration 1, 2 and 5 than the curve of WFA. Figure 9c involves the Purity curve of WFA and WFA II . We notice that the curve of WFA II rise in iteration 1, 2 and 5 compared against the curve of WFA.
In Fig. 9d, it can be seen that the curve of ADDC in WFA II is smooth since iteration 1 until the last iteration, i.e., 20, while the WFA curve rise in iteration 5. Furthermore, WFA II produce a smaller ADDC value than WFA. According to (Murugesan and Zhang, 2011;Forsati et al., 2013) a better clustering solution is the one with a high value of F-measure, Purity but a low value of Entropy and ADDC. Table 3 tabularizes results on F-measure, Entropy, Purity and ADDC for the WFA II and three comparative algorithms; K-means (Jain, 2010), PSO (Cui et al., 2005) and FA-Kmeans (Rui et al., 2012). From the table, it is noted that the WFA has high values of F-measure and the best value (0.5945) is obtained in iteration 5. On the other hand, the highest value for is at 0.5018 and is produced in iteration 10. Similar situation is also seen in the results generated by the PSO; the best value of F-measure is obtained in iteration 10. On the other hand, the FA-Kmeans produces the lowest Fmeasure in most iteration.  As for entropy, value that approaches 0 indicates that it is a better algorithm (Murugesan and Zhang, 2011). In Table 3, it is learned that the WFA II has the smallest entropy produced in all iteration, where the best value (0.5675) is generated in iteration 1. While for K-means, PSO and FA-Kmeans, we noticed that there is not much difference between the entropy values in most iteration. Furthermore, the obtained values are larger as compared to the WFA II .

Results of Comparison of WFA II and Existing Algorithms
Similar to the F-Measure metrics, a larger value of purity indicates that it is a better algorithm (Murugesan and Zhang, 2011). Purity is higher in WFA II , where the best value (0.8513) generates in iteration 1, while, PSO generate worst Purity. Based on literature (Murugesan and Zhang, 2011;Forsati et al., 2013), it is learned that a good clustering solution is the one with F-measure and purity values approaching to 1 and Entropy value approaching to 0. Hence, the WFA II algorithm produces A better F-measure, Purity and Entropy than the one produced by K-means, PSO and FAK-means Algorithm. This indicates that the WFA II algorithm produces better quality clusters.
Further, The ADDC value for Euclidian similarity examines the clustering results how much satisfies the optimization constraints. As this metrics is similar to the Entropy metrics, a smaller value indicates that it is a bet ter algorithm (Forsati et al., 2013).
It is noted from Table 3 that the ADDC value for Kmeans and PSO is less than WFA II ; however, the curve of WFA II is smoother than K-means and PSO and not contains any increase. Figure 10 illustrates the results obtained by the proposed WFA II algorithm, K-means, particle swarm optimization and FA-Kmeans in a graphical representation.

Discussion
This section presents a discussion on why the proposed WFA II outperformed its competitor (i.e., Kmeans, PSO and hybrid FA-K means). WFA II originates from firefly algorithm. It works in 2D grid and forms clusters automatically without any prior knowledge about the dataset using representative points (centroids) and does not require a pre-determined value of k cluster. While in K-means, PSO and FAK-means, the methods need to be supplied with the information on k number of cluster. The points (centroids) in WFA II represent document, while in K-means, the initial centroids are identified randomly and later produce clusters by minimizing the distance between document and the center. Later it re-calculates for a new center based on mean value for documents in the cluster. Such an approach may propose a point that is beyond the dataset and this will lead to local optima. In PSO, the initial Centroid is identified randomly from a vector in the dataset. Later solutions on the clustering for each particle that is optimized by the ADDC metrics are obtained. The final solution is the one with the smallest ADDC value. Similarly, the FAK-means model also represents initial centroids randomly.
Furthermore, in both of the WFA approaches, upon the identification of a centroid, two clusters are created based on a density threshold (used in density based clustering) and Cosine similarity. The similarity measures is learned to be more suitable as compared to Euclidean distance that was employed in K-means, PSO and FAK-means. The use of threshold will lead to create compact cluster that will affect the performance metrics positively.

Conclusion
In this study, we have addressed the problem of finding optimal initial cluster center (centroids) which can cause a search to be trapped into local optima. A new approach in meta-heuristics, i.e., firefly algorithm, is utilized for document clustering. We propose that each document is represented by a single firefly and the total weight of a document is the initial brightness of the firefly. Firefly (document) with the brightest light intensity and is similar to other fireflies (documents) is later identified as the centroid. Such an operation is identified to be a new approach in utilizing Firefly in document clustering. This study also proposes a second algorithm, WFA II that employs a more restrictive condition in identifying members of a cluster.
The performance of the proposed firefly Algorithms are tested on a standard text classification dataset which is the 20 Newsgroups and is evaluated using four performance measurements which are the F-measure, Entropy, Purity and ADDC. The obtained results indicated that the WFA II outperformed the WFA, PSO, K-means and FA-Kmeans. This shows that a better clustering can be obtained once the exploitation of a search solution is improved.

Funding Information
Authors would like to thank the Ministry of Education for providing the financial support under the Fundamental Research Grant Scheme (s/o: 12894).

Author's Contributions
Athraa Jasim Mohammed: Undertake the required experiments and analyse the obtained results.
Yuhanis Yusof: Design the research and prepare the workflow.
Husniza Husni: Organizes the writing and structure of the manuscript.

Ethics
This article is original and contains unpublished materials. The corresponding author confirms that all of the other authors have read and approved the manuscript and no ethical issues involved.