An Efficient Approach for Computing Silhouette Coefficients

: One popular approach for finding the best number of clusters (K) in a data set is through computing the silhouette coefficients. The silhouette coefficients for different values of K, are first found and then the maximum value of these coefficients is chosen. However, computing the silhouette coefficient for different Ks is a very time consuming process. This is due to the amount of CPU time spent on distance calculations. A proposed approach to compute the silhouette coefficient quickly had been presented. The approach was based on decreasing the number of addition operations when computing distances. The results were efficient and more than 50% of the CPU time was achieved when applied to different data sets.


INTRODUCTION
Clustering, also called unsupervised learning, is defined as the process of grouping a set of objects into classes (groups) of similar objects, such that the objects in a group will be similar (or related) to each other and different from (or unrelated to) the objects in other groups [1] .
However, in most clustering algorithms (e.g., Kmeans and PAM), usually one does not know the number of clusters (or groups), in a set [2][3][4] .These algorithms provide a fixed value of k in advance.
Finding the right number of clusters is a challenging issue in cluster analysis literature, for which no unique solution exists [4,5] .Therefore, different approaches have been proposed [4][5][6][7] .One of the most popular methods to select the right value of K is by means of the silhouette coefficients [4,[8][9][10][11] .For a given point i in a cluster A, the silhouette of i, s(i) is defined as follows [4] : where, a(i) is the average dissimilarity between point i and all other points in A (the cluster to which i belongs) and b(i) is the average dissimilarity between point i and the points in the closest cluster to A, which is B in this case.
The average of all silhouettes in the data set S'is called the average silhouettes width for all points in the data set.The value S' will be denoted by S'(K), which is used for the selection of the right value of the number of clusters, K, by choosing that k for which S'(K) is as high as possible.The Silhouette Coefficient (SC) is then defined as follows: where, the maximum is taken over all K for which the silhouettes can be constructed, which means K = 2, 3, …, n-1 [4] .
As illustrated above, it is clear that distance calculations (usually Euclidean distance) from each point, i, to the points in the current cluster (the cluster to which i belongs) and the points in the neighboring clusters to calculate the silhouette coefficients for one value (one run) of K.This process is repeated for many values (many runs) of K, causing a long CPU time to spend on distance calculations.
In this article, a new approach had been proposed to compute the silhouette coefficients quickly.The approach is based on omitting some of the addition operations.

PROPOSED APPRAOCH
The Euclidean distance between a query point q and a data point x, in a D-dimensional space R D , is given as follows: Applying Eq. 1, the calculation of one distance involves D multiplications, D additions and D subtraction, where D is the number of variables (dimensions).For computing N distances, then ND multiplications, ND additions and ND subtraction are involved.For large data sets with a large number of dimensions, the computation of distances requires a long CPU time, which hinders the development of effective processes that involve distance computations, which is the case when computing the silhouette coefficients in this study: In the proposed approach, equation ( 1) can be written as follows [12] : or as: where, The first and third terms of Eq. 3 can be calculated only once (for all runs) for the whole data set in a preprocessing step and stored as dimensionless quantities.This is also valid for computing W. Therefore, only the second term of Eq. 3 is calculated in the run time and hence, ND additions can be saved and the performance of the distance computations is expected to increase.

RESULTS AND DISCUSSION
We had investigated the efficiency of the new proposed approach, compared to the conventional (exhaustive) approach, when applied on different data sets to compute the silhouette coefficients.The proposed approach had generated outputs that are identical to the outputs of the conventional approach.The performance of the proposed approach had been reported in terms of the CPU time and the percentage of savings compared to the conventional approach.
In our tests, six data sets had been tested to compute the silhouette coefficients.The first two sets had randomly been generated while the other four sets had been obtained from the UCI Repository of Machine Learning Databases [13] .These are Breast, Letter, Pima and Segmentation data sets.The description of these data sets is shown in Table 1, where N is the number of points and D represents the dimensionality (number of dimensions) of data.Table 2 shows the CPU run time (in seconds) for the proposed approach and conventional (exhaustive) approach when applied on the Rnd1 data set.It shows that the performance of the proposed approach has a good speed improvement over the conventional approach in all cases.The Table shows that up to 35% of the CPU time savings had been achieved.
Table 3 shows the CPU run time (in seconds) for the proposed and conventional approaches when applied on the Rnd2 data set.It shows that the performance of the proposed approach has a very good speed improvement over the conventional approach in all cases.The Table shows that up to 43% of the CPU time savings had been achieved.
Table 4 shows the CPU run time (in seconds) for the proposed and conventional approaches when applied on the Pima data set.It shows that the performance of the proposed approach has a significant speed improvement over the conventional approach in all cases.The Table shows that up to 57% of the CPU time savings had been achieved.Table 5 shows the CPU run time (in seconds) for the proposed and conventional approaches when applied on the Breast data set.It shows that the performance of the proposed approach has a significant speed improvement over the conventional approach in all cases.The Table shows that up to 56% of the CPU time savings had been achieved.
Table 6 shows the CPU run time (in seconds) for the proposed and conventional approaches when applied on the Letter data set.It shows that the performance of the proposed approach has a significant speed improvement over the conventional approach in all cases.The Table shows that up to 52% of the CPU time savings had been achieved.
Table 7 shows the CPU run time (in seconds) for the proposed and conventional approaches when applied on the Segmentation data set.It shows that the performance of the proposed approach had a significant speed improvement over the conventional approach in all cases.The Table shows that up to 56% of the CPU time savings had been achieved.
It can be noticed from the results presented in the tables above that the performance of the proposed approach increases when the dimensionality increases.This is expected since the CPU time savings is based on decreasing the addition operations mainly spent on operations regarding dimensions.

CONCLUSION
A proposed approach to compute the silhouette coefficient quickly had been presented.The approach is based on decreasing the number of addition operations when computing distances.The results were efficient and more than 50% of the CPU time had been achieved when applied to different data sets.However, some extra memory is needed to store the data from the preprocessing step discussed earlier in this and.This will be handled in future works.

Table 5 :
The CPU run time (in seconds) of the proposed and conventional approaches when applied on the Breast data set for a different number of runs