An Efficient Unified K-Means Clustering Technique for Microarray Gene Expression Data

: Problem statement: Using microarray techniques one could monitor the expressions levels of thousands of genes simultaneously. One challenge was how to derive meaningful insights into expressed data. This might be carried out by clustering techniques such as hierarchical and k-means, but most of the clustering techniques were largely heuristic in nature and are associated with some unresolved issues like how to fix the precise number of clusters and how to visualize the results in a pictorial form. Approach: Determine accurate number of clusters from gene expression data and validate the results using correctness ratio and sum of squares criteria. A new approach suggested to addresses the primary issue of k-means clustering algorithm that predefining number of clusters. This approach provides accurate number of clusters by minimizing the squared error function and maximizing the correctness ratio value. Results: The experimental results have shown the efficiency of our method by calculating and comparing the sum of squares with different k values. It was concluded that the number of clusters were accurate with minimum sum of squares value and maximum value of correctness ratio. Conclusion: The results showed that the quality of clusters and performance of this new approach is improved.


INTRODUCTION
The advent of microarray technology made it possible to monitor the expression levels of thousands of genes concurrently whereas in traditional approaches one can focus local examination and collection of data on single gene (Wilkin and Huang, 2007;Chen et al., 2005). Microarray may be used to measure gene expression in many ways, but one of the most popular applications is to compare expression of a set of genes from a cell maintained in a particular 'condition A' to the same set of genes from a reference cell maintained under normal 'condition B'. The process data, after the normalization procedure, can be represented in the form of matrix. Each row in the matrix corresponds to a particular gene and each column could either correspond to an experimental condition or to a specific time point at which expression of genes has been measured. Huge volume of data generated by microarray techniques are collected and stored in massive databases. Traditional techniques and tools are not adequate to deal with this data and obtain the desired results (jiang et al., 2004;Eisen et al., 1998;Ali et al., 2009). The challenge is to effectively analyze and interpret such a huge volume of information. Two statistical operations commonly applied to microarray data are classification and clustering (Suresh et al., 2009;Kumar, 2009). Classification technique is a supervised one in which objects is classified by known class label, whereas clustering is an unsupervised technique requiring no predefined class labels. As we have little knowledge of the complete data set, we have favored unsupervised methods (Eisen et al., 1998). The patterns within the groups are similar to one another and dissimilar to the patterns in different groups. Many tools that cluster microarray data employ methods such as hierarchical clustering, k-means clustering and self organizing maps to analyze and interpret the data. As each technique has its own disadvantages, a new approach is required to overcome them. K means clustering adopts a non-hierarchical approach to cluster N objects into K partitions where 0<K<N. It randomly selects k of the objects, each of which initially represents a cluster means then calculates mean value for each of the remaining object to which it is the most similar, based on the distance between the object and the cluster mean (Chen et al., 2005;Jaradat et al., 2009). Very common measures include the sum of distances or sum of squared Euclidean distances from the mean of each cluster. It then re-computes the mean value for each cluster, this process being repeated until no more reassignment occur (Han and Kamber, 2001). The objective of kmeans is to minimize total intra-cluster variance, or the squared error function. The mathematical formula for squared error function is This algorithm is sensitive to initial value of k; hence it may produce different results for different k values and it may find only local optimum rather than global one (Al-Zoubi et al., 2010). Also, it is sensitive to noise and outlier objects since a small number of such objects can substantially influence the mean value.
In the past, hierarchical and k-means methods have been the primary clustering tools employed to perform the task of clustering microarray data. The major limitation of these methods is their inability to determine the number of clusters (Mar and McLachlan, 2003). Model based clustering has become an essential one in microarray gene expression data in order to determine the number of clusters and provides a statistical framework to model the cluster structure of gene expression data. In this approach the data is generated by a finite mixture of underlying probability distributions in which each component represents a different cluster (Yeung et al., 2001). For a fixed number of components G, the model parameters can be estimated using the EM algorithm. It is a general approach to maximum likelihood in the presence of incomplete data. Let the dataset be y i =(x i, z i ), where z i =(z i1, ………z iG). The EM algorithm iterates between E-step in which the values of Z ik are computed from the data with the current parameter estimates. In M-step, model parameters are estimated so as to maximize the likelihood of complete data for the given estimated Z ik parameters. Each data object is assigned to the component with the maximum conditional probability when the algorithm converges (Suresh et al., 2009;Fraley and Raftery, 1998). In order to ascertain the number of clusters represented by the model based method, we calculated the correctness ratio (Arima and Hanai, 2003). In this study, we suggest a new approach to solve the problems not addressed in the conventional methods.

MATERIALS AND METHODS
Most of the clustering algorithms that have been employed in the literature are heuristic and have the disadvantage of requiring beforehand the precise number of clusters. The present work focuses on Kmeans clustering algorithm where the number of clusters k has to be defined by the user arbitrarily in advance. This may not help the researchers to achieve the desired aim; hence drawing inference of biological significance becomes difficult for them. The present method helps them avoid this arbitrariness by automatically suggesting the correct number of clusters that is obtained by applying the results of the model based algorithm to k-means clustering. The sample dataset is downloaded from the machine learning database in order to examine the performance of the proposed method.
Output: K clusters with maximum sum of square.
Step: 1 Estimate the no. of components K using EM algorithm.
Step: 2 Select K object as the initial cluster centers from step 1.

Repeat:
Step: 3 Assign each object to the cluster to which the object is the most similar based on the cenroid value.
Step: 4 Update the cluster centroids using any similarity metric Until Centroid values remain unchanged or else goto step 2.

RESULTS AND DISCUSSION
The result of model based clustering is shown Fig.  1. This figure shows the best model EEV, the highest point in the plot provides four components (clusters). In order to ascertain the number of clusters represented by the model, we have calculated and compared the correctness ratios for different k values. It is confirmed that the number of components provided by the EEV model is correct with respect to ratio value 0.135, corresponding to the value of k = 4 as given in Table 1.   To minimize the squared error function in k-means clustering algorithm, we calculated the sum of squares for different clusters in Table 2. From the obtained values, one can conclude that the number of clusters four is optimum with minimum sum of squares value.

CONCLUSION
In this study, we have described a novel clustering approach for performing clustering of microarray gene expression data. We examined the results of model based clustering to obtain the precise k clusters and applied the same to k-means clustering. The results of clustering yeast data show the efficiency of the new method. The future work is to enhance the performance of this new algorithm that can be achieved by reducing the dimensionality of the dataset so that the outliers are removed and thereby increasing its efficiency and the accuracy.