A Rough Set based Gene Expression Clustering Algorithm

: Problem statement: Microarray technology helps in monitoring the expression levels of thousands of genes across collections of related samples. Approach: The main goal in the analysis of large and heterogeneous gene expression datasets was to identify groups of genes that get expressed in a set of experimental conditions. Results: Several clustering techniques have been proposed for identifying gene signatures and to understand their role and many of them have been applied to gene expression data, but with partial success. The main aim of this work was to develop a clustering algorithm that would successfully indentify gene patterns. The proposed novel clustering technique (RCGED) provides an efficient way of finding the hidden and unique gene expression patterns. It overcomes the restriction of one object being placed in only one cluster. Conclusion/Recommendations: The proposed algorithm is termed intelligent because it automatically determines the optimum number of clusters. The proposed algorithm was experimented with colon cancer dataset and the results were compared with Rough Fuzzy K Means algorithm.


INTRODUCTION
Biological data are being produced at a phenomenal rate. It is astonishing to see the repositories grow in an extraordinary way. On average, these databases double in size every 10 month. The enormous quantity and variety of information that is being produced cannot be handled that efficiently with the puny human brains. It would be easier if this data can be divided into a more comprehensible level by subdividing the genes into smaller categories and then analyze them. This is where clustering comes in.
Cluster Analysis plays a major role in Knowledge Discovery and Data mining (KDDM). The process of clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. It ultimately increases intra class similarity but decreases interclass similarity. Clustering of gene expression data helps to understand gene functions and gene regulations and assists in pattern recognition in gene expression profiles. Genes with similar expression patterns can be grouped together which would help us in further understanding the functionalities of unknown and abnormal patterns.
Related work: Hybrid fuzzy c-means clustering technique proposed by Valarmathie et al. (2009), combines Fuzzy C-Means with Expectation Maximization algorithm to determine the precise number of clusters and to interpret them efficiently. Noureen and Qadir (2009) have proposed a simple and efficient biclustering algorithm (BiSim) which proves to be very simple when compared the Bimax algorithm. It reduces the complexity and extra computation when compared to Bimax. Thilagamani and Shanthi (2010) have done a survey stating that clustering algorithms designed based on rough sets are neither too restrictive as the Crisp clustering nor too descriptive as that of fuzzy clustering. Pavan et al. (2010) have proposed a Single Pass Seed Selection (SPSS) algorithm which is an extension of K-means++ which works well with high dimensional data sets. K-Biclusters Clustering (KBC Algorithm), proposed by Tsai and Chiu (2010), minimizes the dissimilarities between genes and bicluster centers. Additionally it tries to minimize the residue within the clusters and to involve as many conditions as possible. Venkatesh and Thangaraj (2008) have proposed a SOM based clustering and artificial intelligence technique to analyse patterns of soil distributed across a geographical area. Maji (2011) proposed a new clustering algorithm, termed as Fuzzy-Rough Supervised Attribute Clustering (FRSAC), to find groups of coregulated genes whose collective expression is strongly associated with sample categories. A new quantitative measure is introduced based on fuzzy-rough sets that incorporates the information of sample categories to measure the similarity among genes whereby redundancy among the genes are removed.

Research background:
Rough set-definition: Rough set theory introduced by Pawlak (1982) deals with uncertainty and vagueness. It is a new mathematical approach to imperfect knowledge. Rough sets can be considered as sets with fuzzy boundaries i.e., sets that cannot be precisely characterized using the available set of attributes. Rough set theory has become popular among scientists around the world due to its fundamental importance in the field of artificial intelligence and cognitive sciences. Similar to fuzzy set theory it is not an alternative to classical set theory but it is embedded in it.
Suppose we are given a set of objects U called the universe and an indiscernibility relation R as U x U, representing our lack of knowledge about elements of U. For the sake of simplicity we assume that R is an equivalence relation. Let X be a subset of U. We want to characterize the set X with respect to R: • The lower approximation of a set X with respect to R is the set of all objects, which can be for certain classified as X with respect to R (are certainly X with respect to R) • The upper approximation of a set X with respect to R is the set of all objects which can be possibly classified as X with respect to R (are possibly X in view of R) • The boundary region of a set X with respect to R is the set of all objects, which can be classified neither as X nor as not-X with respect to R Now we are ready to give the definition of rough sets: • Set X is crisp (exact with respect to R), if the boundary region of X is empty • Set X is rough (inexact with respect to R), if the boundary region of X is nonempty Formal definitions of approximations are as follows: R-lower approximation of X:

Clustering gene expression data:
Clustering is one of the first steps in gene expression analysis. One of the important characteristics of gene expression data is that it is meaningful to cluster both genes and samples. During cluster analysis, genes are clustered based on similarity. Proximity measurement measures the similarity (or distance) between two data objects. The proximity between two objects is measured by a proximity function of their corresponding vectors. Euclidean distance is one of the most commonly used methods to measure the distance between two data objects. The main drawback is that Euclidean distance does not score well for scaled patterns or profiles of genes. The Manhattan distance is closely related to Euclidean distance. This finds out the sum of distances along each dimension while Euclidean distance finds the length of the shortest path between two points. Another measure is Pearson's correlation coefficient, which measures the similarity between the shapes of two expression patterns (profiles). Pearson's correlation coefficient is widely used and has proved to be efficient in many clustering algorithm for gene expression data (Jiang et al., 2004). The main drawback of this measures is that it is not more robust in handling outliers. In order to address the problems faced with pearson's correlation coefficient another measure named Spearsman correlation coefficient was introduced. It is more robust against outliers when compared to Pearson's correlation coefficient. A survey on Rough set based clustering and its preference over conventional methods was initially done and analyzed.
Rough fuzzy K means algorithm: K means is one of the traditional algorithms available for the clustering. However this algorithm is crisp as it allows an object to be placed exactly in only one cluster. To overcome the disadvantages of crisp clustering fuzzy based clustering was introduced. The distribution of member is fuzzy based methods can be improved by rough clustering. Based on the lower and upper approximations of rough set, the rough fuzzy k-means clustering algorithm makes the distribution of membership function become more reasonable (Shi et al., 2009).
The frame work of RFKM algorithm: Specific steps of the RFKM clustering algorithm are given as follows: Step1: Determine the class number k (2<=k<=n), parameter m, initial matrix of member function, the upper approximate limit Ai of class, an appropriate number ε > 0 and s = 0.
Step 2: We can calculate centroids with the formula given below: Step3: If j X ∉ the upper approximation, then U ij = 0. Otherwise, update U ij as shown below Experimental results: The RFKM algorithm was experimented with yeast expression data set. The data set is 834 X 7 matrix. A total of 834 genes were clustered based on 7 experimental conditions into different no of clusters. Since RFKM requires the no of clusters to be given as input, 8 different clusters were generated. The result in Fig. 1 shows the membership matrix of the genes belonging to different clusters. A total of 8 clusters were generated with each graph representing their membership values of a particular cluster. The algorithm was implemented in matlab and was also experimented for variety of data sets.

The Proposed algorithm (RCGED):
Our proposed new algorithm, Rough Clustering of Gene Expression Data (RCGED), clusters genes based on rough set theory. The main advantage of our method is that it does not restrict a gene to one cluster. Genes can get expressed in two are more clusters ie Overlapping of genes are possible. It also finds the lower and upper approximation of the clusters. Our algorithm is designed to be intelligent in the sense that it itself detects the optimum number of clusters. Our algorithm uses a similarity measure based on correlation coefficient.

The Frame work of the proposed algorithm:
Algorithm: RCGED Input: Gene expression matrix Output: No of clusters, membership matrix, similarity matrix. Step1: For each gene g i , compute the membership subset Step2: Compute the similarity or distance matrix f sim K=1; For each gene g i K++; Ith gene is placed in cluster k; For each gene j<>i Compute the similarity of i th gene with j th gene f sim (i,j) using correlation coefficient metric; If f sim (i,j) > threshold α place j in cluster k End; End; Step3: Calculate mean m i for the k clusters; Step4: Assign each data object P to the lower approximation or the upper approximation by finding the difference in its distance from the cluster centroid pairs m i and m j : Step5: If the distance is less than some threshold ∞, X P is in the upper approximation and X p is not in the lower approximation else X p is in the lower approximation.
Step 6: Compute new mean for each cluster k and iterate until there are no more assignments.
The algorithm generates the membership matrix based on the rough set theory. Based on the similarity between the genes, the algorithm proceeds on to find out the possible number of clusters and the distance matrix for which it uses correlation coefficient as the metric. Genes that are more similar are put in the same cluster. Each object is either assigned to the upper or the lower approximation of each cluster. Then we also dynamically calculate the membership matrices for both upper and lower approximations as shown in the algorithm. The mean of each cluster(lower and upper) is then taken as the centroid (pair) of that cluster. The process iterates and dynamically updates the membership matrices and the similarity matrix until there is no more change in the cluster centroid.

Experimental results:
The RFKM algorithm requires the user to specify the no of clusters prior to clustering. This does not suit all problems as the no of clusters specified by the user might be too small or too large. The result of RFKM on colon cancer data set is shown in Fig.  2. The colon cancer data set contains expression levels of 2000 genes taken in 62 different samples out of which 50 genes where chosen across all 62 samples. The proposed RCGED algorithm is designed to be intelligent. Unlike the RFKM, it finds out the optimum no of clusters on its own and proceeds with the clustering. The algorithm uses a method to tune the threshold and the relative importance of the upper and lower approximation of the rough sets is used in modeling the clusters. The RCGED algorithm was also experimented with colon cancer data set. The result is shown in Fig. 3.

Comparison of RCGED with RFKM:
The effectiveness of the algorithm is shown as a comparative study between the performance of Rough Fuzzy K-Means and RCGED. Cluster validation of the clusters generated by these two algorithms is done. The procedure of qualitative evaluation of the clusters is referred to as cluster validation. Validation index is a real value that determines the quality of the clusters. Our algorithm is evaluated using Davis-Bouldin's measure as the validation index. This index is a function of the ratio of the sum of within-cluster scatter and between-cluster separation. Table 1 gives the sample results and the comparative study between RFKM and RCGED. The uncertainty that prevails in the overlapping clusters is eliminated in our proposed algorithm. We can observe that RCGED algorithm has minimum value for DB index when compared to RFKM. In all rough clustering algorithms, the number of objects in the boundary region depends on the value of the threshold α. It has been noted for our algorithm that the number of genes in the boundary region decreases as the value of α becomes <0.1. When the threshold value becomes larger, the number of genes in the boundary region also increases.

DISCUSSION
There are dozens of clustering algorithms that have been applied to gene expression data. But there is no single-best solution or a fit-all solution to clustering because there is no clear criteria and definition of what and how a cluster is to be (Jain and Dubes, 1988). Clusters can be of any shape and size in the multidimensional pattern space. In Jain and Dubes words, "Each clustering criterion imposes certain structure on the data and if the data happen to conform to the requirements of a particular criterion, the true clusters are recovered".

CONCLUSION
There are dozens of clustering algorithms that have been applied to gene expression data. But there is no single-best solution or a fit-all solution to clustering. In this study, we have proposed an intelligent clustering algorithm that is based on the frame work of rough sets. A more general rough fuzzy k means algorithm was implemented and experimented with different gene expression data sets. The proposed algorithm RCGED was also implemented and experimented with colon cancer gene expression datasets. A comparison of the algorithms and their results were studied. The importance of upper and lower approximations of the rough clusters is optimized using DB index value. This algorithm seems to prove better than the other rough set based clustering algorithms. As an extension of the current research work, a toolkit that integrates and visualizes the results of a few rough clustering algorithms for clustering gene expression data is being developed.