A Gene Selection Algorithm using Bayesian Classification Approach

In this study, we propose a new feature (or gene) s election algorithm using Bayes classification approach. The algorithm can find gen e subset crucial for cancer classification problem. Problem statement: Gene identification plays important role in human cer classification problem. Several feature selection algorithms have been prop osed for analyzing and understanding influential genes using gene expression profiles. Approach: The feature selection algorithms aim to explore genes that are crucial for accurate cancer classifi cation and also endure biological significance. However, the performance of the algorithms is still limited. In this study, we propose a feature selection algorithm using Bayesian classification a pproach. Results: This approach gives promising results on gene expression datasets and compares fa vorably with respect to several other existing techniques. Conclusion: The proposed gene selection algorithm using Bayes c lassification approach is shown to find important genes that can provide h igh classification accuracy on DNA microarray gene expression datasets.


INTRODUCTION
The classification of tissue samples into one of the several classes or subclasses using their gene expression profile is an important task and has been attracted widespread attention (Sharma and Paliwal, 2008).The gene expression profiles measured through DNA microarray technology provide accurate, reliable and objective cancer classification.It is also possible to uncover cancer subclasses that are related with the efficacy of anti-cancer drugs that are hard to be predicted by pathological tests.The feature selection algorithms are considered to be an important way of identifying crucial genes.Various feature selection algorithms have been proposed in the literature with some advantages and disadvantages (Sharma et al., 2011b;Tan and Gilbert, 2003;Cong et al., 2005;Golub et al., 1999;Wang et al., 2005;Li and Wong, 2003;Thi et al., 2008;Yan and Zheng, 2007;Sharma et al., 2011a).These methods select important genes using some objective functions.The selected genes are expected to have biological significance and should provide high classification accuracy.However, on many microarray datasets the performance is still limited and hence the improvements are necessitated.
In this study, we propose a feature selection algorithm using Bayesian classification approach.The proposed scheme begins at an empty feature subset and includes a feature that provides the maximum information to the current subset.The process of including features is terminated when no feature can add information to the current subset.The bays classifier is used to judge the merit of features.It is considered to be the optimum classifier.However, the bays classifier using normal distribution could suffer from inverse operation of sample covariance matrix due to scarce training samples.However, this problem can be resolved by regularization techniques or pseudo inversing covariance matrix.The proposed algorithm is experimented on several publically available microarray datasets and promising results have been obtained when compared with other feature selection algorithms.

Proposed strategy:
The purpose of the algorithm is to select a subset of features s = {s 1 , s 2 ,…,s m } from the original feature set f = {f 1 , f 2 ,…,f d } where d is the dimension of feature vectors and m<d is the number of selected features.A feature f k is included in the subset s, if for this f k , the subset s gives the highest classification accuracy (or the lowest misclassification error).Let χ ={x 1 , x 2 ,…x n } be the training sample set where each x i is a d-dimensional vector.Let m i x ∈ ℜ be the corresponding vector having its features defined by subset s.Let Ω = {ω j ; j = 1, 2 ,…c} be the finite set of c classes and j χ be the set of m-dimensional training vectors i x of class ω j .The Bayesian classification procedure is described as follows.According to the Bayes rule, the a posteriori probability ω and a priori probability P(ω j ).If we assume that the parametric distribution is normal then a posteriori probability can be defined as Eq.1: where, j μ is the centroid and j Σ the covariance matrix computed from j χ .j Σ+ is the pseudo-inverse of j Σ (which is applied when j Σ is a singular matrix).If m < n then j Σ will be a non-singular matrix and therefore conventional 1 j Σ− can be used in Eq. 1 instead of j Σ+ .
The training set χ can be partitioned into a smaller portion of training set χ tr and validation set χ val .The set χ tr can be used to evaluate the parameters of equation 1 (i.e., j μ and j Σ ) and the set χ val can be used to compute classification accuracy (or misclassification error) for the feature vectors defined by the subset s.The procedure of finding feature subset is described in the following algorithm:
Step 1: Given the training feature vectors χ , partition it randomly into two segments χ tr and χ val using partitioning ratio r (we allocate approximate 60% of samples to χ tr and the remaining in the other segment).
Step 2: Take feature subset s ∪ f k (for k =1, 2, …,d) at a time and compute for this feature subset the training parameters j μ and j Σ on χ tr segment.
Step 3: By using Eq. 1, compute classification accuracy α k using feature subset s ∪ f k on χ val segment.
Step where k (q) α is the average classification accuracy at q th iteratation).
The above algorithm will give a subset of features.However, if more than one subset of features is required then the procedure should be repeated on the remaining features.Next, we describe materials and method.

MATERIALS AND METHODS
Publicly available DNA microarray gene expression datasets are used from Kent Ridge Biomedical repository (http://datam.i2r.astar.edu.sg/datasets/krbd/).The program code is written in Matlab on i7 dual-core Pentium processor in Linux environment.

RESULTS
In the experimentation 3 DNA microarray gene expression datasets have been used.The description of the datasets is given as follows.

DISCUSSION
A feature or gene selection algorithm using Bayes classification approach has been presented.The pseudoinverse of covariance matrix is used in place of inverse covariance matrix for the class-conditional probability density function (Eq.1), to cater for any singularities of the matrix (i.e., when the number of selected genes > number of training samples).The gene subset is obtained in the forward selection manner.It can be observed that on 3 DNA microarray gene expression datasets, the proposed algorithm is exhibiting very promising classification performance when compared with several other feature selection techniques.

CONCLUSION
A gene selection algorithm using Bayesian classification approach has been presented.The algorithm has been experimented on several DNA microarray gene expression datasets and compared with the several other existing methods.It is observed that the obtained genes exhibit high classification accuracy and also show biological significance.Van The identified genes from all the three datasets are described in Table1.Their corresponding classification accuracies on TRAIN data are also given.The biological significance of the identified genes is depicted in the last column of the table under p-value statistics using Ingenuity Pathways Analysis (IPA, http://www.ingenuity.com)tool.For acute leukemia dataset the highest classification accuracy on training set is obtained at 2nd iteration which is 100%; for lung cancer dataset, 100% classification accuracy is obtained at the 1st iteration; and, for breast cancer dataset, highest classification accuracy 95% is obtained at the 7th iteration.The proposed algorithm is compared with several other existing techniques on DNA microarray gene expression datasets.The performance (in terms of classification accuracy) of various techniques is depicted in Table2.It can be observed that the proposed method is giving high classification accuracy on very small number of selected features.