DISTINGUISHABILITY BASED WEIGHTED FEATURE SELECTION USING COLUMN WISE K NEIGHBORHOOD FOR THE CLASSIFICATION OF GENE MICROARRAY DATASET

In data mining, much research is being carried out t discover the previously unknown, valid, novel, useful and understandable patterns in large databas es. The patterns must be actionable so that they mi ght be used for decision making to a variety of applica tions in healthcare. In this study, feature subset selection is an important area, where many approach es have been proposed. Hence, the authors chosen three existing feature selection algorithms analyze d their performance using the publicly available standard colon tumor dataset. The performance of th e existing three methods evaluated and compared each method with DWFS-CKN under study.


INTRODUCTION
Microarrays provided lot of information that have significance in various medical domain. In recent years there had been an explosion in the rate of acquisition of biomedical data. Different types of microarray used different technologies for measuring mRNA expression levels.
Machine learning and statistical techniques applied to gene expression data had been used to address the questions of distinguishing tumor morphology. Analysis of microarray presented a number of unique challenges for data mining. The main types of data analysis needed for biomedical applications including gene selection, classification and clustering. One of the major goals of microarray data analysis was discovery of biological knowledge. In this, the importance of feature selection in machine learning came from its ability of improving learning performance. Several feature selection techniques developed and discussed for many years. However, the problem of finding the optimal feature selection still remains to be a very necessary, so far difficult problem. In order to solve this problem and find a solution of the problem, the authors made a selection of three feature selection algorithms which were compared and discussed with the proposed DWFS-CKN in this study.
Feature selection was a topic that concerns selecting a subset of features among the full features that shows the best performance in classification accuracy . The process of feature selection consists of 4 steps. Starting point, Search strategy, Subset Evaluation and Stopping criteria. The starting point, the search for feature subsets started with no features or with all features , the search strategytheoretically, the best subset of features could be found by evaluating all the possible subsets, the third, point is the subset evaluation-after generated subsets of features, the authors needed to evaluate them. To Evaluate the subset features, there two methods namely filter approach and wrapper approach used Kira and Rendell (1992) and to stop the criteria-finally, the researchers decided the criteria for halting the search. In this study

AJAS
the authors proposed a simple and efficient feature selection algorithm called "Distinguishability Based Weighted Feature Selection Using Column Wise K Neighborhood (DWFS-CKN)". The performance of the proposed algorithm has been compared with three algorithms Gini Index, MRMR and Relief-F since these three algorithms performed well in our previous evaluation and also the accuracy tested with two popular classification algorithms Bayes and C4.5 and validated by k-fold validation and Leave-one-out cross validation by considering accuracy as metrics. The obtained results proved that the proposed DWFS-CKN algorithm performed better accuracy as well as speed.

Objectives and Scope
Microarray experiments were expected to contribute considerably to progress in cancer treatment by enabling a precise early diagnosis, eventhough it is difficult. The objectives of the research were: • To eliminate the redundant, irrelevant or noisy data • To get better the data quality furthermore minimize the feature space • To develop a new algorithm for feature selection to maximize classification accuracy The aim of the present study was to verify whether the data selection dependent on the algorithm or not. The scope of the present study was restricted to the adoption of three algorithms for analyzing the already available data.

Previous Works
Many successful feature selection algorithms had been devised. Gheyas and Smith (2010) were involved in the study of goodness of a feature subset. Huang et al. (2005) suggested the well organized choice of discriminative genes from microarray gene expression data for cancer diagnosis. Dai et al. (2006) demonstrated the Dimension Reduction for Classification with Gene Expression Microarray Data. Wang and Palade (2007) recognized a comprehensive fuzzy based framework for cancer microarray data gene expression analysis. This method used three microarray cancer datasets namely Leukemia, colon cancer and Lymphoma cancer. A novel fuzzy based system was used for both gene selection and classification by applying the microarray gene expression data. The performance achieved by that method was more viable. Yeh et al. (2007) followed the data mining techniques for cancer classification using Gene data. Feature Selection from microarray dataset carried out using t-statistics (t-GA) based algorithm. The decision based classifier was used on the top datasets.  proposed the approach for cancer classification using an expression of very few genes. There were two types involved in that method. The first type was of an important gene selection that was done by the use of the gene ranking scheme. The second type was of the classification accuracy of gene combination carried out by using a fine classifier. Hang and Wu (2009) described a new approach called "Sparse Representation" using Microarray gene expression profiles for cancer diagnosis. Nine human tumor types were used as data set in their research. Rejani and Selvi (2009) projected a tumor discovery as of mammogram, extracting features which categorized tumors. Microarray data analysis was conducted by Osareh and Shadgar (2010) for cancer classification. An automated system was developed for consistent cancer analysis based on gene microarray expression data. The researchers used the microarray datasets which included both binary and multi-class cancer problems.

The Proposed Distinguishability Based Weighted Feature Selection Using Column Wise K-Neighborhood
In this section the authors present a algorithm called "Distinguishability based Weighted Feature Selection using Column wise k Neighborhood (DWFS-CKN)".
In the proposed algorithm, feature weights were calculated based on the classifiable/distinguishable nature of the corresponding member points of that features using a column wise k-neighborhood method. It meant that for a particular column of a feature, most of the points were definitely belonging to any one of the class and distinguishable from the other classes based on k-neighborhood of each value, then the feature weight of that particular column was high. So, a feature which had highest feature weight was the most important attribute of the data and a feature which had lowest feature weight was the least important attribute of the data. So, for classification tasks, the authors selected a small set of first few features which were high feature weights. The following algorithm explained the proposed Data Distinguishability based Weighted Feature Selection using Column wise k Neighborhood (DWFS-CKN).

AJAS
T be the corresponding class id's of m records of D. The dataset D can be grouped in to c number of sub groups based on the class membership as follows D={ g 1 , g 2 , .. g c , } Where g 1 , g 2 , .. g c , are the c number of sub sets of data belonging to c classes.

Gini Index
The Gini coefficient or Index was measure of inequality developed by the Italian statistician Corrado Gini and published in his 1912 paper "Variabilità e mutabilità". The Gini coefficient was often calculated by:

Maximum
Relevance-Minimum Redundancy (MRMR) was the scheme in feature selection was to select the features that correlate the strongest with a classification variable Peng et al. (2005).

Relief F
Relief-F was a feature selection strategy that chosen instances randomly and changed the weights of the feature relevance based on the nearest neighbor.

Metrics Used for Performance Evaluation-Classifiers, Accuracy and Validation Methods
The most popular two classifiers namely Bayes Classifier and C4.5 Classifier were used and it was proposed by Quinlan (1993). C4.5 was the most popular and the most efficient algorithm in Decision tree-based approach these two classification algorithms were more frequently used by the previous researchers. The metrics calculated using the following formulas:

Accuracy = (TP+TN) / (TP + FP + TN + FN)
In this study the authors have used k-fold cross validation as well as leave-one-out cross validation for evaluating the performance. The authors strong-willed to use the colon tumor dataset for this study. Because, some of the previous researchers used and highlighted the complication of this dataset. This dataset contains 62 samples collected from Colon Tumor patients and it is a publicly available standard dataset. Among them, 40 tumor biopsies were from tumors (labeled as "negative") and 22 normal (labeled as "positive"). Each sample was represented by 2000 genes. So, the data set contains 62×2000 continuous variables and 2000 class ids.

RESULTS AND DISCUSSION
The Table 1 shows the accuracy and error rate of classification by Bayes and J48 (C4.5) with respect to first 50 features selected by different feature selection algorithms. The metrics were calculated by doing Leave-One-Out (LOO) cross validation Jeyachidra and Punithavalli (2013).
The Fig. 1 shows the accuracy of classification by Bayes and J48 (C4.5) while using the first 50 features selected by four different feature selection algorithms. The performance of the proposed DWFS_CKN was better than compared to the other three algorithms.
The Fig. 1, the set of bars at the right most of the chart belongs to the proposed DWFS_CKN method.
The Table 2 shows the average accuracy, average error, maximum accuracy and minimum error achieved by Bayes classifier and J48 classifier. It was calculated by with respect to repeating the 10 fold cross validation for 25 times (each time, the data was kept in a random order).
The Fig. 2 shows the average error of the 25 iterations of 10 fold cross validation and the performance of the proposed DWFS_CKN was better than compared to three other algorithms with respect to average error of 10 fold validation.
The Fig. 3 shows the average accuracy of the 25 iterations of 10 fold cross validation and the performance of the proposed DWFS was better than compared to the three algorithms with respect to average accuracy of 10 fold validation.
The Table 3 shows the time taken by the three different algorithms. In the case of MRMR, the time taken for selecting the primary features would increase with increase in the number of features, MRMR consumed more time and the performance of the MRMR was poorer than that of the other compared algorithms Jeyachidra and Punithavalli (2012).

CONCLUSION
In this study the authors addressed simple, fast, effective and an efficient feature selection algorithm called DWFS_CKN under study and compared its performance with three other classical feature selection algorithms using a complex microarray dataset. The performance of the proposed algorithms have shown improved performance in terms of accuracy of the feature size, consumed less time and the classification accuracy of the DWFS-CKN was better than the three existing algorithms.

Future Work
Based on the study, the performance, characteristics and the accuracy of the feature selection algorithms, still there are possibilities to advance the performance of the proposed DWFS_CKN algorithm by using appropriate distance calculation procedure to find more and more noticeable features. This study results are in the hands of the future researchers.