Using Feature Selection as Accuracy Benchmarking in Clinical Data Mining

Automated prediction of new patients’ disease diagnosis based on data mining analysis on historical data is proven to be an extremely useful tool in the medical innovation. There are several studies focusing on this particular aspect. The objective of this study is two-fold. First, we look into three different classifiers, which are the Naïve Bayes, Multilayer Perceptron (MLP) and Decision Tree J48 to predict the diagnosis results. Next, we investigate the effects of feature selection in such experiments. We also compare the experimental results with the study of Comparative Disease Profile (CDP) using the same dataset. Results have shown that the Naive Bayes provides the best result in terms of accuracy in our experiments and in comparison with CDP. However, we suggest using Multilayer Perceptron since the variables used in our experiments are inter-dependent among each other. In addition, MLP has shown better accuracy than CDP.


INTRODUCTION
Data mining is the process of finding previously unknown patterns and trends in databases and using that information to build predictive models. Data mining in medical science is critical and is more sensitive than other domains because of its complexity of nature. On the other hand the significance of data mining in medical science can play a vital role if it is utilized for prediction and decision making. Healthcare industry today generates large amount of complex data about patients, hospitals resources, disease diagnosis, electronic patient records or medical devices. The large amount of data is a key resource to be processed and analyzed for knowledge extraction that enables support for costsavings and decision making (Bhatla and Jyoti, 2012). Data mining provides a set of tools and techniques that can be applied to this processed data to discover hidden patterns and also provides healthcare professionals an additional source of knowledge for making decisions.
As has been highlighted in Wei and Altman (2004), historical clinical data is the critical source to support information to help diagnosis of patient's disease. They propose a Comparative Disease Profile (CDP), which is a set of distinguished features derived from historical medical dataset. Once established, CDP is claimed to have helped the process of manual decision making by providing useful diagnosis guidelines. Motivated by their work, this study focuses on classification approach for diagnosis of cardiac patients. The datasets were sourced from the Cleveland Heart Disease Datasets of UCI Repository of Machine Learning databases and domain theory which is available for download at: http://archive.ics.uci.edu/ml/datasets/Heart+Disease.
The remaining of this study proceeds as follows. The second part of this studydescribes the methods and techniques used, whereas the following sections discussed the experiments and results respectively.

MATERIALS AND METHODS
In this study, three classification algorithms are chosen for the purpose of accuracy benchmarking in clinical data, which are the Naïve Bayes, Multilayer Perceptron (MLP) and Decision Tree J48.

JCS
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable (Han et al., 2011).
A Multilayer Perceptron (MLP) is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate output. An MLP model consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network. MLP is a modification of the standard linear perceptron and can distinguish data that is not linearly separable.
J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool. C4.5 is an algorithm used to generate a decision tree developed by Quinlan (1993). C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification and for this reason; C4.5 is often referred to as a statistical classifier.
To investigate further the classifier performance in accuracy benchmarking, this study also looks into feature selection algorithm via Weka filtering method called the attribute select classifier to reduce the dimensionality of the data. This limits the number of attributes by choosing the ones that are more likely to impact the target class label. However, in principle there is no guarantee that feature selection will yield a result better than that with the full attribute range.
To measure the performance, this study focuses on Receiver Operating Characteristics (ROC) area to compare the accuracy of the different classifiers. ROC graph organizes classifiers and helps visualize their performance. ROC graphs are commonly used in medical decision making and in recent years have been used increasingly in machine learning and data mining research (Robin et al., 2011). Basically, ROC is a twodimensional graph in which true positive is plotted on the Y-axis and false positive is plotted on the X-axis. The classifier that is nearest to the perfect point (0, 1) or the top left corner in the graph shows the best accuracy.

Experiments
In this study, we set up a series of classification experiments focusing three algorithms in Weka 3.7.4 data mining tool (Hall et al., 2009), which are Naïve Bayes, Multilayer Perceptron (MLP) and Decision Tree J48. The task is to predict and diagnose cardiac patients based on the given symptoms and information from the Cleveland Heart Disease Dataset.
This dataset contains 13 attributes and one class variable "Label" that is used to categorize between 'sick' and 'not-sick'. The 13 attributes are all numeric and they are: age, sex, Chest pain type (Cp), resting blood pressure (Trestbps), serum Cholesterol (Chol), Fasting blood sugar (Fbs), resting electrocardiographic results (Restecg), maximum heart rate achieved (Thalach), the occurrence of Exercise induced angina (Exang), ST depression induced by exercise relative to rest (Oldpeak), slope of peak exercise ST segment (slope), number of major vessels colored by fluoroscopy (Ca) and thal. Table 1 shows the attributes and descriptions on the Cleveland data. The experiments were carried out in two stages. The first stage is to measure the benchmark performance of Naïve Bayes, Multilayer Perceptron and J48 classifiers. The ROC areas were observed and recorded. Next in the second stage, feature selection was added to the experiments before the classification task. The ROC area were again observed and compared. Finally, the results from the second stage were then compared with findings by Comparative Disease Profile (CDP) (Wei and Altman, 2004).

RESULTS
The experimental results are reported in two parts, before and after feature selection is applied.

Benchmark Results
From the observation, the average of ROC area using Naïve Bayes was 88.8%, whereas 82.4% and 78.7% for Multilayer Perceptron and J48 respectively. Figure 1-3 shows the benchmark results before feature selection.

After Feature Selection
Then the attribute select classifier was applied to find the best attributes while expecting better ROC area. Feature selection stage returned seven best attributes after using the attribute select classifier. They are Cp, Restecg, Thalach, Exang, Oldpeak, Ca and Thal.
The same classifiers were used to this seven selected attributes and the ROC area were observed again. Interestingly the Naïve Bayes shows a little lower ROC area this time than using the 13 attributes. On the other hand the average of ROC area increased significantly after using Multilayer Perceptron and J48 to 86 and 79.1% respectively. Figure 4-6 shows the results.

ROC Area
The ROC areas below serve to present the comparative performance across three proposed classifiers. Without applying the Weka filtering feature of attribute select classifier, the results are clearly in favor of Naïve Bayes. However, using attribute select classifier, changes the comparison more in favor of the other two classifiers, as they inch forward towards Naïve Bayes. Naïve Bayes actually deteriorates a little. Figure 7 and 8 shows the comparison of ROC areas between two experiments.
Even with improvement in accuracy of Multilayer Perceptron and J48 algorithms, as well as deterioration in Naïve Bayes accuracies, Naïve Bayes still remains the best classifier in terms of accuracy.

DISCUSION
Though Naïve Bayes is showing better results in our experiment, we suggest using Multilayer Perceptron since Naïve Bayes algorithm assumes independency among variables whereby in real-life situations the variables are inter-dependent among each other. We also suggest to use Multilayer Perceptron classification algorithm together with the filtering method of attribute select classifier in Weka, which resulted a significant increase in accuracy from 82.4 to 86%.
Next, this study compares the findings with the CDP accuracy (Wei and Altman, 2004). The result in the CDP study shows an accuracy of 82.2%, which is better than the performance of our J48 classifier. However, our proposed MLP shows a better result than the CDP after using the attribute select classifier.

CONCLUSION
In this study, we have used three different classification algorithms in a data mining tool, Weka (Hall et al., 2009) using the standard Cleveland heart data sets and compared the accuracy level of each method. We also compared the results of our experiments with CDP system developed by Wei and Altman (2004). It has been observed that the Naïve Bayes shows the best result in terms of accuracy in our experiment and in comparison with CDP. However, we suggest to use Multilayer Perceptron since the variable used in our experiments are inter-dependent among each other. In addition, MLP has shown better accuracy than CDP. In the future work, we hope to investigate further on attributes from other medical dataset.

ACKNOWLEDGMENT
Special thanks from author to financial support (Prototype Research Grant Scheme, PRGS) received from the Ministry of Higher Education (MoHE),