A New Approach Using Data Envelopment Analysis for Ranking Classification Algorithms

: Problem statement: A variety of methods and algorithms for classification problems have been developed recently. But the main question is that how to select an appropriate and effective classification algorithm. This has always been an important and difficult issue. Approach: Since the classification algorithm selection task needs to examine more than one criterion such as accuracy and computational time, it can be modeled and also ranked by Data Envelopment Analysis (DEA) technique. Results: In this study, 44 standard databases were modeled by 7 famous classification algorithms and we have examined them by accreditation method. Conclusion/Recommendation: The results indicate that Data Envelopment Analysis (DEA) is an appropriate tool for evaluating classification algorithms.


INTRODUCTION
Classification is an extensive and also important issue in various fields, including statistics, artificial intelligence, operations research, data mining and knowledge discovery. Depends on the number of classes, classification is divided into two groups, binary and multiclass. Due to the increasing use of classification in real systems such as network intrusion detection, credit analysis, classifying websites, diagnosis of disease, a lot of algorithms and methods have been presented to classify databases with binary and specially multi classes, (Peng et al., 2008) Allwein et al. (2001) used binary learning algorithm based on margin and presented a framework for multiclass classification. Crammer and Singer (2001) explained algorithmic implementation of multiclass kernel based on vector machine and it had been compared with prior works. A complex numbers programming approach was presented by Loucopoulos (2001) for minimizing misclassification costs. Rennie and Rifkin (2001) compared Naïve Bayes and support vector machine with each other for text classification. Har-Peled et al. (2002) introduced a constraint classification method and also a Meta algorithm for multiclass classifying. Kou et al. (2009) presented a multiple criteria mathematical programming for data classification in large scale. A least squares support vector machine classifier was presented by Yu et al. (2009) for risk analysis.
To date a variety of classification algorithms, especially for multiclass data classification, have been presented. Note that choosing an effective classifier is an important and difficult issue. The algorithm selection problem is actually a central issue in many fields including artificial intelligence, operations research and learning machine is indeed active from the viewpoint of research (Smith-Miles, 2008). Classification algorithms evaluation usually have more than one criterion, accuracy, misclassification rate and computational time (Dietterich, 1998), therefore algorithm selection can be modeled by developed Data Envelopment Analysis (DEA).
DEA decision making technique has been proposed by Charnes et al. (1978), which is founded based on the relative efficiency of each decision making unit in comparison with the priors and has a lot of implications in evaluating and ranking of congenial units (independent units that have equal inputs and outputs), which (Cook and Bala, 2007;Avkiran, 2006;Lin et al., 2009) can be noted among them.
One of the advantages of using DEA is that, when inputs and outputs do not have the same scales (in this study accuracy and time), DEA can be used for computing efficiency and ranking of units easily. The other advantage is to ponder the desirability of newly presented algorithms.
The goal of this study is to use DEA in order to rank classification algorithms and choose the best through this technique.
The results of experiments in this study indicate that the presented approach is capable of ranking the classifications in various fields.
Background: Here we will have a general review on the selected classification algorithms and developed Data Envelopment Analysis (DEA).
Logestic linear regression (Le Cessie and Van Houwelingen, 1992) is modeled the probability of occurrence of an event by a linear function of predictor variables set. SMO (Premachandra et al., 2011) is an algorithm for solving optimization problems, which is raised from learning machines support vector. Simple Bayes (Domingos and Pazzani, 1997) is modeled probable relation between predictor variables and class variable. Regression tree and classification (Breiman et al., 1984) is a greedy algorithm for multiple variables learning decision trees, which can model crisp and continuous variables. Random tree (Breiman, 2001) is a grouping classifier, which is made of some decision trees and its output is a class includes single trees classes' mode. Breiman (1996) is a grouping Meta algorithm to improve accuracy in classification. C4.5 (Quinlan, 1993) is a decision tree algorithm which builds decision trees by division and dominance method in a reversing and up to down way.
DEA method: Developed Data Envelopment Analysis (CCR) is a methodology based upon an interesting application of linear programming, which measured the relative performance of Decision Making Units (DMUs) with different inputs and outputs. The basic efficiency measure used in DEA is the ratio of total outputs to total inputs (Charnes et al., 1978).
The ratio form of DEA for the mth DMU can be expressed as: Let us use x and y to represent inputs and outputs, respectively. For instance y jm represents the amount of jth output and x im represents the amount of ith input for the mth DMU. (n is the number of DMUs).
The point is to maximize the amount of Efficiency (E), so this problem can be solved by two different methods: • Output-oriented: It means the same amount of inputs while producing more output • Input-oriented: In this method for calculating efficiency, we should produce the same amount of output with fewer inputs In this study, the first method is utilized and by placing denominator to 1 we reach to an input orientation CCR model as comes next: where, Z the number of efficiency changes between 0-1 and if the efficiency of examined unit becomes 1, this unit is efficient unless it considers as an inefficient unit. In a situation that the efficiency of more than one DMU becomes 1, we use the Andersen-Peterson model for ranking DMUs. In this model, we let it have the efficiency more than 1, by omitting the constraint of that DMU, (Andersen and Petersen, 1993). Each decision making unit that can reach to a greater efficiency number, has a high level of performance among the efficient units, is something, which takes to account in this model.

Experiments:
Experiments, which have been done on informational databases will be explained here.

Performance measures:
There are an extensive number of performance measures for classification. Commonly used performance measures in software defect classification are accuracy, precision, recall, Fmeasure, the area under receiver operating characteristic (AUC) and mean absolute error UC Irvine Machine Learning Repository (Challagulla et al., 2005;Elish and Elish, 2008;Lessmann et al., 2008). Besides these popular measures, this study includes seven other classification measures. The following paragraphs briefly describe these measures.
Overall accuracy: Accuracy is the percentage of correctly classified modules (Mair et al., 2000). It is one the most widely used classification performance metrics Eq. 3: F-measure: It is the harmonic mean of precision and recall. F-measure has been widely used in information retrieval (Han and Kamber, 2000) Eq. 10: 2 Pr ecision Re call F measure Pr ecision Re call AUC: ROC stands for receiver operating characteristic, which shows the tradeoff between TP rate and FP rate. AUC represents the accuracy of a classifier. The larger the area, the better the classifier.

Kappa Statistic (KapS):
This is a classifier performance measure that estimates the similarity between the members of an ensemble in multiclassifiers systems (Baeza-Yates and Ribeiro-Neto, 2011) Eq. 11: P(A) is the accuracy of the classifier and P(E) is the probability that agreement among classifiers is due to chance Eq. 12: m is the number of modules and c is the number of classes. f (i, j) is the actual probability of i module to be of class m i 1 f (i, j) = ∑ is the number of modules of class j.
Given threshold θ, C e (i, j) is 1 if j is the predicted class for i obtained from P(i, j); otherwise it is [0, 1].  Proposed approach: Experiments are done based on the following procedure: Input: 44 chosen data bases, Output: ranking of classifiers.
Step 1: After preparing data bases using Weka, we proceed with the data analysis and calculate the accuracy and constructing time of model in each algorithm by 7 famous algorithms in classification (Table 1) Step 2: Since we are planning to gain the efficiency of every classification algorithms, using DEA CCR output oriented, each of these algorithms is considered as a Decision Making Unit (DMU), we assume the amount of accuracy of every algorithm is pondered as its output, in analysis of every data bases with method. Moreover in some analysis the accuracy of classification is just the most effectual factor in the best algorithm selection and in some other analysis algorithm performance time in addition to accuracy is effective too, to determine the efficiency, two methods are presented which result to rank the algorithms.
Step 3: Ranking algorithms by considering accuracy parameter: to compute the efficiency of every DMU, let us consider 1 as input of DMU and the accuracy of every algorithm as output. This analysis is done by DEAP software and if the efficiency of several algorithms becomes 1, using DEA Andersen-Peterson model, their efficiency is found by lingo software.
Step 4: Ranking algorithms by considering accuracy and performance time parameters: to compute the efficiency of every DMU in this condition, it is enough to consider learning time as input (time is cost attribute) and the accuracy of classification as output (accuracy is benefit attribute) of that algorithm and do the same as previous step.
Step 5: Considering obtained efficiency, we rank algorithms.

RESULTS AND DISCUSION
According to the analysis, two tables due to the steps 3 and 4 are obtained: Table 2 represents the efficiency of every algorithm just by considering the accuracy parameter.
Since the CCR model efficiency of all algorithms equals 1, therefore all algorithms are desirable algorithms and Andersen-Peterson model is used to rank them. According to the efficiency number which is obtained by Andersen-Peterson model, final ranking of algorithms is as follows: Logistic> Random Forest> SMO > C4.5 > Naïve Bayse> Cart> Bagging   Table 3 represents the efficiency of every algorithm by considering the accuracy and performance time parameters: Since the CCR model efficiency of all algorithms equals 1, therefore all algorithms are desirable algorithms and Andersen-Peterson model is used to rank them. According to the efficiency number which is obtained by Andersen-peterson model, final ranking of algorithms is as follows: Naïve Bayes> C4.5> SMO> Random Forest > Logistic > Cart > Bagging

CONCLUSION
A variety of algorithms for solving classification problems have been suggested recently. The best algorithm selection problem is considered as an important issue. This study proposed to use DEA technique by taking two criteria; time and accuracy, into account to choose the best algorithm and two ranking were suggested, which help noteworthy to choose the best algorithm.
It is suggested to repeat the experiments for more data bases in future. And also the other decision making techniques such as TOPSIS and VIKOR can be used for ranking and results will be compared and the other parameters for instance (the number of samples or attributes) can be contemplated, or even according to the input and output indices, their importance can be involved.