A New Smooth Support Vector Machine and Its Applications in Diabetes Disease Diagnosis

: Problem statement: Research on Smooth Support Vector Machine (SSVM) is an active field in data mining. Many researchers developed the method to improve accuracy of the result. This study proposed a new SSVM for classification problems. It is called Multiple Knot Spline SSVM (MKS-SSVM). To evaluate the effectiveness of our method, we carried out an experiment on Pima Indian diabetes dataset. The accuracy of previous results of this data still under 80% so far. Approach: First, theoretical of MKS-SSVM was presented. Then, application of MKS-SSVM and comparison with SSVM in diabetes disease diagnosis were given. Results: Compared to the SSVM, the proposed MKS-SSVM showed better performance in classifying diabetes disease diagnosis with accuracy 93.2%. Conclusion: The results of this study showed that the MKS-SSVM was effective to detect diabetes disease diagnosis and this is very promising compared to the previously reported results.


INTRODUCTION
Support Vector Machines (SVM) is a new algorithm of data mining technique, recently received increasing popularity in machine learning and statistics community. SVM have been introduced by Vapnik [1] for solving pattern recognition and nonlinear function estimation problems. SVM have become the tool of choice for fundamental classification problem of machine learning and data mining. Unlike traditional methods which minimize the empirical training error, SVM aims at minimizing an upper bound of the generalization error through maximizing the margin between the separating hyperplane and the data. This can be regarded as an approximate implementation of the structure risk minimization principle [1,2] .
Although many variants of SVM have been proposed, it is still an active research issue in order to improve for more effective classification. SSVM is a development of SVM that uses smoothing technique. This method was first introduced by Lee [3] in 2001. The basic idea of SSVM is to convert SVM primal formulation to a non smooth unconstrained minimization problem. Since the objective function of this unconstrained optimization problem is not twice differentiable, smoothing function can be applied to smooth this unconstrained problem. Lee [3] have proposed the integral of sigmoid function to approximate the plus function. Then, Yuan have proposed polynomial function [4] and spline function [5] .
In this study, we propose a new smooth function to approximate the plus function. This function is called Multiple Knot Spline function which is a modification of the spline function [5] . Then, we applied a new SSVM based on multiple knot spline function to diagnose diabetes disease.
The used data source is Pima Indian diabetes disease taken from the UCI machine learning repository [6] . This dataset is commonly used among researchers that use machine learning methods for diabetes disease classification. The results were also compared with the results of the previous studies reported [8][9][10][11] .

MATERIALS AND METHODS
In this study, we have used SSVM and MKS-SSVM as material and methods. These are explained as follows: SSVM: In this session, we describe the outline of reformulation standard SVM [1,2] to smooth SVM [3] . We begin with the linear case which can be converted to an unconstrained optimization problem. We consider the problem of classifying m points in the n-dimensional real space R n , represented by the m×n matrix A, according to membership of each point A i in the classes 1 or -1 as specified by a given m× m diagonal matrix D with ones or minus ones along its diagonal. For this problem the standard SVM is given by the following quadratic program: γ determines their location relative to the origin. The linear separating surface is the plane: If the classes are linearly inseparable, the bounding plane as follows: x w y 1, for x A and D 1, x w y 1, for x A and D 1 These constraints (4) can be written as a single matrix equation as follows: ( ) D Aw e y e − γ + ≥ In the SSVM approach [3] , the modified SVM problem is yielded as follows: The constraint in Eq. 6, can be written by: Thus, we can replace y in constraint (6) by (7) and convert the SVM problem (6) into an equivalent SVM which is an unconstrained optimization problem as follows: The plus function (x) + , is defined as: The objective function in (8) is undifferentiable and unsmooth. Therefore, it cannot be solved using conventional optimization method, because it always requires that the objective function's gradient and Hessian matrix.
Lee et al. [3] applies the smoothing techniques and replace x + by the integral of the sigmoid function: This p function with a smoothing parameter α is used here to replace the plus function of (8) The solution of problem (6) is obtained by solving problem (11) with α approaching infinity. The problem (11) can be solved using a Newton-Armijo algorithm [3] .
For nonlinear un-separable problem requires choosing kernel function K to reflect the input space into another space. This model was derived from generalized support vector machines [7] . So the problem (6) can be approximated as following: Same as previous, it is obtained the SSVM for inseparable problem: where, is a kernel map from mxn nxm mxm R xR to R . We can also apply the Newton-Armijo Algorithm directly to solve (13).

Multiple Knot Spline-SSVM (MKS-SSVM):
Smooth Support Vector Machines (SSVM) which had been proposed by Lee et al. [3] is very important and significant result to SVM because many algorithms can be used to solve it. In SSVM, the smooth function in objective function (13) is the integral of sigmoid function (9). In this study, we propose a new smooth function which called Multiple Knot Spline (MKS) function. The formulation and performance analysis of new smooth function and how to construct to new SSVM will be described as follows: Multiple knot spline function: In this study, a new smooth function was proposed. It is a Multiple Knot Spline function as following: This function is the modification of the three order spline function introduced by Yuan [5] .
Performance analysis of smooth function: Before discussing the performance analysis, we need to introduce the following lemma: Lemma 1: p(x,k) is defined as integral of sigmoid function (10) and x + is the plus function: The proof can be seen in [3] .
The proof is omitted.
According to results of lemma1 and theorem 1, the following performance results of smooth functions (Theorem 2) are obtained.  (14), by Theorem 1: From Theorem 2, it is clear that m(x,k) is better than p(x,k). In order to show the difference more clearly, we present the following smooth performance comparison Fig. 1. The smooth parameter is set at k = 10.
As can be seen from Fig. 1, our proposed multiple knot spline function is closer to the plus function than sigmoid function, which indicates the superiority of our proposed smooth function. The procedure of classification on dataset can be described as follows:

MKS-SSVM model
• Select an optimal parameter using uniform design [12] and 10-fold cross validation • Solve MKS-SSVM using Newton Armijo algorithm • Get separating plane • Predict a new input • Calculate accuracy of result Application in diabetes disease diagnosis: We have used Pima Indian Dataset taken from UCI machine learning repository [6] in our applications. This dataset is commonly used among researchers who use machine learning method for diabetes disease classification, so it provides us to compare the performance of our method with that of others. The dataset contains 768 samples and two classes. The class distribution is: There has been a lot of research on medical diagnosis of diabetes disease in literature and most of them reported not too high classification accuracies. In Polat et al. [8] a cascade learning system based on Generalized Discriminant Analysis (GDA) and Least Square Support Vector Machien (LS-SVM) was used. They have reported 78.21% classification accuracy using LS-SVM with 10-fold cross validation (10×CV). They have also reported 79.16% classification accuracy using GDA-LS-SVM. Polat and Gunes [9] have reported 89.47% using Principal Component Analysis (PCA) and Adaptive Neuro-Fuzzy Inference System (ANFIS). The accuracy obtained by Kayaer and Yildirim [10] using General Regression Neural Network (GRNN) was 80.21%, while using Multilayer Neural Network (MLNN) with LM algorithm was 77.08%. Temurtas et al. [11] applied MLNN with LM and Probabilistic Neural Network (PNN) for diagnosing Pima Indian diabetes. They have reported 79.62% classification accuracy using MLNN with 10-fold CV and 82.37% accuracy with conventional (one training and one test) validation method. They have also reported 78.05% classification accuracy using PNN 10 fold CV and 78.13% accuracy using conventional validation method.
There have been several other studies reported with accuracy between 59.5 and 84.2%. The detail accuracy of these studies can be seen in Kahramanli,H and Allahverdi,N [14] .
Parameter selection is one of the important steps in SSVM to improve classification accuracy. The performances of SSVM depend on the combination of several parameters. They are capacity parameter C, the kernel type K and its corresponding parameters. We used RBF kernel function, since of its good general performance and a few number of parameters [13] The parameters that should be optimized for the RBF kernel are the capacity parameter C and the kernel function parameter γ.
The 5-fold Cross Validation (CV) was used to select the best parameter. The data set is divided into 5 subsets and the holdout method is repeated 5 times. Each time, one of the 5 subsets is used as the test set and the other 4 subsets are put together to form a training set. The pairs of (C,γ) that the best CV accuracy is picked. After the (C,γ) is found, the whole training set is trained again to generate the final classifier. We used the nested Uniform Design (UD) [12] to choosing a good parameter.

RESULTS
To evaluate the effectiveness our method, we conducted experiments on Pima Indian diabetes dataset. All our experiments were performed on a personal computer, which utilizes a 2.00 GHz T7250 Intel(R) Core(TM) 2 duo CPU processor and 2550 megabytes of RAM. This computer runs on windows vista operating system, with MATLAB 7 installed. The classification accuracies obtained by the original SSVM and the new SSVM were presented in Table 1.
It can be seen from Table 1, the proposed MKS-SSVM can significant increase training accuracy and testing accuracy. However, the computational time of MKSSVM was not better than original SSVM.  [9] PCA-ANFIS 89.47 (not reproducible) Polat et al. [8] LS-SVM 78.21 GDA-LS-SVM 79.16 Temurtas, H. et al. [11]  For comparison purposes, Table 2 gives the classification accuracies of our method and previous methods.
Polat and Gunes [9] have reported 89.47% classification accuracy using PCA and ANFIS as seen in Table 2. Nevertheless, Temurtas [11] that used the same methods on Pima Indian diabetes obtained 66.78% classification accuracy. It is very far from 89.47% classification accuracy. On the other hand, the result of Polat and Gunes [9] using PCA-ANFIS methods on Pima Indian diabetes was not reproducible.
As we can see from Table 2, present method using MKS-SSVM obtained the highest classification accuracy so far.

DISCUSSION
A MKS-SSVM and its applications in diabetes disease diagnosis are presented. From theoretical aspect, MKS-SSVM has a better performance than the original SSVM. Likewise, when be applied to diagnosis of diabetes disease, the accuracy of this proposed method is better than original SSVM and previous studies.
Further exploration of the MKS-SSVM can yield more interesting results. We will apply this method for other classification problems.

CONCLUSION
This study has proposed new SSVM that called Multiple Knot Spline SSVM (MKS-SSVM). To evaluate the effectiveness our method, Pima Indian diabetes disease diagnosis was conducted. In order to achieve high classification accuracy, the Uniform Design approach was used to search the optimal MKS-SSVM parameters. The result was compared with the results of the original SSVM and the previous studies reported [8][9][10][11] focusing on Pima Indian diabetes diagnosis and using the same dataset.
As the conclusion, the following results can be summarized: • The performance of MKS-SSVM was better than original SSVM • It was seen that the MKS-SSVM obtained very promising result to help diagnosis of Pima Indian diabetes disease • The classification accuracy of MKSSVM obtained by this study was better than the original SSVM and the previous studies reported