Unbalance Quantitative Structure Activity Relationship Problem Reduction in Drug Design

Problem statement: Activities of drug molecules can be predicted by Quantitative Structure Activity Relationship (QSAR) models, which overcome the disadvantage of high cost and long cycle by employing traditional experimental methods. With the fact that number of drug molecules with positive activity is rather fewer than that with negatives, it is important to predict molecular activities considering such an unbalanced situation. Approach: Asymmetric bagging and feature selection was introduced into the problem and Asymmetric Bagging of Support Vector Machines (AB-SVM) was proposed on predicting drug activities to treat unbalanced problem. At the same time, features extracted from structures of drug molecules affected prediction accuracy of QSAR models. Hybrid algorithm named SPRAG was proposed, which applied an embedded feature selection method to remove redundant and irrelevant features for AB-SVM. Results: Numerical experimental results on a data set of molecular activities showed that AB-SVM improved AUC and sensitivity values of molecular activities and SPRAG with feature selection further helps to improve prediction ability. Conclusion: Asymmetric bagging can help to improve prediction accuracy of activities of drug molecules, which could be furthermore improved by performing feature selection to select relevant features from the drug.


INTRODUCTION
Machine learning techniques have been used in drug discovery for a number of years. Nevertheless, pharmaceutical manufacturers are constantly seeking to increase predictive accuracy, either through development of existing techniques or through the introduction of new ones. Support Vector Machines (SVMs), genetic algorithm, particle swarm optimization are a recent and powerful addition to the family of supervised machine learning techniques and their application to the drug discovery process may be of considerable benefit Modeling of Quantitative Structure Activity Relationship (QSAR) of drug molecules will help to predict the molecular activities, which reduce the cost of traditional experiments, simultaneously improve the efficiency of drug molecular design [1] . Molecular activity is determined by its structure, so structure parameters are extracted by different methods to build QSAR models. Today, many machine learning methods have been used to the modeling of QSAR problems, like multiple linear regression, k-nearest neighbor [2] , partial least squares [3] , Kriging [4] , artificial neural networks [5] and Support Vector Machines (SVM), of which SVM is a state-of-arts method and achieved satisfactory results in the previous studies [6][7][8] . Nowadays, ensemble learning is becoming a hot topic in the machine learning and bioinformatics communities [9] , which has been widely used to improve the generalization performance of single learning machines. For ensemble learning, a good ensemble is one whose individuals is accurate and makes their errors on different parts of the input space [9] . The most popular methods for ensembles creation are Bagging and Boosting [10][11][12] . The effectiveness of such methods comes primarily from the diversity caused by resampling the training set. Agrafiotis et al. [13] compared bagging with other single learning machines on handling QSAR problems and found that bagging is not always the best one. Signal was proposed in [14] , it created an ensemble of meaningful descriptors chosen from a much larger property space, which showed better performance than other methods. Random forest was also used in QSAR problems [15] . Dutta et al. [16] used different learning machines to make an ensemble to build QSAR models and feature selection is used to produce different subsets for different learning machines. Although the above learning methods obtained satisfactory results, but most of the previous works ignored a critical problem in the modeling of QSAR that the number of positive examples often greatly less than that of negatives. To handle this problem, Eitrich et al. [17] implement their own SVM algorithm, which assigned different costs for two different classes and improved the prediction results.
Here combing ensemble methods, we propose to use asymmetric bagging of SVM to address the unbalanced problem. Asymmetric bagging of SVM has been used to improve relevance feedback in image retrieval [18] . Instead of re-sampling from the whole data set, asymmetric bagging keeps the positive examples fixed and re-samples only from the negatives to make the data subset of individuals unbalanced. Furthermore, we employ AUC (area under ROC curves) [19] as the measure of predictive results, because prediction accuracy cannot show the overall performance. We will analysis the results of AUC and prediction accuracy of experimental results. Furthermore, In QSAR problems, many parameters are extracted from the molecular structures as features, but some features are redundant and even irrelevant, these features will hurt the generalization performance of learning machines [20] . For feature selection, different methods can be categorized into the filter model, the wrapper model and the embedded model [20][21][22] , where the filter model is independent of the learning machine and both the embedded model and the wrapper model are depending on the learning machine, but the embedded model has lower computation complexity than the wrapper model has. Different methods have been applied to QSAR problems [17,23,24] and shown that proper feature selection of molecular descriptor will help improve the prediction accuracy. In order to improve the accuracy of asymmetric bagging, we will use the feature selection methods to improve the accuracy of individuals, this is motivated by the work of Valentini and Dietterich [16] , in which they concluded that improve the accuracy of Support Vector Machines (SVMs) will improve the accuracy of their bagging. Li et al. [25] found embedded feature selection method is effective to improve the accuracy of SVM. They further combined feature selection for SVM with bagging and proposed an modified algorithm, which improved generalization performance of ordinary bagging. Here we propose to combine modified algorithm with asymmetric bagging to treat the unbalanced QSAR problems.

Support vector machines:
Kernel-based techniques (such as support vector machines, Bayes point machines, kernel principal component analysis and Gaussian processes) represent a major development in machine learning algorithms. Support Vector Machines (SVM) are a group of supervised learning methods that can be applied to classification or regression. Support vector machines represent an extension to nonlinear models of the generalized portrait algorithm developed by Vladimir Vapnik. The SVM algorithm is based on the statistical learning theory and the Vapnik-Chervonenkis (VC) dimension introduced by Vladimir Vapnik and Alexey Chervonenkis. After the discovery of SVM they have applied to the biological data mining [28] , drug discovery [6,8] .
In SVM The Optimum Separation Hyperplane (OSH) is the linear classifier with the maximum margin for a given finite set of learning patterns. Consider the classification of two classes of patterns that are linearly separable, i.e., a linear classifier can perfectly separate them. The linear classifier is the hyperplane H (w•x+b = 0) with the maximum width (distance between hyperplanes H 1 and H 2 ). Consider a linear classifier characterized by the set of pairs (w, b) that satisfies the following inequalities for any pattern x i in the training set: These equations can be expressed in compact form as: Because we have considered the case of linearly separable classes, each such hyperplane (w, b) is a classifier that correctly separates all patterns from the training set: For all points from the hyperplane H (w•x+b = 0), the distance between origin and the hyperplane H is |b|/||w||. We consider the patterns from the class -1 that satisfy the equality w•x+b = -1 and determine the hyperplane H 1 ; the distance between origin and the hyperplane H 1 is equal to |-1-b|/||w||. Similarly, the patterns from the class +1 satisfy the equality w•x+b = +1 and determine the hyperplane H 2 ; the distance between origin and the hyperplane H 2 is equal to |+1-b|/||w||. Of course, hyperplanes H, H 1 and H 2 are parallel and no training patterns are located between hyperplanes H 1 and H 2 . Based on the above considerations, the distance between hyperplanes (margin) H 1 and H 2 is 2/||w||. From these considerations it follows that the identification of the optimum separation hyperplane is performed by maximizing 2/||w||, which is equivalent to minimizing ||w|| 2 /2. The problem of finding the optimum separation hyperplane is represented by the identification of (w, b) which satisfies: For which ||w|| is minimum: Denoting the training sample as: SVM discriminate hype plane can be written as: According to the generalization bound in statistical learning theory [29] , we need to minimize the following objective function for a 2-norm soft margin version of SVM: in which, slack variable ξi is introduced when the problem is infeasible. The constant C>0 is a penalty parameter and a larger C corresponds to assigning a larger penalty to errors. By building a Lagrangian and using the Karush-Kuhn-Tucker (KKT) complimentarily conditions [30,31] , we can obtain the value of optimization problem (1). Because of the KKT conditions, only those Lagrangian multipliers, α is, which make the constraint active are non-zeros, we denote these points corresponding to the non-zero α is as support vectors (sv). Therefore we can describe the classification hyper plane in terms of α and b: To address the unbalanced problem, C in Eq. 1 is separated as C+ and C-to adjust the penalties on the false positive vs. false negative, then Equation becomes: The SVM obtained by the above equation is named as balanced SVM.
Bagging as a procedure capable to reduce the variance of predictors mimicking averaging over several training sets. For well behaved loss functions, bagging can provide generalization bounds with a rate of convergence of the same order as Tikhonov regularization. The key observation is that using bagging, an ∝-stable algorithm can becomes strongly ∝-stable with appropriate sampling schemes. Strongly ∝-stable algorithms provide fast rates of convergence from the empirical error to the true expected prediction error. The key fact in the previous analysis is that certain sampling plans allow some points to affect only a subset of learners in the ensemble. The importance of this effect is also remarked in [9,10] . In these studies, empirical evidence is presented to show that bagging equalizes the influence of training points in the estimation procedure, in such a way that points highly influential (the so called leverage points) are downweighted. Since in most situations leverage points are badly influential, bagging can improve generalization by making robust an unstable base learner. From this point of view, resampling has an effect similar to robust M-estimators where the influence of sample points is (globally) bounded using appropriate loss functions, for example the Huber's loss or the Tukey's bisquare loss.
Since in uniform resampling all the points in the sample have the same probability of being selected, it seems counterintuitive that bagging has the ability to selectively reduce the influence of leverage points. The explanation is that leverage points are usually isolated in the feature space. To remove the influence of a leverage point it is enough to eliminate this point from the sample but to remove the influence of a nonleverage point we must in general remove a group of observations. Now, the probability that a group of size K be completely ignored by bagging is (1¡K = m) m which decays exponentially with K. For K = 2 for example (1 ¡ K = m)m » 0:14 while (1 ¡ 1= m)m » 0:368. This means that bootstrapping allows the ensemble predictions to depend mainly on\common" examples, which in turns allows to get a better generalization.
Thus Bagging helps to improve stable of single learning machines, but unbalance also reduce its generalization performance, therefore, we propose to employ asymmetric bagging to handle the unbalanced problem, which only execute the bootstrapping on the negative examples since there are far more negative examples than the positive ones. Tao et al. [18] applied asymmetric bagging to another unbalanced problem of relevance feedback in image retrieval and obtained satisfactory results. This way make individual classifier of bagging be trained on a balanced number of positive and negative examples, so for solving the unbalanced problem asymmetric bagging is used Asymmetric bagging: In AB-SVM, the aggregation is implemented by the Majority Voting Rule (MVR). The asymmetric bagging strategy solves the unstable problem of SVM classifiers and the unbalance problem in the training set. However, it cannot solve the problem of irrelevant and weak redundant features in the datasets. We can solve it by feature selection embedded in the bagging method.
Input: Training data set S r (x 1 ,x 2 ,….x d , C), Number of individuals T Procedure: For k = 1: T 1. Generate a training subset S rk from negative training Set S r by using Bootstrap sampling algorithm, the size of S rk is the same with that of S r + 2. Train the individuals model N k the training subset S rk -U S r + by using support vector Assymetric bagging SVM approach: PRIFEB: Feature selection for the individuals can help to improve the accuracy of bagging and is based on the conclusion of [19] where they concluded that reducing the error of Support Vector Machines (SVMs) will reduce the error of bagging of SVMs. At the same time, we used embedded feature selection to reduce the error of SVMs effectively. Prediction Risk based Feature selection for Bagging (PRIFEB) which uses the embedded feature selection method with the prediction risk criteria for bagging of SVMs to test if feature selection can effectively improve the accuracy of bagging methods and furthermore improve the degree prediction of drug discovery. In PRIFEB, the prediction risk criteria is used to rank the features, which evaluates one feature through estimating prediction error of the data sets Finally, the feature corresponding with the smallest will be deleted, because this feature causes the smallest error and is the least important one.
The basic steps of PRIFEB are described as follows.
Suppose Tr(x 1 , x 2 ,…., x D ,C) is the training set and p is the number of individuals of ensemble. Tr and p are input into the procedure and ensemble model L is the output.
Step 1: Generate a training subset Trk from Tr by using Bootstrap sampling algorithm the size of T rk is three quarters of the size of T r .
Step 2: Train an individual model L k on the training subset T rk by using support vector machines algorithm and calculate the training error ERR.
Step 3: Compute the prediction risk value S i using equation. If S i is greater than 0, the i th feature is selected as one of optimal features.
Step 4: Repeat step 3 until all the features in T rk are computed.
Step 5: Generate the optimal training subset T rk¡optimal from T rk according to the optimal features obtained in Step 3.
Step 6: Re-train the individual model L k on the optimal training subset T rk¡optimal by using support vector machines.
Step 7: Repeat from Step 2-6 until p models are set up, Step 8: Ensemble the obtained models L by the way of majority voting method for classification problems.
SPRAG algorithm: Feature selection has been used in ensemble learning and obtained some interesting results, Li and Liu [32] proposed to use the embedded feature selection method with the prediction risk criteria for bagging of SVMs, where feature selection can effectively improve the accuracy of bagging methods. As a feature selection method, the prediction risk criteria was proposed by Moody and Utans [33]  Finally, the feature corresponding with the smallest will be deleted, because this feature causes the smallest error and is the least important one. The embedded feature selection model with the prediction risk criteria is employed to select relevant features for the individuals of bagging of SVMs, which is named as Prediction Risk based Feature selection for Bagging (PRIFEB). PRIFEB has been compared with MIFEB (Mutual Information based Feature selection for Bagging) and ordinary bagging, which showed that PRIFEB improved bagging on different data sets [33] . Since the asymmetric bagging method can overcome both the problems of unstable and unbalance and PRIFEB can overcome the problem of irrelevant features. So we propose a hybrid algorithm to combine the two algorithms.
The basic idea of SPRAG algorithm is that we first use bootstrap sampling to generate a negative sample and combine it with the whole positive sample to obtain an individual training subset. Then, prediction risk based feature selection is used to select optimal features and we obtain an individual model by training SVM on the optimal training subset. Finally, ensemble the individual SVM classifiers by using majority voting Rule to obtain the final model.

Learning and performance measurement:
1. Begin 2. For k = 1 to T do 3. Generate a training subset S − rk for negative training set S − r by using the bootstrap sampling technique, the size of S − rk is same with that of S r + 4. train the individual model L k on the training subset S − rk ∪ S r + by using the support vector machine algorithm and calculate the AUC value on the training subset 5. for i = 1 to D do 6. compare the prediction risk value R i using the equation 7. If R i is greater than 0 the i th feature is selected as one of optimal features 8. End for 9. Enerate the optimal training subset S rk-optimal from S rk according to the above optimal features 10. Train the individual model N k on the optimal training subset S rk-optimal by using support vector machines. 11. End for 12. Ensemble the obtained model N by the way of majority voting method for classification problems 13. End Since the class distribution of the used data set is unbalanced, classification accuracy may be misleading. Therefore, AUC (Area Under the Curve of Receiver Operating Characteristic (ROC)) [19] is used to measure the performance. At the same time, we will present detail results of prediction accuracy (ACC), which consists of two parts True Positives Ratio (TPR) and True Negatives Ratio (TFR). ACC, TPR and TNR are defined as:  [34] consist 64 BCUT descriptors.

RESULT
Experiments are performed to investigate if asymmetric bagging and feature selection help to improve the performance of bagging. Support vector machines with C = 100, σ = 0.1 are used as individual classifiers and the number of individuals is 5. For balanced SVM, balanced_bridge is used to denote the ratio of C+ to C-. For ordinary bagging, each individual   has one third of the training data set, while for AB-SVM, the size of individual data subset is twice of the positive sample in the whole data set. The 3-fold cross validation scheme is used to validate the results, experiments on each algorithm are repeated 10 times. Table 1-6 list the results of ordinary SVM, balanced-SVM, bagging of balanced-SVM, ordinary bagging, AB-SVM and SPRAG (special prediction risk based feature selection for asymmetric bagging), from which we can see that:

DISCUSSION
• Balanced SVM obtained a slight improvement of ordinary SVM • Bagging methods improves stability of single ones and obtain better results than single ones do. Especially on balanced-SVM, bagging improves 0.1322 from single one   predicted correctly. If we simply predict all the labels as negative, we can get a high value as 98.15%, which is the ratio of negative sample to the whole sample • Since this is a drug discovery problem, we pay more attention to positives. AUC is more valuable than ACC to measure a classifier. Asymmetric bagging improves the AUC value of ordinary bagging and our modified algorithm further significantly improves it to a higher one 79.58% in average, simultaneously, TPR are improved from 9.23-90.95%, which shows our modified algorithm is proper to solve the unbalanced drug discovery problem.
• Asymmetric bagging wins in two aspects, one is that it make the individual data subset balanced, the second is that it pay more attention to the positives and always put the positives in the data set, which makes TPR is higher than ordinary bagging and AUC is improved • Feature selection using prediction risk as criterion also wins in two aspects, one is that embedded feature selection is dependent with the used learning machine, it will select features which benefit the generalization performance of individual classifiers, the second is that different features selected for different individual data subsets, which makes more diversity of bagging and improves their whole performance.

CONCLUSION
To address the unbalanced problem of drug discovery, we propose to apply asymmetric bagging and feature selection to the modeling of QSAR of drug molecules. AB-SVM and our modified novel algorithm are compared with ordinary bagging of support vector machines on a large drug molecular activities data set, experiments show that asymmetric bagging and feature selection can improve the prediction ability in terms of AUC and TPR. Since this is a drug discovery problem, the positive sample is few but important, AUC and TPR is more proper than ACC to measure the generalization performance of classifiers. This work introduces asymmetric bagging into prediction of drug activities and furthermore extends feature selection to asymmetric bagging. Extension of this paper includes test the proposed algorithms with a higher number of individuals. This work only concerns an embedded feature selection model with the prediction risk criteria and one of the future works will try to employ more efficient feature selection methods for this task.