Classification of Non-Small Cell Lung Cancer Based on Gene Expression in Cases of Smokers and Non-Smokers using Ensemble Methods with Statistical Based Feature Selection

: Lung cancer is one of the leading causes of death globally. One of the main risk factors for lung cancer is smoking, which causes more than 90% of lung cancer cases. There are two types of lung cancer, i.e., Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC), which the latter is the most common. One method that can be used to detect cancer is the implementation of machine learning on gene expression data. Machine learning is one approach that promises good performance in classifying gene expression data. This study aimed to predict the existence of NSCLC based on gene expression, whether including NSCLC or normal. We used three data sets, i.e., GSE10072, GSE19804, and GSE19188, which relate to the cases of NSCLC in smokers and nonsmokers. The prediction was carried out using six Ensemble Methods, i.e., Random Forest, Adaptive Boosting, Extra Tree, Gradient Boosting, Extreme Gradient Boosting, and Categorical Boosting. Feature selection was carried out by calculating the correlation between feature and target according to statistical parameters, i.e., ANOVA, Mutual Information (MI), and a combination of ANOVA and MI. We obtained the prediction model that outperformed the related studies for two similar data sets with the value of accuracy for the GSE10072, GSE19804, and GSE19188 data sets 100%, 97.22%, and 100%, respectively.


Introduction
According to the 2020 Global Cancer Statistics, cancer is ranked as the leading cause of death and has become a barrier to increasing life expectancy in the world (Li et al., 2018;Sung et al., 2021). Meanwhile, lung cancer is the second most frequently diagnosed cancer and is the leading cause of death in 2020 (Pilleron et al., 2021;Ferlay et al., 2021). It is known that about sixty-seven percent of lung cancer deaths worldwide are caused by smoking behavior (Sung et al., 2021). Such behavior is a significant risk factor for developing lung cancer, accounting for more than 90% of lung cancer cases (Landi et al., 2008;Li et al., 2018). Concerning gender, lung and colon cancers are most common in men, especially older men (Pilleron et al., 2021;Sung et al., 2021).
There are two types of lung cancer, i.e., Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC), while NSCLC is the most common one that causes 85% of lung cancer cases (Ren et al., 2020;Lai et al., 2020;Moitra and Mandal, 2020;Le et al., 2021). The diagnosis process for NSCLC patients is very complex . Generally, NSCLC patients are diagnosed using Positron Emission Tomography (PET) or CT images to detect the location and severity of the disease . However, not all images can be analyzed efficiently due to the limited medical tools and resources. The late diagnosis of NSCLC will lead to more severe treatment for the patient, such as chemotherapy and radiotherapy, with a 20% of 5-year survival rate . In contrast, the early diagnosis of NSCLC can increase 80% of the 5-year survival rate .
To accelerate the early detection of NSCLC, one alternative technique that can be applied is the machine learning method implemented on gene expression data (Karthik and Sudha, 2018). The data is obtained from microarray technology that can capture genetic information or gene expression patterns as a sign of the disease's existence, such as lung cancer (Almugren and Alshamlan, 2019). Meanwhile, machine learning is one approach that is widely used in many cases and known to give promising results in analyzing gene expression. Regarding the implementation of machine learning on gene expression data in disease detection, many researchers have implemented several machine learning and feature selection approaches in many cases, including NSCLC. Yang et al. (2018a) used Fisher exact test and Support Vector Machine (SVM) to predict NSCLC by utilizing GSE43458, GSE10072, and GSE12667 data sets. They found that the obtained model produced a satisfactory result with an accuracy is 94.83% (Yang et al., 2018a). Zhao et al. (2018) used SVM to predict NSCLC by using GSE43458 and GSE10072 data sets with accuracy is 90.7% (Zhao et al., 2018). Yang et al. (2018b) used several feature selections, i.e., T-test, entropy, chernoff bound, and wilcoxon test. They proposed a Single-Gene Ensemble Classifier (SGEC) method to predict NSCLC by using GSE10072, GSE19804, and GSE19188 data sets. Overall, they found that the performance of SGEC is better than other machine learning methods, such as SVM, KNN, and Random Forest, with the accuracy for each data set, being 97.08%, 97.87, and 96.88%, respectively (Yang et al., 2018b). Ren et al. (2020) used decision trees, SVM, and logistic regression to predict NSCLC using GSE10072, GSE19804, and GSE19188 data sets. They found that logistic regression gives the best performance with the accuracy for each data set being 97.75%, 97.22%, and 98.72%, respectively (Ren et al., 2020). Rana and Osama also implemented the Extreme Gradient Boosting (XGBoost) algorithm to predict NSCLC using GSE19188 and found a satisfactory result of the model compared to SVM and gcForest with an accuracy of 95.7% (Abdu-Aljabar and Awad, 2021).
One challenge in processing gene expression data is the ability to handle high dimensions of data. Hence, appropriate feature selection methods are needed to improve the result (Almugren and Alshamlan, 2019;Bommert et al., 2020). According to the performed studies, we found the accuracies of similar cases are still under 100%. Besides, according to the literature survey, they conducted the features selection based on individual methods, but in this study, we performed the combination of the individual methods called overlap features. Hence, there is room for improvement to obtain a better result. Amongst several machine learning methods, the ensemble method is known as one method that is suitable for handling a highdimensional type of data, such as gene expression data. The method combines several weak classifiers to improve the overall model performance compared to a single classifier. The ensemble methods aim to reduce variance (with bagging/bootstrap aggregating technique) and bias (by boosting technique). Hence, the ensemble method is promising to predict NSCLC with better accuracy.
In this study, we performed a comprehensive study of the ensemble methods implementation in predicting NSCLC using three data sets, i.e., GSE10072, GSE19804, and GSE19188. There are six ensemble methods used in this study, i.e., Random Forest (RF), Adaptive Boosting (AdaBoost), Extra Tree (ET), Gradient Boosting (GB), Extreme Gradient Boosting (XGBoost), and Categorical Boosting (CatBoost). However, to the best of our knowledge, there is no comprehensive study regarding ensemble method implementation to predict NSCLC using those data sets. Furthermore, we also performed feature selection to extract the most important features by calculating the correlation between feature and target according to statistical parameters, i.e., ANOVA, Mutual Information (MI), and a combination of ANOVA and MI. The performance of our proposed method is also compared with other studies using similar data sets.

Data Set
We used three data sets, i.e., GSE10072, GSE19804, and GSE19188, retrieved from GEO (Gene Expression Omnibus). Each data set has two classes, i.e., normal and NSCLC, presented in Table 1. Each data set is divided into the train and test sets with a ratio are 70:30. Fig. 1 presents the frequency of the data sets for each class in train and test sets. According to Fig. 1, we found that the number of records for each class is almost balanced for both the train and test sets. Hence we can neglect the possibility of imbalances in-class problems.

Data Sets Distribution
The Principal Component Analysis (PCA) algorithm is used to see the data distribution for each class. PCA can project high-dimensional data into a low-dimensional space by changing the input features that are mutually dependent into new independent features called principal components (Karthik and Sudha, 2018). The distribution of the three data sets is presented in Fig. 2.
As for GSE10072, we found that the distribution of data from each class is not overlapped in the train and test sets. This indicated that it is easier to be classified. As for GSE19804, the distribution of the train set and test set are not separated and there are samples in the normal class that is close to the NSCLC class and vice versa. This condition implied that the classification process for the GSE19804 is relatively complex, so it becomes a challenge to produce a good classification performance. As for GSE19188, the train set conditions are almost similar to the GSE19804, but the test set looks quite separate from the GSE10072. This condition indicated the possibility to obtain a good performance result in the test set in particular.

Feature Selection
Feature selection was carried out by calculating the correlation between feature and target according to statistical parameters, i.e., Analysis of Variance (ANOVA), Mutual Information (MI), and a combination of ANOVA and MI. Those methods are classified as filter method because it is carried out before the classification process (Logotheti et al., 2016).
Mutual Information (MI) measures the mutual dependencies between two variables. The higher the MI value, the greater the dependence on these features. The zero value of MI indicates no relationship between feature and target (Vergara and Estévez, 2014). The calculation of MI values is performed by using Eq. (1) (Bommert et al., 2020): where, MI (C;F) is mutual information for C and F, C is class/target, F is the feature, E (C) is entropy value for the class, and E(C|F) is entropy conditional for C given F. Analysis of Variance (ANOVA) aims to identify potential significant differences between the mean of two or more groups (classes) by measuring between-group and within-group variations (Almugren and Alhamlan, 2019). This method is very useful to find the best features that can separate samples between two classes. The calculation for feature a(Xa) is defined in Eq. (2) (Bommert et al., 2020;Purba et al., 2022): where, c is the number of classes, l x is the mean value of Xa in class I, xij is the observed values of feature Xa for samples of class I, and l x  is the mean value of Xa of all samples in the data set.

Prediction Model
The prediction model used in this study is developed by using ensemble methods. The ensemble method is a metaalgorithm that aims to improve machine learning performance by combining several methods into one predictive model. The ensemble methods aim to reduce variance (with bagging/bootstrap aggregating technique) and bias (by boosting technique). The ensemble methods have a more accurate performance than the singleclassifier because it combines several classifiers with the bagging or boosting approach. The ensemble methods have also been used successfully in various real-world cases (Zhou, 2012). The illustration of bagging and boosting is presented in Fig. 3.
Bagging is accomplished by bootstrap or random sampling with replacement. In this way, several random subdatasets can be formed to create separate predictive models. Bagging performs classification in parallel, in which each model is built independently. Several algorithms that use bagging techniques are Random Forest and Extra Tree.
Boosting performs classification sequentially by developing a new model to handle the previous model's shortcomings. Boosting technique aims to strengthen the model by repeating the training for data that is still misclassified. Some of the boosting methods are AdaBoost, Gradient Boosting, XGBoost, and Cat Boost.
Decision-making on the ensemble method is done by majority vote for classification problems. Ensemble methods such as bagging can also reduce conditions of overfitting and underfitting to provide a better classification model (Altman and Krzywinski, 2017). Six ensemble methods used in this study are Random Forest (RF), Adaptive Boosting (AdaBoost), Extra Tree (ET), Gradient Boosting (GB), Extreme Gradient Boosting (XG Boost), and Categorical Boosting (Cat Boost).
Random Forest (RF) is an ensemble-based method built from several decision trees. RF is one of the popular machine learning methods that can handle high-dimensional data (Nembrini et al., 2018). RF builds a random decision tree using the bootstrap/bagging concept (SLRF, 2022;Ram et al., 2017;Kurniawan et al., 2020). RF performance is usually not affected by hyperparameter tuning (Logotheti et al., 2016). For a new observation of Mnew, the output RF (Mnew) of RF is predicted by Eq. (3) (Liu et al., 2021): is the m th decision tree's prediction result with Mnew as inputs.
Adaptive Boosting (AdaBoost) is one of the popular boosting methods (Lu et al., 2019). AdaBoost combines several classifiers' weaknesses to produce a robust classifier. AdaBoost works by adjusting the weights for each cycle of the weak classifier group. AdaBoost can give better results because the diversity among classifiers is weak based on the performance of each classifier (Kurniawan et al., 2020).
The output of the final equation for AdaBoost classification can be represented as shown in Eq. (4) (Wang and Tang, 2020): where, M is the train set, Am stands for the mth weak classifier, and m in the corresponding weight coefficient.
Extra Tree (ET) is a decision tree-based algorithm that works very randomly. The difference from RF is how the tree is built. Extra Tree looks for a threshold that separates samples into two tree branches (Logotheti et al., 2016). The output of the final probability ET in the testing process belonging to each of the classes is computed as the average of the probabilities on all the trees as defined in Eq. (5) (Soltaninejad et al., 2017): where, T is the number of randomized trees, x represents the data point, Mtrain is the dataset, h(x, Mtrain) represents a feature vector and pt(c|h(x', M)) represents the weak predictor learned by each tree.
Gradient Boosting (GB) is a powerful technique for handling various features such as noise data, recommendation systems, and weather forecasting. The main concept of GB is to build a predictive model by performing gradient descent (Prokhorenkova et al., 2018). The following is a gradient boosting procedure using the least squares approximation as shown in Eq. (6) (Prokhorenkova et al., 2018;Liu et al., 2017): where, t represents the number of trees, h represents the function in the functional space H and H represents the set of all possible regression trees. Extreme Gradient Boosting (XGBoost) is an end-toend tree boosting widely used by data scientists to achieve better results (Chen and Guestrin, 2016). In addition, XGBoost can automatically use CPU multithreading for parallel computing so that it can speed up calculations (Li et al., 2019). This advantage makes the model exploration process faster. XGBoost is an advanced version of GB that provides better performance and faster computing time (Abdu-Aljabar and Awad, 2021). The calculation for the objective function of XGBoost is shown in Eq. (7) (Li et al., 2019): where, l is the loss function and  represents the function used for regularization to prevent overfitting. Categorical Boosting (CatBoost) is a GB algorithm that trains a weak decision tree iteratively. CatBoost is a binary decision tree modified from the GB algorithm. The advantage of CatBoost is that it can get various types of data, one of which is categorical data, so it is called Categorical Boosting. CatBoost modifies the gradient calculation to avoid shifting the predictions to improve model accuracy (Bentéjac et al., 2021). In some cases, CatBoost gives better results than XGBoost (Prokhorenkova et al., 2018). The calculation for the decision tree f of CatBoost can be defined as shown in Eq. (8) (Prokhorenkova et al., 2018): where, Rt is the disjoint regions corresponding to the leaves of the tree.

Model Development
In this study, we defined 18 models by combining different feature selection methods and prediction models. We utilized three feature selection methods, i.e., ANOVA, Mutual Information (MI), and a combination of ANOVA and MI. The model variations used in this study are presented in Table 2, while the value of the model parameters is presented in Table 3.

Feature Selection
We reduced the number of features by evaluating the contribution of feature number on the model performance using 5-fold cross-validation. The best number of features was searched within the range value of 2 to 20. The model performance is represented by the value of the log loss score, in which the lower value indicates the better performance of the model. The plot of feature numbers against the log loss score for GSE10072, GSE19804, and GSE19188 are presented in Fig. 4, 5, and 6, respectively. Since the score of AdaBoost in GSE10072 is significantly larger than the score of other methods, we provided two plots of the figure to highlight the fluctuation of the score.
As for the GSE10072, Fig. 4(a) and 4(c) point out that the fluctuation of the AdaBoost score is significantly higher than the score of other methods, as we mentioned before. We found the fluctuation of the AdaBoost score in both ANOVA and MI methods. Meanwhile, the score fluctuation for other methods can be observed in Fig. 4(b) and 4(d). Interestingly, we did not find the fluctuation of the XGBoost score in both ANOVA and MI methods. This indicates that the number of features does not significantly contribute to the XGBoost method. Also, we found that the change in Gradient Boosting score is more fluctuative than the score of other methods. This is confirmed by the high value of the standard deviation of the Gradient Boosting score. This point out that the method's performance is very dependent on the number of features.
Meanwhile, we found that the score of other methods, i.e., Random Forest, Extra Trees, and CatBoost, show a similar tendency, even though the absolute value is different. This might be implied that those methods have similar characteristics. Overall, we can confirm that the increase in the feature number did not guarantee an increase in the model performance.
As for GSE19804, Fig. 5(a) and 5(b) point out that the fluctuation of the AdaBoost and Gradient Boosting is significantly higher than the score of other methods. We found these fluctuations in both ANOVA and MI methods. This shows that the performance of the AdaBoost and Gradient Boosting methods is highly dependent on the number of features. Meanwhile, fluctuations in the XGBoost score are more clearly seen in this data set for the two feature selection methods used. This indicates that the number of features is sufficient to significantly contribute to the XGBoost method. Interestingly, Fig. 5(b) shows a very significant decrease in the score of a feature in the Extra Tree method. Meanwhile, we found that the scores of other methods, i.e., Random Forest and CatBoost, showed the same trend even though the scores were different. The summary of the optimal number of features, minimum log loss score, and standard deviation generated from ANOVA and MI is presented in Tables 4, 5, and 6. As for the GSE10072, Table 4 describes the optimal number of features reached by Extra Tree-ANOVA, AdaBoost-MI, and Gradient Boosting-MI with the optimal number of features are 2, 2, and 4, respectively, while the minimum log loss is 0.030, 0.000 and 0.000, respectively. This result indicates that a less number of features give a better performance model. Meanwhile, the AdaBoost gives a high value of the standard deviation in both ANOVA and MI. However, AdaBoost reached the minimum log loss in MI. This indicates that the number of features significantly contributes to the AdaBoost method.
As for the GSE19804, Table 5 describes the optimal number of features reached by CatBoost in ANOVA and MI, with the optimal number of features being 2 and 3, respectively, while the minimum log loss is 0.129 and 0.106, respectively. This confirms that the increase in the feature number did not guarantee the increase in the model performance. Meanwhile, the standard deviation reached the highest value in AdaBoost in ANOVA and MI. This point out that the method's performance is very dependent on the number of features. As for the GSE19188, Table 6 describes the optimal number of features reached by Extra Tree-ANOVA and Random Forest-MI, with the optimal number of features being 13 and 2, respectively, while the minimum log loss is 0.141 and 0.118, respectively. Meanwhile, the standard deviation reached the highest value in Extra Trees-ANOVA and Gradient Boosting-MI.

Features Selection Evaluation
We evaluated the effect of the feature selection process by comparing the model performance developed by a varied number of features, i.e., all features, ANOVA feature, MI feature, and ANOVA-MI feature (overlap features). The model performance was determined by calculating the F-1 score value. The comparison of the performance for GSE10072, GSE19804, and GSE19188 is presented in Fig. 7a, 7b, and 7c, respectively.
As for GSE10072, we found that the overlap feature gave the best results (100%) when utilized by Random Forest and CatBoost methods compared to other feature sets. This might indicate that the overlap feature can increase feature quality in both methods. Meanwhile, we found several methods, i.e., AdaBoost and Extra Trees, that give a better performance with all features. However, the higher value of the F1 score obtained by all features is not worthed as the score is the consequence of the high dimension and complexity of the model. We also found that the MI feature gives the best score in the Gradient Boosting method.
As for GSE19804, we found that the overlap features gave the highest f1-score when it utilized RF and AdaBoost. While all features and MI achieved the highest f1-score on XGBoost, ANOVA has not provided the highest f1-score for any prediction methods in this data set. Feature selection methods give the best results on RF, AdaBoost, Extra Trees, and Gradient Boosting. In comparison, all features give the same f1-score on XGBoost and CatBoost. These results indicate that many features do not always give good predictive results.
As for GSE19188, we found that RF obtained the smallest f1-score on overlap features, but the best results are 100% on other feature selections. MI gives the highest F1 score on AdaBoost, Extra Trees, and XGBoost. Interestingly, MI also gives the best F1 score in GB, which other models do not produce. Meanwhile, in CatBoost and RF, a 100% f1-score was obtained by all features, ANOVA and MI. Generally, the overlap features give the best performance in GSE10072, which reach a 100% F1 score. As for GSE19804, the highest F1-score value is 97.3%, obtained by using overlap features and MI. As for GSE19188, feature selection using ANOVA and MI gives the best results, with the F1-score being 100%. We can conclude that feature selection effectively analyzes NSCLC in gene expression data.

Validation Results
The model generated from the training process is then validated using the test set. Model performance was measured using accuracy, precision, recall, and F1-score. We consider the accuracy of the test set as the overall measurement to determine the best model. The values of the validation parameter of GSE10072, GSE19804, and GSE19188 are summarized in Table 7, 8, and 9.
As for GSE10072, we found the recall value for all models is 100%, which indicates all models' ability to predict true positives and avoid false-negative predictions perfectly. Meanwhile, the best model is obtained from model RF-OL and CB-OL with the value of accuracy and F-1 score are 100 and 100%, respectively. This point out the ability of both models to predict all of the test sets perfectly. Also, this confirmed the suitability of the overlap feature to the data set, as we discussed before. Meanwhile, we found several methods, i.e., AB-ANOVA, ET-ANOVA, GB-ANOVA, AB-MI, ET-MI, CB-MI, AB-OL, and GB-OL, that give the worst performance with the value of accuracy and F-1 score are 93.94% and 92.31% respectively.
As for GSE19804 in Table 8, we found that the precision reached the maximum score (100%) in several methods. This point out the ability of those models to classify data as positive compared to all positive predictions perfectly. However, several models provided the best recall (94.74%), which indicates the ability of those models to predict true positives and avoid the false-negative. Meanwhile, the best model is obtained from models XG-MI, RF-OL, and AB-OL, with the value of accuracy and F-1 score are 97.22 and 97.30%, respectively. This condition indicates the ability of those models to predict all of the test sets perfectly. Also, this confirms the suitability of the RF and overlap feature to the data set, as similar to GSE10072. Meanwhile, we found the GB-MI model with the worst performance, with the accuracy and F1 score value of 88.89 and 88.24%, respectively.          As for GSE19188 in Table 9, we found the best model is obtained from models RF-ANOVA, CB-ANOVA, and all classification methods in MI, with the value of accuracy and F-1 score of 100% and 100%, respectively. This confirms the suitability of the MI to the data set. Meanwhile, we found the AB-OL models that give the worst performance with the value of accuracy and F-1 score are 91.49% and 93.55%, respectively. Meanwhile, the best recall value is 100% in several models i.e., RF-ANOVA, AB-ANOVA, GB-ANOVA, CB-ANOVA, and all classification methods in MI, which indicates all models' ability to predict true positive and avoid the false negative prediction perfectly.
Interestingly, the overlap features in all classification methods give the recall value, not 100%. But the overlap features reached the maximum score (100%) of precision in several methods i.e., RF-OL, ET-OL, and CB-OL. This point out the ability of those models to classify data as positive compared to all positive predictions perfectly. Table 10 points out the average performance for all datasets. We calculated the average value of each model for all datasets. We found the best model is obtained from models RF-OL with the value of accuracy and F-1 scores are 98.36% and 98.54%, respectively. This confirms the suitability of the RF-OL to the majority of data sets, as we discussed before in GSE10072 and GSE19184. Meanwhile, we found the AB-ANOVA model with the worst performance, with the accuracy and F-1 score value of 93.78% and 93.66%, respectively. We can conclude that the overlap features most significantly contribute to two datasets (GSE10072 and GSE19188) but MI in one dataset (GSE19804).
However, the best recall value is 98.25% in several models, i.e., RF-ANOVA, ET-MI, and XG-MI, which indicates the models' ability to predict true positives and avoid the false-negative prediction almost perfectly. Besides that, the precision reached the maximum score (100%) in the RF-OL model. This point out the ability of the models to classify data as positive compared to all positive predictions perfectly.

Comparison of Competitive Methods
We also compared our results with other studies (Ren et al., 2020;Yang et al., 2018b), which used similar data sets, as shown in Table 11. The performance comparison was carried out by taking the two best methods in each study with the highest accuracy for each data set. In references, the authors used Logistic with L1 and SG-w (Ren et al., 2020), SVM and Logistic with L1 (Yang et al., 2018b) to predict NSCLC. As for GSE10072, our proposed method gives better accuracy (100%) than the reference (Ren et al., 2020) and similar accuracy to the reference (Yang et al., 2018b). This indicates that our proposed methods are suitable to process the data set.
As for the GSE19804, the highest accuracy obtained by another study is 97.87% (Yang et al., 2018b), with a difference of 0.65 compared to the accuracy obtained by our proposed method. Nevertheless, none of the studies achieved 100% accuracy for the GSE19804. This is challenging to explore the GSE19804 using data augmentation (Kaur et al., 2022) or different feature selection and classification methods for subsequent analyses. Meanwhile, as for GSE19188, we achieved 100% accuracy while all competitors did not perform. This point out the novelty of our proposed methods that reached the optimal accuracy while using MI-XGBoost and MI-Random Forest.
Generally, we obtained better results for GSE19188 and quite similar results for GSE10072 and GSE19804. This might be related to feature selection and ensemble methods that we proposed in similar data set. We use a different feature selection with competitor studies. Also, competitive studies do not use the overlap feature. We concluded that the overlap features are suitable for the data set and contribute better to the classification process.

Conclusion
In this study, we developed six ensemble methods, i.e., Random Forest, Adaptive Boosting, Extra Tree, Gradient Boosting, Extreme Gradient Boosting, and Categorical Boosting, to classify gene expression data for NSCLC. The three data sets discussed in this study, i.e., GSE10072, GSE19804, and GSE19188, contain gene expression on NSCLC influenced by smoking. Feature selection was carried out by calculating the correlation between feature and target according to statistical parameters, i.e., ANOVA, Mutual Information (MI), and a combination of ANOVA and MI. On the overall average performance of the prediction model, overlap features or a combination of ANOVA and MI give the best results with Random Forest as the classifier. For the GSE10072 and GSE19188, our proposed method has provided the highest accuracy of 100%, while GSE19804 has not yet reached 100% and this condition is a challenge for the future. For future work, we suggested further improving the performance of GSE19804 using the data augmentation or the other feature selection and classification methods i.e., the deep learning model.