Gene Selection and Classification in Microarray Datasets using a Hybrid Approach of PCC-BPSO/GA with Multi Classifiers

: In this study, a three-phase hybrid approach is proposed for the selection and classification of high dimensional microarray data. The method uses Pearson’s Correlation Coefficient (PCC) in combination with Binary Particle Swarm Optimization (BPSO) or Genetic Algorithm (GA) along with various classifiers, thereby forming a PCC-BPSO/GA-multi classifiers approach. As such, five various classifiers are employed in the final stage of the classification. It was noticed that the PCC filter showed a remarkable improvement in the classification accuracy when it was combined with BPSO or GA. This positive impact was seen to be varied for different datasets based on the final applied classifier. The performance of various combination of the hybrid technique was compared in terms of accuracy and number of selected genes. In addition to the fact that BPSO is working faster than GA, it was noticed that BPSO has better performance than GA when it is combined with PCC feature selection.


Introduction
Advances in microarray technology and the need of analyzing gene expression have stimulated a shining road of research in bioinformatics, biotechnology, cancer informatics and similar fields (Bolón-Canedo et al., 2014). The microarray data holds information about how the genes are expressed. By analyzing these data, one can find the altered genes, thereby facilitating easy diagnosis and classification of the genetic-related diseases. Consequently, biologists can perform cost-effective and efficient studies upon the altered genes when few number of selected genes are targeted (Cosma et al., 2017). Prediction and classification of cancer types is a great challenge in the medical sector.
Gene expression profiles play a vital role in this regard. However, because of the existence of small number of samples compared with the large number of genes, many computational methods are failed to identify a small subset of important genes in microarray data, which ultimately increases the challenge of microarray analysis (Singh and Sivabalakrishnan, 2015). Furthermore, microarray data usually contains redundant and irrelevant features (genes). These features can significantly increase the computational burden (Wang, 2012). The redundant features do not contribute to modeling a better predictor because the information they provide is basically presented by other feature(s) (Song et al., 2013).
It is imperative to know that redundant features negatively affect the performance of a model and hence in order to achieve better performance, it is desirable to perform feature selection. Feature selection, a concept whose purpose is the finding of a subset of discriminative/altered features, becomes essential and is widely recognized as one of the centrally important areas in biomedical, bioinformatics and data mining (Conilione and Wang, 2005). Three main techniques are used in feature selection which include filter-based, wrapperbased and hybrid-based methods (Bolón-Canedo et al., 2013;Hira and Gillies, 2015;Singh and Sivabalakrishnan, 2015). These methods are categorized based on their criteria of using learning algorithm. The filter selection method chooses variables regardless of the used model and it works by suppressing variables that are least interesting. The non-suppressed variables will be part of a regression or a classification model which is used for the classification or prediction of data (Hira and Gillies, 2015). As filter techniques are not applied to build predictors (Lazar et al., 2012), the classifier accuracy becomes lower if the results of these filters are directly given to the learning algorithm (Hira and Gillies, 2015). Taking the distributed data into consideration, filters are divided among parametric and nonparametric methods (Hameed et al., 2018).
Parametric filters assume equal distribution of samples in different classes, such ANOVA, chi-squared and Bayesian (Saeys et al., 2007). However, this assumption cannot be guaranteed in most datasets. Therefore, the utilization of non-parametric methods might yield a better result when there is uncertainty regarding the dataset distribution. Examples of nonparametric filters are Relief-F, Information gain, Correlation coefficient (Pearson) and Gain ratio. Pearson Correlation Coefficient (PCC) is utilized to determine interrelation between the features and to investigate the correlation between classes (Hall, 1999). In the wrapperbased feature selection, the evaluation is performed on subsets of the variables, through which the possible communications between the variables can be observed. This is achieved by using the classifier accuracy (Saeys et al., 2007). Wrappers choose the best subset of features that gives highest accuracy to the model. The result of this selection usually consists of fewer number of features with robust discriminative power (Xiong et al., 2001). In addition, wrappers are classifier dependent and hence the same result is not guaranteed when another classifier is applied (Lazar et al., 2012;Santana and de Paula Canuto, 2014). Furthermore, the overall performance of wrappers is decreased and may lead to over fitting if they are directly applied on the data without using any pre-processing step (Bolón-Canedo et al., 2014). Hybrid approaches are established based on the useful combination of filter and wrapper algorithms (Alba et al., 2007;Hameed et al., 2017;Lu et al., 2017). Hence, the disadvantages of filters and wrappers can be overcome through using a hybrid technique. Conventional optimization algorithms are not efficiently working in the feature selection of large scale problems (Chen et al., 2012).
Problems in high-dimensional data analysis have  motivated the researchers to search for possible  solutions and propose viable algorithms. A novel  Markov Blanket-Embedded Genetic Algorithm (MBEGA) was proposed for gene selection problem (Zhu et al., 2007). The embedded Markov blanket-based memetic operators add or delete features (genes) from a Genetic Algorithm (GA) solution so as to quickly improve the solution and fine-tune the search. A modified Support Vector Machine (SVM) was also suggested to select the minimum possible genes (Ghaddar and aoum-Sawaya, 2018). Multi-objective version of bat algorithm for binary feature selection (Dashtban et al., 2018) and Genetic Bee Colony (GBC) algorithm (Alshamlan et al., 2015) were successfully utilized in high dimensional datasets. Moreover, a hybrid feature selection algorithm was proposed that combines the Mutual Information Maximization (MIM) and the Adaptive Genetic Algorithm (AGA) (Lu et al., 2017). The reduced gene expression dataset presented higher classification accuracy compared with conventional feature selection algorithms. In order to improve classification accuracy, further study has been made to utilize a hybrid form of filter and wrapper, consisting of information gain and standard genetic algorithm (Maldonado et al., 2014). Besides, a binary version of Black Hole Algorithm called BBHA was proposed for solving feature selection problem in biological data. However, the tested classifiers were under tree family and other kinds of classifiers were not assessed (Pashaei and Aydin, 2017). Along this line, the assessment of different classifiers such as Artificial Neural Network (ANN) (Aziz et al., 2017) and fuzzy decision tree algorithm (Ludwig et al., 2018) has been made upon microarray data.
The two evolutionary algorithms of PSO and GA are usually used in wrapper form (Alba et al., 2007;Chen et al., 2012). PSO is known to be a memory enabled algorithm compared with other algorithms, it requires few parameters to be adjusted, so it is simple and efficient (Chandra Sekhara Rao Annavarapu and Banka, 2016; Hameed et al., 2017). Kar et al. (2015) proposed a PSO-adaptive K-nearest neighbor (KNN) based gene selection method and they used a heuristic for selecting the optimal values of K, while the classification accuracies has been tested using SVM algorithm. We have previously reported a hybrid method which combines three filters with geometric binary particle PSO and SVM for effective gene selection and classification in the high dimensional data of autism (Hameed et al., 2017). Very recently, Jain et al. (2018) reported a two phase hybrid model for cancer classification, integrating Correlation-based Feature Selection (CFS) with improved-Binary Particle Swarm Optimization (iBPSO) using Naive-Bayes as the only classifier. In the current research work, a three-phase hybrid form of filter-wrappers-multi classifiers is proposed aiming at performing effective selection and classification task in the high dimensional microarray data. Pearson Correlation Coefficient (PCC) in combination with binary form of PSO (BPSO) or Genetic Algorithm (GA) are utilized in the feature selection process, while five various classifiers are being employed in the final stage of classification. As such, the proposed PCC-BPSO-multi classifier and PCC-GAmulti classifier are applied to eleven microarray datasets and their results are compared with each other.  (PNETs), 10 non-embryonal brain tumors and 4 normal human cerebella. The initial oligonucleotide microarrays contain 6817 genes. They were pre-processed with thresholding (Dettling and Bühlmann, 2002). Hence, the remaining genes are 5597 for the complete dataset with five different sample classes. The Leukemia cancer dataset was generated from a gene expression study in two types of acute leukemia: Acute Myeloid Leukemia (AML) and, Acute Lymphoblastic Leukemia (ALL). The levels of gene expression were measured using Affymetrix highdensity oligonucleotide arrays which consist of 6817 genes, although this was reduced to 3051 genes and further analyzed by Golub et al. (1999). The dataset consists of 25 cases of AML and 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL). The dataset was further pre-processed by Dudoit et al. (2002). Lymphoma microarray dataset is achieved from (Dettling and Bühlmann, 2002). It has 4026 genes and 62 samples. The data samples are mainly from 3 different adult lymphoid malignancies, where 42 samples represent the diffuse large B-cell lymphoma (DLBCL), 9 from Follicular Lymphoma (FL) and 11 of Chronic Lymphocytic Leukemia (CLL). The colon cancer microarray dataset was originally analyzed by Alon et al. (1999). The original authors of the dataset performed treatment on the raw data from the Affymetrix oligonucleotide arrays. The dataset is consisting of normal and tumor tissue samples. The total number of samples are 62 and total gene numbers after pre-processing given by previous authors is 2000. The prostate cancer dataset consists of 102 patterns of gene expression, where 50 of the samples are normal prostate specimens and the other 52 are tumors. This microarray dataset is based on oligonucleotide microarray and consists of approximately 12600 genes. After pre-processing the remaining number of genes in the dataset is 6033 (Díaz-Uriarte and De Andres, 2006). Small Round Blue-Cell Tumor (SRBCT) microarray dataset has four different classes which originally had 6567 genes and 63 samples. Where, 23 samples are from EWS, 20 from RMS, 12 from NB and 8 samples from NHL. After pre-processing the genes are reduced to 2308.This dataset is achieved from (Díaz-Uriarte and De Andres, 2006). The rest of the datasets (Breast, CNS, Lung and MLL) were achieved from (Zhu et al., 2007). The main characteristics of the datasets are given in Table 1.

Pearson Correlation Coefficient (PCC)
The Pearson correlation coefficient, also known as r, R, or Pearson's r, is defined as the strength and direction measure of the linear dependency (correlation) between two features. It can be defined as the covariance of the variables divided by the product of their standard deviations (Benesty et al., 2009). PCC requires all features to be of the same type, hence a discretization pre-processing step is required (Hall, 1999;Huertas and Juárez-Ramírez, 2014). It was originally developed by Karl Pearson based on the idea of Francis Galton who discovered it in 1888 (Stigler, 1989).

Particle Swarm Optimization (PSO)
Particle Swarm Optimization (PSO) is a technique which is based on stochastic population optimization. It was first suggested by Kennedy and Eberhart (1997). PSO algorithm took its inspiration The PSO algorithm is implemented through three simple steps which include; generating the position and velocity of particles, updating their velocity and then updating their position.
In PSO, individual particles are moving in the search space and they are communicating with each other via iterations in order to search for optimal solutions (Tran et al., 2014). If a search space of D-dimensions is assumed, then the ith swarm particle can have a Ddimensional position vector represented by Xi = [1, 2,...; ]. Therefore, the velocity of the ith particle is denoted by Vi= [1, 2,...; ]. It is considered that the best visited position, which produces the best fitness value for the particle, is PBi= [ 1, 2,...; ], while the best explored position so far is GB = [ 1, 2,...; ]. In this way, the velocity of each particle is updated by the following equation: Where: d = 1, 2,………… D c 1 = Cognitive learning factor c 2 = The social learning factor c 1 and c 2 = Positive constants with values ranging from 0 to 4 The inertia weight (w) in "Equation 1" acts to gradually reduce the particles velocity and hence controlling the swarms. The value of w is usually located between 0.4 and 0.9, whereas the random variables rand 1 and rand 2 are uniformly distributed between 0 and 1 (Tran et al., 2014). As such, the velocities of particles are bounded within [v min , v max ]. The vector function of velocity is holding by these bounds, that is to avoid the very sharp movement of particles in the search space. The formula which is used to update the particles position is represented by: Where: d = 1, 2,……… D i = 1, 2, ………. N N = The size of the swarms A modified version of the standard PSO, known as binary PSO (BPSO), was also introduced by Kennedy and Eberhart (1997) in order to handle discrete variables. When BPSO is applied for feature selection, a feature subset is represented by a string vector of n binary bits X i = (x 1 , x 2 ,...x n ) comprising of '0' and '1'. Consequently, if x id is '0', then the d th feature is not selected in this subset, while x id of '1' is alternatively chosen in the subset. In this regard, each binary string vector (X i ) defines the particle position in BPSO. When GBPSO is utilized for the feature selection purpose, the genes are represented by a binary vector. The selected gene is denoted by 1, while the non-selected gene is encoded by 0. For instance, a particle with seven features is encoded as '0100010', implying that the second and sixth features are selected. Therefore, initially the length of each particle is the same as the number of genes in the dataset. Moreover, in the traditional BPSO the dimension of each particle is updated using function 3.2 [21, 45]; ( ) The fitness function in BPSO is employed as an evaluator to choose the best feature subsets. The subset of particles that are giving best fitness values are recorded to maintain a better solution at given population. Consequently, the best subset of genes which provides better accuracy can be recalled. This process is applied in 10-fold cross validation, such that all the training set is used in the determination of the best genes. The inclusion of each gene in the best set is based on the number of repeatability of that gene out of the whole number of folds. Here, the maximum repeatability number is set to 10, so few number of genes with high accuracy are most probably to be imported into the selected set of genes.

Genetic Algorithm
Genetic Algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of Evolutionary Algorithms (EA). It is first generating a random initial population. Later on, the individual chromosomes are evaluated by a fitness function. A detailed description of GA can be found in (Goldberg and Holland, 1988). In this technique, the GA operators which include selection, crossover and mutation are used to search for the best solutions by the individuals. From the current population, the chromosomes having high adapting value are chosen by the selection operator. Meanwhile, the crossover operator is applied to combine two chromosomes, thereby generating two new chromosomes known as offspring. The use of mutation operator is to modify the value of one or more genes in a chromosome from its initial state. This process will be repeated to get the best satisfactory fitness or to arrive the last generation. During the evaluation step, a fitness function is utilized to estimate the quality of each chromosome. Binary coding system is used to represent the chromosome. Each chromosome bit denotes a gene mask. The bit value of '1' implying that the gene is chosen, while '0' indicates that the gene is discarded. In this way, the genes with value '1' are selected and combined as a subset of candidate genes. In this work, the fitness of each chromosome (gene subset) is evaluated by the classification accuracy of SVM. The 10-CV classification accuracy is adopted with the gene subset on the training samples. The higher the 10-CV classification accuracy provides the better gene subset. Ultimately, the gene subset with the highest 10-CV classification accuracy is considered as the optimal gene subset .

Classifiers
In this study, a group of well-known classifiers are applied. The choice of various classifiers is due to the fact that there is no any specific algorithm to work perfectly for all datasets and not all algorithms work in the same way on a dataset. The applied classifiers are Bayes Net (BN), K-Nearest Neighbor (KNN), Naïve Bayes (NB), Random Forest (RF) and Support Vector Machine (SVM). The accuracy of all classifiers is measured based on 10-folds cross validation. This is to make sure that each dataset is equally participated in the training and testing process. Figure 1 shows the complete methodology that was carried out to implement the current work, while the detailed description of the experimental procedures is given below:

Experimental Design
• In the first step of analysis, the datasets were filtered using Pearson Correlation Coefficient (PCC) method in 10 runs. This is to ensure that the whole dataset is passed through this phase and the reduction result is accurate enough at this stage. Different thresholds were tested for considering the number of the filtered genes. This has been made manually by setting the number of genes to 100 and 200 alternatively and automatically by the method itself based on the most attributed gene. Selection set of 100 genes was considered as it was found that the accuracy and performance of the classifier performed better compared to that of the 200 selected genes • Because of the data filtration, the datasets were reduced to be tested against the applied classifiers. This was done in order to compare the performance of the dataset with the one before filtration • The reduced/filtered datasets were again purified by another step of feature selection. BPSO and GA were comparably used as a hybrid method with different classifiers, in which the fitness function was derived from the classification algorithms. This was performed in 10-folds cross validation in order to confirm that the whole dataset is used in the training and testing phases. After the application of this step, the datasets were further reduced to be tested by the same classifiers • The classification results are compared with each other as well as with the results of the previous step. It is fair to mention that the same fitness function was used for each of the BPSO and GA algorithms

Results and Discussion
In the first stage of analysis, the accuracy of the classifiers applied on the original datasets was evaluated. In each classifier, a 10-fold cross validation was applied on the training and testing partitions. Table 2 illustrates the results that were obtained in this stage. It can be seen that Support Vector Machine has presented the highest classification accuracy among all the other classifiers.
The features of the datasets were ranked using PCC filter feature selection method using 10-fold cross validation. Thus, the features are ordered based on the ranking results. To avoid over fitting in the next steps of feature selections (wrapper), possibly due to having low number of samples, the first 100 important attributes were selected. Subsequently, these features were used by the five classifiers. The results of the classifiers performance are tabulated in Table 3. Results in bold indicate the best performed classifier for each specific dataset. The methods highlighted in grey show the best approach for each dataset. The dashed cells indicate that the method is not appropriate for application. The results show that generally the accuracy of the classifiers on the filtered dataset performed better results when compared with those applied directly on the original datasets. However, there are some cases with few classifiers in which the accuracy on the original dataset is better. It was noticed that Bayes net classifier was not working on some of the original datasets, while for the filtered datasets did not show problem. This is because those datasets were having some properties that Bayes net is unable to handle them. This shows one of the differences between our proposed method and others. In other works, only one classifier is applied, while in the current work multiple classifiers are utilized to show the quality of each of them and to follow that rule saying (not all classifiers are best for same dataset and not one classifier is best for all datasets). Moreover, in our work, 11 different high dimensional datasets are applied against the method. This is to show the applicability of our proposed methods, which again confirms the viability of the proposed method.  In the next step, wrapper feature selection was applied to all eleven datasets. This was applied on the reduced dataset with 100 attributes that were selected by the PCC filter method, which is considered as a hybrid method (BPSO-Classifier and GA-Classifier). In this step, all the datasets are further reduced by (BPSO-Classifier and GA-Classifier), which is repeated for all classifiers. This is because wrapper is classifier dependent. It is not perfect idea to apply a classifier on a reduced dataset when its features are selected using another classifier. Thus, we considered this fact and feature selection is done using all classifiers separately.
After the feature selection by BPSO-Classifier in this phase, the datasets are further reduced based on the selected genes. Table 4 illustrates the better performance of the hybrid feature selection method (BPSO-Classifier) on the reduced high dimensional datasets. It is noticeable that the accuracy of all classifiers is improved compared with their accuracy on the filtered datasets, as shown in Table 3. This indicates that the feature selection by BPSO not only improved the efficiency of the classification process but also its accuracy is enhanced.
To see how GA is working as a feature selection, it is also applied on the same filtered datasets using the same fitness function as used for BPSO. The datasets are reduced based on the selected features (genes) by each GA-Classifier. Again, all classifiers are used with GA as feature selection, separately. Then, the classifiers are applied on the reduced datasets to see the effect of this phase. It is clear that the classifier's accuracy is improved compared with the one of the filtered datasets, as shown in Table 5.
Here, we have clearly noticed that BPSO was generally better than GA in terms of accuracy of the classifiers after selection process, as it is illustrated in Table 6. This is also in agreement with the results reported previously that PSO can outperform GA when it comes to feature selection (Hameed et al., 2017;Hassan et al., 2005). Bold classification accuracies indicate better performance for same classifier and same dataset but different selection method. Grey highlighted method shows the winner or the best approach of selection.
Furthermore, the number of selected genes by each method is compared. It is worth to mention that in this study more attention is given to achieving high accuracy rather than achieving fewest number of genes. The number of selected genes is tabulated in Table 7. From the table, we can notice in general that BPSO has selected fewer number of genes compared to that of GA.   Neighbor (    Moreover, it was seen that BPSO is performing faster than GA. The final dataset generated by BPSO and GA are illustrated in scatter plot for two representative random genes for Leukemia dataset in Fig. 2 and 3, respectively. For further demonstration, the Andrews plot is carried out for all selected genes by BPSO and GA, as shown in Fig. 4 and 5. This analysis is performed for worst dataset among them which is Breast dataset. This is to show the quality of the applied methods even in worst case. The scatter plots for two representative genes of the final Breast dataset, which are selected by BPSO and GA, are illustrated in Fig. 6 and 7, respectively. It was concluded that the performance of the proposed method, in terms of accuracy and efficiency, is better than other methods reported in literature (Dash, 2018;Gonzalez-Navarro and Belanche-Muñoz, 2014).

Conclusion
High dimensional datasets such as gene expression datasets are characterized by high number of genes (aka features) with few number of samples. That means they need special and careful analysis. Bioinspired and evolutionary algorithms such as BPSO and GA are tremendously used in the field of machine learning and data mining in different forms. In this study, these two methods were successfully applied in a hybrid wrapper form after the application of filter feature selection. The proposed method was composed of three-phase hybrid form of filter-wrappers-multi classifiers, in which Pearson correlation coefficient (PCC) in combination with binary form of PSO (BPSO) or Genetic Algorithm (GA) were utilized in the feature selection process, while five various classifiers were employed in the final stage of classification. It was noticed that filter feature selection has a remarkable impact on the classification accuracy. This positive impact was seen to be improved when the filtered datasets are reduced by each of BPSO and GA algorithms with different classifiers. Later on, their performances are compared in terms of accuracy and number of selected genes. In addition to the fact that BPSO is working faster than GA, it was noticed that BPSO has better performance than GA.

Ethics
There are no ethical issues that may arise after the publication of this manuscript.