A COMPARATIVE STUDY OF COMBINED FEATURE SELECTION METHODS FOR ARABIC TEXT CLASSIFICATION

Text classification is a very important task due to the huge amount of electronic documents. One of the problems of text classification is the high dimensionality of feature space. Researchers proposed many algorithms to select related features from text. These algorithms have been studied extensively for English text, while studies for Arabic are still limited. This study introduces an investigation on the performance of five widely used feature selection methods namely Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F. In addition, this study also introduces an approach of combination of feature selection methods based on the average weight of the features. The experiments are conducted using Naïve Bayes and Support Vector Machine classifiers to classify a published Arabic corpus. The results show that the best results were obtained when using Information Gain method. The results also show that the combination of multiple feature selection methods outperforms the best results obtain by the individual methods.


INTRODUCTION
With the rapid growth of the Internet, the volume of the news and information available on the web is growing exponentially. Since there has been an explosion of information available on the Internet, this makes the process of analyzing and processing them manually a very difficult task. As a consequence, text classification has gained importance in hierarchical organization of these documents. The fundamental goal of the text classification is to classify texts into appropriate classes.
One of the problems of text classification is the huge number of features which reduce the performance of text classification and consume the time. Feature selection method is used to reduce the feature space by selecting the most relevant features (Maldonado and L'Huillier, 2013). Many feature selection methods have been proposed and investigated to improve the performance of English text classification. However, the work on feature selection for Arabic language are limited and most of studies in text classification for Arabic language are concerned with investigating the efficiency of text classification algorithms without enough attention to how the feature selection task can improve the accuracy of classification (Al-Salemi and Ab Aziz, 2010; Hawashin et al. 2013;Saad, 2011).
Our motivation to do this research is to enhance the robustness of the finally selected feature subsets of the class and get rid of the noisy and redundant features because there is another subset which supplies the same information Science Publications JCS about the class. We need to combine two methods or more together to get rid of redundant and noisy features which degrade the performance of most classifiers.
This study introduces an investigation on the performance of five widely used feature selection methods and a combination approach of feature selection methods including Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F. The main concern is to investigate the effectiveness of combining the individual feature selection methods on the performances of Arabic text classification.
The rest of the paper is organized as follows: Section two for related work, section three is methodology, section four is experimental work, section five gives the results and discussion of the experiments and section six concludes the paper.

RELATED WORK
Many feature selection and other preprocessing techniques have been applied for text classification. The bulk of feature selection work has been devoted for English and other Latin language. Soares (2010) has proposed an algorithm based on a wrapper method to build an ensemble of models with a specific base classifier. The Class-Specific Ensemble Feature Selection (CEFS) algorithm applied in the above test with Naïve-Bayesian Classifier. The results showed an enhancements in the accuracy of prediction. Ren and Sohrab (2013) have introduced class-indexing-based termweighting approaches. The proposed class-based indexing is incorporated with term, document and class index. They have investigated the efficiency of proposed classindexing-based approaches, with other term weighing approaches to address the automatic text classification task. The results of the experiments have revealed that, the proposed term weighting approaches improved the classification task. Chen et al. (2009)  Unlike English language, a limited number of researches had been done for Arabic (Al-Salemi and Ab Aziz, 2010;Chantar and Corne 2011;Hawashin et al., 2013).
An investigation on three representation methods had been performed by Harrag et al. (2010) namely Document Frequency (DF), Latent Semantic Analyses (LSA) and Term Frequency Inverse Document Frequency (TFIDF). The results of experiments that performed on an Arabic dataset showed that TFIDF was the most effective method among the three feature reduction techniques. Duwairi et al. (2007) have compared and contrasted two feature selection techniques when applied to Arabic corpus. The dataset consisted of manually prepared Arabic text documents, collected from internet sites. They have employed stemming and light stemming as feature selection methods. The experiments have showed that the using of light stemming as a feature selection method obtained better results than using stemming.
Generally speaking, the work on feature selection for Arabic language used individual methods only, while using combination of feature selection methods may achieve better.

METHODOLOGY
Feature selection is an important preprocessing stage of text classification, which increases the performance of a predictive model. The main purpose of feature selection is to choose a subset of high discriminative features and eliminate the non-discriminative features.
In this study, we investigate the performance of five common feature selection methods with their combinations for Arabic text classification. We combine every two feature selection methods and we also combine the five feature selection methods. In both cases, two classifiers are used to conduct the experiments namely Naïve Bayes and Support Vector Machine. In literature, the studies that tried to combine the feature selection methods using different strategies, they combine either two or five feature selection methods like (Wang et al., 2010;Vege, 2012).
The key idea behind combining feature selection methods are that every individual method produces different types of errors and feature selection methods are combined to exploit their strengths. Combining feature selection methods are becoming more popular as they allow one to overcome the weaknesses of single methods. The combined feature selection methods always outperforms the best of its individuals in text classification task (Omar et al., 2013).
The following subsections describe briefly the classifiers and the feature selection methods used in this study. They also describe the used approach of combination of feature selection methods.

Naïve Bayes Classifier
Naïve Bayes has been one of the most popular machine learning methods since long ago. Its simplicity makes the framework attractive in different tasks and reasonable Science Publications JCS performances are achieved in the tasks although this way of learning is based on an unrealistic independence assumption (Khalifa and Omar, 2014). The Naïve Bayes (NB) classifier generally uses Bayes' rule: where, N i is the number of documents assigned to class C i and N is the number of classes, ( | ) i p d c is the probability of a document d given a class C i and p(d) is the probability of document d.

SVM Classifier
Support Vector Machine (SVM) classification algorithm is considered as one of the most robust and accurate machine learning algorithms (Ahmed, 2010). In simple words, given a group of training examples, each marked as belonging to one of two categories, SVMs training algorithm starts building a model that predicts whether a new example falls into which category. The method of SVM in its dual form is as follows:

Chi-Square
Chi-square is a commonly used statistical test that determines the divergence from the distribution expected if one assumes the feature occurrence is obviously independent of the class value. As a statistical test, it is well known to act erratically for very minor expected counts, which are known in text classification both because of the rare occurring of word features and some other times because of having different positive training examples for a concept (Forman, 2008). The chi-square statistics is calculated by the following equation.

Correlation
Correlation-based Feature Selection (CFS) is one of commonly known techniques to evaluate and rank the relevance of features by measuring correlation between features and classes and between some features and others (Suganya and Rajaram, 2012).
Given number of features k and classes C, CFS defines relevance of features subset by the use of Pearson's correlation equation: ( 1) where, Merits is considered as the relevance of feature subset, r kc is the average linear correlation coefficient between these features and classes and r kk is the average linear correlation coefficient between different features.

Galavotti-Sebastiani-Simi (GSS) Coefficient
GSS method have been proposed as a simplified Chi Square statistic. The P N factor and the denominator have completely removed. The denominator have also removed, because the denominator gives high correlation coefficient score to rare words and rare categories (Uchyigit and Ma, 2008). The GSS CC value can be computed as follows:

Information Gain
Information gain represents the entropy reduction given a certain feature, that is, the number of bits of information gained about the category by knowing the

JCS
presence or absence of a term in a document (Ramalingam and Zheng, 2013): where, p(c i ) represents the likelihood of the occurrence of ci class; p(t) represents the likelihood of the occurrence of t; p ( ) t represents the likelihood of the nonoccurrence of t .

Relief
Relief-f (Bolón-Canedo et al., 2013;Zhang and Sawchuk, 2011) is a commonly used metric for feature ranking that estimate the relevance of features according to how well its values distinguish the sampled instance from its nearest hit (instance of the same class) and nearest miss (opposite class). The Relief feature selection algorithm selects featureinstances randomly from the training data. For each sampled instance, the nearest hit and the nearest miss is found. A high weight is assigned to a feature if it differentiates between instances from different classes and has the same value for instances of the same class. Specifically, it tries to find a best estimate from the following probabilities to allocate as the weight for each term feature f (Sharma and Dey, 2012):

COMBINATION OF FEATURE SELECTION METHODS
The main reason for the combination of feature selection methods is to compensate for the shortcomings of individual methods. Combining methods is a common technique in machine learning. The method used in this study is based on combination of feature selection techniques including Chi-square, Information Gain, Relief, Correlation and GSS Coefficient. First, it combines the top ranked n features resulted from k feature selection methods. Second, it calculates the average weight of every term obtained from every feature selection method using the following formula. where, k is the number of ranking lists. Then, the method sorts the features according to the new weights and selects top m ranked features to form the final list of features. Figure 1 shows the steps of the combination method.

EXPERIMENTAL WORK
Three different types of experiments have been performed to investigate the performance of five widely used feature selection methods including Chi-square, Correlation, GSS Coefficient, Information Gain and Relief F and their combinations. Every type of these experiments is conducted using two classifiers namely Naïve Bayes and Support Vector Machine. From other hand, two types of combination are performed. The first one combines the ranking lists resulting from two feature selection methods. The second one combines the ranking lists resulting from the five methods used in this study.To do so, the evaluation is performed on CNN Arabic published dataset. The dataset and performance measures used in this study will be described briefly.

CNN Arabic Corpus
This study uses CNN Arabic corpus which is collected by Saad (2011). This dataset contains 5,070 text documents. Each of them belongs to one of the six classes as shown in Table 1.

Performance Measures
In order to evaluate the feature selection methods the F1-measure is used which combines precision and recall. For ease of comparison, the Macro-averaged (Macro-F1) is used. Precision, recall, F 1 measure and macro F 1 are calculated using the following formulas consequently:

RESULTS
In order to compare the performance of the previously mentioned feature selection methods and to investigate the performance of the combination method, Naïve Bayes and SVM classifiers are used. After applying each feature selection method, the classification is performed with varying number of selected features. The experiments were carried out on published Arabic corpus namely CNN dataset. Table 2 shows the results obtained when using every feature selection method individually with Naïve Bayes classifier. Table 2 shows that the best result obtained is 87.3% Macro-F1 by using Information Gain method when number of features is 3000. The lowest result is 66.9% Macro-F1 obtained by Correlation feature selection method when the number of selected features is 500. Table 3 shows the results obtained when using every feature selection method individually with SVM classifier. Table 3 shows that the best result obtained is 93.1% Macro-F1 by using Information Gain method when number of features is 4000. The lowest result is 71.4% Macro-F1 obtained by Correlation feature selection method when the number of selected features is 500.
The results in Table 2 and 3 show that the best performance among the used feature selection methods is obtained by Information Gain method. The lowest performance is achieved by Correlation method in most cases because it depends on the correlation between features and may not take into account the correlation with class. Table 4 shows the results obtained when using all possible combinations of two feature selection methods with Naïve Bayes classifier. Table 5 shows the results obtained when using all possible combinations of two feature selection methods with SVM classifier.
The results in Table 4 and 5 show that the binary combination of the feature selection methods outperform the results in Table 2 and 3 consequently which obtained using individual methods. Table 6 shows the results obtained when combining the ranking lists obtained from the five feature selection methods with Naïve Bayes classifier. Table 7 shows the results obtained when combining the ranking lists obtained from the five feature selection methods with SVM classifier. Table 6 and 7 show the performance of the combination of multiple feature selection methods namely Chi-square, Information Gain, Relief, Correlation and GSS Coefficient with Naïve Bayes and SVM classifiers consequently.            Tables 6 and 7 show that the Macro F1 results after using the combination of multiple feature selection methods outperform the results obtained by using every method individually.

DISCUSSION
The performance obtained of combination of feature selection methods in both scenario of combination, binary and multiple indicates that the combination of feature selection methods are indeed informative for text classification tasks especially when the number of features is extremely large (1000 and above). Table 6 and 7 shows that the performance became more stable especially with increase size of selected features when applying the combination of feature selection methods. The average improvement rate after applying binary and multiple combination ranges between (1-2.5%) in macro-average F1.
The results in Table 4 to 7 show that the combination of feature selection methods performs better than individual methods since the success of the methods depends on various variables. It is more likely that the combination of different feature selection methods obtains more effective performance in text classification as they allow one to overcome the weaknesses of single approaches. The combined methods always outperform the best of its individuals in feature selection task (Omar et al., 2013). The previous studies (Harrag et al., 2010;Omar et al., 2013;Soares, 2010;Vege, 2012) pointed out that using combined methods improve the results, in spite of the different techniques of combination and the different purposes of using combined methods, this point deals with our findings.
In this study, the focus was on combination of feature selection methods to select a subset of features that most representative of the class. The interesting opportunity for future research in this area is towards the investigation of combining the features using different representation techniques (i.e., bigram, trigram or n-gram) to propose a new method that can select the most discriminative features of the class which can be more useful for text classification.

CONCLUSION
This study introduced a combination of feature selection methods to improve the performance of text classification. First, we evaluate the performance of five common feature selection methods on a published Arabic dataset. Then, we evaluate the performance of all possible binary combinations of these five methods. Finally, we evaluate the performance of the combination of the five methods in order to determine the most appropriate features for classification. Comparing the performance of the individual methods with the performance of the combination methods shows that combining two feature selection methods outperforms the individual methods, while combining the five methods significantly improves the classification performance. Although many feature selection methods exist in text categorization, it is hard to state one is generally superior to others since the success of the methods depends on various variables. It is more likely that combining different feature selection methods obtains more effective performance in text categorization.