Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System

This paper aims to implement a Support Vector Machines (SVMs) based text classification system for Arabic language articles. This classifier uses CHI square method as a feature selection method in the pre-processing step of the Text Classification system design procedure. Comparing to other classification methods, our system shows a high classification effectiveness for Arabic data set in term of F-measure (F=88.11).


INTRODUCTION
Text Classification (TC) is the task to classify texts to one of predefined categories based on their contents [1] . It is also referred as Text categorization, document categorization, document classification or topic spotting. And it is one of the important research problems in information retrieval IR, data mining, and natural language processing.
TC has many applications that are becoming increasingly important such as document indexing, document organization, text filtering, word sense disambiguation and web pages hierarchical categorization.
TC research has received much attention [2] . It can be studied as a binary classification approach (a binary classifier is designed for each category of interest), a lot of TC training algorithms have been reported in binary classification e.g. Naïve Bayesian method [3] , k-nearest neighbours (kNN) [3] , support vector machines (SVM) [4,5] etc. On the other hand, it has been studied as a multi classification approach e.g. boosting [6] , and multiclass SVM [7] .
In this paper, we have restricted our study of TC on binary classification methods and in particular to Support Vector Machines (SVM) classification method for Arabic Language text.
TC Procedure: The TC System Design Usually Compromise Three Phases: Data pre-processing, text classification and performance measures: data preprocessing phase is to make the text documents compact and applicable to train the text classifier. The text classifier, the core TC learning algorithm, shall be constructed, learned and tuned using the compact form of the Arabic dataset.
Then the text classifier shall be evaluated by some performance measures. Then the TC system can implement the function of document classification. The following sections are devoted to these three phases Data Pre-processing: Arabic Data set: Since there is no publicly available Arabic TC corpus to test the proposed classifier, we have used an in-house collected corpus from online Arabic newspaper archives, including Al-Jazeera, Al-Nahar, Al-hayat, Al-Ahram, and Al-Dostor as well as a few other specialized websites. The collected corpus contains 1445 documents that vary in length. These documents fall into nine classification categories ( Table  1) that vary in the number of documents.
In this Arabic dataset, each document file was saved in a separate file within the corresponding category's directory, i.e. this dataset documents are single-labelled. Representing Arabic dataset Documents: As mentioned before, this representing aims to transform the Arabic text documents to a form that is suitable for the classification algorithm. In this phase, we have followed [8,9] and [10] and processed the Arabic documents according to the following steps: 1. Each article in the Arabic data set is processed to remove the digits and punctuation marks. 2. We have followed [11] in the normalization of some Arabic letters such as the normalization of (hamza) in all its forms to (alef).
3. All the non Arabic texts were filtered. 4. Arabic function words were removed. The Arabic function words (stop words) are the words that are not useful in IR systems e.g. The Arabic prefixes, pronouns, prepositions. 5. Infrequent terms removal: we have ignored those terms that occur less than 4 times in the training data. The vector space representation [12] is used to represent the Arabic documents.
Table1: Arabic Data set We have not done stemming because it is not always beneficial for text categorization, since many terms may be conflated to the same root form [13]. Based on the vector space model (VSM) each term corresponds to a text feature with term frequency ij TF t = , the number of times term i occurs in document j , as its value. This TF makes the frequent words for the document more important.
We have used the inverse document frequency IDF [4] to improve system performance. DF , the number of documents that term i occurs in, is used to where N is the total number of training documents.
Then the vectors are normalized to unit length.
. IDF TF is calculated as a weight for each term -text feature.

Feature selection:
In text categorization, we are dealing with a huge feature spaces. This is why; we need a feature selection mechanism. The most popular feature selection methods are document frequency thresholding (DF) [14] , the 2 X statistics (CHI) [15] , term strength (TS) [16] , information gain (IG) [14] , and mutual information (MI) [14] , The 2 X statistic [14] measures the lack of independence between the text feature term t and the text category c and can be compared to the 2 X distribution with one degree of freedom to judge the extremeness. Using the two-way contingency table (Table 2) of a term t and a category c , A is the number of times t and c co-occur, B is the number of times t occurs without c , C is the number of times c occurs without t , D is the number of times neither c nor t occurs, and N is the total number of documents.
The term-goodness measure is defined as follows: This 2 X statistic has a natural value of zero if t and c are independent. Among above feature selection methods [14] found (CHI) and (IG) most effective. Unlike [4] where he has used (IG) in his experiment, we have used CHI as a feature selection method for our Arabic TC.

SVMs TC Classifier:
As any classification algorithm, TC algorithms have to be robust and accurate. There are a lot of machine learning based methods that can be implemented for TC tasks; It is obvious that Support Vector Machine (SVM) [4] and other kernel based methods e.g. [17] and [18] have shown empirical successes in the field of TC.
TC empirical results have shown that SVMs classifiers are performing well. Simply because of the following text properties [4] : Most text categorization problems are linearly separable. This is why SVMs based classifiers are working well for TC problems. However, other kernel methods have outperformed SVMs linear kernel method e.g. [18] . Support Vector Machines (SVMs) are binary classifiers, which were originally proposed by [19] .
SVMs have achieved high accuracy in various tasks, such as object recognition [20] .
Suppose a set of ordered pairs consisting of a feature vector and its label is given:  introduced to enable the non-separable problems to be solved [21] , in this case we allow few examples to penetrate into the margin or even into the other side of the hyper plane.
Skipping the details of using the Lagrangian theory, equations (2) and (3) are converted to dual problem as shown in equations (4) and (5), where i α is a Lagrange multiplier, C is a user-given constant.
Because dual problems have quadratic forms, they can be solved more easily than the primal optimization problems in equation (2)  Since SVMs are linear classifiers, their separating ability is limited. To compensate for this limitation, the kernel method is usually combined with SVMs [19] .
In the kernel method, the dot products in (5) and (6)  are replaced with more general inner products ) , ( x x K i , called the kernel function. The polynomial kernel and the Radial Basic Function kernel (Gaussian) are often used. This means that the feature vectors are mapped into a higher dimensional space and linearly separated there. In this process, the significant advantage is that only the general inner products of two vectors are needed. This leads to a relatively small computational overhead. On the hand, the crucial issues for SVMs are choosing the right kernel function and the parameter tuning.

2(Precision x Recall) F-measure = (Precision + Recall)
Many other TC classifiers [22] have been investigated in literatures: k-NN Classifier: k-NN classifier [1] , a generalization of the nearest neighbor rule, constructs k nearest neighbors as a basis for a decision to assign a category for a document. k -nearest neighbor classifiers shows a very good performance on text categorization tasks for English Language [23] . It worth pointing that k-NN uses cosine as a similarity metric.

Naïve Bayes classifier:
The main idea of the naïve Bayes classifier [23] is to use a probabilistic model of text. The probabilities of positive and negative examples are computed.
Performance measures: TC performance is always considered in terms of computational efficiency and categorization effectiveness.
When categorizing a large number of documents into many categories, the computational efficiency of the TC system shall be considered. This includes: feature selection method and the classifier learning algorithm.
TC effectiveness is measured in terms of precision and recall [24] . Precision and Recall are defined as follows: [23] . where a counts the assigned and correct cases, b counts the assigned and incorrect cases, c counts the not assigned but incorrect cases and d counts the not assigned and correct cases.
A two-way contingency table (Table 3) contains , , a b c and d . The values of precision and recall often depend on parameter tuning; there is a trade-off between them. This is why we use other measures that combined both of the precision and recall: the F-measure which is defined as follows: To evaluate the performance across categories, Fmeasure is averaged. There are two kinds of averaged values, namely, micro average and macro average [23] .

RESULTS
In our experiment, we have used the mentioned Arabic data for training and testing the TC classifier. Following the majority of text classification publications, we have removed the Arabic stop words, filter out the non Arabic letters, symbols and removed the digits. But as mentioned before we have not applied a stemming process. We have used one third of the Arabic data set for testing the classifier and two thirds for training the TC classifier as shown in (Table 4). shown no significant changes in results). The results of our classifier in term of Precision, Recall and Fmeasure for the nine categories are shown in (Table 5).  While conducting some other experiments, and using the 2 X scores, we tried to tune the number of selected CHI Square terms (in this case, unequal number of terms is selected for each classification category), but we could not achieve better results than those achieved using the 162 mentioned terms for each classification category. Following [11] in the usage of light stemming to improve to performance of Arabic TCs, we have used [25] stemmer to remove the suffixes and prefixes from the Arabic index terms. Unfortunately, we have concluded that light stemming does not improve the performance of our CHI square feature extraction based SVMs classifier, the F-measure drops to 87.1. As mentioned before, the stemming is not always beneficial for text categorization problems [13] . This may justify the averaged F-measure light drop.

CONCLUSION
We have investigated the performance of CHI statistics as a feature extraction method, and the usage of SVMs classifier for TC tasks for Arabic language articles. We have achieved practically accepted results and comparable research results. In regard to 2 X , we like to deeply investigate the relation between A , B , C and D values in CHI algorithm when dealing with small categories like Computer. For this particular category, we have played with the 2 X and the classifier parameters, but we could not enhance the Recall or the Precision values. The investigation of other feature selection algorithms remains for future works. And Building a bigger Arabic Language TC Corpus shall be considered as well in our future research.