ARABIC PART OF SPEECH TAGGING USING K-NEAREST NEIGHBOUR AND NAIVE BAYES CLASSIFIERS COMBINATION

Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language; namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w 0 (the current word), p 0 (POS of the current word), p -3 (POS of three words before), p -2 (POS of two words before) and p -1 (POS of the word before).


INTRODUCTION
Part Of Speech (POS) disambiguation is the ability to computationally figuring out which POS of a word is activated by its use in a certain context. Additionally, it can be explained as the procedure of determining a suitable POS tag for every single word in a sentence.
Fine-grained POS (morpho-syntactic or morphological) tagging is the procedure of determining POS, tense, number, gender and other morphological information for every single word in a sentence (Feldman, 2006;Schmid and Laws, 2008). POS tagging is an essential language analysis task in almost all NLP systems, including information extraction, corpus annotation Science Publications JCS projects, word-sense disambiguation and etc. The next step is another high-level language analysis task by which the output of POS taggers will be generally submitted to. Both syntactic parsing (Mohamed, 2010) and Named Entity Recognition (Benajiba, 2009) are included in these high-level language analyses.
Part of speech tagging is a crucial NLP problem. It entails a large amount of challenging problems including different kinds of unknown words and POS ambiguities.
Such words that could not be found neither in the dictionary nor in the training corpus are known as "Unknown Words". To understand the meaning of a sentence of unknown words is more essential than known words. They also carry more semantic information than known words (Vadas and Curran, 2005). Unknown words are part of the open POS classes like verbs and nouns and it is not probable to be in the closed classes like particles. In fact, the sources of open-ended text, including web corpus provide NLP systems with major challenge unknown words (Weischedel et al., 1993).
Natural languages are naturally ambiguous (Tomita, 1985;. Ambiguity is most likely to occur at various levels of the Natural Language Processing (NLP) task (Dandapat, 2009;Jurafsky et al., 2009). In the case where the ambiguity shows up in one word is referred to as lexical ambiguity like POS ambiguity (Manning and Schutze, 1999).

RELATED WORK
POS tagging provides essential information about word forms used in sentences of natural language. Utilizing this information varies depending on the specific NLP application (i.e., information retrieval, machine translation), in which it is used.
As depicted in Fig. 1, there are two techniques in POS tagging; linguistic taggers and machine learning approaches. Machine learning approaches are divided into two main groups; supervised and unsupervised.

Linguistic Taggers
Linguistic-based taggers specify the relevant knowledge as a set of rules or constraints that is done by linguists. These models generally require years of work as they are ranging from a few hundred to several thousand rules. Research in automated POS tagging began in the midst 60 and 70's (Klein and Simmons, 1963;Harris, 1962;Greene and Rubin, 1971). Researchers manually established rules for tagging.

Machine Learning Approaches
The POS disambiguation may be seen as a classification problem: The tag set is the classes and an automatic classification method used in each repetition of a word to one class based on the evidence from the context. Picking up the classification method is the most critical phase in POS disambiguation. Machine learning field is the origin of the majority of the recent approaches (Navigli, 2009). The methods of machine learning vary from methods with fully unsupervised to fully supervised methods.
However, unsupervised and supervised approaches differ greatly. Some of the most important differences are shown in Table 1.
Arabic is a Semitic language which is spoken by more than 450 million people. It is also an extremely derivational and firmly structured language. Moreover, Arabic is among the six official languages of the United Nations. It is grammatically ambiguous.
Unfortunately, there have been no open sources available POS tagger that are designed especially for Arabic to handle the community's dependence on fundamental NLP tools. Besides, due to the difficulty with the Arabic POS disambiguation problems and the limitations of the existing work in the literature, thus, the Arabic POS disambiguation problems need more investigations. To date, little research has been done in the area of statistical NLP for Arabic, which is confined by having less openly accessible manually annotated corpora. To be able to minimize the huge cost of manually developing annotated corpora, the progress of the POS taggers is of substantial value.

Arabic Part of Speech
According to Haywood and Nahmad (1962), Arabic words can be classified into three main POS. Later, these POS will be again categorized into more detailed POS. The three main parts of speech are:

Noun
A noun in Arabic is a name or a word that describes a person, thing, or idea. Usually the noun group is divided into sub-group of derivatives (i.e., nouns derived from verbs, nouns derived from other nouns and nouns derived from particles) and primitives (nouns not so derived). These nouns could be further sub-categorized by number, gender and case. This category contains what would be categorized as participles, pronouns, relatives, demonstratives and interrogatives.

Verb
The verb classification in Arabic is similar to English, although the tenses and aspects are different. The verbs can be sub-categorized by 'type' (perfect, imperfect, imperative), person, number and gender and the tag name reflects this sub-category. As an example, the word ksrtm "you [plural, masculine] broke" is a perfect verb in the second person masculine plural form. An indicative imperfect second person feminine singular verb such as tktbyn "you [singular, feminine] are writing".

Particle
The particle group contains: Prepositions, adverbs, conjunctions, interrogative particles, exceptions (these are consisting of the Arabic words that are equivalent to the word except and the prefixes non-, un-and im-.) and interjections.
The group of particle contains adverbs, conjunctions and prepositions. All of these can be found in Arabic either as individual words or as clitics that come with the next word. Other particles are interjections, exceptions and negative particles.

POS Tagging Approaches used for Arabic
The amount of study of POS tagging has been done on Arabic language with different dialects. Each of these dialects has its own small number of vocabularies. Mohamed (2010) described that "Arabic POS tagging is still in the stage of research since Arabic poses different problems than those posed by English." The problems of Arabic studies in POS tagging are as follows (El-Hadj, 2009;Al Gahtani et al., 2009): • It experiences the knowledge acquisition bottleneck problem • Arabic is a language with a complicated morphology which raises the number of unknown words • The problem of lack of resources which are even rarely or not freely open for research, for instance lexicons • Arabic dialects are seldom written which makes annotated corpora and lexicons to be hardly developed • Regarding to some reasons, including the lack of writing short vowels, Arabic is among the languages with a high degree of ambiguity Based on the literature of Arabic POS tagging, there are many approaches have been proposed for such aim. These approaches are based on different assumptions and rules and have had different accuracy results in contributing to the field. Some of the most related research on the POS tagging approaches which have been done for Arabic are summarized in the Table 2.

ARABIC POS TAGGING FRAMEWORK
In this section, we propose a solution for Arabic POS tagging framework which is the classifiers combination of the best supervised machine learningbased taggers including K-Nearest Neighbour (KNN) and Naïve Bayes (NB). They are combined using majority voting algorithms. Classifiers combination with machine learning individuals is effectively used on several languages and typically outperform their individuals. As well as the need of an Arabic analysis tool (Diab and Habash, 2007), we are attempting to discover how the mentioned techniques can be used in Arabic and what are they gained results. We are going to combine the best of the classifiers in order to earn benefit of every single method.

Corpus and Pre-processing
In this study we have used the Quranic Arabic corpus in our approach. The Quranic corpus is preprocessed prior to the experiments, starting with tokenization. Tokenization can be defined as the process of splitting out words (morphemes) from running text (Jurafsky et al., 2009). It is an essential and an initial step in NLP. Splitting sentences into tokens is the purpose of tokenization. It also enables them to end up being given into POS tagger or a morphological analyzer for further processing (Attia, 2007).
Quran is the Islamic religious book and it is written in classical Quranic Arabic (in 600 CE). According to Dukes and Habash (2010) and , the Quranic Arabic corpus is an annotated linguistic resource that indicates the Arabic syntax, grammar and morphology for every single word in the Quran. The research project is structured at the University of Leeds by computing research group within the School of Computing (http://corpus.quran.com). Arabic Quranic Corpus is composed of 77,430 words. The corpus is a reference with numerous levels of analysis consisting of POS tagging, morphological segmentation. Every single word of the Quran is tagged using its POS along with several morphological features.
In this phase, the researchers acquire Quranic Arabic verses for preliminary tokenization. After that, the automatically tokenized text will be examined manually and then corrected. Manual correction includes manual normalization of the tokenized text.

Features Selection
Here are three different kinds of feature from the sliding window:

Word Features
It includes word form n-grams, typically unigrams, bigrams and trigrams suffice. As well as, the sentence last word that refers to a punctuation mark ('.', '?', '!') is important. Different word features used in this experiment.

POS Features
Annotated Parts Of Speech (POS) and ambiguity classes n-grams. Regarding words, considering unigrams, bigrams and trigrams is enough. The ambiguity class for a specific word ascertains when POS is possible.

Affix and Orthographic Features
They consist of prefixes and suffixes, capitalization, hyphenization and similar information related to a word form. They are simply employed to signify unknown words. Table 3 indicates a rich feature set of the experiment.

The Combined Classifiers
The following phase of the workflow is the combined classifiers. Two classifiers have been used in the combination, namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). On one hand, the K-Nearest Neighbour algorithm will assist when the test pair has similar characteristics to one of the training examples. On the other hand, NB is selected because it is known to obtain high performance.

K-Nearest Neighbour classifier
The K-Nearest Neighbour (KNN) is a well-known instance-based classifier. KNN is referred as a powerful method to the various text classification problems (Duda et al., 2001;Yang, 1994). Additionally, KNN is known as lazy learners, because it defers the decision on how to generalize beyond the training data until every new query instance is experienced. In the KNN algorithm, a new input instance needs to be part of the same class as its K nearest neighbours in the training dataset. After that when a new input instance is classified in the class of K nearest neighbours between all training instances. The "closeness" is identified as a distance metric, such as the Euclidean distance.

Naive Bayes
The Naive Bayes (NB) classifier is a well-known machine learning technique. It is an uncomplicated probabilistic classifier determined by utilizing Bayes' theorem (from Bayesian statistics) having strong (naive) independence assumptions. The detailed word for the fundamental probability model could be an independent feature model. Simply a Naive Bayes classifier presumes that the presence (or absence) of a specific feature of a class (that is attribute) is unrelated to the presence (or absence) of any other feature.

Voting Algorithms (Combination Strategies)
The selection algorithm as the center of this methodology ascertains the accuracy of the combined classifiers. It does it by finding the right answer provided a set of three answers. A number of the selection algorithms includes: Majority (simple voting), plural (total) voting, tag precision, stacking (cascade classifiers).
Majority voting is the most straightforward voting technique. It looks at only the most probable class given by every single classifier then it finds the most repeated class label among this crisp output set. Weighted majority voting as a trainable variant of majority voting which increases every single vote by a weight before the actual voting. The weight for every classifier could be gained; for instance by calculating the classifiers' accuracies on a validation set. Another voting technique is board count which considers the whole n-best list of a classifier, not only the crisp 1-best candidate class.

Evaluation
In general, the evaluation measures in classification problems are defined from a matrix with the numbers of examples correctly and incorrectly classified for each class, named confusion matrix. The confusion matrix for a binary classification problem (which has only two classes, positive and negative), is shown in Table 4.
The FP, FN, TP and TN concepts may be described as: • False Positives (FP): Instances predicted as positive, which are from the negative class • False Negatives (FN): Instances predicted as negative, whose true class is positive • True Positives (TP): Instances correctly predicted as pertaining to the positive class • True Negatives (TN): Instances correctly predicted as belonging to the negative class Table 3. Rich feature pattern set used in the experiment and its symbol Word features w -3 , w -2 , w -1 , w 0 , w +1 , w +2 , w +3 POS features p -3 , p -2 , p -1 , p 0 , p +1 , p +2 , p +3 Prefixes s 1 , s 1 s 2 , s 1 s 2 s 3 , s 1 s 2 s 3 s 4 Suffixes s n , s n-1 s n , s n-2 s n-1 s n , s n-3 s n-2 s n-1 s n Binary word features All upper case, all lower case, contains a number Word length Integer

RESULTS AND EVALUATION
This section demonstrates the results of the experiments performed on the Quranic Arabic corpus by applying the identified individual classifiers as well as selected combinations. A sample of experimental results will be delicately elaborated. Furthermore, the classifiers and a list of features that lead to the best result will be stated out.

Experiment Test Set
The Quranic Arabic corpus includes syntactic and morphological annotation of the Quran and builds on the verified Arabic text distributed by the Tanzil project (Tanzil.net). It consists of 77,430 words. The researchers of the present study performed their experiment based on the whole Quran corpus. For each experiment, the whole words and a random set of features of those words are chosen from the corpus.

Experimental Results
The experiment applied the 28 features as it is explained in Table 5, including the word and its part of speech, word features (7), POS features (7), prefixes (4), suffixes (4), binary word features (3) and word length (1) on the datasets of the Arabic Quran. For each individual experimental run, a random set of features was chosen as well as a single classifier or a combination. The total conducted runs are 138 within the experiment. The percentage of the total score for each classifier and the supplemented set of features are calculated and the highest accuracy obtained is 98.32%. The best classifier that gives such accuracy is a combination of NB and KNN. The set of features is a combination of w 0 (the current word), p 0 (POS of the current word), p -3 (POS of three words before), p -2 (POS of two words before), p -1 (POS of the word before) and p 0 (POS of the current word).     Finally, the result of the study revealed that the proposed model is a significant enhancement for the state-of-the-art for Arabic POS tagging. The research results were compared with the latest researches on Arabic POS tagging and have proved higher accuracy.

JCS
By taking advantage of combining classifiers and by evaluating the set of results obtained each time by applying a classifier with a set of features, the highest accuracy was 98.32% achieved by KNN and NB combination. Besides, the most effective feature that accomplish this accuracy is a combination of namely; w 0 (the current word), p 0 (POS of the current word), p -3 (POS of three words before), p -2 (POS of two words before), p -1 (POS of the word before) and p 0 (POS of the current word).

CONCLUSION
Arabic is considered as a widely spoken language that is being spoken by approximately 450 million people, what makes it as the fourth widespread language. However, in the computer world and especially the Internet content, Arabic language only represents 3.00% of the overall Internet's lingual content. Moreover, using Arabic in computerized systems is an issue nowadays because of the complex morphology and structure of such a language.
As has been said before, this research mainly contributes to the field of POS tagging and is specified for the Arabic language. The set of contributions can be achieved, in particular and in general by the research are as follows: • The research has studied, examined and presented a set of rich feature patterns that assist in enhancing the POS tagging especially in rich morphological languages such as Arabic • The research has presented a model that significantly enhances the performance of POS tagging in Arabic based on the combination of classifiers and integration a set of rich feature patterns • The model contributes in improving the disambiguation of the word category and grammatical tagging in Arabic language As a future work, we believe that improving the features and patterns for tags is a possible strategy to raise the accuracy levels of POS tagging systems. They also intend to perform further investigation on this POS tagging approach in order to reduce the error rate and apply it as a basis for a parsing and analyzing system framework.