AN IMPROVED ARABIC WORD’S ROOTS EXTRACTION METHOD USING N-GRAM TECHNIQUE

Arabic language is distinguished by its morphological richness, which forces the workers in the field of Arabic language Processing (i


INTRODUCTION
Arabic is one of the major languages in the world, its spoken by over 400 million people and being the language of the Holy Quran, it is also used by more than 1.5 billion Muslims all over the world, making it the largest Semitic languages Ghazzawi (1992). Arabic alphabet consists of 28 characters and written from right to left and it uses cursive letters. Each Arabic word is formed from the root word and a suffix, a prefix or an infix.
There are many Arabic language computerized applications rely on using of the roots of words, such as information retrieval systems, text classification, text summarization, auto-translation, Data mining, OCR (Ghwanmeh et al., 2009;Yousef et al., 2010) and other applications. The Arabic word's roots can be classified according to the vowels letters into two types (Wightwick and Gaafar, 2007), the first type is the strong roots which is the root that does not contain a vowel, whereas the roots that containing at least is called vocalic roots. Arabic roots can be further classified according to the number of their characters into four types: Triliteral (which forms most words in Arabic language) Al-Kamar (2006), Quadriliteral, Quinquelitera and Hexaliteral.
There are many methods to extract the roots of Arabic words, but there is no agreement on one method because of morphological affluence of the Arabic language and the large number of conflations for each word. The Arabic language researchers produce many methods to extract the roots of Arabic words during the last decade (Hajjar et al., 2010;Al-Nashashibi et al., 2010) specially strong Triliteral roots. Many researchers rely on morphological rules on their extraction methods, which make the process of extracting the root difficult and complex because of the multiplicity morphological formulas and multiplicity of words forms for the same root because of changing the original characters position in the word (i.e., ‫,آ‬ ‫,آ‬ ‫ب‬ ‫.)آ‬ In the study, the researchers will introduce an improved method to extract the word's root without using morphological rules, but using n-gram technique which simplifies the process of extracting the roots.
The study consists of four sections. The first section is a literature review in which the researchers review Science Publications JCS some papers that deal with extracting Arabic word's roots. In second section the researchers introduce the proposed algorithm. The third section presents the experiment that the researches conduct in order to test their proposed algorithm and also introduces the results they obtained from these tests. In the last section the researchers conclude the research.

LITERATURE REVIEW
During the last two decades, many researchers proposed new approaches to extract Arabic words roots, some of these approaches using morphological analysis, whereas other approaches relied on statistical methods.
Hawas (2013) the author presents a new rootextraction approach for Arabic words that tries to assign a unique root for each Arabic word without having an Arabic roots list, a words patterns list, or the Arabic word's prefixes and suffixes list. The proposed algorithm predict the letters positions that may form the word root one by one, using rules based on the relations between the Arabic word letters and their placement in the word. The proposed algorithm consist of two parts, the first part deals with the rules that distinguish between the Arabic definite letter ‫ال"‬ AL, La" and the original word letters ‫ـ"‬ ‫."ا‬ The second part of the approach adopts the segmentation of the word into three parts and classifies its letters into groups according to their positions. The proposed approach was composed of several corporate modules. The researcher tested the proposed approach using the Holy Quran words shows a promising root extraction algorithm, the outher shows that she had the the total success ratio about 93.7% but she considered the root is correct if it has one correct letter. Boudlal et al. (2011) the researchers provide a new way to find the system that assigns, for every non vowel word a unique root depending on the context of the word on the sentence. The proposed system is composed of two modules. The first one consists of analyzing the context by segmenting the words of the sentence into its elementary morphological units in order to identify its possible roots. The researchers adopt the segmentation of the word into three parts (prefix, stem and suffix). In the second module the researchers use the context to identify the correct root among all possible roots of the word. For this purpose, the researchers use a Hidden Markov Models (HMM) approach, where the observations are the words and the possible roots represent the hidden states. The researchers validate their approach using the NEMLAR Arabic writing corpus that consists of 500,000 words, in this research the proposed algorithm gives the correct root in more than 98% of the training set and in almost 94% of the words in the testing set. Hmeidi et al. (2010) the researchers provide a new way to find the roots of Arabic words using bigrams. The researchers use two similarity measures; the dissimilarity measurement, or the "Manhattan distance measurement" and the "Dice's measurement". The researchers test their proposed algorithm on the Holy Quran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The researcher conclude from their research that combining the n-grams with the Dice's measurement gives better results than using the Manhattan distance measurement.
The researchers in Al-Nashashibi et al. (2010) address linguistic approach for root extraction as a preprocessing step for Arabic text mining. The proposed approach is composed of a rule-based light stemmer and a pattern-based infix remover. They propose an algorithm to handle weak, eliminated-longvowel, hamzated and geminated words The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots.
The linguistic approach performance was tested on an inhouse text collection of eight categories, the researchers gained a success ratio about 73.74%. Momani and Faraj (2007) the researchers proposed a novel algorithm to extract tri-literal Arabic roots. The first step of the algorithm is done by eliminating the stop words and then the prefixes and suffixes of each word are removed. In the next step the repeated word's letters that are removed until only three letters are remained. Finally, the remaining letters are arranged according to their order in the original word, which form the root of the original word. The researchers tested their algorithm on two types of Arabic text documents. The researchers claimed that the results of both runs were very promising and satisfactory showing over 73% of accuracy.

PROPOSED AlGORITHM
In the following section, the researches introduce the steps that have been implemented in order to reach to the new roots extraction algorithm. Because of difficulties that facing the approaches based on morphological analysis especially with words that containing vowels later, the new algorithm will use the n-gram method. N-gram is basic text analysis tool that used in natural language processing. In this technique, both the word and its assumed root are divided into pairs (called bi-gram, or digram) then the similarity between the word and the root is calculated using Equation (1) Frakes (1992). This process is repeated for each root in the roots list: To utilize Equation (1) for extracting the word's root, we must have: the word (A) and the potential roots (B) to compare with, then the similarity measuring is conducted by computing the value of S between the word (A) and each potential roots (B). A corpus of 4500 triliteral Arabic roots was used to accomplish the similarity calculation step. Only triliteral roots were chosen because they form about 85% of the Arabic language roots.
To extract the root of the word by the proposed algorithm, both the tested word (A) and the candidate root (B) must be divided into pairs of sequence liters and then only the unique pairs will be taken to calculate the similarity (S). The root that has the highest (S) value among the roots list is considered as the root of the word.
For example, if we had the word ‫ض"‬ ", which its root is ‫ض"‬ ". The values of A, B, C, S are shown in the following: In the previous example the similarity (S) was calculated easily because we compared the word with its actual root directly, but in real situations the extraction of the root will be conducted without having the actual root, which means that we have to calculate the similarity (S) between the word and all candidate roots. For the purpose of this research, we used a list of 4500 Arabic roots. The process of extracting the actual root is shown in the following algorithm: 1. Normalization of the word: by deleting the word diacritics (Alhmza:‫)ء‬ and convert the letter ‫)ة(‬ to the letter ( ). 2. Divide the word into bi-grams pairs. 3. Find the number of unique bi-grams in the word (A). 4. Choose a candidate root (B) from the root list and apply steps 2 and 3 to find the number of unique bigrams in the root (B) 5. Calculate the similarity (S) between the word (A) and the candidate root (B). 6. Repeat steps 4 & 5 for the rest of roots in the roots list. 7. The root that has the highest similarity (S) among the roots in the list is chosen to be the root of the word (A).

EXPERIMENTAL RESULTS
To examine the proposed algorithm the researcher designed a corpus consisting of 141 roots chosen from the 4500 roots list. The corpus contains 6308 morphological forms derived from these 141 roots. Among these morphological forms there are 1318 morphological forms belonging to 21 vowel roots. Figure 1 below demonstrates an example of the morphological forms used in experiments for the root " ‫"آ‬ (i.e., write).
After running the proposed algorithm on the designed corpus the results were as follows.

Tripartite Strong Roots
When examining morphological forms of strong triliteral roots that do not containing a vowel the results were as shown in Table 1.

Tripartite Roots with Vowels
When examining morphological forms for the Tripartite roots with vowels the results were as shown in Table 2. The results are similar with the strong triliteral roots due to reliance on statistical methods, without taking into account the vowels.

All Roots
When examining all morphological forms from the designed corpus the results were as shown in Table 3.

CONCLUSION
In this research an improved extraction Arabic root algorithm was proposed using bi-gram technique. The results showed that the proposed algorithm is capable of extract the most possible root for nearly 80% of the strong roots, by choosing the roots that has the highest similarity value between the desired word and the candidate roots. The proposed approach succeeded in extracting the vocalic roots in a similar ratio with the strong roots. Our future plan and works are to improve the proposed algorithm by using morphological rules and artificial intelligence techniques, to enhance the preliminary results that emerged after extracting the value of similarity.