Research Article Open Access

Statistical Part-of-Speech Tagger for Traditional Arabic Texts

Yahya O. Mohamed Elhadj1
  • 1 , Afganistan
Journal of Computer Science
Volume 5 No. 11, 2009, 794-800

DOI: https://doi.org/10.3844/jcssp.2009.794.800

Submitted On: 3 August 2009 Published On: 30 November 2009

How to Cite: Elhadj, Y. O. M. (2009). Statistical Part-of-Speech Tagger for Traditional Arabic Texts. Journal of Computer Science, 5(11), 794-800. https://doi.org/10.3844/jcssp.2009.794.800

Abstract

Problem statement: This study presented the development of an Arabic part-of-speech tagger that can be used for analyzing and annotating traditional Arabic texts, especially the Quran text. Approach: It is a part of a project related to the computerization of the Holy Quran. One of the main objectives in this project was to build a textual corpus of the Holy Quran. Results: Since an appropriate textual version of the Holy Quran was prepared and morphologically analyzed in other stages of this project, we focused in this work on its annotation by developing and using an appropriate tagger. The developed tagger employed an approach that combines morphological analysis with Hidden Markov Models (HMMs) based-on the Arabic sentence structure. The morphological analysis is used to reduce the size of the tags lexicon by segmenting Arabic words in their prefixes, stems and suffixes; this is due to the fact that Arabic is a derivational language. On another hand, HMM is used to represent the Arabic sentence structure in order to take into account the linguistic combinations. For these purposes, an appropriate tagging system has been proposed to represent the main Arabic part of speech in a hierarchical manner allowing an easy expansion whenever it is needed. Each tag in this system is used to represent a possible state of the HMM and the transitions between tags (states) are governed by the syntax of the sentence. A corpus of some traditional texts, extracted from Books of third century (Hijri), is manually morphologically analyzed and tagged using our developed tagset. Conclusion/Recommendations: It is then used for training and testing this model. Experiments conducted on this dataset gave a recognition rate of about 96% and thus are very promising compared to the data size tagged till now and used in the training. Since our Holy Quran corpus is still under revision, we did not make significant experiments on it. However, preliminary tests conducted on the seven verses of AL-Fatiha showed an encouraging accuracy rate.

  • 1,353 Views
  • 2,292 Downloads
  • 6 Citations

Download

Keywords

  • Hidden Markov models
  • Arabic morphological analysis
  • holy Quran
  • text corpus
  • classical Arabic
  • modern standard Arabic