Research Article Open Access

Statistical Bayesian Learning for Automatic Arabic Text Categorization

Bassam Al-Salemi1 and Mohd. Juzaiddin Ab Aziz2
  • 1 ,
  • 2 , Afganistan
Journal of Computer Science
Volume 7 No. 1, 2011, 39-45

DOI: https://doi.org/10.3844/jcssp.2011.39.45

Submitted On: 19 October 2010 Published On: 27 December 2010

How to Cite: Al-Salemi, B. & Ab Aziz, M. J. (2011). Statistical Bayesian Learning for Automatic Arabic Text Categorization. Journal of Computer Science, 7(1), 39-45. https://doi.org/10.3844/jcssp.2011.39.45

Abstract

Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques that are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make applying these techniques directly difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naïve Bayes (NB), Multi-variant Bernoulli Naïve Bayes (MBNB) and Multinomial Naïve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, Bag-Of-Word and character-level n-gram with the length 3, 4 and 5 are used. In order to reduce the dimensionality of feature space, the following feature selection methods: Mutual Information, Chi-Square statistic, Odds Ratio and GSS-coefficient were used. Conclusion: MBNB classifier outperformed both of NB and MNB classifiers. BOW representation leads to the best classification performance; nevertheless, using character-level n-gram leads to satisfying results for Arabic ATC based on Bayesian learning. Moreover, the use of feature selection methods dramatically increases the categorization performance.

  • 1,575 Views
  • 2,995 Downloads
  • 16 Citations

Download

Keywords

  • Arabic text categorization
  • Bayesian learning
  • Feature Selection
  • Automatic Text Categorization
  • Multinomial Naïve Bayes
  • Multivariate Bernoulli Naïve Bayes
  • Odds Ratio (OR)
  • Information Gain (IG)
  • Feature Selection (FS), Mutual Information (MI)