Statistical Bayesian Learning for Automatic Arabic Text Categorization

Bassam Al-Salemi; Mohd. Juzaiddin Ab Aziz

doi:10.3844/jcssp.2011.39.45

Research Article Open Access

Statistical Bayesian Learning for Automatic Arabic Text Categorization

Bassam Al-Salemi and Mohd. Juzaiddin Ab Aziz

Abstract

Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques that are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make applying these techniques directly difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naïve Bayes (NB), Multi-variant Bernoulli Naïve Bayes (MBNB) and Multinomial Naïve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, Bag-Of-Word and character-level n-gram with the length 3, 4 and 5 are used. In order to reduce the dimensionality of feature space, the following feature selection methods: Mutual Information, Chi-Square statistic, Odds Ratio and GSS-coefficient were used. Conclusion: MBNB classifier outperformed both of NB and MNB classifiers. BOW representation leads to the best classification performance; nevertheless, using character-level n-gram leads to satisfying results for Arabic ATC based on Bayesian learning. Moreover, the use of feature selection methods dramatically increases the categorization performance.

Journal of Computer Science

Volume 7 No. 1, 2011, 39-45

DOI: https://doi.org/10.3844/jcssp.2011.39.45

Submitted On: 19 October 2010 Published On: 27 December 2010

How to Cite: Al-Salemi, B. & Ab Aziz, M. J. (2011). Statistical Bayesian Learning for Automatic Arabic Text Categorization. Journal of Computer Science, 7(1), 39-45. https://doi.org/10.3844/jcssp.2011.39.45

Copyright: © 2011 Bassam Al-Salemi and Mohd. Juzaiddin Ab Aziz. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

7,978 Views
6,201 Downloads
46 Citations

Download

Keywords

Arabic text categorization
Bayesian learning
Feature Selection
Automatic Text Categorization
Multinomial Naïve Bayes
Multivariate Bernoulli Naïve Bayes
Odds Ratio (OR)
Information Gain (IG)
Feature Selection (FS), Mutual Information (MI)