Statistical Bayesian Learning for Automatic Arabic Text Categorization
Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques that are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make applying these techniques directly difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Three classifiers based on Bayesian theorem had been implemented which are Simple Naïve Bayes (NB), Multi-variant Bernoulli Naïve Bayes (MBNB) and Multinomial Naïve Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, Bag-Of-Word and character-level n-gram with the length 3, 4 and 5 are used. In order to reduce the dimensionality of feature space, the following feature selection methods: Mutual Information, Chi-Square statistic, Odds Ratio and GSS-coefficient were used. Conclusion: MBNB classifier outperformed both of NB and MNB classifiers. BOW representation leads to the best classification performance; nevertheless, using character-level n-gram leads to satisfying results for Arabic ATC based on Bayesian learning. Moreover, the use of feature selection methods dramatically increases the categorization performance.
Copyright: © 2011 Bassam Al-Salemi and Mohd. Juzaiddin Ab Aziz. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 3,360 Views
- 3,665 Downloads
- 31 Citations
- Arabic text categorization
- Bayesian learning
- Feature Selection
- Automatic Text Categorization
- Multinomial Naïve Bayes
- Multivariate Bernoulli Naïve Bayes
- Odds Ratio (OR)
- Information Gain (IG)
- Feature Selection (FS), Mutual Information (MI)