Journal of Computer Science

Heuristic Lemmatization for Arabic Texts Indexation and Classification

Faten Khalfallah Hammouda and Abdelsalam Abdelhamid Almarimi

DOI : 10.3844/jcssp.2010.660.665

Journal of Computer Science

Volume 6, Issue 6

Pages 660-665

Abstract

Problem statement: This study proposed a system based on a heuristic lemmatization for Arabic text indexation and classification. This research is needed for a lot of NLP applications such as the research of information and automatic abstract. This system was not related to any linguistic rule. The proposed method was limited to five different domains: Sports, medicine, politics, economics and agriculture. The main idea is collecting different texts related to the chosen domains and studying them by extracting the pertinent terms. Approach: Every entered text had the formatting stage in which we can remove some words and letters that do not have any importance for the meaning. After that, the frequencies’ average is calculated to classify the text and its related domain. Results: The main finality of the System of Indexation and Classification of Arabic Texts (SICAT) is to classify finally an unknown text in its suitable domain. So, it’s to detect the text theme. To do this task, we applied a method by pertinent terms correspondence. It is about testing the correspondence of all pertinent terms of the text to classify with the keywords of every domain of the corpus. The domain, that constitutes the majority of terms having a correspondence with terms of the text, represents the theme that we look for to classify our unknown text. Conclusion: It holds two main parts: the indexation and the classification. The indexation stage is composed of three main parts: the pre-learning, the lemmatization and the frequencies’ calculation. The classification stage is composed of two main components: the extraction of keywords and classification of new text. We have made many tests of verification to test the validation of the system. The system performance was evaluated on the different chosen domains, achieves 90% precision and 85% recall.

Copyright

© 2010 Faten Khalfallah Hammouda and Abdelsalam Abdelhamid Almarimi. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.