EthioLSocMDMTLM: Exploring Application of Topic Modeling for Building Ethiopian Language Social Media Data-Based Multilingual Transformer Language Models for Multilingual Hateful Content Detection
- 1 Software Engineering, College of Engineering, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
- 2 School of Information Science, College of Natural and Computational Science, Addis Ababa University, Addis Ababa, Ethiopia
Abstract
This study proposes topic modeling techniques to develop Ethiopian Language Social Media Data Based Multilingual Transformer Language Models for multilingual hateful content detection. We modified various multilingual pretrained models, investigated the challenges of using pre-trained transformer language models, and built multilingual hateful content detection models. Topic words with rows of 1561, 70, and 1044 extracted from Afaan Oromo, Tigrigna, and Amharic Afaan Oromo, Amharic, and Tigrigna respectively used to train transformers. The proposed models were also tested by developing a multilingual hateful content detection model for low-resource Ethiopian languages using deep learning techniques. A total of 45522, 59529, and 48882, Tex documents of Amharic, Afaan Oromo, and Tigrigna were collected and three annotators annotated the data into binary classes where the agreement among annotators result scored 87% for Amharic, 82% for Tigrigna and 84% for Afaan Oromo. LSTM, CNN, and BiLSM deep learning algorithms applied algorithms, that includes integration of EthioLan_mBERT, EthioLan_BERT, and EthioLan_XLM-Roberta contextual embeddings. Among applied the techniques; LSTM+ EthioLan_mBERT outperforms the score performance of F1score 81%. We publicly release the modified pre-trained models, dataset, and related codes.
DOI: https://doi.org/10.3844/jcssp.2025.250.262
Copyright: © 2025 Naol Bakala Defersha, Kula Kekeba Tune and Solomon Teferra Abate. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 221 Views
- 92 Downloads
- 0 Citations
Download
Keywords
- Afaan Oromo
- Low Resource Languages
- Amharic
- Hateful Content
- EthioLSocMDMTLM
- Transformer Language Model
- Tigrigna