Exploring Data Augmentation for Gender-Based Hate Speech Detection

Muhammad Amien Ibrahim; Samsul Arifin; Eko Setyo Purwanto

doi:10.3844/jcssp.2023.1222.1230

Research Article Open Access

Exploring Data Augmentation for Gender-Based Hate Speech Detection

Muhammad Amien Ibrahim¹, Samsul Arifin² and Eko Setyo Purwanto¹

¹ Department of Computer Science, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
² Department of Statistics, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia

Abstract

Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.

Journal of Computer Science

Volume 19 No. 10, 2023, 1222-1230

DOI: https://doi.org/10.3844/jcssp.2023.1222.1230

Submitted On: 14 April 2023 Published On: 29 September 2023

How to Cite: Ibrahim, M. A., Arifin, S. & Purwanto, E. S. (2023). Exploring Data Augmentation for Gender-Based Hate Speech Detection. Journal of Computer Science, 19(10), 1222-1230. https://doi.org/10.3844/jcssp.2023.1222.1230

Copyright: © 2023 Muhammad Amien Ibrahim, Samsul Arifin and Eko Setyo Purwanto. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

4,348 Views
2,412 Downloads
5 Citations

Download

Keywords

Dataset
Data Augmentation
Hate Speech Detection