Research Article Open Access

Exploring Data Augmentation for Gender-Based Hate Speech Detection

Muhammad Amien Ibrahim1, Samsul Arifin2 and Eko Setyo Purwanto1
  • 1 Department of Computer Science, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
  • 2 Department of Statistics, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia

Abstract

Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.

Journal of Computer Science
Volume 19 No. 10, 2023, 1222-1230

DOI: https://doi.org/10.3844/jcssp.2023.1222.1230

Submitted On: 14 April 2023 Published On: 29 September 2023

How to Cite: Ibrahim, M. A., Arifin, S. & Purwanto, E. S. (2023). Exploring Data Augmentation for Gender-Based Hate Speech Detection. Journal of Computer Science, 19(10), 1222-1230. https://doi.org/10.3844/jcssp.2023.1222.1230

  • 1,590 Views
  • 870 Downloads
  • 0 Citations

Download

Keywords

  • Dataset
  • Data Augmentation
  • Hate Speech Detection