Long Short-Term Memory Model for Classification of English-PtBR Cross-Lingual Hate Speech
- 1 Instituto Federal de Sergipe, Brazil
- 2 Universidade Federal de Sergipe, Brazil
- 3 Universidade Federal de Pernambuco, Brazil
- 4 Universidade Tiradentes, Brazil
Automatic and accurate recognition of hate speech is a difficult job. In addition to the inherent ambiguity of the natural language, deep understanding of the linguistic structure is imperative. Usually, discriminatory discourse does not make use of typical expressions and often abuse of sarcasm. Good knowledge of world and assessment of context are thus highly demanded. Several approaches have been proposed for automating hate speech recognition task. Many of them consider a combination of strategies in order to achieve better results: character-based or word-based N-grams, lexical features such as the presence or absence of negative words, classes or expressions indicative of insult, punctuation marks, repetition of letters, the presence of emoji, etc. The solitary use of linguistic features such as POS tagging have shown itself inefficient. The recent usage of neural networks to create a distributed representation of the sentences within a hate speech corpus is a promising path. Unfortunately, providing such a corpus is hard. Except for the English language, hate speech corpora are rarely found. This work proposes a cross-lingual approach to automatically recognize hate speech in Portuguese language, leveraging the knowledge of English corpora. A deep Long Short-Term Memory (LSTM) model has been trained and many different experimentation scenarios were set to deal with embeddings, TFIDF, N-grams, GloVe vocabulary and so on. At the end, a Gradient Boosting Decision Tree (GBDT) was used to improve classification results. We achieved accuracy of up to 70% in the better scenarios. Two important contributions of this work are: (i) An effective approach to deal with the lack of hate speech corpora in the desired language and (ii) a hate speech database in Portuguese to contribute to research community.
Copyright: © 2019 Thiago D. Bispo, Hendrik T. Macedo, Flávio de O. Santos, Rafael P. da Silva, Leonardo N. Matos, Bruno O.P. Prado, Gilton J.F. da Silva and Adolfo Guimarães. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 2,859 Views
- 1,366 Downloads
- 9 Citations
- Hate Speech
- Portuguese Language
- Deep Learning
- (Bi) LSTM