research-article

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

Authors:

Fatemah Husain,

Ozlem UzunerAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 4

Article No.: 73, Pages 1 - 20

https://doi.org/10.1145/3501398

Published: 19 January 2022 Publication History

Abstract

Preprocessing of input text can play a key role in text classification by reducing dimensionality and removing unnecessary content. This study aims to investigate the impact of preprocessing on Arabic offensive language classification. We explore six preprocessing techniques: conversion of emojis to Arabic textual labels, normalization of different forms of Arabic letters, normalization of selected nouns from dialectal Arabic to Modern Standard Arabic, conversion of selected hyponyms to hypernyms, hashtag segmentation, and basic cleaning such as removing numbers, kashidas, diacritics, and HTML tags. We also experiment with raw text and a combination of all six preprocessing techniques. We apply different types of classifiers in our experiments including traditional machine learning, ensemble machine learning, Artificial Neural Networks, and Bidirectional Encoder Representations from Transformers (BERT)-based models to analyze the impact of preprocessing. Our results demonstrate significant variations in the effects of preprocessing on each classifier type and on each dataset. Classifiers that are based on BERT do not benefit from preprocessing, while traditional machine learning classifiers do. However, these results can benefit from validation on larger datasets that cover broader domains and dialects.

References

[1]

Abdullah I. Alharbi and Mark Lee. 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 91–96. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.15.

[2]

Sarah Alhumoud, Mawaheb Altuwaijri, Tarfa Albuhairi, and Wejdan Alohaideb. 2015. Survey on arabic sentiment analysis in twitter, In World Academy of Science, Engineering and Technology. Int. J. Soc. Behav. Edu. Econ. Bus. Industr. Eng. 9, 1, 364–378.

[3]

Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 9–15. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.

[4]

Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, and Raddouane Chiheb. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 2479–2490. https://doi.org/10.1016/j.asej.2017.04.007

[5]

Kareem Darwish. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP’14). Association for Computational Linguistics, 217–224. https://doi.org/10.3115/v1/W14-3629

[6]

Rehab Duwairi and Mahmoud El-Orfali. 2014. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Info. Sci. 40, 4 (2014), 501–513. https://doi.org/10.1177/0165551514534143 arXiv:https://doi.org/10.1177/0165551514534143

Digital Library

[7]

Neamat El Gayar and Ching Suen. 2018. Series on Language Processing, Pattern Recognition, and Intelligent Systems, Vol. 4. World Scientific, Singapore. 1–288. https://doi.org/10.1142/10693

[8]

Nizar Y. Habash. 2010. Synthesis Lectures on Human Language Technologies, 1st ed., Vol. 3. Morgan & Claypool Publishers. 1–187. https://doi.org/10.2200/S00277ED1V01Y201008HLT010

[9]

Yaakov HaCohen-Kerner, D. Miller, and Yair Yigal. 2020. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15 (2020).

[10]

Hatem Haddad, Hala Mulki, and Asma Oueslati. 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, Kamel Smaïli (Ed.). Springer International Publishing, Cham, 251–263.

[11]

Fatemah Husain. 2020. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. Retrieved from https://arXiv:2005.08946.

[12]

Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 53–60. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.8.

[13]

Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20).

[14]

Fatemah Husain, Jooyeon Lee, Samuel Henry, and Ozlem Uzuner. 2020. SalamNET at SemEval-2020 Task12: Deep learning approach for arabic offensive language detection. In Proceedings of the 14th International Workshop on Semantic Evaluation. 2133–2139.

[15]

Fatemah Husain and Ozlem Uzuner. 2021. A survey of offensive language detection for the arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (Mar. 2021), 44 pages. https://doi.org/10.1145/3421504

Digital Library

[16]

Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al-Khalifa. 2020. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT’20), with a Shared Task on Offensive Language Detection, Vol. 4. European Language Resource Association, Marseille, France. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.

[17]

Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 111–118. https://doi.org/10.18653/v1/W19-3512

[18]

Constantin Orăsan. 2018. Aggressive language identification using word embeddings and sentiment features. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). Association for Computational Linguistics, 113–119. Retrieved from https://www.aclweb.org/anthology/W18-4414.

[19]

Motaz Saad. 2010. The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Ph.D. Dissertation. https://doi.org/10.13140/2.1.4677.2164

[20]

Hafiz Hassaan Saeed, Toon Calders, and Faisal Kamiran. 2020. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic Tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 71–75. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.11.

[21]

Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation. International Committee for Computational Linguistics, 2054–2059. Retrieved from https://www.aclweb.org/anthology/2020.semeval-1.271.

[22]

Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256–265. https://doi.org/10.1016/j.procs.2017.10.117Arabic Computational Linguistics.

[23]

HoSung Woo, JaMee Kim, and WonGyu Lee. 2020. Validation of text data preprocessing using a neural network model. Math. Problems Eng. 2020 (2020), 1958149. https://doi.org/10.1155/2020/1958149

Cited By

Alrayzah AAlsolami FSaleh M(2024)AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language ProcessingApplied Sciences10.3390/app1412529414:12(5294)Online publication date: 19-Jun-2024
https://doi.org/10.3390/app14125294
Adegoke FTenuche BAgozie E(2024)Development of Pidgin English Hate Speech Classification System for Social MediaAmerican Journal of Information Science and Technology10.11648/j.ajist.20240802.128:2(34-44)Online publication date: 14-Jun-2024
https://doi.org/10.11648/j.ajist.20240802.12
Husain FAlostad HOmar H(2024)Bridging the Kuwaiti Dialect Gap in Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.336436712(27709-27722)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3364367
Show More Cited By

Index Terms

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction

Recommendations

Towards Accurate Detection of Offensive Language in Online Communication in Arabic
Abstract
We present the results of predictive modelling for the detection of anti-social behaviour in online communication in Arabic, such as comments which contain obscene or offensive words and phrases. We collected and labelled a large dataset of ...
Sentence Boundary Detection in Colloquial Arabic Text: A Preliminary Result
IALP '11: Proceedings of the 2011 International Conference on Asian Language Processing

Recently, natural language processing tasks are more frequently conducted over online content. This poses a special problem for applications over Arabic language. Online Arabic content is usually written in informal colloquial Arabic, which is ...
Online Recognition System for Handwritten Arabic Chemical Symbols
ICCCE '14: Proceedings of the 2014 International Conference on Computer and Communication Engineering

Arabic chemical symbols are remarkably different from Latin chemical symbols which written by Arabic characters. On the other hand, Arabic chemical symbols follow Latin chemical symbols from the structure of writing the symbols. Although, Arabic symbols ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 4

July 2022

464 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3511099

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 January 2022

Accepted: 01 November 2021

Revised: 01 October 2021

Received: 01 December 2020

Published in TALLIP Volume 21, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
942
Total Downloads

Downloads (Last 12 months)150
Downloads (Last 6 weeks)14

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alrayzah AAlsolami FSaleh M(2024)AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language ProcessingApplied Sciences10.3390/app1412529414:12(5294)Online publication date: 19-Jun-2024
https://doi.org/10.3390/app14125294
Adegoke FTenuche BAgozie E(2024)Development of Pidgin English Hate Speech Classification System for Social MediaAmerican Journal of Information Science and Technology10.11648/j.ajist.20240802.128:2(34-44)Online publication date: 14-Jun-2024
https://doi.org/10.11648/j.ajist.20240802.12
Husain FAlostad HOmar H(2024)Bridging the Kuwaiti Dialect Gap in Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.336436712(27709-27722)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3364367
Abdelsamie MAzab SHefny H(2024)A comprehensive review on Arabic offensive language and hate speech detection on social media: methods, challenges and solutionsSocial Network Analysis and Mining10.1007/s13278-024-01258-114:1Online publication date: 30-May-2024
https://doi.org/10.1007/s13278-024-01258-1
Sokolová ZStaš JJuhár J(2023)Review of Recent Trends in the Detection of Hate Speech and Offensive Language on Social MediaActa Electrotechnica et Informatica10.2478/aei-2022-001822:4(18-24)Online publication date: 24-Jan-2023
https://doi.org/10.2478/aei-2022-0018
Naji EMaslekar AAhmed ZAlharbi AAl-sellami BTawfik M(2023)Advancing Arabic Hate Speech Detection via Neural Transfer Learning with BERT2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON)10.1109/SMARTGENCON60755.2023.10441885(1-8)Online publication date: 29-Dec-2023
https://doi.org/10.1109/SMARTGENCON60755.2023.10441885
Dhankhar APrakash AJuneja SPrakash S(2023)A Survey on Multi-Modal Hate Speech Detection2023 IEEE 11th Region 10 Humanitarian Technology Conference (R10-HTC)10.1109/R10-HTC57504.2023.10461793(225-230)Online publication date: 16-Oct-2023
https://doi.org/10.1109/R10-HTC57504.2023.10461793
Polatgil MKekül H(2023)The Effect of Document Length on Machine Learning Success in Text-Based Data2023 Innovations in Intelligent Systems and Applications Conference (ASYU)10.1109/ASYU58738.2023.10296594(1-6)Online publication date: 11-Oct-2023
https://doi.org/10.1109/ASYU58738.2023.10296594
Khezzar RMoursi AAl Aghbari Z(2023)arHateDetector: detection of hate speech from standard and dialectal Arabic TweetsDiscover Internet of Things10.1007/s43926-023-00030-93:1Online publication date: 20-Mar-2023
https://doi.org/10.1007/s43926-023-00030-9
Chhabra AVishwakarma D(2023)A literature survey on multimodal and multilingual automatic hate speech identificationMultimedia Systems10.1007/s00530-023-01051-829:3(1203-1230)Online publication date: 20-Jan-2023
https://dl.acm.org/doi/10.1007/s00530-023-01051-8
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents