Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

Published: 19 January 2022 Publication History

Abstract

Preprocessing of input text can play a key role in text classification by reducing dimensionality and removing unnecessary content. This study aims to investigate the impact of preprocessing on Arabic offensive language classification. We explore six preprocessing techniques: conversion of emojis to Arabic textual labels, normalization of different forms of Arabic letters, normalization of selected nouns from dialectal Arabic to Modern Standard Arabic, conversion of selected hyponyms to hypernyms, hashtag segmentation, and basic cleaning such as removing numbers, kashidas, diacritics, and HTML tags. We also experiment with raw text and a combination of all six preprocessing techniques. We apply different types of classifiers in our experiments including traditional machine learning, ensemble machine learning, Artificial Neural Networks, and Bidirectional Encoder Representations from Transformers (BERT)-based models to analyze the impact of preprocessing. Our results demonstrate significant variations in the effects of preprocessing on each classifier type and on each dataset. Classifiers that are based on BERT do not benefit from preprocessing, while traditional machine learning classifiers do. However, these results can benefit from validation on larger datasets that cover broader domains and dialects.

References

[1]
Abdullah I. Alharbi and Mark Lee. 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 91–96. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.15.
[2]
Sarah Alhumoud, Mawaheb Altuwaijri, Tarfa Albuhairi, and Wejdan Alohaideb. 2015. Survey on arabic sentiment analysis in twitter, In World Academy of Science, Engineering and Technology. Int. J. Soc. Behav. Edu. Econ. Bus. Industr. Eng. 9, 1, 364–378.
[3]
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 9–15. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.
[4]
Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, and Raddouane Chiheb. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 2479–2490. https://doi.org/10.1016/j.asej.2017.04.007
[5]
Kareem Darwish. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP’14). Association for Computational Linguistics, 217–224. https://doi.org/10.3115/v1/W14-3629
[6]
Rehab Duwairi and Mahmoud El-Orfali. 2014. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Info. Sci. 40, 4 (2014), 501–513. https://doi.org/10.1177/0165551514534143 arXiv:https://doi.org/10.1177/0165551514534143
[7]
Neamat El Gayar and Ching Suen. 2018. Series on Language Processing, Pattern Recognition, and Intelligent Systems, Vol. 4. World Scientific, Singapore. 1–288. https://doi.org/10.1142/10693
[8]
Nizar Y. Habash. 2010. Synthesis Lectures on Human Language Technologies, 1st ed., Vol. 3. Morgan & Claypool Publishers. 1–187. https://doi.org/10.2200/S00277ED1V01Y201008HLT010
[9]
Yaakov HaCohen-Kerner, D. Miller, and Yair Yigal. 2020. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15 (2020).
[10]
Hatem Haddad, Hala Mulki, and Asma Oueslati. 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, Kamel Smaïli (Ed.). Springer International Publishing, Cham, 251–263.
[11]
Fatemah Husain. 2020. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. Retrieved from https://arXiv:2005.08946.
[12]
Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 53–60. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.8.
[13]
Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20).
[14]
Fatemah Husain, Jooyeon Lee, Samuel Henry, and Ozlem Uzuner. 2020. SalamNET at SemEval-2020 Task12: Deep learning approach for arabic offensive language detection. In Proceedings of the 14th International Workshop on Semantic Evaluation. 2133–2139.
[15]
Fatemah Husain and Ozlem Uzuner. 2021. A survey of offensive language detection for the arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (Mar. 2021), 44 pages. https://doi.org/10.1145/3421504
[16]
Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al-Khalifa. 2020. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT’20), with a Shared Task on Offensive Language Detection, Vol. 4. European Language Resource Association, Marseille, France. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.
[17]
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 111–118. https://doi.org/10.18653/v1/W19-3512
[18]
Constantin Orăsan. 2018. Aggressive language identification using word embeddings and sentiment features. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). Association for Computational Linguistics, 113–119. Retrieved from https://www.aclweb.org/anthology/W18-4414.
[19]
Motaz Saad. 2010. The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Ph.D. Dissertation. https://doi.org/10.13140/2.1.4677.2164
[20]
Hafiz Hassaan Saeed, Toon Calders, and Faisal Kamiran. 2020. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic Tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 71–75. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.11.
[21]
Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation. International Committee for Computational Linguistics, 2054–2059. Retrieved from https://www.aclweb.org/anthology/2020.semeval-1.271.
[22]
Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256–265. https://doi.org/10.1016/j.procs.2017.10.117Arabic Computational Linguistics.
[23]
HoSung Woo, JaMee Kim, and WonGyu Lee. 2020. Validation of text data preprocessing using a neural network model. Math. Problems Eng. 2020 (2020), 1958149. https://doi.org/10.1155/2020/1958149

Cited By

View all
  • (2024)AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language ProcessingApplied Sciences10.3390/app1412529414:12(5294)Online publication date: 19-Jun-2024
  • (2024)Development of Pidgin English Hate Speech Classification System for Social MediaAmerican Journal of Information Science and Technology10.11648/j.ajist.20240802.128:2(34-44)Online publication date: 14-Jun-2024
  • (2024)Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/367717623:9(1-23)Online publication date: 11-Jul-2024
  • Show More Cited By

Index Terms

  1. Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 4
    July 2022
    464 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3511099
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 January 2022
    Accepted: 01 November 2021
    Revised: 01 October 2021
    Received: 01 December 2020
    Published in TALLIP Volume 21, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Artificial neural networks
    2. offensive language detection
    3. natural language processing
    4. Arabic language
    5. machine learning

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)139
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language ProcessingApplied Sciences10.3390/app1412529414:12(5294)Online publication date: 19-Jun-2024
    • (2024)Development of Pidgin English Hate Speech Classification System for Social MediaAmerican Journal of Information Science and Technology10.11648/j.ajist.20240802.128:2(34-44)Online publication date: 14-Jun-2024
    • (2024)Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/367717623:9(1-23)Online publication date: 11-Jul-2024
    • (2024)Bridging the Kuwaiti Dialect Gap in Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.336436712(27709-27722)Online publication date: 2024
    • (2024)The study of the effect of preprocessing techniques for emotion detection on Amazon product review datasetSocial Network Analysis and Mining10.1007/s13278-024-01352-414:1Online publication date: 23-Sep-2024
    • (2024)A comprehensive review on Arabic offensive language and hate speech detection on social media: methods, challenges and solutionsSocial Network Analysis and Mining10.1007/s13278-024-01258-114:1Online publication date: 30-May-2024
    • (2023)Review of Recent Trends in the Detection of Hate Speech and Offensive Language on Social MediaActa Electrotechnica et Informatica10.2478/aei-2022-001822:4(18-24)Online publication date: 24-Jan-2023
    • (2023)Advancing Arabic Hate Speech Detection via Neural Transfer Learning with BERT2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON)10.1109/SMARTGENCON60755.2023.10441885(1-8)Online publication date: 29-Dec-2023
    • (2023)A Survey on Multi-Modal Hate Speech Detection2023 IEEE 11th Region 10 Humanitarian Technology Conference (R10-HTC)10.1109/R10-HTC57504.2023.10461793(225-230)Online publication date: 16-Oct-2023
    • (2023)The Effect of Document Length on Machine Learning Success in Text-Based Data2023 Innovations in Intelligent Systems and Applications Conference (ASYU)10.1109/ASYU58738.2023.10296594(1-6)Online publication date: 11-Oct-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media