Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

Published: 19 January 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Preprocessing of input text can play a key role in text classification by reducing dimensionality and removing unnecessary content. This study aims to investigate the impact of preprocessing on Arabic offensive language classification. We explore six preprocessing techniques: conversion of emojis to Arabic textual labels, normalization of different forms of Arabic letters, normalization of selected nouns from dialectal Arabic to Modern Standard Arabic, conversion of selected hyponyms to hypernyms, hashtag segmentation, and basic cleaning such as removing numbers, kashidas, diacritics, and HTML tags. We also experiment with raw text and a combination of all six preprocessing techniques. We apply different types of classifiers in our experiments including traditional machine learning, ensemble machine learning, Artificial Neural Networks, and Bidirectional Encoder Representations from Transformers (BERT)-based models to analyze the impact of preprocessing. Our results demonstrate significant variations in the effects of preprocessing on each classifier type and on each dataset. Classifiers that are based on BERT do not benefit from preprocessing, while traditional machine learning classifiers do. However, these results can benefit from validation on larger datasets that cover broader domains and dialects.

    References

    [1]
    Abdullah I. Alharbi and Mark Lee. 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 91–96. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.15.
    [2]
    Sarah Alhumoud, Mawaheb Altuwaijri, Tarfa Albuhairi, and Wejdan Alohaideb. 2015. Survey on arabic sentiment analysis in twitter, In World Academy of Science, Engineering and Technology. Int. J. Soc. Behav. Edu. Econ. Bus. Industr. Eng. 9, 1, 364–378.
    [3]
    Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools with a Shared Task on Offensive Language Detection. European Language Resource Association, 9–15. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.
    [4]
    Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, and Raddouane Chiheb. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 2479–2490. https://doi.org/10.1016/j.asej.2017.04.007
    [5]
    Kareem Darwish. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP’14). Association for Computational Linguistics, 217–224. https://doi.org/10.3115/v1/W14-3629
    [6]
    Rehab Duwairi and Mahmoud El-Orfali. 2014. A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Info. Sci. 40, 4 (2014), 501–513. https://doi.org/10.1177/0165551514534143 arXiv:https://doi.org/10.1177/0165551514534143
    [7]
    Neamat El Gayar and Ching Suen. 2018. Series on Language Processing, Pattern Recognition, and Intelligent Systems, Vol. 4. World Scientific, Singapore. 1–288. https://doi.org/10.1142/10693
    [8]
    Nizar Y. Habash. 2010. Synthesis Lectures on Human Language Technologies, 1st ed., Vol. 3. Morgan & Claypool Publishers. 1–187. https://doi.org/10.2200/S00277ED1V01Y201008HLT010
    [9]
    Yaakov HaCohen-Kerner, D. Miller, and Yair Yigal. 2020. The influence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15 (2020).
    [10]
    Hatem Haddad, Hala Mulki, and Asma Oueslati. 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, Kamel Smaïli (Ed.). Springer International Publishing, Cham, 251–263.
    [11]
    Fatemah Husain. 2020. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. Retrieved from https://arXiv:2005.08946.
    [12]
    Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 53–60. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.8.
    [13]
    Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing based approach. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20).
    [14]
    Fatemah Husain, Jooyeon Lee, Samuel Henry, and Ozlem Uzuner. 2020. SalamNET at SemEval-2020 Task12: Deep learning approach for arabic offensive language detection. In Proceedings of the 14th International Workshop on Semantic Evaluation. 2133–2139.
    [15]
    Fatemah Husain and Ozlem Uzuner. 2021. A survey of offensive language detection for the arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (Mar. 2021), 44 pages. https://doi.org/10.1145/3421504
    [16]
    Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al-Khalifa. 2020. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT’20), with a Shared Task on Offensive Language Detection, Vol. 4. European Language Resource Association, Marseille, France. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.2.
    [17]
    Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, 111–118. https://doi.org/10.18653/v1/W19-3512
    [18]
    Constantin Orăsan. 2018. Aggressive language identification using word embeddings and sentiment features. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). Association for Computational Linguistics, 113–119. Retrieved from https://www.aclweb.org/anthology/W18-4414.
    [19]
    Motaz Saad. 2010. The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Ph.D. Dissertation. https://doi.org/10.13140/2.1.4677.2164
    [20]
    Hafiz Hassaan Saeed, Toon Calders, and Faisal Kamiran. 2020. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic Tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 71–75. Retrieved from https://www.aclweb.org/anthology/2020.osact-1.11.
    [21]
    Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation. International Committee for Computational Linguistics, 2054–2059. Retrieved from https://www.aclweb.org/anthology/2020.semeval-1.271.
    [22]
    Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256–265. https://doi.org/10.1016/j.procs.2017.10.117Arabic Computational Linguistics.
    [23]
    HoSung Woo, JaMee Kim, and WonGyu Lee. 2020. Validation of text data preprocessing using a neural network model. Math. Problems Eng. 2020 (2020), 1958149. https://doi.org/10.1155/2020/1958149

    Cited By

    View all
    • (2024)AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language ProcessingApplied Sciences10.3390/app1412529414:12(5294)Online publication date: 19-Jun-2024
    • (2024)Development of Pidgin English Hate Speech Classification System for Social MediaAmerican Journal of Information Science and Technology10.11648/j.ajist.20240802.128:2(34-44)Online publication date: 14-Jun-2024
    • (2024)Bridging the Kuwaiti Dialect Gap in Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.336436712(27709-27722)Online publication date: 2024
    • Show More Cited By

    Index Terms

    1. Investigating the Effect of Preprocessing Arabic Text on Offensive Language and Hate Speech Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 4
      July 2022
      464 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3511099
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 January 2022
      Accepted: 01 November 2021
      Revised: 01 October 2021
      Received: 01 December 2020
      Published in TALLIP Volume 21, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Artificial neural networks
      2. offensive language detection
      3. natural language processing
      4. Arabic language
      5. machine learning

      Qualifiers

      • Research-article
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)150
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language ProcessingApplied Sciences10.3390/app1412529414:12(5294)Online publication date: 19-Jun-2024
      • (2024)Development of Pidgin English Hate Speech Classification System for Social MediaAmerican Journal of Information Science and Technology10.11648/j.ajist.20240802.128:2(34-44)Online publication date: 14-Jun-2024
      • (2024)Bridging the Kuwaiti Dialect Gap in Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.336436712(27709-27722)Online publication date: 2024
      • (2024)A comprehensive review on Arabic offensive language and hate speech detection on social media: methods, challenges and solutionsSocial Network Analysis and Mining10.1007/s13278-024-01258-114:1Online publication date: 30-May-2024
      • (2023)Review of Recent Trends in the Detection of Hate Speech and Offensive Language on Social MediaActa Electrotechnica et Informatica10.2478/aei-2022-001822:4(18-24)Online publication date: 24-Jan-2023
      • (2023)Advancing Arabic Hate Speech Detection via Neural Transfer Learning with BERT2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON)10.1109/SMARTGENCON60755.2023.10441885(1-8)Online publication date: 29-Dec-2023
      • (2023)A Survey on Multi-Modal Hate Speech Detection2023 IEEE 11th Region 10 Humanitarian Technology Conference (R10-HTC)10.1109/R10-HTC57504.2023.10461793(225-230)Online publication date: 16-Oct-2023
      • (2023)The Effect of Document Length on Machine Learning Success in Text-Based Data2023 Innovations in Intelligent Systems and Applications Conference (ASYU)10.1109/ASYU58738.2023.10296594(1-6)Online publication date: 11-Oct-2023
      • (2023)arHateDetector: detection of hate speech from standard and dialectal Arabic TweetsDiscover Internet of Things10.1007/s43926-023-00030-93:1Online publication date: 20-Mar-2023
      • (2023)A literature survey on multimodal and multilingual automatic hate speech identificationMultimedia Systems10.1007/s00530-023-01051-829:3(1203-1230)Online publication date: 20-Jan-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media