Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Novel Knowledge-augmented Model Customization Approach for Arabic Offensive Language Detection

Published: 19 December 2023 Publication History

Abstract

Multiple attempts to develop systems for detecting online Arabic offensive language have been explored in previous studies. However, most of these attempts do not consider the variation of Arabic dialects, cultures, and offensive phrases. In contrast, this study aims to extract knowledge from multiple offensive language datasets to build a cross-dialect and culture knowledge-based repository. This knowledge-based repository is utilized to develop novel system architecture based on customizing the AraBERT model in a unique method to preserve dialectal knowledge and offensive cultural knowledge within the contextual word embedding of BERT architecture. Performance evaluation procedures consist of statistical evaluation metrics and a behavioral checklist. Results report more effective predictions by the customized model than the uncustomized one, particularly for offensive text. The customization process allows the model to gain more knowledge of informal text in general, and a better understanding of dialectal Arabic.

References

[1]
Muhammad Abdul-Mageed, AbdelRahim Elmadany, and El Moatez Billah Nagoudi. 2020. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. Retrieved from https://arXiv:2101.01785
[2]
Azalden Alakrot, Liam Murray, and Nikola S. Nikolov. 2018a. Dataset construction for the detection of anti-social behaviour in online communication in Arabic. Procedia Comput. Sci. 142 (2018a), 174–181.
[3]
Nuha Albadi, Maram Kurdi, and Shivakant Mishra. 2018. Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). IEEE, 69–76.
[4]
Nuha Albadi, Maram Kurdi, and Shivakant Mishra. 2018. Are they our brothers? Analysis and detection of religious hate speech in the Arabic Twittersphere. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). ACM, 69–76.
[5]
Paul Azunre. 2021. Transfer Learning for Natural Language Processing. Simon and Schuster.
[6]
Rajat Subhra Bhowmick, Isha Ganguli, Jayanta Paul, and Jaya Sil. 2021. A multimodal deep framework for derogatory social media post identification of a recognized person. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 21, 1, Article 2 (Nov. 2021), 19 pages.
[7]
Houda Bouamor, Sabit Hassan, and Nizar Habash. 2019. The MADAR shared task on Arabic fine-grained dialect identification. In Proceedings of the 4th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, Florence, Italy, 199–207.
[8]
Abbas S Brashi. 2005. Arabic Collocations: Implications for Translations. Ph.D. Dissertation. University of Western Sydney (Australia).
[9]
Shammur A. Chowdhury, Hamdy Mubarak, Ahmed Abdelali, Soon Gyo Jung, Bernard J. Jansen, and Joni Salminen. 2020. A multi-platform Arabic news comment dataset for offensive language detection. In Proceedings of the 12th International Conference on Language Resources and Evaluation, Conference Proceedings (LREC’20), Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 6203–6212.
[10]
Shammur Absar Chowdhury, Hamdy Mubarak, Ahmed Abdelali, Soon-gyo Jung, Bernard J. Jansen, and Joni Salminen. 2020. A multi-platform Arabic news comment dataset for offensive language detection. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6203–6212. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.761
[11]
Shammur Absar Chowdhury, Hamdy Mubarak, Ahmed Abdelali, Soon-gyo Jung, Bernard J Jansen, and Joni Salminen. 2020. A multi-platform Arabic news comment dataset for offensive language detection. In Proceedings of the 12th Language Resources and Evaluation Conference. 6203–6212.
[12]
Çağrı Çöltekin. 2020. A corpus of Turkish offensive language on social media. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6174–6184. Retrieved from https://aclanthology.org/2020.lrec-1.758
[13]
Jonathan Culpeper. 2018. Taboo Language and Impoliteness. 28–40.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805
[15]
Stefan Evert. 2007. Corpora and collocations (extended manuscript). Corpus Linguistics. An International Handbook 2 (2007).
[16]
Ibrahim Abu Farha and Walid Magdy. 2019. Mazajak: An online Arabic sentiment analyser. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 192–198.
[17]
Ibrahim Abu Farha and Walid Magdy. 2020. From Arabic sentiment analysis to sarcasm detection: The ArSarcasm dataset. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 32–39.
[18]
John Rupert Firth. 1957. Ethnographic analysis and language with reference to Malinowski’s views. Man and Culture: An Evaluation of the Work of Bronislaw Malinowski (1957), 93–118.
[19]
Christian Haddad and Lars Hornuf. 2019. The emergence of the global Fintech market: Economic and technological determinants. Small Bus. Econ. 53, 1 (2019), 81–105.
[20]
Hatem Haddad, Hala Mulki, and Asma Oueslati. 2019. T-HSAB: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, Kamel Smaïli (Ed.). Springer International Publishing, Cham, 251–263.
[21]
Sabit Hassan, Younes Samih, Hamdy Mubarak, Ahmed Abdelali, Ammar Rashed, and Shammur Absar Chowdhury. 2020. ALT submission for OSACT shared task on offensive language detection. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 61–65.
[22]
Fatemah Husain. 2020. OSACT4 shared task on offensive language detection: Intensive preprocessing-based approach. Retrieved from https://arXiv:2005.07297
[23]
Fatemah Husain and Ozlem Uzuner. 2021. A survey of offensive language detection for the Arabic language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 1, Article 12 (Mar. 2021), 44 pages.
[24]
Fatemah Ali Husain. 2021. Arabic Offensive Language Detection in Social Media. Ph. D. Dissertation. George Mason University.
[25]
Timothy Jay and Kristin Janschewitz. 2008. The pragmatics of swearing. J. Polite. Res. 4, 2 (2008), 267–288.
[26]
John S. Justeson and Slava M. Katz. 1995. Principled disambiguation: Discriminating adjective senses with modified nouns. Computational Linguistics 21, 1 (1995), 1–27.
[27]
Ramesh Krishnamurthy. 2006. Collocations. In Encyclopedia of Language and Linguistics. Elsevier, 596–600.
[28]
Christopher Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
[29]
Zhenxiong Miao, Xingshu Chen, Haizhou Wang, Rui Tang, Zhou Yang, and Wenyi Tang. 2022. Detecting Offensive Language on Social Networks: An End-to-end Detection Method based on Graph Attention Networks. Retrieved from arXiv:2203.02123
[30]
Hamdy Mubarak and Kareem Darwish. 2019. Arabic offensive language classification on Twitter. In Social Informatics, Ingmar Weber, Kareem M. Darwish, Claudia Wagner, Emilio Zagheni, Laura Nelson, Samin Aref, and Fabian Flöck (Eds.). Springer International Publishing, Cham, 269–276.
[31]
Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on Arabic social media. In Proceedings of the 1st Workshop on Abusive Language Online. 52–56.
[32]
Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on Arabic social media. In Proceedings of the 1st Workshop on Abusive Language Online. Association for Computational Linguistics, Vancouver, BC, Canada, 52–56.
[33]
Hamdy Mubarak, Kareem Darwish, Walid Magdy, Tamer Elsayed, and Hend Al-Khalifa. 2020. Overview of OSACT4 Arabic offensive language detection shared task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, European Language Resource Association, Marseille, 48–52. Retrieved from https://aclanthology.org/2020.osact-1.7
[34]
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-hsab: A levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. 111–118.
[35]
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A levantine Twitter dataset for hate speech and abusive language. In Proceedings of the 3rd Workshop on Abusive Language Online. Association for Computational Linguistics, Florence, Italy, 111–118.
[36]
Asst Lect Balsam A. Mustafa. 2010. Collocation in english and Arabic: A linguistic and cultural analysis. J. Coll. Basic Edu. 16, 65 (2010), 59–43.
[37]
Mahmoud Nabil, Mohamed Aly, and Amir Atiya. 2015. ASTD: Arabic sentiment tweets dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2515–2519.
[38]
Adham Mousa Obeidat, Ghada Rajeh Ayyad, and Tengku Sepora Tengku Mahadi. 2020. A new vision of classifying quranic collocations: A syntactic and semantic perspective. e-BANGI 17, 7 (2020), 133–144.
[39]
Ahmed Omar, Tarek M. Mahmoud, and Tarek Abd-El-Hafeez. 2020. Comparative performance of machine learning and deep learning algorithms for Arabic hate speech detection in OSNs. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV’20), Aboul-Ella Hassanien, Ahmad Taher Azar, Tarek Gaber, Diego Oliva, and Fahmy M. Tolba (Eds.). Springer International Publishing, Cham, 247–257.
[40]
Ahmed Omar, Tarek M. Mahmoud, and Tarek Abd-El-Hafeez. 2020. Comparative performance of machine learning and deep learning algorithms for Arabic hate speech detection in OSNs. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV’20). Springer, 247–257.
[41]
Tharindu Ranasinghe and Marcos Zampieri. 2021. Multilingual offensive language identification for low-resource languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 21, 1, Article 4 (Nov. 2021), 13 pages.
[42]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. Retrieved from https://arXiv:2005.04118
[43]
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. SemEval-2017 task 4: Sentiment analysis in Twitter. Retrieved from https://arXiv:1912.00741
[44]
Sebastian Ruder. 2019. Neural Transfer Learning for Natural Language Processing. Ph.D. Dissertation. NUI Galway.
[45]
Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. Transfer learning in natural language processing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials. 15–18.
[46]
Gretel Liz De la Peña Sarracén and Paolo Rosso. 2023. Offensive keyword extraction based on the attention mechanism of BERT and the eigenvector centrality using a graph representation. Person. Ubiq. Comput. 27, 1 (2023), 45–57.
[47]
Dagmara Szczepańska and Marta Marchlewska. 2023. Unfree to speak and forced to hate? The phenomenon of the all-poland women’s strike. In Challenges and Perspectives of Hate Speech Research, Christian Strippel, Sünje Paasch-Colberg, Martin Emmer, and Joachim Trebbe (Eds.). Digital Communication Research, Vol. 12. Springer, Berlin, 55–71.

Index Terms

  1. A Novel Knowledge-augmented Model Customization Approach for Arabic Offensive Language Detection

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 12
        December 2023
        194 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3638035
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 December 2023
        Online AM: 28 November 2023
        Accepted: 22 November 2023
        Revised: 06 October 2023
        Received: 09 June 2023
        Published in TALLIP Volume 22, Issue 12

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Natural Language Processing
        2. neural networks
        3. offensive Language detection
        4. knowledge-based
        5. collocations

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 111
          Total Downloads
        • Downloads (Last 12 months)59
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 11 Jan 2025

        Other Metrics

        Citations

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media