research-article

Free access

Just Accepted

Syntax-aware Offensive Content Detection in Low-resourced Code-mixed Languages with Continual Pre-training

Authors:

Necva Bölücü and

Pelin CanbayAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing

Accepted on 17 March 2024

https://doi.org/10.1145/3653450

Online AM: 26 March 2024 Publication History

Abstract

Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.

References

[1]

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. 72–78.

[2]

Vimala Balakrishnan, Vithyatheri Govindan, and Kumanan N Govaichelvan. 2023. Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 4(2023), 1–14.

Digital Library

[3]

Somnath Banerjee, Maulindu Sarkar, Nancy Agrawal, Punyajoy Saha, and Mithun Das. 2021. Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages. In Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org.

[4]

Md Abul Bashar and Richi Nayak. 2020. QutNocturnal@ HASOC’19: CNN for hate speech and offensive content identification in Hindi language. arXiv preprint arXiv:2008.12448(2020).

[5]

Jasmijn Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1957–1967.

[6]

Tuhin Chakrabarty, Christopher Hidey, and Kathleen Mckeown. 2019. IMHO Fine-Tuning Improves Claim Detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 558–563.

[7]

BR Chakravarthi, PK Kumaresan, R Sakuntharaj, AK Madasamy, S Thavareesan, S Chinnaudayar Navaneethakrishnan, JP McCrae, T Mandl, et al. 2021. Overview of the HASOC-DravidianCodeMix shared task on offensive language detection in Tamil and Malayalam. In Working Notes of FIRE 2021-Forum for Information Retrieval Evaluation. CEUR.

[8]

Bharathi Raja Chakravarthi, Prasanna Kumar Kumaresan, Ratnasingam Sakuntharaj, Anand Kumar Madasamy, Sajeetha Thavareesan, Premjith B, Subalalitha Chinnaudayar Navaneethakrishnan, John P. McCrae, and Thomas Mandl. 2021. Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam. In Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation (Online). CEUR.

[9]

Bharathi Raja Chakravarthi, Ruba Priyadharshini, Shubanker Banerjee, Manoj Balaji Jagadeeshan, Prasanna Kumar Kumaresan, Rahul Ponnusamy, Sean Benhur, and John Philip McCrae. 2023. Detecting abusive comments at a fine-grained level in a low-resource language. Natural Language Processing Journal 3 (2023), 100006.

[10]

Bharathi Raja Chakravarthi, Ruba Priyadharshini, Vigneshwaran Muralidaran, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John P McCrae. 2022. Dravidiancodemix: Sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation(2022), 1–42.

[11]

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2898–2904.

[12]

Polychronis Charitidis, Stavros Doropoulos, Stavros Vologiannidis, Ioannis Papastergiou, and Sophia Karakeva. 2020. Towards countering hate speech against journalists on social media. Online Social Networks and Media 17 (2020), 100071.

[13]

Nancy Chinchor. 1992. The statistical significance of the muc-4 results. In Proceedings of the 4th Conference on Message Understanding, MUC 1992. 30–50.

Digital Library

[14]

Çağrı Çöltekin. 2020. A corpus of Turkish offensive language on social media. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 6174–6184.

[15]

Tom De Smedt, Sylvia Jaki, Eduan Kotzé, Leïla Saoud, Maja Gwóźdź, Guy De Pauw, and Walter Daelemans. 2018. Multilingual Cross-domain Perspectives on Online Hate Speech. (2018). https://doi.org/10.48550/ARXIV.1809.03944

[16]

V Sharmila Devi, S Kannimuthu, and Anand Kumar Madasamy. 2024. The Effect of Phrase Vector Embedding in Explainable Hierarchical Attention-based Tamil Code-Mixed Hate Speech and Intent Detection. IEEE Access (2024).

[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.

[18]

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8342–8360.

[19]

Adeep Hande, Karthik Puranik, Konthala Yasaswini, Ruba Priyadharshini, Sajeetha Thavareesan, Anbukkarasi Sampath, Kogilavani Shanmugavadivel, Durairaj Thenmozhi, and Bharathi Raja Chakravarthi. 2021. Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling. https://doi.org/10.48550/ARXIV.2108.12177

[20]

Asha Hegde, Mudoor Devadas Anusha, and Hosahalli Lakshmaiah Shashirekha. 2021. Ensemble based machine learning models for hate speech and offensive content identification. In Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org.

[21]

Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear 7, 1 (2017), 411–420.

[22]

Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328–339.

[23]

Vijayasaradhi Indurthi, Bakhtiyar Syed, Manish Shrivastava, Nikhil Chakravartula, Manish Gupta, and Vasudeva Varma. 2019. FERMI at SemEval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in Twitter. In Proceedings of the 13th international workshop on semantic evaluation. 70–74.

[24]

Younghoon Jeong, Juhyun Oh, Jaimeen Ahn, Jongwon Lee, Jihyung Moon, Sungjoon Park, and Alice Oh. 2022. KOLD: Korean Offensive Language Dataset. https://doi.org/10.48550/ARXIV.2205.11315

[25]

Navya Jose, Bharathi Raja Chakravarthi, Shardul Suryawanshi, Elizabeth Sherly, and John P McCrae. 2020. A survey of current datasets for code-switching research. In 2020 6th international conference on advanced computing and communication systems (ICACCS). IEEE, 136–141.

[26]

Sumit Kawate and Kailas Patil. 2017. Analysis of foul language usage in social media text conversation. International Journal of Social Media and Interactive Learning Environments 5, 3(2017), 227–251.

[27]

Kushal Kedia and Abhilash Nandy. 2021. indicnlp@ kgp at DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Languages. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. 330–335.

[28]

Anas Ali Khan, M Hammad Iqbal, Shibli Nisar, Awais Ahmad, and Waseem Iqbal. 2023. Offensive Language Detection for Low Resource Language Using Deep Sequence Model. IEEE Transactions on Computational Social Systems (2023).

[29]

Eunhui Kim, Yuna Jeong, and Myung-seok Choi. 2023. MediBioDeBERTa: Biomedical Language Model with Continuous Learning and Intermediate Fine-Tuning. IEEE Access (2023).

[30]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. https://doi.org/10.48550/ARXIV.1412.6980

[31]

Yongyi Kui. 2021. Detect Hate and Offensive Content in English and Indo-Aryan Languages based on Transformer. In Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org.

[32]

Ritesh Kumar, Atul Kr Ojha, Shervin Malmasi, and Marcos Zampieri. 2020. Evaluating aggression identification in social media. In Proceedings of the second workshop on trolling, aggression and cyberbullying. 1–5.

[33]

R Prasanna Kumar, G Bharathi Mohan, Sangeeth Ajith, R Sudarshan, Manojna Karuparthi, VVV Bhagya Sree, B Vamsi Krushna, et al. 2023. Empowering Multilingual Insensitive Language Detection: Leveraging Transformers for Code-Mixed Text Analysis. In 2023 International Conference on Network, Multimedia and Information Technology (NMITCON). IEEE, 1–6.

[34]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.

[35]

Zhongli Li, Qingyu Zhou, Chao Li, Ke Xu, and Yunbo Cao. 2020. Improving BERT with syntax-aware local attention. arXiv preprint arXiv:2012.15150(2020).

[36]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/ARXIV.1907.11692

[37]

Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. 2019. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation. 14–17.

Digital Library

[38]

Thomas Mandl, Sandip Modha, Gautam Kishore Shahi, Hiren Madhu, Shrey Satapara, Prasenjit Majumder, Johannes Schäfer, Tharindu Ranasinghe, Marcos Zampieri, Durgesh Nandini, et al. 2021. Overview of the hasoc subtrack at fire 2021: Hate speech and offensive content identification in english and indo-aryan languages. arXiv preprint arXiv:2112.09301(2021).

[39]

Diego Marcheggiani and Ivan Titov. 2020. Graph Convolutions over Constituent Trees for Syntax-Aware Semantic Role Labeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3915–3928.

[40]

Sarah Masud, Mohammad Aflah Khan, Md Shad Akhtar, and Tanmoy Chakraborty. 2023. Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span Detection. arXiv preprint arXiv:2311.09834(2023).

[41]

Shubhanshu Mishra and Sudhanshu Mishra. 2019. 3Idiots at HASOC 2019: Fine-tuning Transformer Neural Networks for Hate Speech Identification in Indo-European Languages. In FIRE (Working Notes). 208–213.

[42]

Thanh-Tung Nguyen, Xuan-Phi Nguyen, Shafiq Joty, and Xiaoli Li. 2020. Differentiable Window for Dynamic Local Attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6589–6599.

[43]

Ratnavel Rajalakshmi, Srivarshan Selvaraj, Pavitra Vasudevan, et al. 2023. Hottest: Hate and offensive content identification in Tamil using transformers and enhanced stemming. Computer Speech & Language 78 (2023), 101464.

Digital Library

[44]

Tharindu Ranasinghe, Marcos Zampieri, and Hansi Hettiarachchi. 2019. BRUMS at HASOC 2019: Deep Learning Models for Multilingual Hate Speech and Offensive Language Identification. In FIRE (working notes). 199–207.

[45]

Pradeep Kumar Roy, Snehaan Bhawal, and Chinnaudayar Navaneethakrishnan Subalalitha. 2022. Hate speech and offensive language detection in Dravidian languages using deep ensemble framework. Computer Speech & Language 75 (2022), 101386. https://doi.org/10.1016/j.csl.2022.101386

Digital Library

[46]

Debjoy Saha, Naman Paharia, Debajit Chakraborty, Punyajoy Saha, and Animesh Mukherjee. 2021. Hate-Alert@ DravidianLangTech-EACL2021: Ensembling strategies for Transformer-based Offensive language Detection. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. 270–276.

[47]

Cesa Salaam, Franck Dernoncourt, Trung Bui, Danda Rawat, and Seunghyun Yoon. 2022. Offensive Content Detection Via Synthetic Code-Switched Text. In Proceedings of the 29th International Conference on Computational Linguistics. 6617–6624.

[48]

Shrey Satapara, Prasenjit Majumder, Thomas Mandl, Sandip Modha, Hiren Madhu, Tharindu Ranasinghe, Marcos Zampieri, Kai North, and Damith Premasiri. 2022. Overview of the hasoc subtrack at fire 2022: Hate speech and offensive content identification in english and indo-aryan languages. In Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation. 4–7.

Digital Library

[49]

Michael Sejr Schlichtkrull, Nicola De Cao, and Ivan Titov. 2020. Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking. CoRR abs/2010.00577(2020). arXiv:2010.00577 https://arxiv.org/abs/2010.00577

[50]

Kogilavani Shanmugavadivel, VE Sathishkumar, Sandhiya Raja, T Bheema Lingaiah, S Neelakandan, and Malliga Subramanian. 2022. Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data. Scientific Reports 12, 1 (2022), 21557.

[51]

Gudbjartur Ingi Sigurbergsson and Leon Derczynski. 2020. Offensive Language and Hate Speech Detection for Danish. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 3498–3508.

[52]

Malliga Subramanian, Rahul Ponnusamy, Sean Benhur, Kogilavani Shanmugavadivel, Adhithiya Ganesan, Deepti Ravi, Gowtham Krishnan Shanmugasundaram, Ruba Priyadharshini, and Bharathi Raja Chakravarthi. 2022. Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer. Computer Speech & Language 76 (2022), 101404.

Digital Library

[53]

Malliga Subramanian, Rahul Ponnusamy, Sean Benhur, Kogilavani Shanmugavadivel, Adhithiya Ganesan, Deepti Ravi, Gowtham Krishnan Shanmugasundaram, Ruba Priyadharshini, and Bharathi Raja Chakravarthi. 2022. Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer. Computer Speech & Language 76 (2022), 101404. https://doi.org/10.1016/j.csl.2022.101404

Digital Library

[54]

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification?. In China national conference on Chinese computational linguistics. Springer, 194–206.

Digital Library

[55]

Chul Sung, Tejas Dhamecha, Swarnadeep Saha, Tengfei Ma, Vinay Reddy, and Rishi Arora. 2019. Pre-training BERT on domain resources for short answer grading. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6071–6075.

[56]

Yufei Wang, Mark Johnson, Stephen Wan, Yifang Sun, and Wei Wang. 2019. How to best use syntax in semantic role labelling. arXiv preprint arXiv:1906.00266(2019).

[57]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.

[58]

Sargam Yadav, Abhishek Kaushik, and Kevin McDaid. 2023. Hate Speech is not Free Speech: Explainable Machine Learning for Hate Speech Detection in Code-Mixed Languages. In 2023 IEEE International Symposium on Technology and Society (ISTAS). IEEE, 1–8.

[59]

Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering. https://doi.org/10.48550/ARXIV.1904.06652

[60]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Advances in neural information processing systems 32 (2019).

[61]

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983(2019).

[62]

Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. 2020. SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of the Fourteenth Workshop on Semantic Evaluation. 1425–1447.

[63]

Shaomin Zheng and Meng Yang. 2019. A new method of improving BERT for text classification. In International Conference on Intelligent Science and Big Data Engineering. Springer, 442–452.

Index Terms

Syntax-aware Offensive Content Detection in Low-resourced Code-mixed Languages with Continual Pre-training

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts
Abstract
Offensive Language detection in social media platforms has been an active field of research over the past years. In non-native English-speaking countries, social media users mostly use a code-mixed form of text in their posts/comments. This poses ...
Read More
DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
Abstract
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language ...
Read More
The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media
Abstract
Stopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted

ISSN:2375-4699

EISSN:2375-4702

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 26 March 2024

Accepted: 17 March 2024

Revised: 27 January 2024

Received: 31 July 2023

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
97
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)19

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables