Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

“Breaking the Silence” Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces

Advaitha Vetagiri1, Gyandeep Kalita1, Eisha Halder1, Chetna Taparia1, Partha Pakray1, Riyanka Manna2
1National Institute of Technology Silchar, Silchar, Assam, India.
2Computer Science & Engineering, School of Computing, Amrita Vishwa Vidyapeetham
Amaravati Campus, Andhra Pradesh, India.
1advaitha21_rs@cse.nits.ac.in, 1gyandeepkalita1@gmail.com, 1eishashalder@gmail.com,
1chetna.taparia@gmail.com, 1partha@cse.nits.ac.in,
2m_riyanka@av.amrita.edu
Abstract

Online gender-based harassment is a widespread issue limiting the free expression and participation of women and marginalized genders in digital spaces. Detecting such abusive content can enable platforms to curb this menace. We participated in the “Gendered Abuse Detection in Indic Languages” shared task at ICON2023 that provided datasets of annotated Twitter posts in English, Hindi and Tamil for building classifiers to identify gendered abuse. Our team CNLP-NITS-PP developed an ensemble approach combining CNN and BiLSTM networks that can effectively model semantic and sequential patterns in textual data. The CNN captures localized features indicative of abusive language through its convolution filters applied on embedded input text. To determine context-based offensiveness, the BiLSTM analyzes this sequence for dependencies among words and phrases. Multiple variations were trained using FastText and GloVe word embeddings for each language dataset comprising over 7,600 crowdsourced annotations across labels for explicit abuse, targeted minority attacks and general offences. The validation scores showed strong performance across f1-measures, especially for English 0.84. Our experiments reveal how customizing embeddings and model hyperparameters can improve detection capability. The proposed architecture ranked 1st in the competition, proving its ability to handle real-world noisy text with code-switching. This technique has a promising scope as platforms aim to combat cyber harassment facing Indic language internet users. Our Code is at https://github.com/advaithavetagiri/CNLP-NITS-PP

“Breaking the Silence” Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces


Advaitha Vetagiri1, Gyandeep Kalita1, Eisha Halder1, Chetna Taparia1, Partha Pakray1, Riyanka Manna2 1National Institute of Technology Silchar, Silchar, Assam, India. 2Computer Science & Engineering, School of Computing, Amrita Vishwa Vidyapeetham Amaravati Campus, Andhra Pradesh, India. 1advaitha21_rs@cse.nits.ac.in, 1gyandeepkalita1@gmail.com, 1eishashalder@gmail.com, 1chetna.taparia@gmail.com, 1partha@cse.nits.ac.in, 2m_riyanka@av.amrita.edu


1 Introduction

Sexism exists in online communication today just as it does offline. Sexism continues because modern online communication is just as vital a part of current society as anything else, and there are benefits to it, just as there are disadvantages. Online chat, such as sexist language, leads to harassment, cyberspace bullying, and gendered discourse. Indicates this Whiley et al. (2023), as argued by Hoskin and Whiley (2023) Felmlee et al. (2023), both males and females feel the impacts of toxic masculinity Sexism is an acknowledged problem in internet-based discussions. However, it is still difficult to distinguish it.

As observed by (Kural and Kovács, 2022), this has serious ramifications affecting self-esteem, anxiety, and feelings of insecurity among the targets. As stated by (Feigt et al., 2022), these adverse effects could have a prolonged impact on one’s life in terms of deteriorating mental health and relationships. Finally, using sexist language further contributes towards inequality among genders because it continues to propagate biased messages, which is explained more by (Barreto and Doyle, 2023).

One crucial area of research involves automatically detecting sexist language in text. Automated algorithms can help decrease the occurrence of sexism and make the online environment more inclusive (Vetagiri et al., ); one case is when somebody publishes sexist words. However, it presents various challenges in developing autonomous systems capable of consistently identifying sexist language in texts, such as interpreting the context of language use and cultural nuances of language (Van Dijk, 2015). These problems become more complex because of Internet communication, where people often use abbreviations and shorthand that may not be easily understood.

Online gender-based violence is taking the upper hand among other challenges, and it is an addition of social and economic weaknesses, especially for the people in Indic languages. Such abuse scares people away from virtual space, which subsequently affects individuals’ political or economic prospects. It manifests itself very seriously and can be tragic, even leading to death. In light of this, it is essential to develop an automatic mechanism that identifies gender insults as they occur in interactions taking place in Indic languages. However, the development of these approaches is hampered by a significant shortage of Indic language-dependent datasets.

Under ICON 2023111http://icon2023.unigoa.ac.in/, led by Tattled Civic Tech222https://tattle.co.in/, we explore a relevant topic on how to counter online gender-based violence in Indic languages among participants in a shared task. This challenge multiplies present social and economic insecurities such that they could drive people out of social media platforms, disrupt their chances of politically or economically surviving, and sometimes result in deaths. To develop fast techniques for identifying gender-based violence, we acknowledge the lack of Indic language data sets essential for advancing this field of study. Our approach has mainly centred on assisting in a solution, paying specific attention to Hindi, Tamil, and Indian English. This allows us to access the new, well-annotated data from 18 activists and researchers based on their observations and experience with gender violence. We rely on this dataset Arora et al. (2023) incorporating 7638 posts in English, 7714 in Hindi, and 7914 in Tamil for the basis of our involvement in the shared task.

2 Related Work

Research on gender-based online harassment has gained prominence due to its pervasive and damaging effects on individuals. The early attempts include machine learning techniques for detecting gender-based online harassment. Various models, e.g., Support Vector Machines (SVMs) Ghosal et al. (2023), Random Forests Das et al. (2023), have been used to classify abusive and non-abusive content. These methods have potential, but there is a great difficulty in obtaining a high accuracy that fits different language structures and cultural circumstances.

The modern day developments in deep learning techniques have made it possible to develop better approaches for detecting offensive languages on e-platforms today. CNNs Quoc Tran et al. (2023) and RNNs Jahan and Oussalah (2023) efficiently pick up complicated attributes and context details. The models mentioned above that are trained with high data set sizes can identify slight differences in gender discriminative abuse.

The ensemble models especially, which incorporate Convolutional Neural Networks (CNNs) and Bi-directional Long Short-Term Memory networks (BiLSTMs) Vetagiri et al. (2023a), are one of the most effective techniques in dealing with the complex nature of the problem of gender-based online The following part discusses the benefits and effectiveness of ensembled architecture and specifically the CNN-BiLSTM blend. Some ensemble models, such as the CNN-BiLSTM Vetagiri et al. (2023b), combine the advantages of CNNS and bi-LSTMs for capturing spatial and temporal dependencies inherent in textual data. CNNs are good at extracting local features and patterns, whereas biLSTMs capture long-range dependencies, thus providing complimentary components for unravelling the complexities of abusive language.

The transferability of models trained in one language for use with another language has become an area of interest. Preliminary studies Priyadarshini et al. (2023) have focused on training models in English for abusive language development before being fine-tuned with Indic languages. The second method uses abundant English-language resources and overcomes the lack of labelled Indian data.

Researchers have studied multimodal approaches that consider different modes of abusive content, such as text, pictures, and videos. Deep learning-based fusion of text and image features Chhabra and Vishwakarma (2023) appears promising at reflecting diverse Gender-based violence (GBV) expression in different genres.

An important issue in implementing machine learning models for monitoring harassment based on gender is the consideration of cultural and language features. Some studies Ghosal et al. (2023); Das et al. (2023) stress that when designing models for indigenous languages like Hindi, we should be aware of local linguistic peculiarities and cultural specificity.

3 Dataset

The dataset creation process began with a focus on three Indian languages: ‘Indian’ English is the name used for the language that has become distinguishable as it involves transliteration and code-switching; Indian English, Hindi, and Tamil. A list of slurs and offensive phrases was crowd-sourced from activists and researchers to create a widely varied dataset. To this end, various lists of accounts commonly associated with Hate Online and Hate Speech offenders were also established. The researcher used Twitter data spanning between 2018 and 2021. Such criteria included slurs that are crowdsourced, tweets by perpetrators, and responses to influential women.

The user handles were appropriately anonymized, and Python’s Twint library was used to obtain datasets. Subsequently, the stratified pooling technique selected 8000 instances out of the 1.3 million unmarked posts pool. The unsupervised dataset was labelled using democratic co-training of several models, whose parameters were pre-trained on public source datasets for similar tasks. Posts were then selected for the final data set by calculating average confidence scores using the mixture-of-experts (Moe) model based on ten classes having different toxicity score levels.

A carefully collected dataset Arora et al. (2023) for the Hindi, Tamil, and Indian English shared tasks on detecting gender-based cyber violence. It consists of 7638 posts in English, 7714 in Hindi, and 7914 in Tamil. Eighteen activists carefully annotated each post, and researchers had first-hand or expert knowledge of gender studies.

All languages and labels are compiled into a single dataset for convenience purposes. The dataset consists of three labels; each tweet is shown three times as it reflects the annotation of the different labels. The dataset columns include:

  • Label 1 (question_1): Determines if the post amounts to gendered abuse, especially when targeting people who are not from gender or sexually marginalised groups.

  • Label 2 (question_2): Evaluate the extent to which it would qualify as gendered abuse to a person of minority gender and sexual orientation.

  • Label 3 (question_3): Indicates whether the post is overtly hostile.

Language Count
English 6531
Hindi 6197
Tamil 6778
Table 1: The training dataset size per language
Language Count
English 7638
Hindi 7714
Tamil 7914
Table 2: The total dataset size per language

Each post is annotated by the assigned annotators with values such as “1” to indicate agreement with the label, “0” to denote disagreement, “NL” for posts that were assigned but not annotated, and “NaN” for posts that were not assigned to annotators.

As shown in the Tables 1 & 2, this makes it easy as all languages with associated labels are collected in the same file. This means that each tweet is denoted as one of the three labels three times in the dataset. The dataset columns include:

  • id: Seq no—a serial number for each row.

  • text: The content of the tweet.

  • language: The tweet’s language (English, Hindi, or Tamil).

  • key: Label identifier (question 1, question 2, or question 3).

  • en_a1 … en_a6: Annotations by English annotators; presenting the assigned values.

  • hi_a1 … hi_a5: Annotations per tweet for Hindu annotators and columns.

  • ta_a1 … ta_a6: Values annotators assign to columns for Tamil annotators.

The dataset 333https://github.com/tattle-made/uli_dataset is comprehensive in its structure. This will be an opportunity for the researchers and participants of the shared task to explore, analyze and formulate effective models of gender-based abuse detections in Indic languages.

4 Tasks Description

Indeed, we are happy to participate in the Shared Task, named “Gendered abuse detection in indic languages” within ICON 23. This objective project intends to address the ever-escalating problem of cyber sexual harassment, whose impacts run deep in society and the economy as well. Our engagement in this shared task involves addressing three subtasks:

  • Build a Classifier for Gendered Abuse (Label 1): The task is to develop a classifier from the available dataset concerning label 1, gender-based abuse. Using carefully made eighteen activist’s and researcher’s annotations, we aim to create a robust system of detecting gender-based cyber violence in online communication.

  • Transfer Learning for Gendered Abuse Detection (Label 1): We intend to use transfer learning from other open datasets concerning hate speech and toxic text recognition in the Indic languages for this subtask. We, therefore, improve our gendered abuse detection model by seeking to incorporate external knowledge and patterns to strengthen the overall effectiveness of the entire detection device.

  • Multi-Task Classifier for Gendered Abuse and Explicit Language (Labels 1 and 3): We also take part in constructing a multitask classifier that will simultaneously estimate gendered abuse (label 1) and explicit language (label 3). In turn, this subtask is per our undertaking to consider gender-based cyberbullying as well as harsh/explicit vocabulary.

It is worth noting that we are involved in this shared endeavour to demonstrate our willingness to contribute towards creating methodologies of computerized detection of gender discrimination on the internet. It excites us to work together, learn, and take tangible steps towards developing a secure Internet environment.

5 Methodology

5.1 System Overview

After thorough experimentation with various neural network architectures, pretrained Large Language models, and classical machine learning models, we settled on an ensemble approach, developing a sophisticated model built upon a Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) architecture, for all three Tasks.

Refer to caption
Figure 1: CNN-BiLSTM Architecture for Sexism Classification

The model implements a combination of CNN layers for capturing localized textual features indicative of abusive language in the input text and Bidirectional LSTM Layers strategically employed for comprehending the intricate, long-term, complex sequential dependencies within the text data. This synergistic integration enables the model to navigate and discern nuanced patterns, facilitating robust performance in classifying abusive and sexist language across diverse linguistic contexts.

For the initial input layers, the model uses pre-trained GloVe and FastText embeddings Kumar et al. (2020) for the respective languages, representing the words as 300-dimensional dense vectors, with the sequence length capped at 100 words, held as non-trainable parameters. These word embeddings map each token to a vector of real numbers aiming to quantify and categorize the semantic similarities between the linguistic terms based on their distributional properties in a large corpus using machine learning or related dimensional reduction techniques. This was followed by a SpatialDropout1D layer, a dropout variant that selectively drops entire 1D feature maps during training to combat overfitting.

For the CNN-BiLSTM Layers as shown in Figure 1, a one-dimensional convolution layer employing 64 filters and a kernel size of 2 was employed to capture localized textual patterns with refined granularity. The output for this layer was then passed through a Bidirectional LSTM layer featuring 128 units and a return sequence setting, coupled with a dropout of 0.1 and recurrent dropout of 0.1, to process textual inputs bidirectionally. This facilitated comprehensive analysis in both forward and reverse directions. A dense layer with 128 neurons and Global Average Pooling is applied for dimensionality reduction and holistic sequence information aggregation. At last, a dropout layer is employed, whose output is passed through a dense layer with a softmax activation function to generate the classification Model. A comprehensive summary of the model including all the layers and trainable/non-trainable parameters has been portrayed in Figure 2

Refer to caption
Figure 2: CNN-BiLSTM Model Summary

5.2 Experimental Setup

We trained different models for each of the three languages, viz. English, Hindi, and Tamil, for each of the three subtasks, using the CNN-BiLSTM architecture mentioned in the previous subsection, the details of which are discussed in the following sub-subsections.

5.2.1 Tasks 1 & 3

As mentioned in the previous sections, Task 1 required us to develop a classifier from the available dataset concerning label 1 and gender-based abuse. Task 3 required us to construct a multitask classifier to estimate gendered abuse (label 1) and explicit language (label 3) simultaneously. For both these tasks, we used the provided labeled dataset for Label 1 for Task 1 and Label 1 & 3 for Task 3.

We first calculated the final label for each sentence for datasets of both Labels for all languages. This was achieved by considering the majority occurrence of 0 or 1 among all annotators. The case of equal occurrences of 0s and 1s, the final label was considered as 1. The datasets were then preprocessed to remove stopwords, symbols, tags, and emojis, which we believed would result in a better generalization of the models, leading to superior performance compared to the raw datasets. The datasets were then divided into an 80/20 train-test split for model development and validation.

The training process of the Models of each of the three languages involved a 5-fold cross-validation strategy with a batch size of 32 training patterns. The models were trained for five epochs across each fold using the Adam optimizer, chosen for its efficiency in handling non-stationary objectives and providing adaptive learning rates and Categorical Crossentropy as the loss function.

5.2.2 Task 2

Task 2 required us to use transfer learning from other open datasets concerning hate speech and toxic text recognition in the Indic languages to create a classifier for detecting gender-based abusive content (label 1). In addition to the provided dataset, we used the Multilingual Abusive Comment Detection (MACD) dataset Gupta et al. (2022). for Hindi and Tamil, and the MULTILATE 444https://github.com/advaithavetagiri/MULTILATE dataset for English as external open datasets.

The MULTILATE Dataset is a large labeled dataset containing over 2.6 million sentences for detecting Hate and Abusive Content in social media. The dataset has been annotated into ‘Hate’ and ‘Not-Hate’ for a Binary Classification Task and into ‘sexist’, ‘racist’ and ‘neither’ for a Multiclass Classification Task. The MACD dataset comprises around 150K textual sentences with 74K abusive and 77K non-abusive comments from five Indic languages - Hindi (Hi), Tamil (Ta), Telugu (Te), Malayalam (Ml) and Kannada (Kn), annotated as 0 (For abusive) and 1 (For non-abusive). We used the Hindi and the Tamil subsets of the MACD dataset, consisting of 33k sentences and 30k sentences, respectively.

In Task 2, for each language, after assigning the final levels to each sentence in the provided dataset, we concatenated the given dataset with the respective external datasets to create a final combined dataset. The datasets were then divided into an 80/20 train-test split for model development and validation.

Training processes similar to Tasks 1 & 3 were employed with 5-fold cross-validation but with a batch size of 64 training patterns and seven epochs for each fold. The models were trained using the Adam Optimiser and Categorical Crossentropy as the loss function.

6 Results

6.1 Evaluation

To evaluate our Models’ overall performance and efficiency, we adopted several metrics, particularly precision, recall, and the F1 score. The F1 score is an essential evaluation measure because it consolidates the performance of a classifying model in terms of all categories into one statistic by providing a balanced measure between the model’s precision and recall. Precision measures how accurately the model predicts positive outcomes. At the same time, recall tells us how well the model captures all the relevant instances, giving us insights into how broadly the models can predict. These metrics are immensely useful in tasks such as classification, where finding a middle ground between accuracy and completeness is crucial.

The precision and recall for the positive class can be calculated as follows:

Precision1=TPTP+FP𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛1𝑇𝑃𝑇𝑃𝐹𝑃Precision_{1}=\frac{TP}{TP+FP}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG (1)
Recall1=TPTP+FN𝑅𝑒𝑐𝑎𝑙subscript𝑙1𝑇𝑃𝑇𝑃𝐹𝑁Recall_{1}=\frac{TP}{TP+FN}italic_R italic_e italic_c italic_a italic_l italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG (2)

Similarly, the precision and recall for the negative class can be calculated as:

Precision2=TNTN+FN𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛2𝑇𝑁𝑇𝑁𝐹𝑁Precision_{2}=\frac{TN}{TN+FN}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_T italic_N end_ARG start_ARG italic_T italic_N + italic_F italic_N end_ARG (3)
Recall2=TNTN+FP𝑅𝑒𝑐𝑎𝑙subscript𝑙2𝑇𝑁𝑇𝑁𝐹𝑃Recall_{2}=\frac{TN}{TN+FP}italic_R italic_e italic_c italic_a italic_l italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_T italic_N end_ARG start_ARG italic_T italic_N + italic_F italic_P end_ARG (4)

Macro-Average Precision and Recall for Multiclass Classification:

Precisionc=TPcTPc+FPc𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑐𝑇subscript𝑃𝑐𝑇subscript𝑃𝑐𝐹subscript𝑃𝑐Precision_{c}=\frac{TP_{c}}{TP_{c}+FP_{c}}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_F italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG (5)
Recallc=TPcTPc+FNc𝑅𝑒𝑐𝑎𝑙subscript𝑙𝑐𝑇subscript𝑃𝑐𝑇subscript𝑃𝑐𝐹subscript𝑁𝑐Recall_{c}=\frac{TP_{c}}{TP_{c}+FN_{c}}italic_R italic_e italic_c italic_a italic_l italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_F italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG (6)

The Macro-Average Precision (MAP) and Recall (MAR) can then be calculated as:

MAPBinary=Precision1+Precision22𝑀𝐴𝑃𝐵𝑖𝑛𝑎𝑟𝑦𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛1𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛22MAP-Binary=\frac{Precision_{1}+Precision_{2}}{2}italic_M italic_A italic_P - italic_B italic_i italic_n italic_a italic_r italic_y = divide start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG (7)
MARBinary=Recall1+Recall22𝑀𝐴𝑅𝐵𝑖𝑛𝑎𝑟𝑦𝑅𝑒𝑐𝑎𝑙subscript𝑙1𝑅𝑒𝑐𝑎𝑙subscript𝑙22MAR-Binary=\frac{Recall_{1}+Recall_{2}}{2}italic_M italic_A italic_R - italic_B italic_i italic_n italic_a italic_r italic_y = divide start_ARG italic_R italic_e italic_c italic_a italic_l italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_R italic_e italic_c italic_a italic_l italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG (8)
MAPMulticlass=1Cc=1CPrecisionc𝑀𝐴𝑃𝑀𝑢𝑙𝑡𝑖𝑐𝑙𝑎𝑠𝑠1𝐶superscriptsubscript𝑐1𝐶𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑐MAP-Multiclass=\frac{1}{C}\sum_{c=1}^{C}Precision_{c}italic_M italic_A italic_P - italic_M italic_u italic_l italic_t italic_i italic_c italic_l italic_a italic_s italic_s = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (9)
MARMulticlass=1Cc=1CRecallc𝑀𝐴𝑅𝑀𝑢𝑙𝑡𝑖𝑐𝑙𝑎𝑠𝑠1𝐶superscriptsubscript𝑐1𝐶𝑅𝑒𝑐𝑎𝑙subscript𝑙𝑐MAR-Multiclass=\frac{1}{C}\sum_{c=1}^{C}Recall_{c}italic_M italic_A italic_R - italic_M italic_u italic_l italic_t italic_i italic_c italic_l italic_a italic_s italic_s = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R italic_e italic_c italic_a italic_l italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (10)

Finally, the Macro F1 score can be computed using the formula:

F1macro=2×(MAP×MAR)(MAP+MAR)𝐹subscript1macro2𝑀𝐴𝑃𝑀𝐴𝑅𝑀𝐴𝑃𝑀𝐴𝑅F1_{\text{macro}}=\frac{2\times(MAP\times MAR)}{(MAP+MAR)}italic_F 1 start_POSTSUBSCRIPT macro end_POSTSUBSCRIPT = divide start_ARG 2 × ( italic_M italic_A italic_P × italic_M italic_A italic_R ) end_ARG start_ARG ( italic_M italic_A italic_P + italic_M italic_A italic_R ) end_ARG (11)

To measure the efficiency of our models, we have used the above metrics, the results of which have been mentioned in the next section.

6.2 Training Results

To analyze the training performance, we generated classification reports for the Models of each language across each task, emphasizing on the precision, recall, and F1 scores for each fold and calculated the average macro scores for each of them.

Task Language P R F1
English 0.79 0.80 0.79
Task 1 Hindi 0.73 0.75 0.70
Tamil 0.75 0.74 0.74
English 0.84 0.84 0.84
Task 2 Hindi 0.79 0.79 0.78
Tamil 0.83 0.83 0.83
English 0.79 0.80 0.79
Task 3 Hindi 0.73 0.75 0.70
Tamil 0.75 0.75 0.74
Table 3: The training results as Precision (P), Recall (R), and F1 Scores (F1) on the languages English, Hindi, and Tamil by the models CNN-BiLSTM using GloVe Embeddings for English and FastText Embeddings for Hindi and Tamil.
[Uncaptioned image]
Figure 3: Task 2 Accuracy and Loss of English (Top), Hindi (Middle) and Tamil(Bottom).

The training performance metrics scores for the Models have been depicted in Table 3. The performance for all the models has been sufficiently high across all tasks. For all tasks 1, 2 & 3, the precision, recall, and F1 score has been the highest for English, followed by Tamil and Hindi. The training scores in Task 2 for all the models are considerably higher than the other two tasks, primarily due to the availability of larger datasets for training. The highest F1- the English Model has achieved a score across all tasks in Task 2 at 84%. For all the matrices to be measured with accordance with the task given, there have been instances which have taken the form of alias and as such there needs to be alterations that need to be made in the name

We also generated the graphs representing the Model Accuracy as a function of the number of epochs across the training and the validation sets and the Model Loss as a function of the number of epochs across the training and the validation sets to check for overfitting and model improvement during the entire training duration of the Models, the plots for which have been demonstrated in Figure 3.

All the graphs show an increase in the validation accuracy and a decrease in the validation loss across all the epochs for a corresponding increase in the training accuracy and a decrease in the training loss, respectively, indicating the generation of a robust model with improved generalization and no overfitting.

6.3 Testing Results

We predicted the labels for test datasets provided for each task using the models for all three languages, compiled them into .csv files, and submitted them in the respective Kaggle competitions to obtain the test set evaluation scores for each task. The F1 score was used as the testing metric for the shared tasks, as it provides a balanced evaluation basis between precision and recall and is also known to give good results on imbalanced classification problems.

Tasks F1 Score
Task 1 0.616
CNLP-NITS-PP Task 2 0.572
Task 3(Multi) 0.616 & 0.582
Table 4: Testing Results of the CNN-BiLSTM models on each task

The scores for the Models across each of the three tasks have been summarized in Table 4. In Task 1, we achieved the highest F1 score of 0.616. For Task 2, the F1-score for our Models came out to be 0.572, and for Task 3, our Models scored F1-scores of 0.616 for Label 1 and 0.582 for Label 3.

6.4 Results Analysis

Our study reveals promising performance outcomes for the various classification tasks under the shared tasks of ICON 2023. The model performance analysis across all three tasks and datasets reveals intriguing patterns in training and testing scenarios.

From Table 3, we observe that the training set performance of the Models, across all the languages, for Task 2 is better than that of Task 1 by a considerable margin. These improvements in the performance scores for Task 2 can be attributed to the utilization of significantly larger volumes of training data from external sources, hinting at improved generalization capabilities of the models, as they benefit from a more diverse and extensive set of examples. In addition, the models seem to perform considerably well in the training set evaluation for Task 3.

Contrary to the patterns observed in the training set, when assessed on the test set, the models exhibited higher scores for Task 1 than Task 2, as evident from Table 4. The reason for this change in trend cannot be explained directly based on our experimental findings. This is due to the collective evaluation of the test set performance across all languages in each task, making it difficult for us to verify the contributions of the models pertaining to each language. In contrast, the test set results for Task 3 conform to those of the training set results as expected.

7 Conclution

This paper presented our approach and results for the ICON2023 shared task on identifying gendered abuse in online content. Our ensemble models using CNN-BiLSTMs and contextual embeddings like FastText proved effective, achieving top ranks on the leaderboard across multiple languages. The models could capture nuanced abusive language through localized feature learning and sequence modelling. Our analysis showed the impact of factors like embedding techniques and input preprocessing. There is still difficulty in handling heavily code-switched languages - an area for future work. Through this shared task, we developed performant models for a crucial problem limiting online freedom of expression. The datasets and model code have been open-sourced to enable further research towards mitigating such gendered cyber harassment.

Acknowledgements

We appreciate the Department of Computer Science & Engineering, National Institute of Technology Silchar, for allowing us to pursue our research and experimentation, the Center for Natural Language Processing (CNLP)  and Artificial Intelligence (AI) laboratories’ resources, and the research atmosphere.

References

  • Arora et al. (2023) Arnav Arora, Maha Jinadoss, Cheshta Arora, Denny George, Brindaalakshmi, Haseena Dawood Khan, Kirti Rawat, Div, Ritash, Seema Mathur, Shivani Yadav, Shehla Rashid Shora, Rie Raut, Sumit Pawar, Apurva Paithane, Sonia, Vivek, Dharini Priscilla, Khairunnisha, Grace Banu, Ambika Tandon, Rishav Thakker, Rahul Dev Korra, Aatman Vaidya, and Tarunima Prabhakar. 2023. The uli dataset: An exercise in experience led annotation of ogbv.
  • Barreto and Doyle (2023) Manuela Barreto and David Matthew Doyle. 2023. Benevolent and hostile sexism in a shifting global context. Nature reviews psychology, 2(2):98–111.
  • Chhabra and Vishwakarma (2023) Anusha Chhabra and Dinesh Kumar Vishwakarma. 2023. A literature survey on multimodal and multilingual automatic hate speech identification. Multimedia Systems, pages 1–28.
  • Das et al. (2023) Subhajeet Das, Koushikk Bhattacharyya, and Sonali Sarkar. 2023. Performance analysis of logistic regression, naive bayes, knn, decision tree, random forest and svm on hate speech detection from twitter. International Research Journal of Innovations in Engineering and Technology, 7(3):24.
  • Feigt et al. (2022) Nicole D Feigt, Melanie M Domenech Rodríguez, and Alejandro L Vázquez. 2022. The impact of gender-based microaggressions and internalized sexism on mental health outcomes: A mother–daughter study. Family Relations, 71(1):201–219.
  • Felmlee et al. (2023) Diane H Felmlee, Chris Julien, and Sara C Francisco. 2023. Debating stereotypes: Online reactions to the vice-presidential debate of 2020. PloS one, 18(1):e0280828.
  • Ghosal et al. (2023) Sayani Ghosal, Amita Jain, Devendra Kumar Tayal, Varun G Menon, and Akshi Kumar. 2023. Inculcating context for emoji powered bengali hate speech detection using extended fuzzy svm and text embedding models. ACM Transactions on Asian and Low-Resource Language Information Processing.
  • Gupta et al. (2022) Vikram Gupta, Sumegh Roychowdhury, Mithun Das, Somnath Banerjee, Punyajoy Saha, Binny Mathew, hastagiri prakash vanchinathan, and Animesh Mukherjee. 2022. Multilingual abusive comment detection at scale for indic languages. In Advances in Neural Information Processing Systems, volume 35, pages 26176–26191. Curran Associates, Inc.
  • Hoskin and Whiley (2023) Rhea Ashley Hoskin and Lilith A Whiley. 2023. Femme-toring: Leveraging critical femininities and femme theory to cultivate alternative approaches to mentoring. Gender, Work & Organization.
  • Jahan and Oussalah (2023) Md Saroar Jahan and Mourad Oussalah. 2023. A systematic review of hate speech automatic detection using natural language processing. Neurocomputing, page 126232.
  • Kumar et al. (2020) Saurav Kumar, Saunack Kumar, Diptesh Kanojia, and Pushpak Bhattacharyya. 2020. “a passage to India”: Pre-trained word embeddings for Indian languages. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 352–357, Marseille, France. European Language Resources association.
  • Kural and Kovács (2022) Ayşe I Kural and Monika Kovács. 2022. Attachment security schemas to attenuate the appeal of benevolent sexism: The effect of the need to belong and relationship security. Acta Psychologica, 229:103671.
  • Priyadarshini et al. (2023) Ishaani Priyadarshini, Sandipan Sahu, and Raghvendra Kumar. 2023. A transfer learning approach for detecting offensive and hate speech on social media platforms. Multimedia Tools and Applications, pages 1–27.
  • Quoc Tran et al. (2023) Khanh Quoc Tran, An Trong Nguyen, Phu Gia Hoang, Canh Duc Luu, Trong-Hop Do, and Kiet Van Nguyen. 2023. Vietnamese hate and offensive detection using phobert-cnn and social media streaming data. Neural Computing and Applications, 35(1):573–594.
  • Van Dijk (2015) Teun A Van Dijk. 2015. Critical discourse analysis. The handbook of discourse analysis, pages 466–485.
  • Vetagiri et al. (2023a) Advaitha Vetagiri, Prottay Adhikary, Partha Pakray, and Amitava Das. 2023a. CNLP-NITS at SemEval-2023 task 10: Online sexism prediction, PREDHATE! In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 815–822, Toronto, Canada. Association for Computational Linguistics.
  • Vetagiri et al. (2023b) Advaitha Vetagiri, Prottay Kumar Adhikary, Partha Pakray, and Amitava Das. 2023b. Leveraging gpt-2 for automated classification of online sexist content. Working Notes of CLEF.
  • (18) Advaitha Vetagiri, Eisha Halder, Ayanangshu Das Majumder, Partha Pakray, and Amitava Das. “multilate”: A synthetic dataset for multimodal hate speech detection. Available at SSRN 4733628.
  • Whiley et al. (2023) Lilith A Whiley, Lukasz Walasek, and Marie Juanchich. 2023. Contributions to reducing online gender harassment: Social re-norming and appealing to empathy as tried-and-failed techniques. Feminism & Psychology, 33(1):83–104.