\copyrightclause

Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

[email=d.hartmann@tu-berlin.de]

[email=amin.m.oueslati@gmail.com ]

[email=staufer@tu-berlin.de]

Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services

David Hartmann Faculty of Electrical Engineering and Computer Science, TU Berlin Weizenbaum Institute for the Networked Society Amin Oueslati Hertie School Berlin Dimitri Staufer

(2022)

Abstract

Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.

keywords:

Content moderation as a service \sephate speech detection \septhird-party audit \sepNLP fairness

1 Introduction

The digital age has brought a significant increase in online content. Worryingly, this also includes pernicious, unwanted content, such as hate speech [1]. Online platforms responded by adopting extensive content moderation regimes [2]. Absent a legal taxonomy of what constitutes hate speech, private companies are given substantial autonomy in their moderation practices, effectively making them the judges of public speech [3, 4].

Typically, large online platforms rely on so-called community guidelines against which speech is assessed. The assessment is done by human moderators, who are assisted by algorithms [5]. The largest technology firms, such as Google, Microsoft, and Amazon, additionally offer content moderation as a service via cloud-based API access. While most organisations do not report the extent to which algorithms shape content moderation, the sheer amount of online speech makes reliance on algorithmic moderation inevitable [6].

To date, there exists no systematic evaluation of cloud-based content moderation services. The absence of public scrutiny is alarming, as open-source content moderation algorithms have continuously displayed biases against minorities and vulnerable groups [7, 8, 9, 10, 11, 12]. This paper’s contribution is twofold. Firstly, it offers the first comprehensive fairness assessment of four major cloud-based content moderation algorithms. Not only are these algorithms likely in use at the companies themselves, but they are also deployed by a vast number of smaller organisations through the SaaS model. Secondly, our auditing strategy may inform future bias audits of (cloud-based) content moderation algorithms. Importantly, our proposed approach solely assumes limited black-box access [13] and offers guidance on reinforced sampling strategies to achieve maximal scrutiny with limited resources, noting the realities of unsolicited audits from civil society organisations and academia [14, 15, 16].

2 Data and Method

We gained researcher access to the Google Moderate Text API, Amazon Comprehend, Microsoft Azure Content Moderation, and the Open AI Content Moderation API. These services generate a hate speech score per text sequence, often split across several sub-categories, as well as a binary flag. Our study uses the MegaSpeech, Jigsaw, HateXplain, and ToxiGen datasets [17, 18, 19, 11]. The selected datasets capture various forms of hate speech, with ToxiGen containing implicit and adversarial hate speech constructed around indirect messages [20], while MegaSpeech and ToxiGen use generative AI to diversify speech corpora [17, 11]. Jigsaw and HateXplain contain human-written examples labeled by annotators, with MegaSpeech containing more hate speech corpora but no target group labels. MegaSpeech, HateXplain, and ToxiGen provide shorter text sequences, with on average 17.7, 23.3, and 18.1 words respectively, while Jigsaw is made up by longer sequences, 48.3 words on average.

We evaluate all cloud-based moderation algorithms across all datasets on a set of threshold-variant and threshold-invariant performance metrics [21, 22] at an aggregate level and also specifically for vulnerable groups. We ensure consistency across datasets by mapping these onto seven vulnerable groups (Women, LGBTQ+, PoC, Muslim, Asian, Jewish, Latinx). Since MegaSpeech comes without labels, we train a Bi-LSTM model with the collected data set by Yoder et al. [23] (preliminary evaluation accuracy 78 %) for target identity classification.

Dataset	Moderation Service	ROC AUC	F1	FPR	FNR	Dataset	Moderation Service	ROC AUC	F1	FPR	FNR
ToxiGen	Amazon	70.4%	68.9%	7.2%	52.0%	MegaSpeech	Amazon	72.8 %	72.0 %	10.4 %	43.9 %
	Google	62.7%	62.7%	39.1%	35.5%		Google	73.3 %	72.3 %	41.3 %	12.0 %
	OpenAI	70.3%	68.1%	33.2%	56.0%		OpenAI	77.1 %	76.7 %	8.4 %	37.3 %
	Microsoft	59.8%	57.4%	16.4%	64.0%		Microsoft	70.6 %	70.1 %	16.9 %	41.9 %
Jigsaw	Amazon	92.2%	92.2%	7.5%	8.1%	HateXplain	Amazon	66.8%	66.25%	46.3 %	20 %
	Google	69.9%	67.2%	58.4%	1.8%		Google	52.2 %	58.9 %	78.2 %	4 %
	OpenAI	78.6%	78.6%	17.1%	25.6%		OpenAI	72.9 %	76.7 %	45.4 %	8.86 %
	Microsoft	75.8%	75.7%	20.4%	28.1%		Microsoft	63.1 %	60.2 %	63.6 %	10.3 %

Table 1: Performance metrics by moderation service and dataset. Blue shading signals the best performance, while red shading indicates the worst performance. ToxiGen includes 7,800 observations and HateXplain 14,000, while Jigsaw and MegaSpeech each contain 50,000. All datasets are balanced on toxic and non-toxic phrases.

At the group-level, we compute the pinned ROC AUC, a metric proposed by Dixon et al. [24], designed to provide a more robust measure for scale-invariant performance comparison across sub-groups. While this approach comes with its pitfalls, as the authors themselves note in a subsequent paper, it is the best scale-invariant metric to date when presented with group-level variation in biases [22].

Perturbation Sensitivity Analysis (PSA) offers an additional, arguably more robust evaluation of group-level biases by using counterfactual fairness evaluation[25]. We follow prior research in defining an anchor group against which other groups are compared [25]. Using the dominant majority group as baseline, Counterfactual Token Fairness (CFT) scores are computed as the difference in toxicity between the baseline and the corresponding minority group. PSA makes two assumptions: (1) counterfactual pairs should convey the same or neutral meaning, avoiding any implicit biases or derogatory connotations. While constructing toxic counterfactuals is theoretically possible, it is methodologically demanding and exceeds the scope of this project. Instead, we construct 34 neutral counterfactual pairs. Importantly, each minority group is represented by multiple tokens, reflecting its different semantic representations. For instance, the minority group female also manifests as woman and women. Furthermore, (2) there should be no unique interactions between a particular minority token and the context of the sentence that would skew the analysis. This is challenging in real-world applications, as certain combinations might evoke stereotypes or specific cultural connotations. Thus, the project uses data consisting largely of short and explicit statements. Furthermore, CFT scores are calculated separately for toxic and non-toxic statements, with the latter generally supporting the assumption of counterfactual symmetry more consistently.

PSA experiments are conducted using two distinct data sets. First, the synthetic Identity Phrase Templates from Dixon et al. [24] are used. The set contains 77,000 synthetic examples of which 50% are toxic. These avoid stereotypes and complex sentence structures by design, which ensures that the symmetric counterfactual assumption is met. Mapping the dataset, which contains a broader set of identities, to the 34 minority token relevant to this study, results in 25,738 sentence pairs. Second, by applying the same logic, 9,190 sentence pairs are derived from the MegaSpeech dataset.

3 Preliminary Results

Table 1 shows aggregated performance results for chosen benchmark data sets. Our results indicate notable disparities between moderation APIs. OpenAI’s content moderation algorithm performs best for Megaspeech and Amazon Text Moderation on Jigsaw and ToxiGen, generalising well across data sets. On Jigsaw, Amazon Comprehend performs best. However, its near-optimal performance (92.2 % ROC AUC) suggests that the Jigsaw data was likely included in training process of the Amazon Comprehend API . Overall, Google’s API shows the worst performance across data sets. Its poor performance seems driven by a comparably high FPR, which suggests that the algorithm tends to overmoderate. In contrast, Microsoft Azure Content Moderation is associated with a high FNR, suggesting it often misses hate speech.

Furthermore, all services struggle to detect implicit hate speech, reflected in their high False Positive Rates on ToxiGen. To this end, commercial moderation services do not fare much better than their open-source counterparts [11]. One likely cause is the limited availability of implicit hate speech datasets for training purposes.

The comparative fairness evaluation of the identity group is presented via group-level pinned ROC AUC scores in Figure 3.¹¹1Due to space constraints, we only present one metric (ROC AUC). Future work includes a comprehensive analysis. We find that all services tend to overmoderate speech concerning groups PoC and LGBTQ+. This is somewhat surprising as extensive prior research uncovered biases in open-source content moderation algorithms in relation to these groups [26]. Commonly, such overmoderation occurs as toxic speech concerning these groups is overrepresented in the training data, and subsequently learned by the model. Most services fail to reliably detect hate speech aimed at groups Disability, Asian, and Latinx. Lastly, the tendency of Google Text Moderation to overmoderate is puzzling but also alarming. While we cannot entirely rule out an error on our end, this observation is robust to different configurations of API sub-categories.

Figure 3 (right) displays the PSA results. We find (1) differences in toxicity scores by and large are more pronounced on non-toxic than toxic data²²2 Intuitively this makes sense, as scores are generated non-linearly with a definite upper bound. Thus, when other elements in a sentence induce a high toxicity score, the marginal effect from identity tokens is comparably lower. and (2) greater variation in the mean CFT scores in non-synthetic than in synthetic data³³3This was to be expected, as the sentences from MegaSpeech contain more contextual information that interacts with the tokens.. Overall, the results suggest that most minorities are associated with higher levels of toxicity than dominant majorities, although these effects appear relatively small, and vary across groups and services. Group LGBTQ+ seems associated with the strongest negative bias, occurring for all samples and services. We observe limited negative bias against groups Latinx and Asian.

Summarizing, we uncovered both aggregate-level performance issues and group-level biases in major commercial cloud-based content moderation services. Importantly, while some shortcomings extend to all services, such as difficulty in detecting implicit hate speech or biases against group LGBTQ+, others are confined to a particular service.

References

Bakalis [2016] C. Bakalis, Regulating hate crime in the digital age, Oxford University Press, 2016.
De Gregorio [2020] G. De Gregorio, Democratising online content moderation: A constitutional framework, Computer Law & Security Review 36 (2020) 105376.
Seering [2020] J. Seering, Reconsidering self-moderation: the role of research in supporting community-based models for online content moderation, Proceedings of the ACM on Human-Computer Interaction 4 (2020) 1–23.
Einwiller and Kim [2020] S. A. Einwiller, S. Kim, How online content providers moderate user-generated content to prevent harmful online communication: An analysis of policies and their implementation, Policy & Internet 12 (2020) 184–206. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.239. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/poi3.239.
Gorwa et al. [2020] R. Gorwa, R. Binns, C. Katzenbach, Algorithmic content moderation: Technical and political challenges in the automation of platform governance, Big Data & Society 7 (2020) 205395171989794. URL: http://journals.sagepub.com/doi/10.1177/2053951719897945.
Schluger et al. [2022] C. Schluger, J. P. Chang, C. Danescu-Niculescu-Mizil, K. E. C. Levy, Proactive moderation of online discussions: Existing practices and the potential for algorithmic support, Proceedings of the ACM on Human-Computer Interaction 6 (2022) 1 – 27. URL: https://api.semanticscholar.org/CorpusID:253460203.
Garg et al. [2022] T. Garg, S. Masud, T. Suresh, T. Chakraborty, Handling Bias in Toxic Speech Detection: A Survey, CoRR abs/2202.00126 (2022). URL: https://arxiv.org/abs/2202.00126, arXiv: 2202.00126.
Sap et al. [2020] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, Y. Choi, Social Bias Frames: Reasoning about Social and Power Implications of Language, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 5477–5490. URL: https://aclanthology.org/2020.acl-main.486.
Fortuna et al. [2020] P. Fortuna, J. Soler, L. Wanner, Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 6786–6794. URL: https://aclanthology.org/2020.lrec-1.838.
Wiegand et al. [2019] M. Wiegand, J. Ruppenhofer, T. Kleinbauer, Detection of Abusive Language: the Problem of Biased Datasets, in: North American Chapter of the Association for Computational Linguistics, 2019. URL: https://api.semanticscholar.org/CorpusID:174799974.
Hartvigsen et al. [2022] T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, E. Kamar, ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, in: Annual Meeting of the Association for Computational Linguistics, 2022. URL: https://api.semanticscholar.org/CorpusID:247519233.
Sheng et al. [2019] E. Sheng, K.-W. Chang, P. Natarajan, N. Peng, The Woman Worked as a Babysitter: On Biases in Language Generation, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3407–3412. URL: https://aclanthology.org/D19-1339. doi:10.18653/v1/D19-1339.
Casper et al. [2024] S. Casper, C. Ezell, C. Siegmann, N. Kolt, T. L. Curtis, B. Bucknall, A. Haupt, K. Wei, J. Scheurer, M. Hobbhahn, L. Sharkey, S. Krishna, M. V. Hagen, S. Alberti, A. Chan, Q. Sun, M. Gerovitch, D. Bau, M. Tegmark, D. Krueger, D. Hadfield-Menell, Black-box access is insufficient for rigorous ai audits, 2024. arXiv:2401.14446.
Birhane et al. [2024] A. Birhane, R. Steed, V. Ojewale, B. Vecchione, I. D. Raji, Ai auditing: The broken bus on the road to ai accountability, ArXiv abs/2401.14462 (2024). URL: https://api.semanticscholar.org/CorpusID:267301287.
Kak and West [2023] A. Kak, S. M. West, Algorithmic Accountability: Moving Beyond Audits, AI Now Institute (2023). URL: https://ainowinstitute.org/publication/algorithmic-accountability.
Raji et al. [2022] I. D. Raji, P. Xu, C. Honigsberg, D. E. Ho, Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance, 2022. URL: http://arxiv.org/abs/2206.04737, arXiv:2206.04737 [cs].
Pendzel et al. [2023] S. Pendzel, T. Wullach, A. Adler, E. Minkov, Generative AI for Hate Speech Detection: Evaluation and Findings, 2023. URL: http://arxiv.org/abs/2311.09993, arXiv:2311.09993 [cs].
Jigsaw [2019] Jigsaw, Jigsaw toxic comment classifi- cation challenge., 2019. URL: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.
Mathew et al. [2021] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, A. Mukherjee, HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021) 14867–14875. URL: https://ojs.aaai.org/index.php/AAAI/article/view/17745, number: 17.
ElSherief et al. [2021] M. ElSherief, C. Ziems, D. Muchlinski, V. Anupindi, J. Seybolt, M. D. Choudhury, D. Yang, Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, CoRR abs/2109.05322 (2021). URL: https://arxiv.org/abs/2109.05322, arXiv: 2109.05322.
Elsafoury et al. [2023] F. Elsafoury, S. Katsigiannis, N. Ramzan, On Bias and Fairness in NLP: How to have a fairer text classification?, 2023. URL: http://arxiv.org/abs/2305.12829, arXiv:2305.12829 [cs].
Borkan et al. [2019] D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced metrics for measuring unintended bias with real data for text classification, in: Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 491–500. URL: https://doi.org/10.1145/3308560.3317593.
Yoder et al. [2022] M. M. Yoder, L. H. X. Ng, D. W. Brown, K. M. Carley, How hate speech varies by target identity: A computational analysis, arXiv preprint arXiv:2210.10839 (2022).
Dixon et al. [2018] L. Dixon, J. Li, J. Sorensen, N. Thain, L. Vasserman, Measuring and Mitigating Unintended Bias in Text Classification, in: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, Association for Computing Machinery, New York, NY, USA, 2018, pp. 67–73.
Prabhakaran et al. [2019] V. Prabhakaran, B. Hutchinson, M. Mitchell, Perturbation sensitivity analysis to detect unintended model biases, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 5740–5745. URL: https://aclanthology.org/D19-1578.
Garg et al. [2019] S. Garg, V. Perot, N. Limtiaco, A. Taly, E. H. Chi, A. Beutel, Counterfactual Fairness in Text Classification through Robustness, in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, ACM, Honolulu HI USA, 2019, pp. 219–226. URL: https://dl.acm.org/doi/10.1145/3306618.3317950.