General Phrase Debiaser: Debiasing Masked Language Models at a Multi-Token Level
Abstract
The social biases and unwelcome stereotypes revealed by pretrained language models are becoming obstacles to their application. Compared to numerous debiasing methods targeting word level, there has been relatively less attention on biases present at phrase level, limiting the performance of debiasing in discipline domains. In this paper, we propose an automatic multi-token debiasing pipeline called General Phrase Debiaser, which is capable of mitigating phrase-level biases in masked language models. Specifically, our method consists of a phrase filter stage that generates stereotypical phrases from Wikipedia pages as well as a model debias stage that can debias models at the multi-token level to tackle bias challenges on phrases. The latter searches for prompts that trigger model’s bias, and then uses them for debiasing. State-of-the-art results on standard datasets and metrics show that our approach can significantly reduce gender biases on both career and multiple disciplines, across models with varying parameter sizes.
Index Terms— Social Bias, Stereotype, Pretrained Language Model, Masked Language Model, NLP
1 Introduction
Recently, masked language models (MLMs) [1, 2, 3, 4, 5, 6] are employed in both traditional tasks like text classification [7, 8, 9] and diverse multimodal tasks [10, 11] when combined with models like image generators [12, 13]. We aim to develop MLMs with minimal human biases, even when the pretraining data unavoidably contains these biases. However, correcting implicit biases in pretrained MLMs can be very challenging, especially considering the high cost of retraining models from scratch.
Existing studies [14, 15, 16, 17, 18, 19] have introduced intuitive approaches that use additional corpus to retrieve contextualized embeddings or locate the biases and fine-tune accordingly. But they are rely on external human-written corpus. Auto-Debias[20] hires the prompt[21] template ”[attribute word] [T]…[T] [MASK]” to guide MLMs to automatically search for prompts that makes the model show its bias, and then fine-tune MLMs with them. Nevertheless, real-world language environments are not so ideal, meaning both attribute words and stereotypes should be treated as multi-token. While these method only correct biases at the word level, lead to struggling at the phrase level.
Motivated by this, we propose an automatic multi-token debias pipeline called General Phrase Debiaser to address the limitations of automatic debiasing mentioned above. The major contributions of our work are:
-
•
Unlike existing methods, we debias MLMs at the phrase granularity. In order to reduce the cost of manually constructing the phrase list, we get the stereotypical phrases filtered from hyperlinks of Wikipedia pages in Phrase Filter Stage.
-
•
With the multi-token debias head we proposed, “discriminative” prompts can be searched in Model Debias Stage. These cloze-style prompts have the highest disagreement in generating stereotypical phrases (e.g., mathematical theory/dance art) with respect to demographic words (e.g., man/woman). Then we fine-tune the model using searched prompts.
-
•
Different from the Auto-Debias’ fine-tuning stage, our approach derives loss from stereotypical phrases, rather than from the entire vocabulary belonging to the model itself. This allows our method to adjust the model parameters more specifically without affecting any other gender-independent word or knowledge.
- •
Our code and debiased model files are available at https://
github.com/BingkangShi/general-phrase-debiaser.
2 General Phrase Debiaser
2.1 Phrase Filter Stage
To minimize the cost of manually constructing stereotypes in many specific domains, we use the MLM which needs to be debiased to filter hyperlinks of WikiPedia pages. Hyperlinked phrases that semantically similar to stereotypical seeds can be filtered out. And stereotype seeds comprising topics need to be manually specified, with each topic having hyponyms. We choose career, math, art, and science as topics to construct stereotypes, so is 4 in this paper. It should be noted that the filtered phrases under the math, art, and science topics are generated by our Phrase Filter Stage, while the phrases under the career topic were provided by previous work [15].
Let be a MLM and be the process of computing the classification embedding (CLS) that represents a sentence. The embedding of a phrase can be computed as follows:
(1) |
where represents a sentence template, and is a set that includes all templates. We refer to the 14 blank-filling templates used in the SEAT test [22], such as ”this is a __.” or ”__ is here.”.
Then the cosine similarity between phrase and can be computed through:
(2) |
where is the -th hyponym of the -th topic, and .
We define the quantity of in phrase set as , according to sorted by ascending order. So we can collect with:
(3) |
where the phrase set of -th topic is . After removing duplicate phrases, is transformed into .
2.2 Finding Biased Prompts
Previous attempts of Auto-Debias [20] used cloze-style prompts to detect biases in attribute words within stereotypes. Let be vocabulary of a MLM, and a prompt is a sequence of words with one masked token [MASK] and one attribute token. A MLM can be probed by a cloze-style prompt, such as =”[attribute] majors in [MASK].”. The ”[attribute]” is assigned to be filled in a set composed of -tuples, derived from the gender word list in [15]. And the position of ”[MASK]” serves the purpose of being predicted by for a stereotypical word. So we can obtain stereotypical word probability as:
(4) | ||||
where . is a string composed of and . For example, = ” majors in [MASK].”.
While the above method is effective only when stereotypes are single-token. Thus we introduce a probability calculation method for stereotypes at multi-token granularity (as shown in Fig.2):
(5) | ||||
while the logit corresponding to the [MASK] token position should be:
(6) |
where is multiple [MASK] token sequence of length , and is the maximum length of stereotypical phrase . containing [MASK] tokens is the -th phrase of the -th tuple in set . Here should evolve into:
(7) | ||||
By repeatedly applying Eq. (5), we can obtain the distributions , , … , for stereotypical phrases with different token length. In step 2 and step 3 in Fig.1, we use Jensen-Shannon Divergence (JSD), which is a symmetric and smooth Kullback–Leibler divergence (KLD), to measure the difference between multiple distributions as follow:
(8) | ||||
In this paper, JSD measures the difference between the two-gender distributions, so . The KLD between two distributions and can be computed as: .
The loss of an input can be seen as the sum of the overall probability distribution differences of :
(9) | ||||
Step 2 in Fig.1 shows we employ Beam Search[23] to find biased prompts that maximize the . Searched biased prompts will be collected for fine-tuning MLM in the step 3.
2.3 Fine-tuning MLM with Prompts
Model | SEAT-6 | SEAT-6b | SEAT-7 | SEAT-7b | SEAT-8 | SEAT-8b | avg. |
---|---|---|---|---|---|---|---|
BERT | 0.48 | 0.11 | 0.25 | 0.25 | 0.40 | 0.61 | 0.35 |
+Context-Debias[15] | 1.13 | - | 0.34 | - | 0.12 | - | 0.53 |
+FairFil[19] | 0.18 | 0.08 | 0.12 | 0.08 | 0.20 | 0.24 | 0.15 |
+Auto-Debias[20] | 0.08 | 0.02 | 0.36 | 0.40 | 0.12 | 0.20 | 0.20 |
+General Phrase Debiaser | 0.00 | 0.13 | 0.19 | 0.10 | 0.02 | 0.27 | 0.12 |
ALBERT | 0.51 | 0.02 | 0.58 | 1.02 | 0.99 | 1.20 | 0.72 |
+Context-Debias[15] | 0.18 | - | 0.05 | - | 0.77 | - | 0.33 |
+General Phrase Debiaser | 0.04 | 0.30 | 0.01 | 0.02 | 0.33 | 0.29 | 0.16 |
DistilBERT | 1.26 | 0.25 | 0.31 | 1.22 | 0.74 | 0.98 | 0.79 |
+Context-Debias[15] | 1.34 | - | 1.01 | - | 0.97 | - | 1.11 |
+General Phrase Debiaser | 0.60 | 0.32 | 0.21 | 0.99 | 0.23 | 0.79 | 0.52 |
Model | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | WNLI |
---|---|---|---|---|---|---|---|---|---|
BERT | 0.59 | 0.93 | 0.89/0.85 | 0.89/0.88 | 0.91/0.88 | 0.85/0.85 | 0.92 | 0.65 | 0.56 |
+General Phrase Debiaser | 0.56 | 0.93 | 0.89/0.84 | 0.89/0.89 | 0.90/0.88 | 0.85/0.85 | 0.92 | 0.65 | 0.56 |
ALBERT | 0.55 | 0.92 | 0.92/0.89 | 0.91/0.91 | 0.91/0.88 | 0.85/0.85 | 0.92 | 0.73 | 0.39 |
+General Phrase Debiaser | 0.54 | 0.93 | 0.90/0.86 | 0.91/0.91 | 0.91/0.88 | 0.85/0.85 | 0.92 | 0.73 | 0.42 |
DistilBERT | 0.47 | 0.91 | 0.89/0.84 | 0.86/0.86 | 0.90/0.87 | 0.82/0.82 | 0.88 | 0.58 | 0.56 |
+General Phrase Debiaser | 0.46 | 0.91 | 0.89/0.84 | 0.86/0.86 | 0.90/0.87 | 0.82/0.82 | 0.89 | 0.62 | 0.56 |
Given that existing work [15] demonstrated the presence of biases in all parameters of the models, we choose to fine-tune the entire to mitigate biases in the model with searched biased prompts in step 2. This corresponds to step 3 as illustrated in Fig.1.
In contrast to the prompt search phase, during the debiasing fine-tuning, we aim to minimize to reduce the distribution discrepancy of on induced by . This distribution discrepancy is specific to , indicating that our method propagates gradients through each phrase in the entire stereotype, rather than debiasing the entire vocabulary of as done in Auto-Debias. As a result, our debiasing approach pays more attention to stereotypical phrases in and has less impact on unrelated words.
3 Results and Evaluation
3.1 Evaluation Data And Details
Debias Data & Language Capability Data: We evaluate the proposed General Phrase Debiaser on two dataset: (1) the Sentence Embedding Association Test (SEAT) [22] which provides a commonly used metric for assessing biases in PLM embeddings, and (2)the General Language Understanding Evaluation (GLUE) benchmark [24] which measures common language modeling capability. We evaluate our method on 3 MLMs with differnt sized parameters: BERT [1], ALBERT [2], and distilBERT [5], and compare the proposed method with 3 other algorithms: Context-Debias[15], FairFil[19] as well as Auto-Debias[20].
Implementation Details: Hyperparameters play a critical role in final performance [25, 26, 27]. For completeness sake, we then introduce the hyperparameters we used in our study. In step 2 of Fig.1, we use which has 624 phrases. The maximum biased prompt length is 5 and beam search width is 100. We use the 5,000 highest frequency words in Wikipedia as the search space , to avoid noise in the vocabulary and speed up the prompt search process. In step 3, we use with more than 500 phrases instead of because takes into account the varying weights of different stereotypical phrases, resulting in better debiasing effects (as shown in Table 1). And we choose (an extension of , derived from the gender word list in [15]) as attribute phrases to construct more fine-tuning data. All models are trained with AdamW [28] optimizer and early stopping strategy. Our experiments run on a single NVIDIA 3090Ti.
3.2 Evaluation Result And Analysis
We run General Phrase Debiaser in both career field and discipline field at the same time. The effect size score of the SEAT[16] benchmark we report in Table 1 measures the association between two sets of target concepts and two sets of attributes. It is obtained by calculating the normalized distance between a set of attribute sentence vectors and a set of concept sentence vectors output by the model, and the closer this distance is to 0, the less biased the model is. The result demonstrates that our method is capable of reducing model biases, lowering original average scores of BERT, ALBERT, and DistilBERT in the six SEAT tests from 0.35, 0.72, and 0.79, respectively, to 0.12, 0.16, and 0.52. Furthermore, compared to other approaches in the entire benchmark, including those relying on manual datasets or generating data automatically, General Phrase Debiaser shows the state-of-the-art debiasing performance. We find the global superiority of our model is derived from three aspects through analysis:
Simultaneous Debiasing across Multiple Domains. Our method can effectively eliminate the gender bias in career, math, art, and science simultaneously, without requiring multiple debiasing process on the same model.
Knowledge Debiasing in Phrase Granularity. Our method operates at the phrase granularity rather than the word granularity, making it easier to probe and mitigate biases in disciplines. For example, in Table 2, General Phrase Debiaser achieves the best average score in the four tests (SEAT-7 to SEAT-8) concerning math, art and science.
Keep Language Capability after Debias.we test gender-debiased versions of BERT, ALBERT, and DistilBERT on the General Language Understanding Evaluation (GLUE) benchmark [24]. The test results are presented in Table 2. Gender-debiased versions of BERT, ALBERT, and DistilBERT show a little decrease in scores compared to the original models on the GLUE test, demonstrating our General MLM Debiaser alleviates the bias concerns while also maintaining language modeling capability.
4 Conclusion
Our proposed method can debias MLMs at phrase granularity while also maintaining language modeling capability, and gets state of the art in SEAT test. Although decoder-only LLMs are gaining popularity, we still consider bias mitigation in encoder-only models crucial. Moreover, the concepts presented here can also be applied to cross-modal models involving encoder-only models.
Acknowledgements
This work was supported by Grant 2020YFB1005400 from the National Key R&D Program of China.
References
- [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [2] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut, “Albert: A lite bert for self-supervised learning of language representations,” .
- [3] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- [4] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. 2019, vol. 32, Curran Associates, Inc.
- [5] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
- [6] Shaokun Zhang, Xiawu Zheng, Chenyi Yang, Yuchao Li, Yan Wang, Fei Chao, Mengdi Wang, Shen Li, Jun Yang, and Rongrong Ji, “You only compress once: Towards effective and elastic bert compression via exploit-explore stochastic nature gradient,” arXiv preprint arXiv:2106.02435, 2021.
- [7] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, pp. 150, 2019.
- [8] Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, and Tongliang Liu, “Ideal: Influence-driven selective annotations empower in-context learners in large language models,” arXiv preprint arXiv:2310.10873, 2023.
- [9] Shaokun Zhang, Yiran Wu, Zhonghua Zheng, Qingyun Wu, and Chi Wang, “Hypertime: Hyperparameter optimization for combating temporal distribution shifts,” arXiv preprint arXiv:2305.18421, 2023.
- [10] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation framework,” arXiv preprint arXiv:2308.08155, 2023.
- [11] Yiran Wu, Feiran Jia, Shaokun Zhang, Qingyun Wu, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, and Chi Wang, “An empirical study on challenging math problem solving with gpt-4,” arXiv preprint arXiv:2306.01337, 2023.
- [12] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- [13] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695.
- [14] Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-Philippe Morency, “Towards debiasing sentence representations,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 5502–5515.
- [15] Masahiro Kaneko and Danushka Bollegala, “Debiasing pre-trained contextualised embeddings,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1256–1266.
- [16] Aparna Garimella, Akhash Amarnath, Kiran Kumar, Akash Pramod Yalla, N Anandhavelu, Niyati Chhaya, and Balaji Vasan Srinivasan, “He is very intelligent, she is very beautiful? on mitigating social biases in language modelling and generation,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4534–4545.
- [17] James W. Cooley and John W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, 1965.
- [18] Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, Ed Chi, and Slav Petrov, “Measuring and reducing gendered correlations in pre-trained models,” arXiv preprint arXiv:2010.06032, 2020.
- [19] Pengyu Cheng, Weituo Hao, Siyang Yuan, Shijing Si, and Lawrence Carin, “Fairfil: Contrastive neural debiasing method for pretrained text encoders,” in International Conference on Learning Representations.
- [20] Yue Guo, Yi Yang, and Ahmed Abbasi, “Auto-debias: Debiasing masked language models with automated biased prompts,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 1012–1023.
- [21] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig, “How can we know what language models know?,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020.
- [22] Chandler May, Alex Wang, Shikha Bordia, Samuel R Bowman, and Rachel Rudinger, “On measuring social biases in sentence encoders,” in Proceedings of NAACL-HLT, 2019, pp. 622–628.
- [23] Markus Freitag and Yaser Al-Onaizan, “Beam search strategies for neural machine translation,” ACL 2017, p. 56, 2017.
- [24] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations.
- [25] Shaokun Zhang, Feiran Jia, Chi Wang, and Qingyun Wu, “Targeted hyperparameter optimization with lexicographic preferences over multiple objectives,” in The Eleventh International Conference on Learning Representations, 2022.
- [26] Xiawu Zheng, Chenyi Yang, Shaokun Zhang, Yan Wang, Baochang Zhang, Yongjian Wu, Yunsheng Wu, Ling Shao, and Rongrong Ji, “Ddpnas: Efficient neural architecture search via dynamic distribution pruning,” International Journal of Computer Vision, vol. 131, no. 5, pp. 1234–1249, 2023.
- [27] Xiaobo Xia, Jiale Liu, Shaokun Zhang, Qingyun Wu, and Tongliang Liu, “Coreset selection with prioritized multiple objectives,” arXiv preprint arXiv:2311.08675, 2023.
- [28] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.