In this section, we first present several measures for quantifying bias in NLP.
Task-agnostic metrics, which target intermediate representations non-specific to a task, are discussed in Section
3.1.1 and
task-specific metrics, i.e., those that measure bias for a specific downstream task, are discussed in Section
3.1.2. Section
3.1.3 presents research on the handling of gender beyond the male-female binary within gender bias research, which is increasingly acknowledged as a scientific gap within the NLP community. Finally, Section
3.1.4 discusses some overall limitations of current bias measures in NLP.
3.1.1 Task-agnostic Metrics.
Task-agnostic bias metrics target models that are pre-trained to be used as input representations in later tasks, therefore these metrics measure semantic bias (cf. Table
1). As these metrics are applied to the trained model they are independent of the domain and therefore of the dataset. These models are either LLMs or their predecessor word embeddings. Some of the methodology that was first developed to show how pre-trained word embeddings capture social biases were later adapted for LLMs [
53], however, other methods were tailored to the LLMs’ context-dependent structure and training objective as language models. An overview of the methods discussed in this section can be found in Table
2.
Embedding Association Tests. One of the most commonly used frameworks for gender bias detection in NLP applications is the
Word Embedding Association Test (WEAT). The test was adapted by Caliskan et al. [
26] from the Implicit Association Test used in psychological research [
52]. The
WEAT measures associations between identity terms that express gender, such as
he,
she, and so on, and positive or negative terms, or terms relating to fields with a stereotyped gender-connotation, such as family life or natural sciences. The metric used here is the distance between the terms’ vector representations. While the WEAT has been praised for drawing on literature outside of NLP and thereby presenting an inter-disciplinary approach that grounds word embedding associations in human cognition [
17], it has also been criticised for over-estimating bias [
45].
With the emergence of LLMs, the WEAT was further adapted, as word representations in LLMs are not singular, like in traditional word embedding models, but are dependent on sentence context. May et al. [
85] developed the
Sentence Embedding Association Test (SEAT) and created “semantically bleached,” i.e., very simple, sentence templates into which the target and attribute terms were embedded to extract the respective vector representations of the words. The authors tested their methodology on a variety of LLMs, but found discrepancies in the results, leading them to question whether the concepts tested (e.g., gender or pleasantness) can be represented within simple sentences and their association measured using cosine similarity.
Avoiding the problem of sentence templates, Guo and Caliskan [
53] also adapted the WEAT for use in LLMs. For their so-called
Contextualized Embedding Association Test (CEAT), they extracted 10,000 sentences containing stimuli (target/attribute words) from a Reddit Corpus, compute the WEATs to obtain effect sizes for all pairings and then use a random effects model, which is used in meta analysis, to analyze the distribution of effect sizes. They found presence of all tested bias categories in all of the tested LLMs (GPT, GPT-2, BERT, ELMo) but also found some negative results, indicating that “some WEAT stimuli tend to occur in stereotype-incongruent contexts more frequently” [
53].
Increased Log Probability Bias Score (ILPBS) . Kurita et al. [
71] presented a method of measuring associations between identity terms and stereotypical terms within masked language models, such as BERT. The
log probability bias score measures word likelihood in varying contexts. Similar to May et al. [
85], they also created templates. However, instead of using semantically bleached contexts, their templates contain both an identity term and a stereotyped attribute term. The difference in association between counterfactual identity terms in sentences with the same attribute using an LLM is measured to create the score. Kurita et al. [
71] found statistically significant bias scores where their adaptation of WEAT did not provide significant results. The authors interpret this as proof that gender bias assessment methods used on standard word embeddings cannot be simply translated to LLMs.
Discovery of Correlations (DisCo) . Webster et al. [
137] took a slightly different template-based approach. Instead of measuring the associations between pre-defined terms, their method aimed to discover terms correlated with gender (
DisCo). They created two template variants for
DisCo, one that uses first names, e.g., “[NAME] studied [BLANK] at college,” and another that uses nouns that contain gender information, e.g., “The [NOUN] likes to [BLANK]” [
137]. The model was then asked to fill in the blank slot. The researchers used the
\(\chi ^2\) measure to see whether the three most likely proposals were significantly correlated with the associated gender of [PERSON] or [NAME], i.e., whether the proposed fill was gender-dependent, indicating model bias. Using DisCo, the researchers found that in both BERT and ALBERT, first names are more likely to generate fills with what they termed gendered correlations, than gendered nouns. Furthermore they showed that LLMs with similar accuracy do not necessarily show the same gendered correlations.
ABC Stereotype Model/ Sensitivity Test (SeT) . Another common criticism of research on bias and fairness in NLP is that techniques are not sufficiently grounded in theory from outside of the field, such as psychology, sociology, feminist theory [
17]. Cao et al. [
28] conducted a study asking participants to report stereotypes held by the general population and based their research on Koch et al.’s [
68]
Agency Beliefs Communion (ABC) stereotype model from social psychology theory. Cao et al. [
28] also presented their own methodology for measuring stereotyping in LLMs: the SeT. The SeT measures how much model weights would need to change to arrive at predicting an anti-stereotypical trait for a given group, e.g., “Men are
kind.” Compared to both CEAT [
53] and the Log Probability Bias Score [
71], the SeT showed better alignment with the ways in which humans tend to stereotype. However, overall human and model judgements only showed moderate correlation.
Open-ended Language Generation. Bias can also be measured through open-ended language generation. Sheng et al. [
114] created sentence templates with placeholders for identity terms in contexts related to respect for a person and occupations. They used two measures to assess bias within the generated sentence completions by the LLM: sentiment as well as regard. Nozza et al. [
92] also create templates that contain identity terms in different contexts, and they additionally presented the HONEST score, which uses the
HurtLex lexicon of harmful language [
9], to measure how often a language model’s top candidates for completing a sentence contain toxic language. They applied the HONEST score to BERT and GPT-2 models in six languages and found, for example, that sentence templates containing a female subject were completed with a reference to promiscuity 9% of the time. Nozza et al. [
93] extended this research to LGBTQ+-related identity terms and measured the HONEST score as well as toxicity on the sentence level using the Perspective API.
2 They found that the completed sentences by the LLMs queried are classified to be harmful 13% of the time. Furthermore, Dhamala et al. [
41] created the BOLD measures and dataset, which use sentences from Wikipedia that contain mentions of protected groups to measure bias in open-ended language generation. Akyürek et al. [
3] took a critical perspective on using open-ended language generation as a measure for bias. They demonstrated that bias measures are highly dependant on experimental design, including factors like model parameters, which can influence whether or not bias reaches a harmfulness threshold.
Challenge Datasets. Measuring gender bias in NLP overall is dependant on datasets that contain identity terms for which different behaviours are measured. These challenge datasets are designed specifically to assess shortcomings with respect to a certain (social) variable. Challenge datasets are mentioned under several different names throughout the literature. Blodgett et al. [
18] referred to “benchmark datasets,” thereby drawing the connection to performance-measuring benchmarks. Stanczak and Augenstein [
119] called them “probing datasets” while Sun et al. [
124] referred to “Gender Bias Evaluation Test sets.” Bowman [
20] specifically mentioned adversarial datasets, for which annotators were asked to create cases that make a model fail.
Two challenge datasets to assess social biases in LLMs are
CrowS-Pairs [
91] and
StereoSet [
90].
CrowS-Pairs consists of minimal sentence pairs, one of which contains a stereotype and one which does not. By measuring the likelihood that a given language model gives to either sentence, social biases can be assessed.
StereoSet is slightly more comprehensive; it features both intra- and inter-sentence settings. In the intra-sentence setting, sentences have a gap and multiple words to fill them: a stereotypical, a non-stereotypical and an unrelated filler. Measuring the likelihood of each of these fillers can provide indication of model bias. The inter-sentence setting is similarly structured, only here the likelihood of three possible sentence continuations is being measured. Despite being more comprehensive, the authors of
CrowS-Pairs noted that the
StereoSet has a lower annotator validation rate than
CrowS-Pairs [
91].
3.1.2 Task-specific Metrics.
In this section, we present research on detecting gender bias in specific NLP tasks. While task-agnostic metrics are mostly architecture-dependent (e.g., suitable for masked and/or causal language models), task-specific metrics are less tied to a specific architecture, because different model architectures can be used for the same task. Instead, methodologies are often dependent on challenge datasets, which allows researchers to test model performance with regard to a protected variable, gender in our case. Therefore, in addition to the discussion of several works on task-specific gender bias, we provide a non-exhaustive overview of these datasets in Table
3.
Coreference Resolution. Challenge datasets for detecting gender bias exist for a broad variety of downstream NLP applications. For example, the
WinoBias dataset [
141] targets pronoun resolution in pro- and anti-stereotypical sentences, such as “The
physician hired the
secretary because
he was highly recommended” [
141]. Similar datasets, which also target binary gender bias in coreference resolution, are
WinoGender [
100] and
GAP [
136]. Cao and Daumé [
27] then developed the
GICoref dataset, which contains challenging coreference cases with non-binary and neopronouns, as well as gender-fluid cases, in which pronoun use changes while still referring to the same person.
Occupation Classification. De-Arteaga et al. [
36] filtered almost 400,000 professional biographies from the Common Crawl Corpus and used this dataset to assess gender bias in occupation classification. They measured gender bias using the difference in true positive rate for male and female biographies per occupation in two settings: (1) gender markers contained in the biographies left as is or (2) they were “scrubbed.” De-Arteaga et al. [
36] found a significant gender gap that was correlated with statistics of gender participation in the workforce. When “scrubbing” gender information, the gender gap was reduced, but the accuracy of the classifier remained stable.
Sentiment Analysis. One of the first to study biases in
sentiment analysis (SA) systems were Kiritchenko and Mohammad [
67]. They created the
Equity Evaluation Corpus (EEC) that consists of sentences designed to target race and gender biases within an SA system. Using this corpus to evaluate 219 openly available SA systems, the authors found that around three quarters of the systems consistently attribute higher sentiment intensity to identity terms related to historically disadvantaged protected groups, such as women and Black people [
67]. Based on this approach, Bhaskaran and Bhallamudi [
14] created another EEC that is designed to expose gendered occupational stereotypes in SA systems. They found differing sentiment scores for male and female identity terms as well as more negative sentiment toward lower-earning versus higher-earning jobs. Addressing limitations posed by handcrafted templates, Asyrofi et al. [
5] created
BiasFinder, a system that automatically generates templates that differ in identity terms of the same protected group, and for which different transformer-based SA systems predict differing sentiment. They call these sentence templates
“bias-uncovering test sets” (BTC). On average, their SA systems find around 8,000 of these in an IMDB movie review dataset [
79] and 24,000 in the Twitter
Sentiment140 dataset [
1].
Machine Translation. Another application of NLP for which bias is measured is
machine translation (MT). Gender bias in MT becomes evident, for example, when translating from a non-gender marking language to a gender-marking language, in which the choice of grammatical gender for an originally gender-neutral word is based on stereotypes or societal gender roles. For example, the phrase “The doctor and the nurse” would be translated into German as “Der Arzt
masc und die Krankenschwester
fem.” While it could be argued that this translation simply reflects numbers of male and female participation in the respective professions, Prates et al. [
96] established that Google Translate, for instance, has a tendency to create male-default translations and to over-amplify men’s participation in STEM fields. In a similar study, Cho et al. [
31] showed a similar male skew for gender-neutral pronoun translation from Korean to English. Besides using occupation words, they also demonstrated this skew for gender-neutral pronoun translation in formal/informal contexts, and contexts containing words that carry positive or negative sentiment. In addition, Stanovsky et al. [
120] illustrates MT systems’ tendency to ignore morpho-syntactic contextual cues in coreference resolution settings in favour of stereotype information.
However, it should also be noted that with growing pressure from the public and academic research pointing at biases in MT systems, there has recently been some positive development such as Google providing several possible translations in the case of words with ambiguous gender [
70,
104]. Besides problems in the translations of gender-neutral (pro)nouns, another form of gender bias recorded for MT is stylistic bias. Hovy et al. [
60] found that due to the demographic skew in training data, automatic translations made users sound older and “more male.” Moreover, making gender information for first-person narration salient in non-gender marking source languages improved the translation of women’s voices into gender-marking target languages [
130]. In addition, there exist several challenge or benchmark datasets to assess gender bias for different MT systems, such as
WinoMT [
101,
120], the occupations test set [
44], the Arabic Parallel Gender Corpus [
55] and
MuST-SHE [
12].
3.1.3 Bias beyond Binary Gender.
Gender bias detection and mitigation efforts in NLP until this point have mostly employed a binary conceptualisation of gender, meaning that these works concentrated on equality in representation and quality of service for only male and female gender [
38,
39]. Including other genders besides binary male and female was either not mentioned at all [
71,
85], mentioned only when discussing future work [
142], limitations [
8,
41], or mentioned as an issue that the authors were aware of but could not be addressed, because it would complicate experiment setups [
29] or the work was building on prior work with a binary conceptualisation of gender [
126].
Devinney et al. [
39] presented a two-round survey of conceptualisations of gender in NLP research in 2020 and 2021 and found that while awareness for and inclusion of non-binary gender models are increasing, more than half of all research surveyed still subscribed to the binary “folk” model of gender, according to which there are only two immutable categories of gender,
male and
female. They found most works lacking explicit definitions of the conceptualisation of gender that was applied and of how gender was implemented in experiments. In line with this, they also found that social gender and linguistic gender are often conflated. Moreover, there are few works that address intersectional aspects in connection with gender, such as race or socioeconomic status. Devinney et al. [
39] recommended future publications to explicitly define gender using appropriate and respectful language, subsequently select a method in line with the chosen definition of gender, and finally base the work on feminist research.
Dev et al. [
38] moreover explored harms and challenges related to the exclusion or misrepresentation of gender identities that are non-binary in the context of NLP. Because non-binary gender identities (
non-binary,
agender,
genderqueer, etc.) are not always recognised and not well-understood in large swaths of public discourse, training datasets for language technology reflect this lack of (accurate) representation. This data deficiency leads to language models, which currently function as the basis for most state-of-the-art language technology, creating “meaningless, unstable representations” for words used to express non-binary gender [
38]. For example, the neopronouns
xe and
ze are treated as out-of-vocabulary tokens by BERT [
40]. As a result, downstream applications such as machine translation or coreference resolution systems are likely to fail at resolving neopronouns and other language for expressing non-binary gender identities. The result is either the misgendering and/or erasure of non-binary genders [
38]. For future work, the authors mentioned two primary challenges: the need for more real-world data of neopronoun use and a move away from a tripartite view of social gender as
male/female/gender-neutral but rather a more open conceptualisation of gender to account for its fluidity [
38].
3.1.4 Limitations.
While measures for the assessment of bias have lead to an awareness of how models integrate and emphasise existing biases in the data, a definitive bias measure that works reliably, especially in the context of large-scale language models, does not yet exist.
Aribandi et al. [
4] tested the
SEAT [
85],
CrowS-Pairs [
91], and
StereoSet [
90] on three BERT models with different random seeds. They found that, while the performance of the models remained stable, predictions of the stereotypical categories in
SteroSet and
CrowS-Pairs, as well as statistical significance of the
SEATs, appeared to be erratic (i.e., heavily influenced by the configuration of an individual model). In addition to inconsistencies in their application, Blodgett et al. [
18] moreover found that a variety of pitfalls in four bias measuring datasets themselves. They analysed
StereoSet and
CrowS-Pairs, as well as
WinoGender and
WinoBias, which all contain contrastive pairs meant to measure a model’s performance on stereotyped versus non- or anti-stereotyped examples. Within these examples, the researchers found a string of inconsistencies in the operationalisation of stereotyping, with some examples being non-meaningful, misaligned, non-relevant, or containing offensive language in place of a stereotype, among others. Blodgett et al. [
18] also criticised that all datasets analysed lacked a clear conceptualisation of stereotyping, which is their main focus.
Furthermore, as mentioned earlier, it should not be assumed that when task-agnostic gender bias can be measured for an intermediate representation, such as a word embedding model or language model, that this will definitively translate to task-specific bias in the downstream application [
48]. Similarly, different bias measures might not necessarily correlate for the same model [
37].
Overall, these works show that while the task of being able to measure problematic behaviour in models is important, it is also equally important to carefully construct measures [
18] that remain robust to different configurations of the same model, take model uncertainty into account [
4], and illustrate the influences on downstream applications [
37].