Measuring and Mitigating Biases
in Motor Insurance Pricing

Mulah Moriah^1,2, Franck Vermet^1,2 & Arthur Charpentier³

(¹Euro-Institut d’Actuariat Jean Dieudonné (EURIA)
²Univ Brest, CNRS, UMR 6205, Laboratoire de Mathématiques de Bretagne Atlantique, France
³Université du Québec à Montréal )

Abstract

The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, considering various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender. Or mutualist groups in accordance with respective corporate strategies can implement age-based premium fairness. In certain insurance domains, the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any fairness biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance. Results show that fairness bias can be found in historical data and models, and that fairer outcomes can be obtained by more fairness-aware approaches.

Keywords: Machine learning, Fairness, Pricing, Non-life insurance, Discrimination

1 Introduction

1.1 Motivation

The insurance industry is characterized by an inherent reversal in its production cycle, where insurers request a fixed premium at the time of policy subscription in exchange for coverage against uncertain risks in terms of both occurrence and magnitude. This inversion underscores the statistical nature that envelops the pricing of insurance, necessitating adherence to statistical theory for the estimation and coverage of random events. Beyond these statistical considerations, insurance premiums represent the equitable price for insurance services, encompassing a multitude of strategic challenges. The insurance market is increasingly competitive, comprising established incumbents and more agile newcomers striving to gain market share. This competitive landscape is further catalyzed by continuously evolving regulations aimed at fostering competition among industry players.

Hence, participants in the insurance market must offer competitive pricing strategies that align with their corporate strategies and communication, while also employing tools tailored to diverse distribution processes.

Recent years have witnessed the widespread adoption of machine learning algorithms, neural networks, and the utilization of vast datasets. This adoption is attributable to scientific advancements, increased computational power, enhanced accessibility to technology, and the proliferation of data. These emerging technologies have made significant inroads into various business sectors, including the insurance industry.

These data and algorithms serve as decision support tools, aiding in policyholder segmentation, risk comprehension, and the consideration of various factors associated with it. Therefore, industry stakeholders must incorporate these new elements to maintain their competitiveness. Nevertheless, the extensive use of massive data and intricate algorithms has brought the issue of transparency to the forefront. It is imperative that premiums and decisions are both explainable and fair, given the high stakes involved. This is not simply about algorithms assisting in trivial choices like movie selection; rather, it involves algorithms determining the cost of access to insurance services for various population segments. Concerns regarding fairness and ethics have been integral to our societies and philosophies for centuries. Although subject to interpretation, fairness can be defined as the ability to place individuals on an equal footing while acknowledging the differences that exist among them.

In response to these considerations, regulatory measures such as the gender directive have been implemented to promote fairness. Reforms related to access to borrower insurance also represent a form of fairness enforcement. Consequently, insurance industry participants may need to construct fairer premiums with regard to so-called sensitive or protected variables. These fairness constraints can emanate from regulatory requirements, as seen in the case of gender, or be driven by commercial and strategic objectives, such as those related to age in specific companies.

In the context of this study, bias is understood as a form of discrimination, signifying the undesirable impact of a sensitive variable on a variable of interest. For instance, this could include the effect of gender on insurance premiums within the framework established by the gender directive (of the European Union (2008)). Since 2016, numerous research studies (Angwin et al. (2016); Chouldechova (2017))have identified instances of discrimination in decision-making tools across various domains, such as the risk assessment tool for recidivism in the United States, Google Images algorithms, Amazon’s application processing algorithms, and lending and financing algorithms, among others. Historically, the solution has often been to circumvent the issue rather than address it from an ethical perspective.

One simplified scenario that can be examined involves the use of a gender-correlated variable to determine premiums. In this case, gender is not explicitly factored into the estimates, resulting in fairness by omission. However, the presence of a correlation between gender and the non-sensitive explanatory variable leads to an indirect effect of gender on the estimated premiums. Even though gender is seemingly reprocessed or removed from the model, its association with other variables allows its influence on the target variable to persist. This is because adjustments only account for the direct impact of sensitive variables and not their indirect effects, despite these variables exerting a significant influence on the distributions of other explanatory variables. Factors such as age, gender, and disability influence choices related to activities, risk tolerance, product preferences, and so on. Consequently, it is imperative to first establish the means to define and measure the fairness of the constructed models and subsequently mitigate this bias through fairness-aware approaches.

1.2 Agenda

The purpose of this article is to provide actuaries with tools to understand, measure and mitigate the unwanted effect of a sensitive variable in a pricing problem.In Section 2, we will discuss fairness notions and statistical measures of fairness. The bigger picture will be shown while focusing on important elements for pricing. Then, in Section 3, we will present methods that can be used to reduce fairness bias in pricing models. Finally, in Sections 4 and 5, we present the results of fairness implementation on a real car insurance pricing, respectively with measures of biases (in Section 4) and a description of the impact of mitigation (in Section 5). Note that all implementations are made in Python 3.8.

1.3 Notations

Throughout this document, the variable denoted as $Y$ is the target variable we aim to predict. This variable can take on either categorical or quantitative values (discrete or continuous, but positive). Given the primary focus on pricing, $Y$ predominantly assumes a quantitative nature. However, for the sake of illustration, it might be temporarily treated as a binary variable. In the context of addressing this supervised machine learning problem, we introduce an algorithm denoted as $m$ , a set of non-sensitive features represented by the vector $\boldsymbol{X}=(X_{1},X_{2},\dots,X_{p})$ , and sensitive attributes denoted as $\boldsymbol{S}$ . The vector $\boldsymbol{S}$ comprises variables that are intended to have no influence on the models, either intentionally or inadvertently. These variables are typically discrete, particularly in the context of examining fairness in pricing, and may explicitly pertain to attributes such as gender. If a single sensitive attribute is considered, as it will mostly be the case in this document, notation $S$ will be used (see Hu et al. (2023) for a detailed discussion about multiple sensitive attributes). Furthermore, we assume that $S$ is a binary sensitive variable, with possible values of $0$ and $1$ , which may correspond to the gender of the policyholder.

Each data point in our dataset, which can represent individuals, contracts, or claims, is identified as $(y_{i},\boldsymbol{x}_{i},s_{i})$ , with $i$ ranging from $1$ to $n$ , and $n_{0}$ and $n_{1}$ signifying the numbers of observations where $s_{i}=0$ and $s_{i}=1,$ respectively. Note that lowercase letter variables denote observations, whereas uppercase letter variables denote random variables.

We introduce $\widehat{y}=\widehat{m}(\boldsymbol{x})$ as the predictions generated by an "unfair model" (or "unaware model", as defined by Dwork et al. (2011, 2012)), and $\widetilde{y}=\widetilde{m}(\boldsymbol{x})$ as the predictions generated by a fair model. $\widetilde{m}$ is a model that takes into account defined fairness constraints to minimize the effects of $s$ on $y$ . As studied in section 3, these constraints can take the form of criteria for variable selection (preprocessing constraints), penalized learning (in-processing approach) and output correction (post-processing approach). We can then define individual fairness bias, denoted as $\varepsilon$ , as the difference between these two predictions, i.e., $\varepsilon=\widehat{y}-\widetilde{y}$ .

We also introduce the function $\mathcal{V}^{s}(\boldsymbol{x}_{i})$ , which represents the $k$ -nearest neighbors in the group $S=s$ associated with the $i$ -th individual.

2 Measuring fairness and biases

While economists have engaged in discussions on discrimination for several decades, as evidenced in works such as Edgeworth (1922), Becker (1957), and Phelps (1972), recent contributions from the field of computer science have sought to formalize key fairness concepts and the notion of non-discrimination with respect to specific sensitive attributes denoted as "S". These discussions primarily pertain to various classifiers denoted as "m", and are exemplified in works such as Dwork et al. (2012), Hardt et al. (2016), Berk et al. (2017), and Corbett-Davies et al. (2017). For an overview of state-of-the-art developments in insurance models, please refer to Charpentier (2024).

As the literature in this area is relatively recent, references and consensus have not yet fully coalesced, as noted by Angwin et al. (2016): "The rapid growth of this emerging field has led to highly inconsistent motivations, terminologies, and notations, posing a significant challenge in cataloging and comparing definitions." Furthermore, Castelnovo et al. (2022) describe the multiplicity of fairness definitions as a "zoo of definitions," remarking, "The researcher or practitioner approaching this facet of machine learning for the first time can easily feel confused and somewhat lost in this maze of definitions. These various definitions capture different facets of the fairness concept, but to the best of our knowledge, a comprehensive understanding of the broader landscape where these measures reside remains elusive."

Fairness is intuitively understood as the absence of any association between the sensitive variable and the variable of interest. Fair predictions $\widetilde{y}$ will then be independent of $S$ . Indeed, independence between these variables implies the absence of any direct or indirect relationships, thereby precluding the existence of fairness bias. However, this is just one facet of the fairness concept.

In this section, we will introduce the two primary types of fairness, namely group fairness and individual fairness, with group fairness being the more prevalent of the two.

2.1 Group or Statistical Fairness

Group fairness necessitates equality of treatment among groups based on the sensitive variables. For that, statistical properties are specified using the model $m$ and $\widehat{Y}$ to compare each group. The overarching objective is to ensure that individuals from both privileged and unprivileged groups are treated equitably in terms of the specified statistical properties.

Fairness was initially introduced using conditional probabilities in the context of binary classification (see EEOC (1979)).This concept can be extended to scenarios where the variable $Y$ is non-binary, incorporating moment properties, a weak version, or distribution properties, a strong version. These can be linked to correlation properties and independence conditions, respectively.

Independence (demographic parity):

States that predictions ( $\widehat{Y}$ ) should not rely on the sensitive variable $S$ . It emphasizes that predictions must be independent of $S$ :

\widehat{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2% .0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}S.

This principle highlights the need for predictions to be unrelated to $S$ , without mentioning the target variable ( $Y$ ). In practical scenarios like insurance pricing, if $S$ is age and $Y$ is premium, it suggests premiums should be consistent across age groups. However, enforcing this fairness principle may seem counterintuitive if certain age groups pose higher risks, implying less restrictive models to ensure consistent premiums regardless of age. This approach may contradict traditional fairness perceptions, which prioritize equal treatment regardless of sensitive attributes. This fairness principle becomes crucial when there’s concern about unfair information in the target variable ( $Y$ ), especially due to historical bias in the data, reflecting past unfair behaviors or decisions. In such cases, emphasizing fairness regarding $S$ may be necessary for corrective measures.

Separation (equalized odds):

states that predictions ( $\widehat{Y}$ ) should be independent of the sensitive variable ( $S$ ) when the true value of the target variable ( $Y$ ) is known. This means that once the actual outcome ( $Y$ ) is revealed, the predictions should not be influenced by the sensitive attribute ( $S$ ):

\widehat{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2% .0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}S\leavevmode\nobreak\ |\leavevmode% \nobreak\ Y.

In the context of separation, any disparities in treatment between groups based on the sensitive attribute ( $S$ ) must be justifiable by the actual value of the target variable ( $Y$ ). For example, in a scenario involving premium and age, premiums may vary for each age group, based on risk factors independent of age. This approach reduces the influence of the sensitive attribute ( $S$ ) while preserving valuable information in the target variable ( $Y$ ). However, it’s only feasible when $Y$ isn’t affected by historical bias. If unfair information is present in $Y$ , it will easily affect the predictions $\widehat{Y}$ .

Sufficiency (predictive parity):

examines fairness regarding the target variable ( $Y$ ). It aims for independence between $Y$ and the sensitive attribute ( $S$ ), given the predictions ( $\widehat{Y}$ ):

Y\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.% 0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}% \mkern 2.0mu{\scriptscriptstyle\perp}}}S\leavevmode\nobreak\ |\leavevmode% \nobreak\ \widehat{Y}.

This approach doesn’t require knowledge of the true $Y$ for unseen individuals, addressing fairness directly. In feature selection and modeling, $Y$ often faces selection issues due to observed data limitations. By focusing on $\widehat{Y}$ as a starting point, this approach helps mitigate such issues associated with $Y$ .

Different scopes of group fairness are inherently incompatible. Studies show simultaneous fulfillment of multiple fairness criteria is challenging, except in special cases. References like Chouldechova (2017), Kleinberg et al. (2016), Berk et al. (2018), and Charpentier (2024) explore this. This understanding leads to the introduction of the following definition of statistical fairness.

Definition 1 (Statistical (Group) Fairness).

Let $\widehat{y}={m}(\boldsymbol{x})$ . A model $m$ is classified as strongly fair if it satisfies the following conditions:

\begin{cases}\text{demographic parity}:&\leavevmode\nobreak\ (\widehat{Y}|S=s)% \overset{\mathcal{L}}{=}\widehat{Y},\leavevmode\nobreak\ \forall s\\ \text{equalized odds}:&\leavevmode\nobreak\ (\widehat{Y}|S=s,Y)\overset{% \mathcal{L}}{=}(\widehat{Y}|Y),\leavevmode\nobreak\ \forall s\\ \text{predictive parity}:&\leavevmode\nobreak\ (Y|S=s,\widehat{Y})\overset{% \mathcal{L}}{=}(Y|\widehat{Y}),\leavevmode\nobreak\ \forall s\end{cases}

while $m$ is considered weakly fair if it fulfills the following conditions:

\begin{cases}\text{demographic parity}:&\leavevmode\nobreak\ \mathbb{E}[% \widehat{Y}|S=s]=\mathbb{E}[\widehat{Y}],\leavevmode\nobreak\ \forall s\\ \text{equalized odds}:&\leavevmode\nobreak\ \mathbb{E}[\widehat{Y}|S=s,Y]=% \mathbb{E}[\widehat{Y}|Y],\leavevmode\nobreak\ \forall s\\ \text{predictive parity}:&\leavevmode\nobreak\ \mathbb{E}[Y|S=s,\widehat{Y}]=% \mathbb{E}[Y|\widehat{Y}],\leavevmode\nobreak\ \forall s\end{cases}

These definitions provide a framework for categorizing models as either strongly fair or weakly fair based on their compliance with different fairness criteria, namely demographic parity, equalized odds, and predictive parity.

2.2 Individual fairness

Individual fairness, also known as similarity-based fairness, asserts that similar individuals should receive similar predictions. Unlike group fairness, which aggregates outcomes at the group level, individual fairness compares individuals directly. This concept is formalized as "disparate treatment":

Definition 2 (Disparate Treatment).

A model $m$ exhibits disparate treatment if, given the explanatory variables $\boldsymbol{X}$ , the predictions $\widehat{Y}$ and the sensitive attribute $S$ are dependent. To achieve fairness:

\widehat{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2% .0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}S|\boldsymbol{X}.

Individual fairness and disparate treatment embody an intuitive conception of fairness, seen in scenarios like car insurance pricing, where identical individuals with different sensitive attribute values receive the same predictions. However, this approach overlooks relationships between explanatory variables ( $\boldsymbol{X}$ ) and $S$ , impacting fairness enforcement.

The challenge of individual fairness arises from sensitive variables significantly affecting other variables’ distribution. Attributes like gender, age, disabilities, ethnicity, and illnesses shape habits, preferences, and behaviors, complicating defining "similar individuals."

To address these questions, Dwork et al. (2011) introduced the Lipschitz constraint concept:

d_{Y}(\widehat{y}_{i},\widehat{y}_{j})<\lambda d_{\boldsymbol{X}}(\boldsymbol{% x}_{i},\boldsymbol{x}_{j}),

Here, $d_{Y}$ and $d_{\boldsymbol{X}}$ measure distances in target and explanatory variable spaces, respectively. Similarity in the $Y$ space, like between premiums, can be defined using metrics such as $|\widehat{y}_{i}-\widehat{y}_{j}|$ or $(\widehat{y}_{i}-\widehat{y}_{j})^{2}$ . The focus is on defining individuals as similar when they have different values of the sensitive attribute ( $S$ ). And, disparate treatment will mean that regardless of the value of $S$ , if two individuals are similar on $\boldsymbol{X}$ , they must be similar on $Y$ and thus have the same premium.

Despite its intuitive appeal, individual fairness is less frequently used than group fairness, largely due to addressing various interactions between variables’ complexity. Causality-based criteria may offer a promising avenue, but practical implementation remains challenging. These causal models are not discussed here (for more information, refer to Kusner et al. (2017), Alycia and Wu (2022), and Galles and Pearl (1998)).

2.3 Quantifying unfairness

In light of the definitions provided earlier, it is imperative that fairness metrics possess the ability to quantify full/conditioned independence in the context of group fairness and also discern discrepancies in individual predictions, contingent upon the proximity of individuals in the case of individual fairness.

2.3.1 From the binary to the general case

In the case where both the target variable ( $y$ ) and the sensitive attribute ( $s$ ) are binary variables, metrics can be defined using the confusion matrix of $\widehat{y}$ and $s$ . For instance, the disparate impact metric can be derived as:

\frac{\mathbb{P}(\widehat{Y}=1|S=1)}{\mathbb{P}(\widehat{Y}=1|S=0)}.

In the context of demographic parity, when this metric is closer to one, it indicates a fairer model. Similarly, another fairness metric, denoted as $M$ , is defined as:

	$\displaystyle M_{1}$	$\displaystyle=$	$\displaystyle\|{\mathbb{P}(\widehat{Y}=1\|S=1)}-{\mathbb{P}(\widehat{Y}=1\|S=0)}\|$
		$\displaystyle=$	$\displaystyle\|{\mathbb{P}(\widehat{Y}=0\|S=1)}-{\mathbb{P}(\widehat{Y}=0\|S=0)}\|% \leavevmode\nobreak\ =\leavevmode\nobreak\ M_{0}.$

In the case of demographic parity, the closer the value of $M_{1}$ (or $M_{0}$ ) is to zero, the fairer the model.

To address separation and sufficiency, a confusion matrix involving $\widehat{Y}$ , $Y$ , and $S$ can be constructed, enabling the definition of metrics such as true positive and false negative rates. To measure fairness in accordance with the equalized odds definition, the disparate mistreatment can be calculated as follows:

\begin{cases}M_{1|0}=|\mathbb{P}(\widehat{Y}=1|Y=0,S=1)-\mathbb{P}(\widehat{Y}% =1|Y=0,S=0)|,\\ M_{0|1}=|\mathbb{P}(\widehat{Y}=0|Y=1,S=1)-\mathbb{P}(\widehat{Y}=0|Y=1,S=0)|.% \end{cases}

In the context of equalized odds, the closer these values are to zero, the fairer the model is considered to be.

In a more general context involving non-binary variables, various correlation-based metrics, such as Kendall’s tau, Pearson’s correlation, and Spearman’s rho, can be employed to assess dependencies between variables. However, in the pursuit of capturing all forms of dependence in the relationship between $\widehat{Y}$ and $S$ , the Hirschfeld-Gebelein-Renyi ( $\mathrm{HGR}$ ) maximal correlation measure appears to outperform other methods.

The $\mathrm{HGR}$ maximal correlation, originally introduced by Hirschfeld (1935) and further developed by Gebelein (1941) and Rényi (1959), offers the ability to quantify both linear and nonlinear relationships while adhering to essential properties for a reliable measure of dependency, as established by Rényi (1959). It enjoys broad acceptance within the field of statistics. For two continuous or discrete variables, the $\mathrm{HGR}$ maximal correlation is precisely equal to zero if and only if the variables are independent. This makes it a valuable tool for detecting and quantifying dependencies between variables.

Definition 3 ( $\mathrm{HGR}$ ).

For two random variables $U$ and $V$ respectively with values in $\mathcal{U}$ and $\mathcal{V}$ ,

\mathrm{HGR}(U,V)=\max_{f\in\mathcal{F}_{\mathcal{U}},\ g\in\mathcal{G}_{% \mathcal{V}}}\mathbb{E}[f(U)g(V)],

where

\begin{cases}\mathcal{F}_{\mathcal{U}}=\{f:\mathcal{U}\to\mathbb{R}:\mathbb{E}% [f(U)]=0\text{ and }\mathbb{E}[f^{2}(U)]=1\}\\ \mathcal{G}_{\mathcal{V}}=\{g:\mathcal{V}\to\mathbb{R}:\mathbb{E}[g(V)]=0\text% { and }\mathbb{E}[g^{2}(V)]=1\}\\ \end{cases}

In our case, $U$ will be the sensitive attribute, and $V$ the premium (or one of its components).

In the realm of fairness quantification, Mary et al. (2019) proposed the use of the $\mathrm{HGR}$ maximal correlation measure, and Grary et al. (2022) explored its application in the insurance domain. However, obtaining the exact value of this metric is not straightforward, as it is defined in infinite-dimensional function spaces. Several authors have used Witsenhausen’s linear algebra characterization (see Witsenhausen (1975)) to compute the HGR. This measure can also be related to Kernel Based Nonlinear Canonical Analysis Darolles et al. (2004). Therefore, Mary et al. (2019) combine kernel density estimation (KDE) (specifically a gaussian kernel) to these characterizations to obtain an approximation of the HGR, called HGR_KDE. Importantly, this estimator has been demonstrated to preserve the fundamental properties of the original $\mathrm{HGR}$ measure and delivers strong performance. Therefore, in this paper, one of the metrics utilized is the $\mathrm{HGR}$ maximal correlation, and we employ the implementation through HGR_KDE, as recommended by its authors.

2.3.2 Focus on the case of pricing

In pricing, where the target variable $Y$ is continuous, confusion matrix metrics are irrelevant. Various correlation and distribution-based metrics related to the sensitive attribute $S$ become relevant. Here, we define $S$ as a binary variable representing gender.

One commonly used metric is Kendall’s tau. Additionally, $\mathrm{HGR}$ estimation is valuable and applicable to all variables.

Probabilistic distances and divergences can be computed on conditional distributions $Y|S$ . For instance, Kullback–Leibler’s divergence (KL) assesses disparities between premiums for men and women.

The $p$ -value from the Kolmogorov-Smirnov (KS) test is employed for fairness evaluation. This test complements divergences, being particularly sensitive to extreme values. These metrics are implemented using the Python package Scipy. The KL divergence is calculated using relative entropy, following the approach proposed in Boyd and Vandenberghe (2004).

The concept of individual fairness with continuous variables has received less attention. Noteworthy contributions include Dwork et al. (2011), who proposed using $k$ -nearest-neighbors for proximity assessment.

Adaptation flip-test

Inspired by Black et al. (2020) and Dwork et al. (2011), we adapt a metric for continuous variables to define individual fairness. We assume a distance metric or algorithm effectively measures proximity between individuals.

•

Select individuals within group $s_{i}=1$ .
•

Find their $k$ nearest neighbors of the opposite gender.

•

Calculate differences between their predictions $\widehat{y}_{i}$ and the average predictions of their neighbors $\mathcal{V}^{0}(\boldsymbol{x}_{i})$ :

\Delta_{\mathcal{V}}^{i}=\widehat{y}_{i}-\frac{1}{k}\sum_{x_{j}\in\mathcal{V}^% {0}(\boldsymbol{x}_{i})}\widehat{y}_{j}.

•

Average these distances :

\tilde{FT}_{1}=\frac{1}{n_{1}}\sum^{n_{1}}_{i=1}\Delta_{\mathcal{V}}^{i}.

This process is repeated starting with $s_{i}=0$ to obtain $\tilde{FT}_{0}$ . A value closer to zero indicates fairer model.

However, it’s crucial to note that the quality of the proximity metric provided by the $k$ -nearest-neighbors model is the primary limitation. To address this, we conduct hyperparameter tuning and variable selection using a grid search approach on our model. This optimization involves selecting an appropriate distance metric and its parameter settings, along with determining the optimal number of neighbors. Additionally, we optimize the number and specific variables used in model construction. The objective is to identify the $k$ -nearest-neighbors model that minimizes individual distance while minimizing differences in premiums. We perform a grid search on these parameters and validate the results using a test dataset. This approach offers a satisfactory solution with significantly lower operational costs compared to a causal study. Minimizing bias and distance helps ensure that the presented bias level isn’t artificially inflated by the model, aligning with the concept of counterfactual fairness discussed in Kusner et al. (2017), De Lara et al. (2021), and more recently Charpentier et al. (2023a).

Once bias is detected, it’s crucial to attempt mitigation while preserving the performance and consistency of the models constructed.

3 Bias mitigation

Fairness bias mitigation involves reducing the unwanted influence of the sensitive variable and the estimated variable of interest in models, data, or results. There is no consensus on bias mitigation approaches, as they are closely tied to specific use cases. Binary $y$ is often preferred in the literature, but in this study, we present methods applicable to regression and adapted for insurance contexts. Mitigation can occur before, during, or after modeling.

Incorporating fairness into the modeling process typically negatively impacts performance. Many papers have observed performance drops, with attempts to quantify these reductions (del Barrio et al., 2020). Therefore, maintaining acceptable performance while enforcing fairness is crucial.

We evaluate model performance using two mean metrics: Root Mean Square Error (RMSE) and losses over premiums ratio (LR):

\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}w_{i}(\widehat{y}_{i}-y_{i})^{2}}% ,\ \mathrm{LR}=\frac{\sum_{i=1}^{n}{y}_{i}}{\sum_{i=1}^{n}\widehat{y}_{i}},

(1)

where $w_{i}$ may represent a weight based on the variable $y$ considered. To balance performance and fairness metrics, we use them to guide optimization and comparison. A scenario is non-dominated when no other model achieves better performance and fairness simultaneously. Dominated scenarios can be surpassed by others in terms of fairness and performance.

Non-dominated scenarios warrant further investigation, and choosing the best dominant scenario depends on decision-maker constraints. Some prioritize fairness attainment, while others focus on maintaining performance levels.

We will present various methods in the following sections and summarize their advantages and disadvantages in Table 1.

3.1 Pre-processing mitigation

These mitigation methods involve data transformations aimed at reducing bias while retaining relevant information. The methods we have implemented include: total removal of variables correlated with the sensitive variable (Section 3.1.1), removal of linear correlations (Section 3.1.2) and an adaptation of the fair-SMOTE method (Section 3.1.3) proposed by Chakraborty et al. (2021). Originally designed for binary cases, we have modified it to suit the needs of bias mitigation in non-life insurance pricing.

3.1.1 Total deletion

This straightforward approach aims to mitigate bias by removing not only the sensitive variable but also variables that are correlated with it to reduce its indirect effects.Several studies have highlighted the potential for non-sensitive variables to perpetuate the effects of sensitive variables in models, as discussed in Lindholm et al. (2022a) and Lindholm et al. (2022b). Starting with explanatory variables $(X_{1},X_{2},\dots,X_{p})$ , the process involves the following steps:

•

Measure the dependency between the sensitive variable, denoted as $S$ , and each of the explanatory variables $X_{j}$ .
•

Based on the identified dependencies, create deletion scenarios, where each scenario represents a modeling instance with specific variables removed. Domain knowledge can also guide the selection of variables to delete.
•

Build models for each scenario and assess their fairness and performance on predictions to identify the best models.

Additionally, scenarios can be automatically generated by setting a maximal dependency threshold and removing all variables with dependencies exceeding that threshold.

3.1.2 Correlation remover

Rather than deleting variables correlated with the sensitive variable, an alternative approach is to transform these variables to reduce bias while retaining some information. This method, suggested in studies like Komiyama and Shimao (2017) and Frees and Huang (2023), involves removing information contained in $\boldsymbol{S}$ from $\boldsymbol{X}$ using regression models. The resulting residuals, denoted as $\boldsymbol{x}^{\perp}$ , are used in place of $\boldsymbol{x}$ , ensuring that $\boldsymbol{x}^{\perp}$ is orthogonal to the sensitive attributes $\boldsymbol{s}$ . Formally, for a matrix of vectors $\boldsymbol{s}$ , each variable $x_{j}$ is transformed as: $x^{\perp}_{j}=x_{j}-\alpha\boldsymbol{s}(\boldsymbol{s}^{\top}\boldsymbol{s})^% {-1}\boldsymbol{s}^{\top}x_{j}$ . $\alpha\in[0,1]$ is an hyperparameter that controls the level of correction applied on $x_{j}$ .

It is important to note that the absence of correlation does not guarantee the absence of statistical dependence, especially in cases involving non-linear transformations of legitimate features. The bias mitigation achieved through this method is limited. However, for linear models like Generalized Linear Models (GLMs), this approach is consistent in breaking all the connections that models can establish.

3.1.3 Fair-SMOTE adaptation

The fair-SMOTE approach differs from previous techniques by introducing synthetic individuals into the dataset instead of altering the existing data. Its primary goal is to ensure equal gender representation, regardless of premium levels, potentially addressing under-representation issues of specific classes. It operates exclusively on the training dataset, leaving the test dataset untouched.

In a study by Chakraborty et al. (2021), sampling methods directly on $Y$ were found to exacerbate bias as they do not consider the sensitive variable. They opted to sample based on $S|Y$ and $Y$ . However, sampling on $Y$ in insurance pricing contexts may compromise unique target variable characteristics and decrease performance. Therefore, we avoid sampling on $Y$ and assess the ramifications in our use case.

To apply this method to continuous target variables, we discretize them, allowing delineation of resampling bins. The number of bins influences proximity to the continuous distribution and reduction of statistical bias from discretization. However, selecting too many bins may result in small sub-populations unfit for consistent simulations, requiring tuning based on target variable distributions. Once bins are designated, distributions of each $S$ modality are harmonized within each bin.

Instead of randomly selecting from the initial set, we introduce subtleties enabling distinct individuals’ generation in specific scenarios. Two hyperparameters, the threshold $st$ and transformation factor $ft$ , are defined and selected within the interval $[0,1]$ . Using a $k$ -nearest neighbor model, two closest individuals, $v_{1}$ and $v_{2}$ , are identified from a randomly chosen individual $p$ . New individuals’ attributes are reconstructed column by column, preserving observed subgroup distributions. The pseudo code for this process is outlined in Algorithm 1 (Appendix A). Continuous $Y$ values for newly generated individuals are calculated based on the algorithm’s quantitative value rule.

Generative adversarial networks (GANs) could have been employed to simulate new individuals in a SMOTE approach or reconstruct $\boldsymbol{X}$ variables by generating similar individuals while minimizing dependence on $S$ .

3.2 In-processing

These approaches involve the inclusion of fairness constraints during the model calibration phase. The exponentiated gradient method, named after the game theory technique upon which it is founded, and its grid search version are both utilized.

3.2.1 Exponentiated gradient

Agarwal et al. (2018) present a bias mitigation approach that focuses on reducing bias within machine learning models. They begin by demonstrating that fairness definitions can be expressed as linear inequality sets of the following form:

M\mu(m)\leq c,

In this representation, $M$ is a matrix, $c$ is a vector, and $\mu$ is a conditional moment vector. The vector $c$ provides the means to control the level at which each constraint is enforced by adjusting the values of $c_{k}$ .

This formulation leads to an optimization problem within the context of statistical learning, defined as follows:

\min_{m\in\mathcal{M}}L(m)\text{ under the constraint that }M\mu(m)\leq c,

where $L$ represents a loss function employed for the evaluation of the models. We can observe that applying constraints to deterministic prediction functions can have a detrimental impact on performance. To mitigate this, they introduce the concept of "random functions," involving the use of a randomized predictor denoted as Q, which is drawn from $\Delta$ , the set encompassing all distributions over $\mathcal{M}$ , for the purpose of making predictions. In this approach, a predictor $m$ , selected from $\mathcal{M}$ , is sampled from $Q$ and subsequently used for prediction. Consequently, the prediction error is defined as $L(Q)=\sum_{m\in\mathcal{M}}Q(m)L(m)$ and the conditional moments are expressed as $\mu(Q)=\sum_{m\in\mathcal{M}}Q(m)\mu(m).$ This adaptation transforms the optimization problem into the following for

\min_{Q\in\Delta}L(Q)\text{ under the constraint that }M\mu(Q)\leq c.

To address this problem, we employ the "exponentiated gradient" algorithm, as recommended in Freund and Schapire (1996). This game-theory-based approach pits the prediction function against the level of compliance with the constraint. The optimum is reached when any alteration of these elements results in a minimal loss of performance and an increase in constraint enforcement.

In theory, this approach has the potential to accommodate various fairness constraints in both binary and continuous cases. However, in practice, it has primarily been applied to the binary case. The implementation of equalized odds or demographic parity in the continuous case remains an unresolved challenge, and it is inconclusive whether this method is suitable for such cases.

A less stringent constraint known as "equality of the expected error" has been incorporated, specifically when $S$ is a discrete variable:

\text{ equality of }\mathbb{E}[\ell(Y,\widehat{Y})|S=s]\text{ across all }s.

This constraint aims to ensure that the model makes errors of similar magnitude, on average, regardless of the value of $S$ , ultimately resulting in predictions of equal quality for different $s$ values. This concept can be reformulated as an inequality:

\mathbb{E}[\ell(Y,\widehat{Y})|S=s]<\zeta,\forall s.

In this formulation, the hyperparameter $\zeta$ controls the acceptable error margin beyond which the constraint may be violated. Additionally, the error from a prediction can be further constrained by introducing the hyperparameter $M$ , leading to the inequality:

\mathbb{E}[\min(\{\ell(Y,\widehat{Y})|S=s\},M)]<\zeta,\forall s.

This approach allows the incorporation of a threshold to limit the extent of deviations considered, particularly when the algorithm encounters convergence challenges.

3.2.2 Grid search approach

In their exploration of bias mitigation, Agarwal et al. (2018) highlight an intriguing possibility when the sensitive variable is binary. In this context, selecting the most suitable hyperparameters becomes crucial, transforming the pursuit of the optimal solution into a challenge of identifying two interconnected parameters through an equation.

However, in the continuous case, a grid search methodology becomes vital for uncovering suboptimal solutions to the inequality mentioned earlier. When the exponentiated gradient method fails to converge for a given value of $\zeta$ or computational time becomes prohibitive, grid search offers an alternative. It systematically explores specific or random regions within the solution space, identifying a solution within a predefined time frame.

3.3 Post-processing

This approach involves transforming model predictions to enhance fairness. For instance, logistic regression allows influencing the model’s behavior with respect to different classes through computed probabilities and decision thresholds without recalibrating predictions.

Historically, post-modeling techniques mainly adjust decision boundaries while considering fairness definitions, but they’re less applicable to the continuous case due to the absence of distinct decision boundaries. Recent suggestions, like using Wasserstein barycenter on scores and the associated transport procedure by Charpentier et al. (2023b), aim to address this gap. However, defining advantageous and disadvantageous premiums is challenging, as it depends on policyholder characteristics and inherent risk. The "individual fair redistribution" approach was introduced within this context.

3.3.1 Fair redistribution

The methodology based on optimal transport, particularly the Wasserstein barycenter concept as utilized by Charpentier et al. (2023b), involves implementing monotone transformations on premiums within distinct sensitive groups while maintaining the relative orderings of premiums within each group. Here, we’ll focus on a uniform premium adjustment within each sensitive group. This approach adapts the flip-test to define bias, dividing premiums into fair shares and biases. The redistributed bias is then allocated to individual data points iteratively with the aim of minimizing bias given the adjusted premiums. In this subsection, we will refer to $Y$ as premium to grasp the intuition behind our approach. The same reasoning can be made with any other continuous pricing outcome.

We introduce $\varepsilon$ to represent a measure of "individual fairness bias", which is defined as the difference between the estimated outcome $\widehat{y}$ and the fair premium $\widetilde{y}$ . This estimation is initially determined by means of the quantity $\Delta_{\mathcal{V}}$ , presented earlier. More specifically, we define $\varepsilon_{i}$ as $\Delta_{\mathcal{V}}^{i}$ , as per the adaptation of the flip-test, which is elaborated upon in Section 2.3.2. But here, instead of completely rectifying the disparity in the premium due to the initially quantified bias with the equation:

\widetilde{y}=\widehat{y}-\varepsilon,

we adopt a strategy of partial fairness correction. This is accomplished by employing the following expression:

\widetilde{y}=\widehat{y}-\frac{\varepsilon}{\eta},

where $\eta$ is an hyperparameter constrained within the interval $[1,\infty)$ , identical within a given group. When $\eta$ takes on a very large value, it corresponds to minimal correction, while as $\eta$ approaches 1, the level of correction is greater. $\eta$ was introduced because we noticed that correcting directly $\widehat{y}$ with $\varepsilon$ didn’t lead to an increase of the fairness level. Individuals that had greater premiums than their neighborhood (as defined by the adapted flip-test) now had smaller premiums and vice versa. By bringing an individual’s premium closer to the average premium of his opposite-gender neighbors, it potentially moves away from the premiums of other individuals whose neighborhood it made up. The conclusions of our first works were that a too brutal correction wasn’t efficient and that it will be necessary to correct both subgroups smoothly and simultaneously. Thus, we decided to introduce an alternative algorithm where for a given level of $\eta$ , we correct slowly and iteratively the two subgroup’s premiums while recalculating at each iteration the new bias level $\varepsilon$ . After the final iteration, we obtain

\widetilde{y}_{i}=\widehat{y}_{i}-\varepsilon_{i}^{final\leavevmode\nobreak\ % bias},

where $\varepsilon_{i}^{final\leavevmode\nobreak\ bias}$ can be seen as the summation of all $\varepsilon_{i}$ applied to an individual during the iterative process. To control the level of correction and stop the algorithm when needed, we introduce a second hyperparameter $\zeta\in[0,\infty)$ that will be a threshold that defines the maximal level of acceptable correction. Given a gender s (s binary in our case), we define :

\Sigma_{s}=\sum_{i=1}^{n_{s}}\varepsilon_{i}=\sum_{i=1}^{n_{s}}\Delta_{% \mathcal{V}}^{i},

the sum of the fairness biases of all $n_{s}$ individuals in a gender subgroup. When $\Sigma_{s}<\zeta$ , the algorithm will be stopped. This way, $\zeta$ can be used to control the tradeoff between fairness and performance, larger corrections leading to better fairness but a degradation of the quality of the premiums. The quality of the premiums after redistribution is measured using two metrics :

•

Redistribution integrity, measures to which extent the redistribution has altered the scope of the premium’s distribution :

\frac{(\max\widetilde{y}-\min\widetilde{y})}{(\max\widehat{y}-\min\widehat{y})}.

•

Global variation. Within the framework of the optimal transport approach, the fair premium is characterized by a balance property, corresponding to a null global variation. Measures the lost in premium induced by the redistribution :

\sum_{i=1}^{n}\widetilde{y}_{i}-\sum_{i=1}^{n}\widehat{y}_{i}.

With these elements, we detail the alternative approach in the following lines. Initialize by choosing $\eta,\zeta$ and setting a subgroup of $s$ to start with $(s=0)$ . At each iteration, the treated subgroup will be switched and the bias will be recalculated. While $\Sigma_{s}\geq\zeta$ repeat the following steps :

Correct the premiums for the $s$ gender subgroup (update $\widehat{y}_{i}$ ):

\widehat{y}_{i}=\widehat{y}_{i}-\frac{\varepsilon_{i}}{\eta},\leavevmode% \nobreak\ \leavevmode\nobreak\ \forall i\text{\leavevmode\nobreak\ with% \leavevmode\nobreak\ }s_{i}=s.

Switch gender subgroup, take the opposite gender :

\begin{cases}\text{if }s=0,\text{ then }s=1,\\ \text{if }s=1,\text{ then }s=0.\end{cases}

Measure the bias $\varepsilon_{i}$ between individuals of gender $s$ and their neighbors of opposite gender using the flip-test method (update $\varepsilon_{i}$ ):

\varepsilon_{i}=\widehat{y}_{i}-\frac{1}{k}\sum_{x_{j}\in\mathcal{V}^{s}(% \boldsymbol{x}_{i})}\widehat{y}_{j}=\Delta_{\mathcal{V}}^{i},\leavevmode% \nobreak\ \leavevmode\nobreak\ \forall i\text{\leavevmode\nobreak\ with% \leavevmode\nobreak\ }s_{i}=s.

Calculate the sum of the differences:

\Sigma_{s}=\sum_{i=1}^{n_{s}}\varepsilon_{i},

if this sum is greater than $\zeta$ then repeat these five steps else stop the iterative process.

The genders are switched before steps 3 and 4 because, after correction on a given gender subgroup, we want to recalculate the biases on the opposite gender subgroup and perform the check on the new $\Sigma_{s}$ . If $\Sigma_{s}$ is still above our threshold $\zeta$ , then we will repeat step 1 before switching for the next verification. This is because each correction on $\widehat{y}$ leads to a new fairness state which must be reevaluated with $\varepsilon_{i}$ to ensure a proper correction in the next iteration. Performing the check on the opposite gender ensures that the algorithm doesn’t stop immediately after one gender subgroup has obtained sufficient corrections. As hyperparameters, $\zeta$ and $\eta$ have to be optimized to ensure the best possible results with regards to the redistribution integrity and global variation metrics.

	pros	cons
Pre-processing	preserves the overall modeling process. Straightforward to implement and consume less computational time.	can lead to substantial loss of information. Variables may be closely interrelated and might not allow for both bias reduction and information retention. Additionally, it’s crucial to prevent the modeling process from introducing new biases related to fairness.
Total deletion	allows simple variable selection. HGR coefficients facilitates the identification of dependencies that classical measures might overlook.	depends on the ability of the remaining variables to compensate for the loss of information resulting from variable deletion.
Correlation remover	achieves an intriguing balance between fairness and performance. It is relatively simple to implement, relying on linear regression.	requires quantitative explanatory variables and ideally a quantitative sensitive variable. After correction, discretization of the transformed variable’s distribution is necessary, along with the construction of a correspondence function between the original and transformed variables for predictions.
Fair-SMOTE adaptation	needs various scenarios and hyperparameters allow customization to address specific problem requirements while sampling.	seems to have a limited effect on historical fairness bias and interdependencies.
In-processing	leverages sensitive variable information to identify the optimal trade-off between performance and fairness. In theory, they are more likely to yield the best possible compromise.	can be challenging to implement and generalize. Even after successful implementation, there is no assurance of convergence, and computation times can become exponential.
Exponentiated gradient	offers the advantage of providing a comprehensive framework for the direct integration of fairness constraints into machine learning models.	may require exponential computation times, even when dealing with relatively simple constraints. Certain cost functions and the imposition of various constraints may necessitate fundamental restructuring of the system to accommodate the application of the exponentiated gradient method.
Post-processing	avoids model recalibration, leading to shorter computation times, and produce outcomes less susceptible to bias contamination.	relies on the quality of built models, premium, and bias models, to be effective. Applying mitigation using inaccurately estimated components can result in inconsistencies.
Fair redistribution	addresses individual fairness within regression scenarios and can be tailored to suit the specific problem under consideration.	needs monitoring of the quality of the premium and KNN models. And, considering the current distribution constraints, an extra step will may be necessary building a grid that encompasses all premiums corrections or distributable model that predicts the corrected premiums.

Table 1: Summary of pros and cons of the different mitigation approaches

4 Measuring biases in car insurance pricing

For this glass breakage guarantee pricing application, various business and operational constraints alongside statistical considerations are essential throughout the pricing process, such as distribution constraints. Table 2 displays the variables extracted from a car insurance database, augmented with vehicle information. It includes variable names, descriptions, values, and statistics. Quantitative variables show mean and median values, while qualitative ones indicate the shares of the two most common categories.

Variable name	Description	Value taken	Statistics
claim_amount	individual claims expenses (€)	$[0,1825]$	$\bar{x}:580\text{\texteuro}\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:596% \text{\texteuro}$
claim_nb	number of claims	$[0,5]$	$0:97.3\%\leavevmode\nobreak\ \|\leavevmode\nobreak\ 1:2.7\%$
expo	exposure by contract	$[0{.}34\%,100\%]$	$\bar{x}:79\%\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:95\%$
year_pol	year the policy was purchased	$[2015,2020]$	$2018:18\%\leavevmode\nobreak\ \|\leavevmode\nobreak\ 2020:17\%$
driv_age	age of primary driver	$[18,77]$	$\bar{x}:47\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:45$
driv_yp	number of years in portfolio	$[0,12]$	$\bar{x}:1\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:1.98$
area	area code	17 zones	$D:29\%\leavevmode\nobreak\ \|\leavevmode\nobreak\ F:23\%$
driv_gender	gender of primary driver	F, M	$M:58{.}4\%$
driv_ly	driver licence seniority	$[0,44]$	$\bar{x}:15\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:17$
driv_2	presence of secondary driver	0, 1	$0:69\%$
veh_age	age in year of the vehicle	$[0,44]$	$\bar{x}:5\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:4$
energy	type of energy	5 types	$D:57\%\leavevmode\nobreak\ \|\leavevmode\nobreak\ E:43\%$
weight	weight in kilograms	$[830,3200]$	$\bar{x}:1240\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:1280$
veh_power	vehicle power in KW	$[13,220]$	$\bar{x}:94\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:91$
veh_price	vehicule price	$[6.8k,65k]$	$\bar{x}:21400\text{\texteuro}\leavevmode\nobreak\ \|\leavevmode\nobreak\ Me:154% 25\text{\texteuro}$
box_type	type of gearbox	2 types	$A:91.1\%$
claim_hist	claim occurrence in previous observed years	0, 1	$0:91.7\%$

Table 2: Description of the dataset’s variables.

The data underwent processing and analysis, including variable reprocessing and discretization, to construct a modeling database. Output variables frequency, average cost, and pure premium were constructed using claim_amount, claim_nb, and expo. Two additional variables were created: zoning, describing risk zones with ten categories, and weight_kw, representing the weight of the vehicle divided by its power. Models were built using Generalized Linear Models (GLM), Random Forest, and Gradient Boosting, evaluated using Root Mean Square Error (RMSE) and Loss Ratios (LR), and optimized and validated using interpretability methods.

Following the pricing phase, GLM models were selected for retention due to their interpretability and ease of integration into production pricing tools. Despite the potential for black-box models to achieve better performance, the marginal gain did not justify their operational costs. However, calculations were performed on all models at each stage, and there were generally no significant deviations.

Furthermore, the pure premium model was favored over a combined cost and frequency model for efficiency according to defined metrics. Reference models were established by excluding gender. The study then delves into measuring and mitigating gender bias to refine the pricing process using developed tools.

To assess fairness bias, six different dependence measures are employed:

1-2

Kendall’s tau and mean ratio : These provide an initial understanding of dependence, offering a comprehensive first impression,
3-4

Kolmogorov-Smirnov’s test $p$ -value and the JS divergence : These quantify dependence between distributions,
5

HGR (or HGR_KDE) : A potent metric offering a nuanced perspective,
6

Flip-test adaptation : An individual-based fairness metric.

The first five measures focus on group fairness, specifically independence, disregarding variable $Y$ . The sixth measure, the Flip-Test Adaptation, examines bias at the individual level, providing an alternative perspective.

4.1 Bias on historical data

Table 3 displays the results of the dependence measures between each of the variables of interest and the sensitive variable.

$Y$	Kendall	HGR_KDE	KS ( $p$ -value)	Div_JS	mean_ratio	Flip-test
Average cost	-0.0796	0.0903	1.9426e-07	0.3217	1.1024	-10.23€
Frequency	-0.0197	0.3031	3.3333e-02	0.8413	1.2489	-2.57%
Premium	-0.0212	0.3106	3.8555e-02	0.8401	1.3450	-7.88€

Table 3: Dependence between

Y

and

S

before modeling

Kendall’s tau indicates a weak dependence between the variables of interest ( $Y$ and $S$ ), with a negative sign suggesting slightly weaker values for women. HGR also detects slightly stronger relationships, consistent with Kendall’s tau. The KS test highlights significant differences between male and female distributions.

According to the flip-test, compared to similar male policyholders, women have an average cost €10 lower. This observation aligns with the negative sign of Kendall’s tau and the mean ratio values, indicating lower averages for women compared to men.

Figures 1 and 2 analyze the distributions of these variables with respect to gender. For the frequency variable, distributions between $[0,1)$ for males and females are similar, representing 97.3% of the population. At $1$ , there’s a 15% higher representation of women compared to men, accounting for 1.2% of the population. Above $1$ , men are more strongly represented, with an 18% higher presence compared to women, representing 1.5% of the population.

Refer to caption — Figure 1: Historical average cost and premium distribution by gender.

The observed distributions suggest that, on average, men generally exhibit higher values for claim cost, frequency, and premium compared to women in historical data. However, interpreting these figures in absolute terms poses challenges due to the difficulty in establishing critical thresholds, given the lack of precedents in this type of problem. Nonetheless, these initial findings imply that gender has an impact on the variables of interest. Besides the class imbalance (more men than women), bias appears evident in the historical data. Such differences are well-documented in automobile pricing and are typically addressed by either excluding gender as a factor or rebalancing the model outputs. It’s worth noting that risk exposure may differ between gender classes due to various factors (see Ayuso et al. (2016)). To mitigate this effect, our dataset comprises individuals who subscribed to the same mileage package, providing the most accurate information available on vehicle usage.

4.2 Bias after modeling using gender

By definition, the best-performing model would include gender as a feature because it leverages all the available information for risk modeling. However, it is also the most unfair model concerning gender since a clear distinction between men and women is directly visible in the resulting premiums. To create this model, gender is reintroduced into the modeling process. Once constructed, predictions $\widehat{Y}$ are obtained, and the dependence between $S$ and $\widehat{Y}$ is measured. Table 4 presents the results of these different measures.

$\widehat{Y}$	Kendall	HGR_KDE	KS ( $p$ -value)	Div_JS	mean_ratio	Flip-test
Average cost	-0.1824	0.2241	0.0000e+00	0.3825	1.0960	-3.81€
Frequency	-0.1949	0.3144	0.0000e+00	0.7333	1.2219	-1.09%
Premium	-0.2101	0.3268	0.0000e+00	0.7116	1.3524	-1.44€

Table 4: Dependence between

\widehat{Y}

and

S

after modeling containing the gender

The dependencies between the interest variables and the sensitive variable were amplified according to almost all measures. Measures leveraging distributions, like div_JS and the flip-test, suggest weaker dependencies because $\widehat{Y}$ has less dispersion than $Y$ . Kendall’s tau detects triple the dependence for the cost variable, implying a somewhat simpler detection of the dependency structure. Thus, the constructed models not only replicated the historical bias present in the data but also exacerbated it. Figures 3, 4, and 5 analyze the distributions of these predicted interest variables concerning gender.

The mismatch between the distributions is more pronounced post-modeling, with larger portions of the distributions not overlapping. The differences are more evident compared to the historical data case.

How can we explain this bias amplification? The interpretation of the constructed models reveals that gender plays a significant role in the prediction processes. By examining coefficients and importance measures, it becomes evident that gender is among the most important variables in these models.

The greater measured bias in the constructed models compared to the historical data used for modeling can be explained by the interdependence of the variables in the dataset. There is a link between the explanatory variables selected and the variables of interest because they help understand the associated risk. However, these variables, besides their predictive abilities, may also be related to each other. For example, a vehicle’s power may be linked to its price. These interdependencies pose challenges for fairness implementation because the so-called sensitive variables often have a notable influence on the distributions of the other observed variables. Thus, even if the observable interdependencies seem weak at first glance, their accumulation can magnify the role of the sensitive variable in the models.

In the data used, gender is significantly linked with variables such as horsepower, weight, gearbox type, and vehicle price, in addition to its minor relationships with other explanatory variables. Figure 6 visually represents how gender directly influences the variables of interest and has a significant indirect effect through its relationships with other variables.

For instance, in the cost model, gender is ranked as the fourth most influential variable. Nevertheless, it’s noteworthy that the most substantial variables within the model are those closely related to gender. Generally, sensitive variables exert a significant impact on individuals’ behavior. For instance, age can influence risk-taking propensity, the level of responsibility, personal interests, and more. Similarly, ethnicity may correlate with factors such as geographical location, purchased services, and cultural preferences.

These analyses inevitably lead to a central question: What are the implications when gender is omitted or excluded from the modeling process? This raises the possibility that the observed interdependencies might have been overemphasized, and that the models have unduly relied on gender as a driving factor.

4.3 Bias after modeling without gender

By definition, a model that excludes gender as a variable aims to eliminate direct discrimination, ensuring that final outcomes are not discerned based on gender. However, it’s acknowledged that this strategy doesn’t address indirect discrimination as it overlooks the interrelationships between other explanatory variables and gender. Table 5 presents findings on the association between $\widehat{Y}$ and gender in models where gender isn’t included as an explanatory variable.

$\widehat{Y}$	Kendall	HGR_KDE	KS ( $p$ -value)	Div_JS	mean_ratio	Flip-test
Average cost	-0.1681	0.2111	9.7460e-39	0.3288	1.0118	-2.71€
Frequency	-0.1309	0.2741	0.0000e+00	0.6493	1.1573	-1.07%
Premium	-0.1733	0.2897	0.0000e+00	0.7226	1.2694	-0.88€

Table 5: Dependence between

\widehat{Y}

and

S

after modeling without gender

The results show a minimal reduction in bias compared to models where gender is included, indicating that gender’s influence remains pronounced even when it’s absent from the models. While it may seem counterintuitive that premiums no longer differentiate by gender, there are still distinctions observed based on gender, primarily due to indirect interdependencies within the data. For example, varying premiums based on the presence of secondary drivers indirectly leads to varied premiums for each gender because secondary drivers have different distributions by gender. In complex multivariate models, such distinctions based on gender can arise from the combination of multiple variables. For instance, a combination of factors such as vehicle power, gearbox type, and zoning may result in significant gender imbalances within the dataset. When examining the intersection of zoning and vehicle prices, notable gender imbalances become evident in the data. For instance, 82% of individuals residing in the 10th zone and owning a vehicle costing over €30,000 are male. Consequently, any substantial premium difference within this segment compared to others significantly impacts men, leading to gender-based premium disparities, despite the absence of explicit gender classification.

4.4 A different modeling approach

These findings highlight that implementing fairness through omission does not offer a comprehensive solution to discrimination. An alternative modeling approach is employed to assess dependency levels. Instead of excluding the gender variable, it is included in the model, and then the outputs are post-processed to ensure gender-insensitive pricing, aligning with strategies used by insurers to comply with regulations like the gender directive. The specific approach calculates a weighted average of male and female model outputs for all segments, but the results show similar dependencies between the new values of $\widehat{Y}$ and $S$ as previous modeling strategies. This method fails to resolve interdependencies, as reweighting the outputs perpetuates existing imbalances.

Consequently, gender’s influence persists within the modeling process, even after its omission or conventional adjustment. Some scholars refer to this as the "reconstruction" of the sensitive variable, where bias continues to manifest due to interdependencies between explanatory variables and the sensitive variable. This study exemplifies such a scenario.

In summary, bias exists in historical data, and regardless of the modeling approach applied, it is perpetuated, primarily due to interdependencies. The subsequent section explores bias mitigation while maintaining acceptable performance, using the model constructed without gender as reference.

5 Bias mitigation in car insurance pricing

The models have been constructed (using standard machine learning approaches), performance and biases have been assessed, and the next step involves mitigating bias while upholding the performance level. This section provides the implementation specifics and outcomes of the various bias mitigation methods introduced in Section 3.

5.1 Total deletion

To establish the deletion scenarios, we analyze the relationship between sensitive and non-sensitive variables. The results are depicted in Figure 7.

By examining these dependency levels, $\alpha$ can assume values within the range of $(8\%,41\%].$ Scenarios are developed by iteratively adjusting the threshold, considering the relationships between the variables. The chosen scenarios are as follows:

•

Scenario 1: driv_2;
•

Scenario 2: driv_2 and box_type;
•

Scenario 3: driv_2, box_type and energy;
•

Scenario 4: box_type, driv_2, energy and weight;
•

Scenario 5: box_type, driv_2, energy and anc_cp;
•

Scenario 6: box_type, driv_2, energy, and zoning;
•

Scenario 7: box_type, driv_2, energy and veh_price;
•

Scenario 8: box_type, driv_2, energy, veh_price and weight_kw;
•

Scenario 9: box_type, driv_2, energy, veh_price and zoning;
•

Scenario 10: box_type, driv_2, energy, weight and zoning;
•

Scenario 11: box_type, driv_2, energy, veh_price, weight_kw and zoning.

The gender variable is eliminated from all models. Figure 8 illustrates the fairness and performance levels of each scenario. The non-dominated scenarios are highlighted in red. The abbreviation dbe refers to the variables driv_2, box_type and energy. For example, scenario 6 is represented on the graph as dbe+zoning.

Among the non-dominated scenarios, a choice must be made based on the desired trade-off as determined by decision-makers. For instance, the objective might be to attain the fairest model while tolerating a maximum performance loss of 5%. Scenarios 8, 9, 10, and 11, although among the fairest, would no longer meet this criterion. Upon analyzing the graph, the 7th scenario, labeled as dbe+veh_price, emerges as an interesting trade-off option, as it significantly reduces bias, almost halving it, with an acceptable performance loss. Compared to the reference model, a bias reduction of 37% is achieved in exchange for a 3.9% drop in performance. The 7th scenario model is further examined.

It appears that the model has learned to rely more on other variables to compensate for the absence of the veh_price variable. For instance, some variables like zoning, driv_ly, and veh_age now have larger coefficients and play a more significant role in the prediction process. However, the performance loss still indicates that these omitted variables cannot be entirely replaced optimally. Furthermore, this new model leads to a slightly degraded equilibrium in pricing compared to the reference equilibrium. The LR metric obtained is 99.3% compared to the reference LR of 99.7%. It is essential to note that once the input data are modified, a reevaluation and validation of the model are necessary.

Hence, this initial approach can yield satisfactory results depending on the quality of the explanatory variables and the acceptable level of performance loss. It is a straightforward method to implement and underscores the importance of addressing fairness concerns from the initial stages of data processing and variable selection.

5.2 Correlation remover

To apply the methodology introduced in Section 3.1.2, we have chosen to retain qualitative variables while focusing solely on transforming quantitative variables. The quantitative variables subject to transformation include: veh_price, weight_kw, veh_age, driv_yp, and driv_ly. We uniformly select $100$ values of $\alpha$ within the interval $[0,1]$ , including both endpoints. For each $\alpha$ value, we correct our quantitative variables, build our model, and measure performance and fairness.

From these results, two observations arise. First, there is no clear trend between the fairness level and the hyperparameter $\alpha$ . Typically, one might expect fairness and performance to decrease as $\alpha$ increases, but this trend is not evident. The relationship between fairness, performance levels, and $\alpha$ appears random, lacking a discernible pattern. Second, using this method, some models outperform the reference model. For example, the model with $\alpha=85\%$ shows lower bias and slightly better performance. While the performance gain is only around 0.5%, it comes with an 11% improvement in fairness. However, the efficacy of this method relies on the linear model’s ability to detect relationships, ensuring that residuals are unbiased. Nonetheless, examining linear correlations reveals that the highest dependence between $S$ and other explanatory variables is 8%, significantly limiting suppression capacity in this specific case.

While this method seems promising, it presents several limitations within the insurance pricing context. First, in pricing models, it is more common to use discrete or discretized variables as inputs. Ideally, continuous variables should be discretized after suppression to ensure interpretability. However, as the values represent residuals, they may lose interpretability. Additionally, in practical production scenarios, transforming customer attributes into predicted premiums may be complicated by this method. Lastly, the non-intuitive model behavior in response to changes in $\alpha$ lacks a clear explanation. Considering these factors, while a model outperforming the reference model was achieved, adopting it is challenging due to associated limitations. This method may find greater success when applied to a continuous sensitive variable.

5.3 Fair-SMOTE adaptation

Balancing the number of bins and the size of each subpopulation, we have opted for a segmentation into seven bins, with the following breakdown:

•

bin 1 if $y=0$ ;
•

bin 2 if $y\in(0,250]$ ;
•

bin 3 if $y\in(250,500]$ ;
•

bin 4 if $y\in(500,750]$ ;
•

bin 5 if $y\in(750,1000]$ ;
•

bin 6 if $y\in(1000,1500]$ ;
•

bin 7 if $y\in(1500,+\infty)$ .

The number of bins serves as a hyperparameter, resulting in various binning scenarios. The selection of these bins depends on their alignment with the observed trends in the premium distribution.

Following the bin selection, the next step is to determine the scope of resampling. As previously discussed, resampling will only be performed on $S|Y$ since resampling on $Y$ would alter the specific distribution needed for the GLM. Although for algorithms like random forest, which make no distribution assumptions, resampling on $Y$ was attempted, it led to a notably lower LR due to the high premium levels predicted by the models.

It’s worth noting that the creators of the traditional fair-SMOTE method recommend setting $st$ and $ft$ to 0.8 each. This configuration is retained for consistency. The distribution after resampling is illustrated in Figure 9.

In total, 9588 individuals were generated, constituting a substantial expansion of the learning dataset by 16.1%. The resulting distribution of this enlarged dataset remains in alignment with the initial observations, and the distinctions between genders are notably less prominent. The box plots demonstrate a substantial overlap, with the noticeable shift observed in the historical data significantly mitigated. Therefore, this resampling procedure appears to be effective. Following the preparation of the augmented training data, the models are reconstructed, and the outcomes are presented in Table 6.

Models	RMSE	HGR_KDE	LR
Reference model	164.21	28.97%	99.71%
Model after fair-SMOTE	164.88	28.05%	99.69%

Table 6: Comparative table of results after fair-SMOTE.

The initial distribution shows reduced skewness, but the outcomes on the test dataset still exhibit skewed results. The improvement in fairness is minimal, with less than a 4% enhancement. Performance metrics and the LR remain similar to the baseline. Therefore, rebalancing the distributions solely based on $S|Y$ seems inadequate for achieving fairness. This could be because the bias needing mitigation in car pricing does not stem from a representation issue concerning $Y$ , but rather from historical differences in male and female premiums. Thus, while resampling ensures both genders receive sufficient data for equitable treatment during training, it doesn’t address the interdependencies between variables, crucial for ensuring fairness.

Sensitivity analyses on various modeling choices were conducted to explore whether these choices contributed to the ineffectiveness of our approach. Changes to hyperparameters $ft$ and $st$ didn’t result in improved outcomes. Adjusting the resampling bins also didn’t lead to significant enhancements. Increasing the number of bins introduced more irregularities in the distribution of $Y$ and reduced model performance.

A significant departure from conventional practices in the literature was limiting the resampling process solely to $S|Y$ rather than conducting full resampling on $Y$ . Full resampling of $Y$ altered its distribution drastically, generating an excessive number of artificial instances and failing to produce satisfactory results. To explore the potential benefits of resampling on $Y$ while preserving the overall distribution shape, partial rebalancing of $Y$ values was considered. For instance, in scenarios where $Y$ has only two distinct values, such as $\{25,35\}$ , Tables 10 to 10 demonstrate the various forms of resampling undertaken.

Y/S	F	H	Total
25 €	250	150	400
35 €	120	170	290
Total	370	320	690

Table 7: initial distribution

Y/S	F	H	Total
25 €	250	250	500
35 €	170	170	340
Total	420	420	840

Table 8: Balance on

S|Y

Y/S	F	H	Total
25 €	250	250	500
35 €	250	250	500
Total	500	500	1000

Table 9: Balance on

S|Y

and

Y

Y/S	F	H	Total
25 €	250	250	500
35 €	200	200	400
Total	450	450	900

Table 10: Balance on

S|Y

and partially on

Y

Multiple scenarios involving partial resampling of $Y$ were explored. The results indicate that as the number of artificially generated individuals increases, model performance worsens without simultaneous improvements in fairness. For example, expanding the population by 20% through resampling led to an increased RMSE of 177, while the HGR fairness metric remained at 28.05. Ultimately, these approaches failed to produce better results.

An unexplored area for potential improvement involves considering the explanatory variables during the resampling process. Instead of resampling solely to achieve equity based on $S|Y$ , an alternative approach is resampling based on $S|X_{1},\dots,X_{p},Y$ . The selection of variables for conditioning is crucial here, as they can potentially amplify existing biases. Implementing this approach is complex, as it requires overseeing the quality of resampling on individual variables and their interactions while maintaining the overall consistency of premium distributions. Although this approach hasn’t been extensively explored in this study, integrating more efficient methods for addressing interdependencies between variables may be considered in future research.

In summary, the various pre-modeling bias mitigation strategies have provided insights into the data’s structure and identified elements with significant influence on bias levels. They offer valuable tools for addressing ethical concerns during the data preprocessing phase.

5.4 Exponentiated gradient

To achieve the model with the lowest permissible error level, bounded by $\zeta$ , a specific approach is adopted. The method starts with the lowest feasible $\zeta$ value, which is initially set at $10^{-5}$ , and progressively increases it in multiples of 10 until the algorithm converges. The first convergence occurred when $\zeta=10^{-1}$ . Once the method converges to the optimal solution, the grid search and the $M$ parameter become unnecessary.

However, for $\zeta<10^{-1}$ , these tools can help identify suboptimal solutions that may outperform the solution found with larger tolerances. During testing, values of $M$ ranging from 0.5 to 500 were employed with grid sizes of 3000. The results obtained were not as favorable as those achieved with convergence. It is essential to note that the grids explored were relatively small compared to the dimensionality of the dataset. Given more computational power and time, improved results might potentially be attainable.

The outcomes for the best model are presented in Table 11.

Models	RMSE	HGR_KDE	LR
Reference model	164.21	28.97%	99.71%
Model after mitigation	164.30	31.74%	99.70%

Table 11: Evaluation of the results obtained after application of the exponentiated gradient.

The model obtained after mitigation exhibits reduced efficiency and increased unfairness compared to the reference model. While the loss of performance is negligible, the model experiences a 9% higher HGR after the application of the mitigation method. These outcomes can be attributed to the fact that the error constraint implemented does not enforce any form of independence between $\widehat{Y}$ and $S$ . In other words, the error rate per gender can be the same without ensuring fair treatment of the genders.

In addition to the inability to accommodate alternative fairness definitions, this method presents exponential computation times. The grid search approach and the parameter $M$ offer an alternative but lead to suboptimal results. In the rapidly evolving domain of bias mitigation, the aspect of mitigation during modeling stands as one of the most challenging and least developed. The limited availability of methods primarily focused on binary classification, combined with the high customization required for their adaptation, compounds the challenge. Nonetheless, the method explored here, implemented under a relatively straightforward fairness constraint, serves to illustrate the limitations associated with implementing fairness as an optimization constraint. Furthermore, the quest for accessible, generalizable, and stable methods remains a significant mathematical challenge that must consider the constraints inherent to the field of pricing.

5.5 Fair redistribution

This method employs the flip test’s adaptation to measure individual fairness. We optimized the $k$ -nearest neighbors used in the flip test through hyperparameter tuning. The objective was to select the most suitable variables that would minimize the differences between individuals in a test database. As discussed earlier, the goal is to develop a model that reduces the distance and bias between individuals.

The selected variables include: driv_yp, driv_2, veh_age, veh_price, box_type, and zoning, with a number of neighbors set to 5 and a Manhattan distance metric ( $\ell_{1}$ ). In Python, there is an attribute in the $k$ -nearest neighbor algorithm that can automatically select the optimal neighbor search method based on factors like dimensionality, the number of individuals, and the data structure (e.g., whether the matrix is sparse or not). This automated method yielded the most favorable results. Figure 10 illustrates the distribution of disparities between women and their corresponding male neighbors.

Subsequently, for each individual in the database, the average disparity with their closest neighbors of the opposite gender is computed. These average discrepancies are incorporated into the original modeling dataset. An experiment is then conducted on a test dataset, encompassing a total cost of 359584€ and predicted premiums of 360884€ for 14000 observations. The Loss Ratio (LR) of 99.64% aligns well with the overall LR of 99.7%. On average, women’s premiums are 0.8€ lower than those of the nearest men. Table 12 provides a summary of the interesting aggregates calculated on the test dataset.

Aggregates	Female	Male	Sum
Total expenses	154880€	204704€	359584€
Predicted Premiums	154953€	206931€	360884€
Exposure	4776.6	7114.6	11891.2
Number of individuals	6069	7931	14000
Average $\varepsilon$	-0.8€	1.2€	0.4€
Sum of bias ( $\Sigma_{s}$ )	-17448€	22029€	4581€

Table 12: Some relevant aggregates on the test perimeter.

The sum of individual fairness bias for women is 17448€, while men have a sum of 22029€. These disparities can be interpreted as follows: on average, women pay 17448€ less than men with similar characteristics. The signs of the differences align with expectations, reflecting that women pose lower risks compared to men. However, the differences do not fully offset each other, resulting in a residual difference of 4581€ (-17448€ + 22029€). This remaining variance can be attributed, in part, to the larger representation of men in the database and the imperfections within the employed models.

To assess the influence of different values for $\zeta$ and $\eta$ , and to identify the optimal combination, the redistribution process is repeated across a grid of values. This grid covers all possible combinations of $(\eta,\zeta)$ , where $\eta$ takes on values from the set $\{2,3,4,5,6,7,8,9,10\}$ and $\zeta$ from the set $\{2500,2000,1500,1000,500,100,10,1,0.1\}$ . Three factors are examined: computational time, global variation, and redistribution integrity.

When $\zeta\leq 100$ , the redistribution process results in a substantial reduction in the range of the $\widehat{Y}$ distribution. This occurs because the correction needed to achieve such low levels of differences between subgroups is extensive, causing premiums to become increasingly clustered until they eventually converge towards the sample’s mean premium. Consequently, pushing the redistribution method towards maximum convergence leads to an equilibrium where all individuals are assigned the same premium. While this equilibrium represents trivial fairness, it severely deteriorates the model’s performance. Therefore, it becomes imperative to guide the method towards optima where global variation is minimized while preserving the distribution of $\widehat{Y}$ . Among the 81 tested combinations, only 10 of them emerged as non-dominated solutions based on criteria of redistribution integrity and global variation. Out of these 10, 5 had a redistribution integrity of less than 25%, making them unacceptable despite having the smallest global variations. The two most promising scenarios are as follows:

1.

$\eta=6$ and $\zeta=2000$ for a global variation of $873$ € and an integrity of $88\%$ ;
2.

$\eta=5$ and $\zeta=2500$ for a global variation of $771$ € and an integrity of $86\%$ .

The scenario with 88% fidelity is preferred, resulting in a sum of gaps of -603€ for women and +1682€ for men. Consequently, the total gap is reduced to 1079€, representing a 76% decrease compared to the initial gap of 4581€. The average bias are now 0.071€ for women and 0.14€ for men, all while preserving a distribution of $\widehat{Y}$ that remains faithful to the one prior to redistribution. Figure 11 illustrates the overlap of histograms before and after the redistribution.

Regarding performance, the premiums post-redistribution exhibit a Root Mean Square Error (RMSE) of 164.72, which closely aligns with the baseline value of 164.21. Despite the increase of 873€ in total premiums (global variation), the Loss Ratio (LR) only experiences a minor reduction from 99.64% to 99.39%. Thus, premiums maintain their consistency and calibration while effectively reducing the gaps between men and women. For values of $\zeta$ less than or equal to 100, most computation times are under five minutes, rendering this criterion less relevant for defining the best redistribution strategies.

The introduction of bias mitigation methods has prompted the examination of fairness considerations throughout various phases of the pricing process. The successful integration of these mitigations necessitates alignment with the diverse requirements and limitations inherent to pricing. Consequently, they should be operationally feasible while striving to maintain the performance of the models. Among the implemented methods, variable suppression, due to its simplicity, and redistribution, owing to its personalized approach, appear to yield the most promising outcomes. The other methods contribute to a more comprehensive understanding of the analyzed bias and may yield superior results in alternative datasets. An overview of the results stemming from the various mitigation strategies is presented in Table 13.

metrics	reference model	total deletion	correlation deletion	fair-SMOTE Adaptation	exponentiated gradient	fair redistribution
HGR KDE	28.97%	18.25%	27.66%	28.05%	31.74%	29.05%
RMSE	164.21	170.61	163.03	164.88	164.30	164.72
Loss ratio	99.71%	99.30%	99.79%	99.69%	99.70%	99.39%

Table 13: Summary of bias mitigation results : the HGR KDE metric evaluates the dependance between

\hat{Y}

and

S

. RMSE ans Loss ratio, defined in (1), evaluate the performance of the model after various mitigation strategies.

6 Conclusion

Fairness serves as a crucial constraint in the realm of pricing. Whether it arises from regulatory requirements or a company’s strategic objectives, it is imperative to establish, assess, and alleviate biases to attain equitable models. Consequently, there was an initial examination of fairness from a mathematical standpoint, wherein various metrics were introduced to identify biases both before and after the modeling process, irrespective of the conventional gender-related treatment. Subsequently, a study of mitigation methods was conducted, which encompassed pre-, mid-, and post-modeling interventions. These interventions aimed to address bias comprehensively by reprocessing the data and imposing fairness constraints, as well as to address bias individually by reprocessing distinct premiums. While some methods proved more effective than others, each approach offered valuable insights.

In this application, taking into account the constraint of pricing model performance, unfairness as defined by our metrics was not fully mitigated because it would have led to a significant degradation of our models. For example, in total deletion more variables could be deleted or more redistribution iterations could be made in the fair redistribution method. Apart from this constraint, full fairness seems difficult to achieve, at least because corrections are made on a training dataset and with statistical randomness, changes in the portfolio structure etc. fairness would be partial.

In continuation of this research, further datasets and scenarios can be explored to obtain more robust results and establish a reference benchmark. It will also be essential to extend the analysis to encompass other pricing granularities such as portfolio segments or groups of coverages.

This study, nevertheless, lays the groundwork for quantifying and mitigating biases within the insurance domain, providing a framework for ongoing exploration. For instance, there is potential to enhance mitigation techniques during the modeling process and investigate the simultaneous treatment of multiple sensitive variables.

Acknowledgements

Data: Datasets used in this manuscript are based on confidential data obtained by a consulting company, from a (real) insurance company.

Conflicts of interest: The authors declare no conflicts of interest.

Declaration of funding: AC acknowledges Canada’s National Sciences and Engineering Research Council (NSERC) for funding (RGPIN-2019-07077) and the SCOR Foundation.

Acknowledgements: We are grateful to Laurence Barry for stimulating discussions and André Grondin for sharing this thoughts on earlier versions of this work.
FV thanks the France 2030 framework programme Centre Henri Lebesgue ANR-11-LABX-0020-01 for its stimulating mathematical research programs.
We are very grateful to the anonymous referees for valuable comments.

Appendix A Fair-SMOTE adaptation pseudo code

Algorithm 1 Fair-SMOTE adaptation

1:for each subgroup do

2: for each simulated individual of the given subgroup do

3: randomly select an individual

p

in the sub group

4: find via

k

-nearest neighbor the two

p

’s closest neighbors

v_{1}\text{ and }v_{2}

5: sample

u

from an uniform distribution

6: for each column of our

N

columns do

7: if column with binary values then

8: if

st>u

then

\tilde{x}=\text{{random\_choice}}(x_{v_{1}},x_{v_{2}},x_{p})

10: else

11:

\tilde{x}=x_{p}

12: end if

13: else if column with qualitative values then

14:

\tilde{x}=\text{{random\_choice}}(x_{v_{1}},x_{v_{2}},x_{p})

15: else if column with quantitative values then

16: if

st>u

then

17:

\tilde{x}=x_{p}+ft\times(x_{v_{1}}-x_{v_{2}})

18: else

19:

\tilde{x}=x_{p}

20: end if

21: else

22: return an alert to trigger the transformation of the column

23: end if

24: end for

25: end for

26:end for

With $u$ the realization of an uniform law on $[0,1]$ , $\tilde{x}$ the value of the coordinate of the new individual on the corresponding $j$ column $X_{j}$ , $(x_{j})_{j\in\{v_{1},v_{2},p\}}$ the value of the coordinate for the individuals $v_{1},v_{2}\text{ and }p.$ Random_choice, a random choice between parameters with each parameter having the same chances of being choice.

References

Agarwal et al. (2018) Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. ICML’18, pages 60––69, 2018.
Alycia and Wu (2022) Carey Alycia and Xintao Wu. The causal fairness field guide: Perspectives from social and formal sciences. Frontiers in Big Data, 5, 2022.
Angwin et al. (2016) Julia Angwin, Jeff Larson, Lauren Kirchner, and Surya Mattu. Machine bias. ProPublica, 2016.
Ayuso et al. (2016) Mercedes Ayuso, Montserrat Guillen, and Ana Maria Pérez-Marín. Telematics and gender discrimination: Some usage-based evidence on whether men’s risk of accidents differs from women’s. Risks, 4(2), 10, 2016.
Becker (1957) Gary S Becker. The economics of discrimination. University of Chicago press, 1957.
Berk et al. (2017) Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. A convex framework for fair regression. arXiv, 1706.02409, 2017.
Berk et al. (2018) Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: The state of the art. sociological methods & research. Sociological Methods & Research, 50(1), 2018.
Black et al. (2020) Emily Black, Samuel Yeom, and Matt Fredrikson. Fliptest: fairness testing via optimal transport. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 111–121, 2020.
Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. doi: 10.1017/CBO9780511804441.
Castelnovo et al. (2022) Alessandro Castelnovo, Riccardo Crupi, Greta Greco, Daniele Regoli, Ilaria Giuseppina Penco, and Andrea Claudio Cosentini. A clarification of the nuances in the fairness metrics landscape. 12, 2022. doi: 10.1038/s41598-022-07939-1.
Chakraborty et al. (2021) Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. Bias in machine learning software : Why ? how ? what to do ? arXiv, 2105.12195v3, 2021.
Charpentier (2024) Arthur Charpentier. Insurance, biases, discrimination and fairness. Springer, 2024. doi: 10.1007/978-3-031-49783-4.
Charpentier et al. (2023a) Arthur Charpentier, Emmanuel Flachaire, and Ewen Gallic. Causal inference with optimal transport. In Nguyen Ngoc Thach, Vladik Kreinovich, Doan Thanh Ha, and Nguyen Duc Trung, editors, Optimal Transport Statistics for Economics and Related Topics. Springer Verlag, 2023a.
Charpentier et al. (2023b) Arthur Charpentier, François Hu, and Philipp Ratz. Mitigating discrimination in insurance with wasserstein barycenters. In Proceedings of BIAS 2023, 3rd Workshop on Bias and Fairness in AI, International Workshop of ECML PKDD, 2023b.
Chouldechova (2017) Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2):153–163, 2017.
Corbett-Davies et al. (2017) Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. arXiv, 1701.08230, 2017.
Darolles et al. (2004) Serge Darolles, Jean-Pierre Florens, and Christian Gourieroux. Kernel-based nonlinear canonical analysis and time reversibility. Journal of Econometrics, 119(2), 323-353, 2004.
De Lara et al. (2021) Lucas De Lara, Alberto González-Sanz, Nicholas Asher, and Jean-Michel Loubes. Transport-based counterfactual models. arXiv, 2108.13025, 2021.
del Barrio et al. (2020) Eustasio del Barrio, Paula Gordaliza, and Jean-Michel Loubes. Review of mathematical frameworks for fairness in machine learning. arXiv, 2005.13755, 2020.
Dwork et al. (2011) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Rich Zemel. Fairness through awareness. arXiv, 1104.3913, 2011.
Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
Edgeworth (1922) Francis Y Edgeworth. Equal pay to men and women for equal work. The Economic Journal, 32(128):431–457, 1922.
EEOC (1979) The U.S. EEOC. Uniform guidelines on employee selection procedures. Equal Employment Opportunity Commission EEOC Technical Report, 1979.
Frees and Huang (2023) Edward W Frees and Fei Huang. The discriminating (pricing) actuary. North American Actuarial Journal, 27(1):2–24, 2023.
Freund and Schapire (1996) Yoav Freund and Robert Schapire. Game theory, on-line prediction and boosting. Proceedings of the Ninth Annual Conference on Computational Learning Theory, 1996.
Galles and Pearl (1998) David Galles and Judea Pearl. An axiomatic characterization of causal counterfactuals. Foundations of Science, 3:151–182, 1998.
Gebelein (1941) Hans Gebelein. Das statistische problem der korrelation als variations- und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMM - Journal of Applied Mathematics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik, 21(6):364–379, 1941.
Grary et al. (2022) Vincent Grary, Arthur Charpentier, and Marcin Detyniecki. A fair pricing model via adversarial learning. ArXiv, 2202.12008, 2022.
Hardt et al. (2016) Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315–3323, 2016.
Hirschfeld (1935) Hermann Otto Hirschfeld. A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4):520–524, 1935. doi: 10.1017/S0305004100013517.
Hu et al. (2023) François Hu, Philipp Ratz, and Arthur Charpentier. A sequentially fair mechanism for multiple sensitive attributes. arXiv, 2309.06627, 2023.
Kleinberg et al. (2016) Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. arXiv, 1609.05807, 2016.
Komiyama and Shimao (2017) Junpei Komiyama and Hajime Shimao. Two-stage algorithm for fairness-aware machine learning. arXiv, 1710.04924, 2017.
Kusner et al. (2017) Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
Lindholm et al. (2022a) Mathias Lindholm, Ronald Richman, Andreas Tsanakas, and Mario V Wüthrich. A discussion of discrimination and fairness in insurance pricing. arXiv preprint arXiv:2209.00858, 2022a.
Lindholm et al. (2022b) Mathias Lindholm, Ronald Richman, Andreas Tsanakas, and Mario V Wüthrich. Discrimination-free insurance pricing. ASTIN Bulletin: The Journal of the IAA, 52(1):55–89, 2022b.
Mary et al. (2019) Jérémie Mary, Clément Calauzenes, and Noureddine El Karoui. Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pages 4382–4391. PMLR, 2019.
of the European Union (2008) Council of the European Union. Proposition de directive du conseil relative à la mise en œuvre du principe de l’égalité de traitement entre les personnes sans distinction de religion ou de convictions, de handicap, d’âge ou d’orientation sexuelle. Presses Universitaires de France, 2008.
Phelps (1972) Edmund S Phelps. The statistical theory of racism and sexism. The american economic review, 62(4):659–661, 1972.
Rényi (1959) Alfréd Rényi. On measures of dependence. Acta Mathematica Hungarica, 10 (3-4):441––451, 1959.
Witsenhausen (1975) Hans Witsenhausen. On sequences of pairs of dependent random variables. SIAM J. Appl. Math., 28, 1975.

Measuring and Mitigating Biases in Motor Insurance Pricing

Abstract

1 Introduction

1.1 Motivation

1.2 Agenda

1.3 Notations

2 Measuring fairness and biases

2.1 Group or Statistical Fairness

Independence (demographic parity):

Separation (equalized odds):

Sufficiency (predictive parity):

Definition 1 (Statistical (Group) Fairness).

2.2 Individual fairness

Definition 2 (Disparate Treatment).

2.3 Quantifying unfairness

2.3.1 From the binary to the general case

Definition 3 (HGRHGR\mathrm{HGR}roman_HGR).

2.3.2 Focus on the case of pricing

Adaptation flip-test

3 Bias mitigation

3.1 Pre-processing mitigation

3.1.1 Total deletion

3.1.2 Correlation remover

3.1.3 Fair-SMOTE adaptation

3.2 In-processing

3.2.1 Exponentiated gradient

3.2.2 Grid search approach

3.3 Post-processing

3.3.1 Fair redistribution

4 Measuring biases in car insurance pricing

4.1 Bias on historical data

4.2 Bias after modeling using gender

4.3 Bias after modeling without gender

4.4 A different modeling approach

5 Bias mitigation in car insurance pricing

5.1 Total deletion

5.2 Correlation remover

5.3 Fair-SMOTE adaptation

5.4 Exponentiated gradient

5.5 Fair redistribution

6 Conclusion

Acknowledgements

Appendix A Fair-SMOTE adaptation pseudo code

References

Measuring and Mitigating Biases
in Motor Insurance Pricing

Definition 3 ( $\mathrm{HGR}$ ).