1 Introduction

Supervised machine learning has been studied and explored extensively during the last decades, and both theoretical and experimental solutions exist to accomplish this task (Hastie et al., 2009). Weakly supervised machine learning (WSL) has not reached this state yet. WSL is the machine learning field where algorithms learn a model from data with weak supervision instead of strong supervision. Multiple weak supervisions have been identified in (Zhou, 2017) such as inaccurate supervision when samples are mislabeled, inexact supervision when labels are not adapted to the classification task, or incomplete supervision when labels are missing which reflects the inadequacy of the available labels in the real world. For every kind of weak supervision, assumptions are needed to design sound algorithms, especially on the corruption model, the generative process behind the weakness of supervision.

Weaknesses in supervision are one facet of weakly supervised learning, and dataset shifts are another. Dataset shifts happen when the data distribution observed at training time is different from what is expected from the data distribution at testing time (Moreno-Torres et al., 2012). This distribution change can take multiple forms, such as a change in the distribution of a single feature, a combination of features, or the concept to be learned. Thus, the common assumption that the training and testing data follow the same distributions is often violated in real-world applications. Again, designing algorithms to handle dataset shifts usually requires assumptions on the nature of the shift (David et al., 2010).

We believe that the biquality data setup proposed in (Nodet et al., 2021b) is a suitable framework to design algorithms capable of handling both dataset shifts and weaknesses of supervision simultaneously.

Biquality Learning assumes that two datasets are available at training time: a trusted dataset \(D_T\) and an untrusted dataset \(D_U\) both composed of labeled samples \((x_i, y_i) \in \mathcal {X}\times \mathcal {Y}\). These datasets share the same features \(\mathcal {X}_T=\mathcal {X}_U\) and the same set of labels \(\mathcal {Y}_T=\mathcal {Y}_U\), having a closed-set of features \(\mathcal {X}\) and labels \(\mathcal {Y}\) with \(K=\vert \mathcal {Y}\vert\) classes. However, the two datasets differ in terms of the joint distribution \({\mathbb {P}}_T(X,Y)\ne {\mathbb {P}}_U(X,Y)\) where the trusted distribution is the distribution of interest.

In practice, it is often the case that the trusted dataset is not large enough to learn an efficient model to estimate \({\mathbb {P}}_T(X,Y)\). By contrast, it is generally easy to get a large enough untrusted dataset that enables to correctly estimate \({\mathbb {P}}_U(X,Y)\). However, this data distribution could be completely different from the one of interest, \({\mathbb {P}}_T(X,Y)\). In the biquality data setup, there is no assumption on the difference in joint distribution between the two datasets, and it can cover a wide range of known problems. From the Bayes Formula:

$$\begin{aligned} {\mathbb {P}}(X,Y) = {\mathbb {P}}(X \mid Y){\mathbb {P}}(Y) = {\mathbb {P}}(Y \mid X){\mathbb {P}}(X) \end{aligned}$$
(1)

distribution shift covers covariate shift \({\mathbb {P}}_T(X)\ne {\mathbb {P}}_U(X)\), concept drift \({\mathbb {P}}_T(Y \mid X)\ne {\mathbb {P}}_U(Y \mid X)\), class-conditional shift \({\mathbb {P}}_T(X \mid Y)\ne {\mathbb {P}}_U(X \mid Y)\) and prior shift \({\mathbb {P}}_T(Y)\ne {\mathbb {P}}_U(Y)\).

The Biquality Data setup typically occurs in three scenarios in practice.

  1. 1.

    The first scenario corresponds to the case where annotating samples is expansive to the point of being prohibitive to label an entire dataset, but labeling a small part of the dataset is doable. It is typically the case in Fraud Detection and in Cyber Security where labeling samples require complex forensics from domain experts. There, the rest of the dataset is usually labeled by hand-engineered rules which might not perfectly fit the classification task and the labels can not properly be trusted (Ratner et al., 2020).

  2. 2.

    The second scenario happens when there are data shifts during the labeling process over the course of time. For example in MLOps (Kreuzberger et al., 2022), when a model is first learned on clean data and then deployed in production, predictions can be used to learn an updated model (Veeramachaneni et al., 2016); these predictions may be faulty and need to be dealt with. A second example in MLOps occurs when newly clean data is acquired to retrain a deployed model; the most recent clean data is considered trusted and the old clean data is considered untrusted when dataset shifts occurred (Gama et al., 2014).

  3. 3.

    The third scenario occurs when multiple annotators are responsible for dataset labeling. In natural language processing (NLP) products, for example, multiple annotators follow labeling guidelines to annotate verbatims. The efficiency of annotators to follow these guidelines may vary and might not be trusted. However, if one annotator can be trusted, all the other annotators can be set as untrusted and the biquality data setup may apply. Especially, considering each untrusted annotator against the trusted annotator can be viewed as a biquality learning task (Yuen et al., 2011).

Having the trusted and untrusted datasets available at training time makes it possible to design algorithms dealing with closed-set distribution shifts. If algorithms designed for biquality data with concept drift only have been explored recently (Nodet et al., 2021a), algorithms that deal with distribution shift are still behind. In this paper, we propose two biquality approaches that adapt methods from either the covariate shift literature or the concept drift literature in order to deal with both corruptions simultaneously.

Multiple biquality learning algorithm designs have been identified in (Nodet et al., 2021b). They have been divided into three main families based on how they modify instances to correct the global learning procedure. Untrusted instances can be (i) relabeled correctly, (ii) modified in the feature space, or (iii) reweighted such that the untrusted dataset seems sampled from the trusted distribution \({\mathbb {P}}_T(X,Y)\). We propose here two corresponding algorithms for the third case of importance reweighting.

In Sect. 2, a brief state-of-the-art relates what has already been achieved in biquality learning. Then, Section 3 further focuses on the state-of-the-art of importance reweighting on biquality learning. Sections 4 and 5 introduce our proposals to use classifiers to reweight untrusted instances for biquality learning with distribution shifts. Then, Sect. 6 describes the experiments that evaluate the efficiency of our proposed approaches on real datasets and corruptions. Section 7 presents the results of the proposed experiments. Finally, Sect. 8 open some discussions about the presented results before concluding.

2 Related work

Machine learning algorithms on biquality data have been developed in many different sub-domains of weakly supervised learning. Some of these sub-domains are robust learning to label noise, learning under covariate shift, or transfer learning. Because these subdomains expect different corruptions, not all algorithms designed for some subcases of biquality learning will work in a more general setting with distribution shifts.

For example, Gold Loss Correction (GLC) (Hendrycks et al., 2018) and Importance Reweighting for Biquality Learning (IRBL) (Nodet et al., 2021b) are algorithms designed to specifically deal with a concept drift between the trusted and the untrusted datasets. GLC, on the one hand, corrects the learning procedure on the untrusted dataset using a noise transition matrix between the trusted and untrusted concept. IRBL, on the other hand, reweight untrusted instances using an estimation of the ratio of both concepts. These algorithms are not theoretically designed to handle covariate shift; nevertheless, they could be empirically efficient on this task and serve as a reference.

Another group of algorithms, such as Kernel Mean Matching (KMM) (Gretton et al., 2009), and Probabilistic Density Ratio Estimation (PDR) (Bickel et al., 2007), only deals with covariate shift between the two datasets. These algorithms seek to reweight untrusted instances such that the distribution of features between the two datasets is equivalent. KMM minimizes the difference of the features mean in a reproducing kernel Hilbert space. PDR learns the classification task of predicting if an instance is untrusted or not and uses the predicted probability of being an untrusted instance as the weight. These algorithms are not designed to handle concept drift and will serve as references too.

However, a recent proposal aims to adapt these algorithms to the biquality framework with distribution shift (Fang et al., 2020). They proposed to use one density ratio estimation algorithm per class, which, when combined, corrects the distribution shift. They also proposed to transform the joint density ratio estimation problem by combining the features and labels of the data into a new feature space that allows for a single density ratio estimation algorithm. These adapted algorithms will also serve as competitors.

Finally, recent approaches such as Learning to Reweight (L2RW) (Ren et al., 2018), or Meta-Weight-Net (MWNet) (Shu et al., 2019) based on deep learning and meta-learning have not been tested in this paper. Indeed, they require a class of algorithms with a differentiable and incremental learning procedure that does not fit most popular families of classifiers, such as gradient boosting trees. They are left to be tested in future works.

Biquality Data is not a new setup per see, as previous work exists on this setup going back to (Jiang et al., 2018) to the best of our knowledge. Each previous work was carried out in different sub-domains of weakly supervised learning and thus achieved different goals based on different setups (Hendrycks et al., 2018; Shu et al., 2019; Ren et al., 2018; Zheng et al., 2021; Jiang et al., 2018). These setups used different terms, definitions, hypotheses, and requirements but still sought to solve the same fundamental problem of biquality learning. Only recently, some efforts have been done to provide clear and concise definitions of the biquality learning framework (Nodet et al., 2021b). We propose in this paper to extend it to include dataset shifts. This extension is, to the best of our knowledge, a new problem to tackle that few tried already (Fang et al., 2020), limiting existing prior literature.

3 Reweighting for distribution shift

The previous Section introduced the most common algorithms used in machine learning for biquality data. Most of them are based on instance reweighting, and specifically, on estimating the Radon-Nikodym Derivative (RND) (Nikodym, 1930) of \({\mathbb {P}}_T(X,Y)\) with respect to \({\mathbb {P}}_U(X,Y)\).

Theorem 1

(Radon-Nikdoym-Lebesgue theorem (Rudin, 1975)) Let \(\mu\) and \(\nu\) two positive \(\sigma\)-finite measures defined on a measurable space \((X, \mathcal {A})\) with \(\nu\) being absolutely continuous with respect to \(\mu\). Then there exists a unique positive measurable function f defined on X such that:

$$\begin{aligned} \forall A \in \mathcal {A}, \nu (A) = \int _A f d\mu \end{aligned}$$
(2)

Typically, machine learning datasets form measurable spaces, and probability densities are positive finite measures on these measurable spaces. Assuming that \({\mathbb {P}}_T(X,Y)\) is absolutely continuous with respect to \({\mathbb {P}}_U(X,Y)\), then the RND exists, is unique and equals to \(\frac{\text {d}{\mathbb {P}}_T(X,Y)}{\text {d}{\mathbb {P}}_U(X,Y)}\).

Equation 3 shows that minimizing the reweighted empirical risk by the RND on the untrusted data is equivalent to minimizing the empirical risk on trusted data.

$$\begin{aligned} R_{{(X,Y)\sim T,L}} (f) & = {\mathbb{E}}_{{(X,Y)\sim T}} [L(f(X),Y)] \\ & = \int L (f(X),Y){\mkern 1mu} {\text{d}}{\mathbb{P}}_{T} (X,Y) \\ & = \int {\frac{{{\text{d}}{\mathbb{P}}_{T} (X,Y)}}{{{\text{d}}{\mathbb{P}}_{U} (X,Y)}}} L(f(X),Y){\mkern 1mu} {\text{d}}{\mathbb{P}}_{U} (X,Y) \\ & = {\mathbb{E}}_{{(X,Y)\sim U}} [\frac{{{\mathbb{P}}_{T} (X,Y)}}{{{\mathbb{P}}_{U} (X,Y)}}L(f(X),Y)] \\ & = {\mathbb{E}}_{{(X,Y)\sim U}} [\beta L(f(X),Y)] \\ & = R_{{(X,Y)\sim U,\beta L}} (f) \\ \end{aligned}$$
(3)

However, estimating the RND can be a difficult task, especially in the case of distribution shift where the joint distribution ratio \(\beta\) needs to be estimated. Proposals have been made to ease this estimation.

A first proposal has been made in IRBL (Nodet et al., 2021b) which focused first on the concept drift between datasets using the Bayes Formula:

$$\begin{aligned} \beta (X,Y) = \frac{{\mathbb {P}}_T(X,Y)}{{\mathbb {P}}_U(X,Y)} = \frac{{\mathbb {P}}_T(Y \mid X){\mathbb {P}}_T(X)}{{\mathbb {P}}_U(Y \mid X){\mathbb {P}}_U(X)} \end{aligned}$$
(4)

Their proposed algorithm is based on the decomposition of the joint density ratio estimation task into three sub-tasks. The first one is to estimate the trusted concept \({\mathbb {P}}_T(Y \mid X)\), and is done by learning a classifier on the trusted dataset. The second task is to estimate the untrusted concept \({\mathbb {P}}_U(Y \mid X)\), which is done by learning a classifier on the untrusted dataset. And the third task about density ratio estimation \(\frac{{\mathbb {P}}_T(X)}{{\mathbb {P}}_U(X)}\) was skipped as no covariate shift was introduced in their benchmark, but it is a well known and solved machine learning task (Sugiyama et al., 2012).

A second proposal has been made in Fang et al. (2020) which focused on the covariate shift between datasets using the Bayes Formula differently:

$$\begin{aligned} \beta (X,Y) = \frac{{\mathbb {P}}_T(X,Y)}{{\mathbb {P}}_U(X,Y)} = \frac{{\mathbb {P}}_T(X \mid Y){\mathbb {P}}_T(Y)}{{\mathbb {P}}_U(X \mid Y){\mathbb {P}}_U(Y)} \end{aligned}$$
(5)

In their proposed algorithm, the joint density ratio estimation task has been decomposed into K-tasks where K is the number of classes to predict. For each class, only samples of the given class are selected on both datasets, such that the samples are drawn from the \({\mathbb {P}}(X \mid Y)\) distribution. Then, a density ratio estimation procedure usually employed to estimate \(\frac{{\mathbb {P}}_T(X)}{{\mathbb {P}}_U(X)}\) is learned on these sub-datasets to estimate \(\frac{{\mathbb {P}}_T(X \mid Y)}{{\mathbb {P}}_U(X \mid Y)}\), effectively handling distribution shift from Eq. 5. As it uses K density ratio algorithms, this generic approach will be named K-DensityRatio (K-DR) in the rest of the paper.

Finally, a last approach is to focus on the density ratio estimation task by finding a deterministic and invertible transformation f as proposed in Fang et al. (2020):

$$\begin{aligned} \beta (X,Y) = \frac{{\mathbb {P}}_T(X,Y)}{{\mathbb {P}}_U(X,Y)} = \frac{{\mathbb {P}}_T(Z)}{{\mathbb {P}}_U(Z)},\quad Z = f(X,Y) \end{aligned}$$
(6)

An example of such transformation (Fang et al., 2020) is the classification loss of a model learned on the biquality data. One density ratio estimation procedure is done on these new features \(\mathcal {Z}\) to directly estimate \(\frac{{\mathbb {P}}_T(Z)}{{\mathbb {P}}_U(Z)}\).

IRBL has experimentally proved to efficiently solve the biquality learning task on tabular data (Nodet et al., 2021b). However, the experiments were conducted on corruptions only affecting the untrusted concept \({\mathbb {P}}(Y \mid X)\) and not the joint distribution \({\mathbb {P}}(X,Y)\). We propose here to adapt IRBL to handle distribution shifts by solving the third task of density ratio estimation with a probabilistic classifier. Moreover, we propose a new version of K-DR using probabilistic classifiers to solve the K density ratio estimation tasks. This proposition is driven by the desire to reuse efficient tricks from IRBL and to rely on a non parametric approach by contrast to the original proposal (Fang et al., 2020).

4 First proposed approach: IRBL2

Importance Reweighting for Biquality Learning (IRBL) (Nodet et al., 2021b) is a biquality learning algorithm designed to handle closed-set concept-drift. The algorithm is based on using two probabilistic classifiers: first, to estimate both concepts \({\mathbb {P}}_T(Y \mid X)\) and \({\mathbb {P}}_U(Y \mid X)\) and, second, using these classifiers’ outputs to estimate the RND between both data distributions. In the particular case of label noise, especially instance dependent label noise, it has been shown to be the best approach experimentally on a wide variety of datasets.

We propose to adapt it to handle covariate shift by estimating the ratio \(\frac{{\mathbb {P}}_T(X)}{{\mathbb {P}}_U(X)}\) by using a third probabilistic classifier based on Discriminative Learning (Bickel et al., 2007). This algorithm works by defining a new supervised classification task by learning to predict if a sample is trusted or untrusted by only using its features. If there exists covariate shift between the datasets, the classifier should be able to discriminate between the two datasets.

Let’s introduce S as the new target:

$$\begin{aligned} s_i(x_i)= {\left\{ \begin{array}{ll} 0, &{} \text {if}\ x_i \in D_U\\ 1, &{} \text {if}\ x_i \in D_T \end{array}\right. } \end{aligned}$$
(7)

Estimating \({\mathbb {P}}(S \mid X)\) allows us to estimate \(\frac{{\mathbb {P}}_T(X)}{{\mathbb {P}}_U(X)}\) directly without estimating both distributions:

$$\begin{aligned} \begin{aligned} \frac{{\mathbb {P}}_T(X)}{{\mathbb {P}}_U(X)} = \frac{{\mathbb {P}}(X \mid S=1)}{{\mathbb {P}}(X \mid S=0)}&=\frac{{\mathbb {P}}(S=1 \mid X){\mathbb {P}}(X)}{{\mathbb {P}}(S=1)} \times \frac{{\mathbb {P}}(S=0)}{{\mathbb {P}}(S=0 \mid X){\mathbb {P}}(X)} \\&=\frac{{\mathbb {P}}(S=1 \mid X)}{1 - {\mathbb {P}}(S=1 \mid X)} \times \frac{1 - {\mathbb {P}}(S=1)}{{\mathbb {P}}(S=1)} \end{aligned} \end{aligned}$$
(8)

Combining Eqs. 4 and 8 :

$$\begin{aligned} \frac{{\mathbb {P}}_T(Y \mid X){\mathbb {P}}_T(X)}{{\mathbb {P}}_U(Y \mid X){\mathbb {P}}_U(X)} =\frac{{\mathbb {P}}_T(Y \mid X)}{{\mathbb {P}}_U(Y \mid X)}\times \frac{{\mathbb {P}}(S=1 \mid X)}{1 - {\mathbb {P}}(S=1 \mid X)}\times \frac{1 - {\mathbb {P}}(S=1)}{{\mathbb {P}}(S=1)} \end{aligned}$$
(9)

We propose to estimate Eq. 9 by learning probabilistic classifiers \(f \in \mathcal {F}\) to estimate each of its terms. A probabilistic classifier \(f_T\) is learned on \(D_T\) to estimate \({\mathbb {P}}_T(Y \mid X)\), \(f_U\) is learned on \(D_U\) to estimate \({\mathbb {P}}_U(Y \mid X)\), and \(f_S\) is learned on \(\{(x,s(x)) \mid \forall x \in D_T \cup D_U\}\) to estimate \({\mathbb {P}}_U(S \mid X)\), leading to the following Algorithm 1.

figure a

5 Second proposed approach: K-PDR

K-DensityRatio (K-DR) (Fang et al., 2020) is an alternative approach to design a biquality learning algorithm able to handle distribution shift. The focus is made on the covariate shift between the two datasets. It handles the covariate shift in a class conditional fashion to deal with distribution shifts by using covariate shift correction once per class.

From Eq. 5, K-DR evaluates the ratio \(\frac{{\mathbb {P}}_T(X \mid Y)}{{\mathbb {P}}_U(X \mid Y)}\) with density ratio estimation algorithms. To do so, it first samples data from the \(X \mid Y\) distribution by selecting only samples from a given class \(k \in [\![1,K]\!]\) in both datasets \(D_T\) and \(D_U\). Then, it uses density ratio estimation algorithms \(e \in \mathcal {E}\) on these sub-datasets to estimate \(\frac{{\mathbb {P}}_T(X \mid Y=k)}{{\mathbb {P}}_U(X \mid Y=k)}\) independently k times. The class priors \({\mathbb {P}}_T(Y)\) and \({\mathbb {P}}_U(Y)\) are estimated empirically from both training sets. See Algorithm 2.

figure b

In Fang et al. (2020) Kernel Mean Matching (KMM) (Huang et al., 2007; Gretton et al., 2009) has been used as the Density Ratio algorithm e to handle covariate shift. Empirically, KMM is an algorithm that matches with quadratic programming (Wright, 1999) the mean of both datasets in a feature space induced by a kernel k on the domain \(\mathcal {X}\times \mathcal {X}\):

$$\begin{aligned} \begin{aligned} \min _{\beta _i} \quad&\left\| \frac{1}{ \vert D_U \vert }\sum _{i=0}^{ \vert D_U \vert }\beta _i\Phi (x_i) - \frac{1}{ \vert D_T \vert }\sum _{i=0}^{ \vert D_T \vert }\Phi (x_i) \right\| _\mathcal {H}\\ \text {s.t.} \quad&0\le \beta _i \le B \\&\left| \frac{1}{ \vert D_U \vert }\sum _{i=0}^{ \vert D_U \vert }\beta _i -1 \right| < \epsilon \end{aligned} \end{aligned}$$
(10)

where \(\Phi :\mathcal {X}\rightarrow \mathcal {H}\) denotes the canonical feature map, \(\mathcal {H}\) is the reproducing kernel Hilbert space induced by the kernel k, \(\ \mid \cdot \ \mid _\mathcal {H}\) is the norm on \(\mathcal {H}\) and B and \(\epsilon\) are regularization and normalization constraints.

As such, KMM is a parametric algorithm based on kernels. We propose to use instead a probabilistic classifier to handle covariate shift, in the same fashion as in Equations 7 and 8 to make a non-parametric version of K-DR as shown in Equation (8b).

$$\begin{aligned} \begin{aligned} \frac{{\mathbb {P}}_T(X \mid Y){\mathbb {P}}_T(Y)}{{\mathbb {P}}_U(X \mid Y){\mathbb {P}}_U(Y)}&= \frac{{\mathbb {P}}(X \mid Y,S=1)}{{\mathbb {P}}(X \mid Y,S=0)}\times \frac{{\mathbb {P}}(Y \mid S=1)}{{\mathbb {P}}(Y \mid S=0)}\\&= \frac{{\mathbb {P}}(S=1 \mid X,Y){\mathbb {P}}(X,Y)}{{\mathbb {P}}(Y \mid S=1){\mathbb {P}}(S=1)}\times \frac{{\mathbb {P}}(Y \mid S=0){\mathbb {P}}(S=0)}{{\mathbb {P}}(S=0 \mid X,Y){\mathbb {P}}(X,Y)}\\&\quad \times \frac{{\mathbb {P}}(Y \mid S=1)}{{\mathbb {P}}(Y \mid S=0)}\\&= \frac{{\mathbb {P}}(S=1 \mid X,Y)}{1 - {\mathbb {P}}(S=1 \mid X,Y)}\times \frac{1 -{\mathbb {P}}(S=1)}{{\mathbb {P}}(S=1)} \end{aligned}\quad \mathrm{(8.b)} \end{aligned}$$

The main advantage of the non-parametric approach is that it does not require assumptions about the data distribution, which may not be satisfied in many real-world datasets and could lead to poor performances. Moreover, the scalability of K-PDR is better than the scalability of K-KMM both in space and time complexity. K-PDR has K times the same complexity as the complexity of learning the chosen probabilistic classifier, which is \(\mathcal {O}(K\times \vert \mathcal {X} \vert \times \vert D_u^k \vert \log ( \vert D_u^k \vert ))\) for Decisions Trees (Cormen et al., 2022). Meanwhile, K-KMM memory complexity is \(\mathcal {O}(\vert D_u^k \vert ^2 + \vert D_u^k \vert \times \vert D_t^k \vert )\) to sequentially build matrices necessary for the quadratic program, and a worst-case time complexity of \(\mathcal {O}(K\times \vert D_u^k \vert ^3)\) to solve the quadratic program (Ye & Tse, 1989). Finally, the proposed approach is even more flexible than the previous one, as any family of machine learning classifiers could be used instead of kernels. This leads to Algorithm 3.

figure c

6 Experiments

Benchmarking biquality learning algorithms means evaluating their efficiency and resilience on both dataset shifts and weaknesses of supervision in a joint manner. Introducing these corruptions synthetically in usual public multi-class classification datasets allows a fine-grained and controlled evaluation of these algorithms.

From Equation 1, introducing distribution shift can be done in four ways: by introducing covariate shift, concept drift, class-conditional shift, or prior shift. Especially modifying both concept drift and covariate shift or class-conditional shift and prior shift at the same time leads to particularly complex distribution shifts. Table 1 sums up the hierarchy of distribution shift sources.

Table 1 Hierarchy of Distribution Shift sources

We chose two methods to synthetically generate distribution shifts, one in each sub-tree of distribution shift sources: concept drift and class-conditional shift. We propose two novel ways to generate such shifts in real-world datasets that are detailed in the following Sects. 6.1 and 6.2.

6.1 Concept drift

Concept drift corresponds to changes in the decision boundary \({\mathbb {P}}(Y\mid X)\) of a classifier in some parts of the feature space \(\mathcal {X}\).

To synthetically generate concept drift in real-world datasets, we propose to model the feature space with a Decision Tree classifier (Breiman, 1984) learned on the original dataset D, such that each leaf of the decision tree will correspond to a patch of the dataset grid. By restricting the minimum samples per leaf in the decision tree, here set as 10% of the dataset per class, we will be able to control the finesse of the grid.

Then in each leaf of the decision tree, we are going to change the class distribution, essentially by generating noisy labels \(\tilde{Y}\) from clean labels Y thanks to a transition matrix \(\textbf{T}\), such that \(\forall (i,j) \in [\![1,K]\!]^2, \textbf{T}_{i,j} = {\mathbb {P}}(\tilde{Y}=j\mid Y=i)\). We chose to restrict the transition matrix to a permutation matrix \(\textbf{P}\), meaning that each clean class will be associated in a bijective way with a noisy class, different from the clean class. These permutation matrices \(\textbf{P}\) will be generated randomly once per dataset. For example, for three-class classification, one of these permutation matrices could be the following:

$$\begin{aligned} \textbf{P} = \begin{pmatrix} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 1\\ 1 &{} 0 &{} 0 \end{pmatrix} \end{aligned}$$

To decide how much concept drift will occur, we rank the leaves in the decision tree by their purity, starting from the purest leaves, and chose enough leaves such that \(r\%\) of the untrusted dataset fall in these leaves. These samples will be given a noisy label, and the rest of the samples will be left untouched.

The benefits of this methodology are twofold:

  • Having all samples assigned the same noisy label given their class in a whole subspace of the feature space will avoid the class overlap usually created with uniform (completely at random) label noise.

  • Choosing to add label noise in the purest leaves of the decision tree first creates new patterns in otherwise completely clean and easy parts of the dataset.

6.2 Class-conditional shift

Class-conditional shift corresponds to changes in the feature distribution \({\mathbb {P}}(X)\), with different changes per class \({\mathbb {P}}(X\mid Y)\).

To synthetically generate class-conditional shift in a real-world dataset, we propose to split the original dataset D into sub-datasets \(D^k\), such that \(\forall k \in [\![1,K]\!], D^k=\{(x,y)\in D\mid y=k\}\), where \(D^k\) corresponds to samples from the same class k out of all the classes K.

For each sub-dataset, we train a K-Means (Lloyd, 1982) clustering to learn a division of the feature space \(\mathcal {X}\) given the class k. The number of clusters is chosen by cross-validation to maximize the average silhouette score (Rousseeuw, 1987).

Then we sub-sample some of these clusters to modify the class-conditional distribution \({\mathbb {P}}(X \mid Y=k)\). We sort the clusters by their size and separate them in two groups of equal size. The group formed by the smallest clusters will be subsampled according to a subsampling ratio \(\rho\). The group formed by the biggest clusters will be left untouched. For example, if the number of clusters for a class is 4 and \(\rho =10\), the two smallest clusters will be subsampled by a ratio of 10, the two biggests will not be subsampled.

A higher \(\rho\) will lead to more data being missing in the untrusted dataset, especially samples in sparse regions of the feature space, represented by small clusters.

The final size of the untrusted dataset relative to the non-sub-sampled untrusted dataset is reported for each scenario in Table 5 in the Appendices.

6.3 Illustration on a toy dataset

We propose to first see the experimental protocol on a toy dataset with two features to visualize better how the corruption mechanisms will act. We chose the two moons dataset, composed of two croissant-shaped classes that are almost linearly separable.

Fig. 1
figure 1

Two moons dataset with the decision boundary of a Support Vector Machine classifier using a polynomial kernel

Figure 1 shows that a Support Vector classifier (Boser et al., 1992) with a Polynomial kernel of degree 3 can adequately separate the two classes.

Now we can apply our synthetic corruptions to this dataset and see the effect on the decision boundary of the Support Vector classifier.

Fig. 2
figure 2

Corrupted two moons dataset with the decision boundary of a Support Vector Machine classifier using a polynomial kernel

The proposed synthetic corruptions affect the learned decision boundary of the Support Vector classifier but each in their own way. The concept drift dramatically changes the decision boundary, and the classifier cannot recognize the two original croissants illustrated in Fig. 2. Instead, it emphasizes the new class patterns created with the noisy patches. The class-conditional shift, however, does not inherently change the shape of the decision boundary. We can still see that the classifier somewhat separates the two croissants. However, the margin is way tighter on the yellow croissant. This classifier would fail to classify some data points that would be considered out-of-distribution for the corrupted dataset but completely normal in the original dataset.

Now that we verified that the synthetic corruptions produce the expected effects on classifiers, we can look at how the previous method of the state-of-the-art can correct these corrupted datasets, especially IRBL and PDR that are expected to work only on concept drift or covariate shift, and our two proposed algorithm, IRBL2, and K-PDR, that should work on distribution shifts. We propose to keep 5% of the origin two moons dataset as trusted data that will be left untouched and corrupt the rest of the dataset with both corruptions.

Fig. 3
figure 3

Trusted and untrusted two moons dataset. Trusted data points are represented with a \(\square\) marker and untrusted data points are represented with a \(\bigcirc\) marker and with a more transparent color. The markers’ sizes are proportional to the weights given by the corresponding algorithm

In Fig. 3, we can see the limitations of IRBL and PDR. IRBL can detect samples belonging to the wrong part of the feature space given their label but cannot take into account the uncertainty of some patches of the feature space that are not present in the trusted dataset. For example, in the top part of the dataset, there is a zone where no trusted points exist for the purple moon. IRBL do not takes that into account and re-weights equally all these, as long as they are from the correct color, purple. Meanwhile, PDR does re-weight the samples differently depending if they belong to the trusted distribution, but indifferently depending on the sample color.

In the second part of Fig. 3, we can observe that IRBL2 and K-PDR can handle both cases and, thus, distribution shift. Their re-weighting schemes are reasonably similar, and to assess which one is better, we need to benchmark them on a broader range of real-world datasets.

6.4 Datasets

We randomly picked supervised classification datasets, see Table 2, from different sources: UCI (Dua and Graff, 2017), libsvm (Chang & Lin, 2011), active learning challenge (Guyon, 2010) and openML (Vanschoren et al., 2014). A part of these datasets comes from past challenges in active learning, where high performances with a low number of labeled samples have proved challenging to obtain, which makes leveraging the untrusted dataset necessary. With this choice of datasets, an extensive range of class ratios, number of classes, number of features, and dataset sizes are covered.

Table 2 Multi-class classification datasets used for the evaluation

Each dataset is first jumbled, then split in a stratified fashion with 80% of samples used for training set, and 20% used for the test set.

Each training set is again split in a stratified fashion into a trusted and untrusted set with a ratio of trusted p. To find a ratio of trusted data that is coherent between all datasets, we compute a learning curve on the training set and choose a ratio of trusted data such that the classifier’s performance is equal to \(p\%\) of the classifier learned on the entire training set. The actual ratio of trusted data for each dataset is provided in Table 4 in the Appendices.

Finally, we synthetically corrupt the untrusted set with the methods described earlier in this Section.

6.5 Competitors

We compare IRBL2 and K-PDR against multiple state-of-the-art competitors and baselines :

  • K-KMM, the original version of K-DR using KMM as proposed in (Fang et al., 2020);

  • IRBL (Nodet et al., 2021b), which is IRBL2 without covariate shift correction;

  • PDR (Bickel et al., 2007) the covariate-shift only baseline;

  • Trusted-Only baseline, when the model is learned using only the trusted dataset;

  • No-Correction baseline, when the model is learned on both datasets without correction applied.

For every competitors, we use the same probabilistic classifier family, histogram-based gradient boosting (HGBT) trees (Ke et al., 2017) from Scikit-Learn (Pedregosa et al., 2011) with their default hyperparameters, each time an algorithm required a base classifier. When it required a well-calibrated classifier, the HGBT trees where calibrated thanks to an Isotonic Regression (Zadrozny and Elkan, 2001) from Scikit-Learn using Zadrozny’s heuristic (Zadrozny and Elkan, 2002) for multiclass calibration.

For KMM and K-KMM we use the Radial Basis Function (RBF) kernel (Vert et al., 2004) with the default \(\gamma = \frac{1}{\vert \mathcal {X} \vert }\) value from Scikit-Learn. In order to scale KMM to the bigger datasets, we used an ensembling version (Miao et al., 2015) of the original KMM algorithm with a batch size of 100.

7 Results

We conducted the previously defined experiments of the proposed datasets with the label noise strength r varying from \(0\%\) to \(50\%\) of the dataset and the maximum cluster subsampling \(\rho\) from 1 to 100. We also tested three different values for the ratio of trusted data p equals to \(25\%\), \(50\%\), and \(75\%\). The predictive performance we used in these experiments is Cohen’s kappa coefficient \(\kappa\) (Cohen, 1960). It is a measure of agreement between assignments relative to the agreement between two random assignments and is fitted to evaluate classifier performance in unbalanced settings. Before looking at the whole grid of experiments over r and \(\rho\), we can look at the two axes independently to independently analyze the impact of concept drift and class-conditional shift.

The primary metric to analyze to assess the efficiency of biquality learning algorithms is the evolution of the predictive performance \(\kappa\) given the corruption strength (\(\rho\) or r). For example, having a constant predictive performance given the corruption strength means being robust to this corruption.

We propose to illustrate this function by plotting the averaged predictive performance over all datasets for all tested corruption strength and ratio of trusted data.

Fig. 4
figure 4

Average Cohen’s kappa \(\kappa\) with different corruption strengths (noise ratio r, or cluster imbalance \(\rho\)) and ratio of trusted data p over all datasets. The first row corresponds to experiments without noise (\(r=0\)), and the second row corresponds to experiments without cluster imbalance (\(\rho =0\)). Each column corresponds to a different case of ratio of trusted data (\(p=0.25\), \(p=0.5\), \(p=0.75\))

Figure 4 shows that with very few trusted data, \(p=0.25\), biquality-learning algorithms have trouble differentiating themselves from one another and with baselines, showing their limitations on low trusted data regimes. However, \(p=0.25\) leads to having around only thousandths of the data as trusted, which we consider to be a particularly pessimistic scenario in practice. With more trusted data, some competitors start to pull apart, especially K-PDR on class-conditional shift and IRBL and IRBL2 on concept drift.

In order to summarize these plots, we propose to compute the area under the curve (AUC) of the previous error curves, which is primarily a fancier average when the data points are not evenly spaced. This AUC is then normalized by the domain length we integrate the curve on.

Table 3 Averaged area under the Cohen’s kappa \(\kappa\) curve for all datasets

Table 3 mostly confirmed what could be observed in Fig. 4, but more particularly highlights the mere improvements brought by IRBL2 over IRBL with both corruptions.

As the previous comparison are made using averaged predictive measures that might not be relatable from one dataset to the other, we propose in a second time to compute the average ranking instead thanks to critical diagrams presented in Fig. 5 as a more robust way to compare competitors.

The Nemenyi test (Nemenyi, 1962) is used to rank the approaches in terms of AUC of the predictive metric over the corruption strength. The Nemenyi test consists of two successive steps. First, the Friedman test is applied to the AUC of competing approaches to determine whether their overall performance is similar. Second, if not, the post-hoc test is applied to determine groups of approaches whose overall performance is significantly different from that of the other groups.

Fig. 5
figure 5

Critical diagrams with different ratio of trusted data p over all datasets. The first row corresponds to experiments without noise (\(r=0\)), and the second row corresponds to experiments without cluster imbalance (\(\rho =0\)). Each column corresponds to a different case of ratio of trusted data (\(p=0.25\), \(p=0.5\), \(p=0.75\))

Figure 5 confirms all previous observations. K-PDR seems to be the best approach on class-conditional noise but is still somewhat robust to label noise, especially at medium and high levels of trusted data ratio as its able to detach from PDR in terms of average rank in these situtations. IRBL and IRBL2 seem to be the best approaches for learning with label noise, even though they struggle with a low ratio of trusted data. However, IRBL2 struggles to improve IRBL performances on class-conditional shifts.

Finally, to study the combined effect of both axes, we extend the experiments on a grid with varying strength of label noise r and sub-sampling \(\rho\).

Figure 6 presents six graphics, each reporting the Wilcoxon test that evaluates one competitor against another based on the accuracy over all datasets. These graphics form a grid with the horizontal axis representing the label noise strength r and the vertical axis representing the sub-sampling ratio \(\rho\). On each point of a grid \((r, \rho )\), a Wilcoxon rank-signed test (Wilcoxon, 1992) is conducted on two competitors for all datasets to determine if there is a significant difference in accuracy between them. If the first competitor wins, “\(\circ\)” is placed on the grid at this location, “\(\cdot\)” and “\(\bullet\)” indicate a tie or a loss, respectively.

Fig. 6
figure 6

Results of the Wilcoxon signed rank test computed on all datasets. Each figure compares one competitor versus another for a given trusted data ratio. Figures in the same row are the same competitors against different cases of trusted data ratio: \(p=0.25\), \(p=0.5\), \(p=0.75\). In each figure “\(\circ\)”, “\(\cdot\)” and “\(\bullet\)” indicate respectively a win, a tie, or a loss of the first competitor compared to the second competitor, the vertical axis is r, and the horizontal axis is \(\rho\)

Figure 6 allows a point-wise analysis between different competitors, but most notably, it uses a robust statistical test to draw statistically significant conclusions. IRBL2 is not able to improve on IRBL in all tested cases, as seen by the numerous number of ties. K-PDR is a better approach than K-KMM as the only times K-PDR loses to K-KMM, K-KMM get also beaten by the Trusted Only baseline (see Fig. 7 in Appendices). Finally, IRBL2 (or IRBL) is a better approach than K-PDR when untrusted data are noisy as soon as there is enough trusted data such that the trusted concept can be learned accurately enough to reweight untrusted samples.

8 Discussions

The previously presented results can be quite underwhelming for both proposed methods, so we propose to discuss these results in this section.

First, results showed that K-PDR could substantially improve on K-KMM, proving the advantage of using classifiers instead of kernels, which are more powerful tools as they require fewer assumptions about the data distribution and are fitter for learning from heterogeneous data such as tabular data. It also highlights a typical result in real-life ML applications where non-parametric methods work better than parametric methods.

Secondly, results showed no improvements brought by IRBL2 over IRBL on class-conditional shift. Indeed, IRBL might already be able to take into account the uncertainty about \({\mathbb {P}}(X)\) provided by PDR in the predicted probabilities \({\mathbb {P}}(Y \mid X)\) thanks to well-calibrated classifiers, nullifying the effect of IRBL2. For example, in unusual regions of the trusted feature space seen in the untrusted dataset, the value of \({\mathbb {P}}_T(Y \mid X)\) might tend to be around \(\frac{1}{K}\) to account the classifier uncertainty and will assign low weights to such samples already without the aid of a specific classifier for that. Meanwhile, K-PDR was clearly able to improve from PDR on label noise corruptions.

Furthermore, the proposed experiments for class-conditional shifts might only reproduce some cases of dataset shifts, especially shifts with out-of-distribution data (Yang et al., 2021). With the proposed design of synthetic class-conditional shift in our per-class cluster sub-sampling, we did not introduce data points in foreign feature spaces. In our experiments, all points from the untrusted dataset are in-distribution. We only under-represented some sub-populations. This corruption could lead to biased models for individuals from minority groups. However, it is a more manageable situation to deal with than having different groups in the trusted and untrusted dataset. Moreover, the chosen evaluation metric, Cohen’s kappa \(\kappa\), might not be able to detect such biases, as this metric only evaluates predictive performance.

Finally, the calibration efficiency of the HGBT trees in these experiments greatly impacts the efficiency of every tested biquality-learning algorithm. This benchmark has not explored different calibration techniques other than Isotonic Regression.

Obviously, this benchmark is not a definitive answer to the problem as we could have extended the experiments to other corruptions in real-world untrusted datasets, such as data poisoning (Steinhardt et al., 2017) or class imbalance (Japkowicz & Stephen, 2002).

9 Conclusion

In this paper, we have shown the capabilities of the biquality learning framework to design algorithms able to handle closed-set distribution shifts by having access to a trusted and untrusted dataset at training time. We proposed two biquality learning algorithms, IRBL2 and K-PDR, inspired respectively by the label noise and the covariate shift literature. We reviewed distribution shift sources and their hierarchy and proposed two novel methods to synthetically create concept drift and class-conditional shifts in real-world datasets. Throughout extensive experiments, we benchmarked many competitors from the state-of-the-art. We opened some discussions on the results and assessed that the development of biquality learning algorithms robust to distributional changes of closed sets, despite the presented results, remains an interesting problem for future research in this area.