\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Qian Shao¹ , Jiangrui Kang²¹¹footnotemark: 1 , Qiyuan Chen¹¹¹footnotemark: 1 , Zepeng Li¹, Hongxia Xu¹,
Yiwen Cao², Jiajuan Liang² , and Jian Wu¹²²footnotemark: 2
¹Zhejiang University
²BNU-HKBU United International College These authors contributed equally to this work.Corresponding authors. Emails: jiajuanliang@uic.edu.cn, wujian2000@zju.edu.cn.

Abstract

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion $\alpha$ -Maximum Mean Discrepancy ( $\alpha$ -MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing $\alpha$ -MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

1 Introduction

Semi-Supervised Learning (SSL) is a popular paradigm which reduces reliance on large amounts of labeled data in many deep learning tasks [37, 34, 55]. Previous SSL research mainly focuses on effectively utilising labelled and unlabeled data. Specifically, labelled data directly supervise model learning, while unlabeled data help learn a desirable model that makes consistent and unambiguous predictions [50]. Besides, we also find that how to select samples for annotation will greatly affect model performance, particularly under extremely low-budget settings (see Section 7.2).

The prevailing sample selection methods in SSL have many shortcomings. For example, random sampling may introduce imbalanced class distributions and inadequate coverage of the overall data distribution, resulting in poor performance. Stratified sampling randomly selects samples within each class, which is impractical in real-world scenarios where the label for each sample is unknown. Existing researchers also employ representativeness and diversity strategies to select appropriate samples for annotation. Representativeness [12] ensures that the selected subset distributes similarly with the entire dataset, and diversity [51] is designed to select informative samples by pushing away them in feature space. And focusing on only one aspect presents significant limitations (Figure 1a and b). To address these issues, Xie et al. [54] and Wang et al. [47] employ a combination of the two strategies for sample selection. These methods set a fixed ratio for representativeness and diversity, restricting the ultimate performance through our empirical evidence (see Section 7.4). Fundamentally, they lack a theoretical basis to substantiate their effectiveness.

Refer to caption — Figure 1: Visualization of selected samples from a dog dataset. The red and grey circles respectively symbolize the selected and unselected samples. a) The selected samples often contain an excessive number of highly similar instances, leading to redundancy; b) The selected samples contain too many edge points, unable to cover the entire dataset; c) The selected samples represent the entire dataset comprehensively and accurately.

We observe that Active Learning (AL) primarily focuses on selecting the right samples for annotation, and numerous studies transfer the sample selection methods of AL into SSL, giving rise to Semi-Supervised Active Learning (SSAL) [48]. However, most of these approaches exhibit several limitations: (1) They require randomly selected samples to begin with, which expends a portion of the labelling budget, making it difficult to work effectively with a very limited budget (e.g., 1% or even lower) [5]; (2) They involve human annotators in iterative cycles of labelling and training, leading to substantial labelling overhead [54]; (3) They are coupled with the model training so that samples for annotation need to be re-selected every time a model is trained [47]. In summary, selecting the appropriate samples for annotation is challenging in SSL.

To address these challenges, we propose a Representative and Diverse Sample Selection approach (RDSS) that requests annotations only once and operates independently of the downstream tasks. Specifically, inspired by the concept of Maximum Mean Discrepancy (MMD) [13], we design a novel criterion named $\alpha$ -MMD. It aims to strike a balance between representativeness and diversity via a trade-off parameter $\alpha$ (Figure 1c), for which we find an optimal interval adapt to different budgets. By using a modified Frank-Wolfe algorithm called Generalized Kernel Herding without Replacement (GKHR), we can get an efficient approximate solution to this minimization problem.

We prove that under certain Reproducing Kernel Hilbert Space (RKHS) assumptions, $\alpha$ -MMD effectively bounds the difference between training with a constrained versus an unlimited labelling budget. This implies that our proposed method could significantly enhance the generalization ability of learning with limited labels. We also give a theoretical assessment of GKHR with some supplementary numerical experiments, showing that GKHR performs well in learning with limited labels.

Furthermore, we evaluate our proposed RDSS across several popular SSL frameworks on the datasets CIFAR-10/100 [18], SVHN [29], STL-10 [8] and ImageNet [9]. Extensive experiments show that RDSS outperforms other sample selection methods widely used in SSL, AL or SSAL, especially with a constrained annotation budget. Besides, ablation experimental results demonstrate that RDSS outperforms methods using a fixed ratio.

The main contributions of this article are as follows:

•

We propose RDSS, which selects representative and diverse samples for annotation to enhance SSL by minimizing a novel criterion $\alpha$ -MMD. Under low-budget settings, we develop a fast and efficient algorithm, GKHR, for optimization.
•

We prove that our method benefits the generalizability of the trained model under certain assumptions and rigorously establish an optimal interval for the trade-off parameter $\alpha$ adapt to the different budgets.
•

We compare RDSS with sample selection strategies widely used in SSL, AL or SSAL, the results of which demonstrate superior sample efficiency compared to these strategies. In addition, we conduct ablation experiments to verify our method’s superiority over the fixed-ratio approach.

2 Related Work

Semi-Supervised Learning. Semi-Supervised Learning (SSL) effectively utilizes sparse labeled data and abundant unlabeled data for model training. Consistency Regularization [33, 19, 42], Pseudo-Labeling [20, 53] and their hybrid strategies [37, 58, 34] are commonly used in SSL. Consistency Regularization ensures the model’s output stays stable even when there’s noise or small changes in the input, usually from the data augmentation [52]. Pseudo-labelling integrates high-confidence data pseudo-labels directly into training, adhering to entropy minimization [22]. Moreover, an integrative approach that combines the aforementioned strategies can also achieve substantial results [50, 55]. Even though these approaches have been proven effective, they usually assume that labelled samples are randomly selected from each class (i.e., stratified sampling), which is not practical in real-world scenarios where the label for each sample is unknown.

Active Learning. Active learning (AL) aims to optimize the learning process by selecting the appropriate samples for labelling, reducing reliance on large labelled datasets. There are two different criteria for sample selection: uncertainty and representativeness. Uncertainty sampling selects samples about which the current model is most uncertain. Earlier studies utilized posterior probability [21, 46], entropy [17, 25], and classification margin [44] to estimate uncertainty. Recent research regards uncertainty as training loss [16, 56], influence on model performance [10, 23] or the prediction discrepancies between multiple classifiers [7]. However, uncertainty sampling methods may exhibit performance disparities across different models, leading researchers to focus on representativeness sampling, which aims to align the distribution of selected subset with that of the entire dataset [35, 36, 26]. Most AL approaches are difficult to perform well under extremely low-label settings. This may be because they usually require randomly selected samples to begin with and involve human annotators in iterative cycles of labelling and training, leading to substantial labelling overhead.

Model-Free Subsampling. Subsampling is a statistical approach which selects a subset with size $m$ as a surrogate for the full dataset with size $n\gg m$ . Model-free subsampling is preferred in data-driven modelling tasks, as it does not depend on the model assumptions. There are mainly two kinds of popular model-free subsampling methods. The one is induced by minimizing statistical discrepancies, which forces the distribution of subset to be similar to that of full data, in other words, selects representative subsamples, such as Wasserstein distance [12], energy distance [27], uniform design [59], maximum mean discrepancy [6] and generalized empirical $F$ -discrepancy [60]. The other tends to select a diverse subset containing as many informative samples as possible [51]. The above-mentioned methodologies either exclusively focus on representativeness or diversity, which are difficult to effectively apply to SSL.

3 Problem Setup

Let $\mathcal{X}$ be the unlabeled data space, $\mathcal{Y}$ be the label space, $\mathbf{X}_{n}=\{\mathbf{x}_{i}\}_{i\in[n]}\subset\mathcal{X}$ be the full unlabeled dataset and $\mathcal{I}_{m}=\{i_{1},i_{2},\cdots,i_{m}\}\subset[n](m<n)$ be an index set contained in $[n]$ , our goal is to find an index set $\mathcal{I}^{*}_{m}=\{i^{*}_{1},i^{*}_{2},\cdots,i^{*}_{m}\}\subset[n](m<n)$ such that the selected set of samples $\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\mathbf{x}_{i^{*}_{2% }},\cdots,\mathbf{x}_{i^{*}_{m}}\}$ is the most informative. After that, we can get access to the true labels of selected samples and use the set of labelled data $S=\{(\mathbf{x}_{i},y_{i})\}_{i\in\mathcal{I}^{*}_{m}}$ and the rest of the unlabeled data to train a deep learning model.

Following the methodology of previous works, we use representativeness and diversity as criteria for evaluating the informativeness of selected samples. Representativeness ensures the selected samples distribute similarly to the full unlabeled dataset. Diversity is proposed to prevent an excessive concentration of selected samples in high-density areas of the full unlabeled dataset. Furthermore, the cluster assumption in SSL suggests that the data tend to form discrete clusters, in which boundary points are likely to be located in the low-density area. Therefore, under this assumption, selected samples with diversity contain more boundary points than the non-diversified ones, which is desired in training classifiers.

As a result, our goal can be formulated by solving the following problem:

\max_{\mathcal{I}_{m}\subset[n]}\text{Rep}(\mathbf{X}_{\mathcal{I}_{m}},% \mathbf{X}_{n})+\lambda\text{Div}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}),

(1)

where $\text{Rep}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})$ and $\text{Div}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})$ quantify the representativeness and diversity of selected samples respectively and $\lambda$ is a hyperparameter to balance the trade-off representativeness and diversity.

Besides, we propose another two fundamental settings which are beneficial to the implementation of the framework: (1) Low-budget learning. The budget for many of the real-world tasks which require sample selection procedures is relatively low compared to the size of unlabeled data. Therefore, we set $m/n\leq 0.2$ in default in the following context, including the analysis of the sampling algorithm and the experiments; (2) Sampling without Replacement. Compared with the setting of sampling with replacement, sampling without replacement offers several benefits which better match our tasks, including bias and variance reduction, precision increase and representativeness enhancement [24, 43].

4 Representative and Diversity Sample Selection

The Representative and Diverse Sample Selection (RDSS) framework consists of two steps: (1) Quantification. We quantify the representativeness and diversity of selected samples by proposing a novel concept called $\alpha$ -MMD (6), where $\lambda$ is replaced by $\alpha$ as the trade-off hyperparameter; (2) Optimization. We optimize $\alpha$ -MMD by GKHR algorithm to obtain the optimally selected samples $\mathbf{X}_{\mathcal{I}^{*}_{m}}$ .

4.1 Quantification of Diversity and Representativeness

In classical statistics and machine learning problems, the inner product of data points $\mathbf{x},\mathbf{y}\in\mathcal{X}$ , defined by $\langle\mathbf{x},\mathbf{y}\rangle$ , is employed to as a similarity measure between $\mathbf{x},\mathbf{y}$ . However, the application of linear functions can be very restrictive in real-world problems. In contrast, kernel methods use kernel functions $k(\mathbf{x},\mathbf{y})$ , including Gaussian kernels (RBF), Laplacian kernels and polynomial kernels, as non-linear similarity measures between $\mathbf{x},\mathbf{y}$ , which are actually inner products of the projections of $k(\mathbf{x},\mathbf{y})$ in some high-dimensional feature space [28].

Let $k(\cdot,\cdot)$ be a kernel function on $\mathcal{X}\times\mathcal{X}$ , and we employ $k(\cdot,\cdot)$ to measure the similarity between any two points and the average similarity, denoted by

S_{k}(\mathbf{X}_{\mathcal{I}_{m}})=\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}% \sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right),

(2)

to measure the similarity between the selected samples. Obviously, $S(\mathbf{X}_{\mathcal{I}_{m}})$ can evaluate the diversity of $\mathbf{X}_{\mathcal{I}_{m}}$ since larger similarity implies smaller diversity.

As a statistical discrepancy which measures the distance between distributions, the maximum mean discrepancy (MMD) is introduced here to quantify the representativeness of $\mathbf{X}_{\mathcal{I}_{m}}$ to $\mathbf{X}_{n}$ . Proposed by Gretton et al. [13], MMD is formally defined below:

Definition 4.1 (Maximum Mean Discrepancy).

Let $P,Q$ be two Borel probability measures on $\mathcal{X}$ . Suppose $f$ is sampled from the unit ball in a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ associated with its reproducing kernel $k(\cdot,\cdot)$ , i.e., $\|f\|_{\mathcal{H}}\leq 1$ , then the MMD between $P$ and $Q$ is defined by

\operatorname{MMD}_{k}^{2}(P,Q):=\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\int fdP% -\int fdQ\right)^{2}=\mathbb{E}\left[k\left(X,X^{\prime}\right)+k\left(Y,Y^{% \prime}\right)-2k(X,Y)\right],

(3)

where $X,X^{\prime}\sim P$ and $Y,Y^{\prime}\sim Q$ are independent copies.

We can next derive the empirical version for MMD that is able to measure the representativeness of $\mathbf{X}_{\mathcal{I}_{m}}=\{\mathbf{x}_{i}\}_{i\in\mathcal{I}_{m}}$ relative to $\mathbf{X}_{n}=\{\mathbf{x}_{i}\}_{i=1}^{n}$ by replacing $P,Q$ with the empirical distribution constructed by $\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}$ in (3):

\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}):=\frac% {1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}% \right)+\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}\sum_{j\in\mathcal{I}_{m}}k% \left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\frac{2}{mn}\sum_{i=1}^{n}\sum_{j% \in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right).

(4)

Optimization objective. Set $\text{Rep}(\cdot,\cdot)=-\operatorname{MMD}_{k}^{2}(\cdot,\cdot)$ and $\text{Div}(\cdot)=-S_{k}(\cdot)$ in (1), where $k$ is a proper kernel function, our optimization objective becomes

\min_{\mathcal{I}_{m}\subset[n]}\operatorname{MMD}_{k}^{2}(\mathbf{X}_{% \mathcal{I}_{m}},\mathbf{X}_{n})+\lambda S_{k}(\mathbf{X}_{\mathcal{I}_{m}}).

(5)

Set $\lambda=\frac{1-\alpha}{\alpha m}$ , since $\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)$ is a constant, the objective function in (5) can be rewritten by

	$\displaystyle\alpha\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},% \mathbf{X}_{n})+\frac{1-\alpha}{m}S_{k}(\mathbf{X}_{\mathcal{I}_{m}})+\frac{% \alpha(\alpha-1)}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},% \mathbf{x}_{j}\right)$	(6)
$\displaystyle=$	$\displaystyle\frac{\alpha^{2}}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(% \mathbf{x}_{i},\mathbf{x}_{j}\right)+\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}% \sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\frac{2% \alpha}{mn}\sum_{i=1}^{n}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},% \mathbf{x}_{j}\right)$
$\displaystyle=$	$\displaystyle\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i\in% \mathcal{I}_{m}}f(\mathbf{x}_{i})-\frac{\alpha}{n}\sum_{j=1}^{n}f(\mathbf{x}_{% j})\right)^{2}$

which defines a new concept called $\alpha$ -MMD, denoted by $\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})$ . This new concept distinguishes our method from those existing methods, which is essential for developing the sampling algorithms and theoretical analysis. Note that $\alpha$ -MMD degenerates to classical MMD when $\alpha=1$ and degenerates to average similarity when $\alpha=0$ . As $\alpha$ decreases, $\lambda$ increases, thereby encouraging the diversity for sample selection.

Remark 1. In the following context, all the kernels are assumed to be characteristic and positive definite if not specified. The following illustrates the advantages of the two properties.

Characteristics kernels. The MMD is generally a pseudo-metric on the space of all Borel probability distributions, implying that the MMD between two different distributions can be zero. Nevertheless, MMD becomes a proper metric when $k$ is a characteristic kernel, i.e., $P\rightarrow\int_{\mathcal{X}}k(\cdot,\mathbf{x})dP$ for any Borel probability distribution $P$ on $\mathcal{X}$ [28]. Therefore, MMD induced by characteristic kernels can be more appropriate for measuring representativeness.

Positive definite kernels. Aronszajn [1] showed that for every positive definite kernel $k(\cdot,\cdot)$ , i.e., its Gram matrix is always positive definite and symmetric, it uniquely determines an RKHS $\mathcal{H}$ and vice versa. This property is not only important for evaluating the property of MMD [40] but also required in optimizing MMD [31] by Frank-Wolfe algorithm.

4.2 Sampling Algorithm

In the previous research [35, 26, 47], sample selection is usually modelled by a non-convex combinatorial optimization problem. In contrast, following the idea of [3], we regard $\min_{\mathcal{I}_{m}\in[n]}\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}_{% \mathcal{I}_{m}},\mathbf{X}_{n})$ as a convex optimization problem by exploiting the convexity of $\alpha$ -MMD, and then solve it by a fast iterative minimization procedure derived from Frank-Wolfe algorithm (see Appendix A for derivation details):

\mathbf{x}_{i^{*}_{p+1}}\in\mathop{\arg\min}_{i\in[n]}f_{\mathcal{I}^{*}_{p}}(% \mathbf{x}_{i}),\mathcal{I}^{*}_{p+1}\leftarrow\mathcal{I}^{*}_{p}\cup\{{i^{*}% _{p+1}}\},\mathcal{I}_{0}=\emptyset,

(7)

where $f_{\mathcal{I}_{p}}(\mathbf{x}_{i})=\sum_{j\in\mathcal{I}_{p}}k\left(\mathbf{x% }_{i},\mathbf{x}_{j}\right)-\alpha p\sum_{l=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_% {l})/n$ . As an extension of kernel herding [6], its corresponding algorithm (see Algorithm 2) is called Generalized Kernel Herding (GKH). Note that $f_{\mathcal{I}_{p}}(\mathbf{x}_{i})$ is iteratively updated in Algorithm 2, which can save a lot of running time. However, GKH can select repeated samples that contradict the setting of sampling without replacement. To address this issue, we propose a modified iterating formula based on (7):

\mathbf{x}_{i^{*}_{p+1}}\in\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}% _{p}}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i}),\mathcal{I}^{*}_{p+1}\leftarrow% \mathcal{I}^{*}_{p}\cup\{{i^{*}_{p+1}}\},\mathcal{I}^{*}_{0}=\emptyset,

(8)

which admits no repetitiveness in the selected samples. Its corresponding algorithm (see Algorithm 1) is thereby named as Generalized Kernel Herding without Replacement (GKHR), employed as the sampling algorithm for RDSS.

Algorithm 1 Generalized Kernel Herding without Replacement

0: Data set

\mathbf{X}_{n}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\}\subset\mathcal{X}

; the number of selected samples

m<n

; a positive definite, characteristic and radial kernel

k(\cdot,\cdot)

\mathcal{X}\times\mathcal{X}

; trade-off parameter

\alpha\leq 1

0: Selected samples

\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\cdots,\mathbf{x}_{i% ^{*}_{m}}\}

1: For each

\mathbf{x}_{i}\in\mathbf{X}_{n}

calculate

\mu(\mathbf{x}_{i}):=\sum_{j=1}^{n}k(\mathbf{x}_{j},\mathbf{x}_{i})/n

2: Set

\beta_{1}=1

S_{0}=0

\mathcal{I}=\emptyset

3: for

p\in\{1,\cdots,m\}

{i^{*}_{p}}\in\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}S_{p-1}(% \mathbf{x}_{i})-\alpha\mu(\mathbf{x}_{i})

5: For all

i\in[n]\backslash\mathcal{I}^{*}_{p}

, update

S_{p}(\mathbf{x}_{i})=(1-\beta_{p})S_{p-1}(\mathbf{x}_{i})+\beta_{p}k(\mathbf{% x}_{i^{*}_{p}},\mathbf{x}_{i})

{\mathcal{I}^{*}_{p+1}}\leftarrow{\mathcal{I}^{*}_{p}}\cup\{{i^{*}_{p}}\}

p\leftarrow p+1

, set

\beta_{p}=1/p

7: end for

Computational complexity. Despite the time cost for calculating kernel functions, the computational complexity of GKHR is $O(mn)$ , since in each iteration, the steps in lines 4 and 5 of Algorithm 2 respectively require $O(n)$ computations. Note that GKH has the same order of computational complexity as GKHR.

5 Theoretical Analysis

5.1 Generalization Bounds

Recall the core-set approach in [35], i.e., for any $h\in\mathcal{H}$ ,

R(h)\leq\widehat{R}_{S}(h)+|R(h)-\widehat{R}_{T}(h)|+|\widehat{R}_{T}(h)-% \widehat{R}_{S}(h)|,

where $T$ is the full labeled dataset and $S\subset T$ is the core set, $R(h)$ is the expected risk of $h$ , $\widehat{R}_{T}(h),\widehat{R}_{S}(h)$ are empirical risk of $h$ on $T,S$ . The first term $\widehat{R}_{S}(h)$ is unknown before we label the selected samples, and the second term $|R(h)-\widehat{R}_{T}(h)|$ can be upper bounded by the so-called generalization bounds [2] which do not depend on the choice of core set. Therefore, to control the upper bound of $R(h)$ , we only need to analyse the upper bound of the third term $|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)|$ called core-set loss, which requires several mild assumptions.

Let $\mathcal{H}_{1}=\{h|h:\mathcal{X}\rightarrow\mathcal{Y}\}$ be a hypothesis set in which we are going to select a predictor and suppose that the labelled data $T=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ are i.i.d. sampled from a random vector $(X,Y)$ defined on $\mathcal{X}\times\mathcal{Y}$ . We firstly assume that $\mathcal{H}_{1}$ is an RKHS, which is mild in machine learning theory [2, 4].

Assumption 5.1.

$\mathcal{H}_{1}$ is an RKHS associated with bounded positive definite kernel $k_{1}$ where the norm of any $h\in\mathcal{H}_{1}$ is bounded by $K_{h}$ .

We further make RKHS assumptions on the functional space of $\mathbb{E}(Y|X)$ and $\operatorname{Var}(Y|X)$ that are fundamental in the field of conditional distribution embedding [38, 40].

Assumption 5.2.

There is an RKHS $\mathcal{H}_{2}$ associated with bounded positive definite kernel $k_{2}$ such that $\mathbb{E}(Y|X)\in\mathcal{H}_{2}$ and the norm of any $\mathbb{E}(Y|X)$ is bounded by $K_{m}$ .

Assumption 5.3.

There is an RKHS $\mathcal{H}_{3}$ associated with bounded positive definite kernel $k_{3}$ such that $\operatorname{Var}(Y|X)\in\mathcal{H}_{3}$ and the norm of any $\operatorname{Var}(Y|X)$ is bounded by $K_{s}$ .

We next give a $\alpha$ -MMD-type upper bound for the core-set loss by the following theorem:

Theorem 5.4.

Take $k=k_{1}^{2}+k_{1}k_{2}+k_{3}$ , then under assumptions 1-3, for any selected samples $S\subset T$ , there exists a positive constant $K_{c}$ such that the following inequality holds:

|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)|\leq K_{c}(\operatorname{MMD}_{k,\alpha% }(\mathbf{X}_{S},\mathbf{X}_{T})+(1-\alpha)\sqrt{K})^{2},

where $0\leq\alpha\leq 1$ , $0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K$ and $\mathbf{X}_{S},\mathbf{X}_{T}$ are projections of $S,T$ on $\mathcal{X}$ .

Therefore, minimizing $\alpha$ -MMD can optimize the generalization bound for $R(h)$ and benefit the generalizability of the trained model (predictor).

5.2 Finite-Sample-Error-Bound for GKHR

The concept of convergence does not apply to analyzing GKHR. With $n$ fixed, GKHR iterates for at most $n$ times and then returns $\mathbf{X}_{\mathcal{I}^{*}_{n}}=\mathbf{X}_{n}$ . Consequently, we analyze the performance of GKHR by its finite-sample-error bound. Previous to that, we make an assumption on the mean of $f_{\mathcal{I}^{*}_{p}}$ over the full unlabeled dataset.

Assumption 5.5.

For any $\mathcal{I}^{*}_{p}$ returned by GKHR, $1\leq p\leq m-1$ , there exists $p+1$ elements $\{\mathbf{x}_{j_{l}}\}_{l=1}^{p+1}$ in $\mathbf{X}_{n}$ such that

f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j_{1}})\leq\cdots f_{\mathcal{I}^{*}_{p}}(% \mathbf{x}_{j_{p+1}})\leq\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{*}_{p}}(\mathbf{x% }_{i})}{n}.

When $m$ is not relatively small, this assumption is rather unrealistic. Nevertheless, under our low-budget setting, especially when $m\ll n$ , the assumption becomes an extension of the principle that "the minimum is never larger than the mean", which still probably makes sense. We can then show that the decaying rate for optimization error of GKHR can be upper bounded by $O(\log m/m)$ :

Theorem 5.6.

Let $\mathbf{X}_{\mathcal{I}^{*}_{m}}$ be the samples selected by GKHR, under assumption 4, it holds that

\operatorname{MMD}^{2}_{k,\alpha}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},% \mathbf{X}_{n}\right)\leq C^{2}_{\alpha}+B\frac{2+\log m}{m+1}

(9)

where $B=2K$ , $0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K$ , $C^{2}_{\alpha}=(1-\alpha)^{2}\overline{K}$ where $\overline{K}$ is defined in Lemma B.6.

6 Choice of Kernel and Hyperparameter Tuning

In this section, we make some suggestions for choosing the kernel and tuning the hyperparameter $\alpha$ .

Choice of kernel. Recall Remark 1 in Section 4.1, we only consider characteristic and positive definite kernels in RDSS. Since the Gaussian kernels are the most commonly used kernels in the field of machine learning and statistics [2, 14], we introduce Gaussian kernel as our choice, which is defined by $k(\mathbf{x},\mathbf{y})=\exp(-\|\mathbf{x}-\mathbf{y}\|_{2}^{2})/\sigma^{2}$ . The bandwidth parameter $\sigma$ is set to be the median distance between samples in the aggregate dataset [14], i.e., $\sigma=\operatorname{Median}(\{\|\mathbf{x}-\mathbf{y}\|_{2}|\mathbf{x},% \mathbf{y}\in\mathbf{X}_{n}\})$ , since the median is robust and also compromises between extreme cases.

Tuning trade-off hyperparameter $\alpha$ . According to Theorem 5.6 and Lemma B.3, by straightforward deduction we have

\operatorname{MMD}_{k}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}% \right)\leq C_{\alpha}+\mathcal{O}\left(\sqrt{\frac{\log m}{m}}\right)+(1-% \alpha)\sqrt{K}

to upper bound the MMD between the selected samples and the full dataset under a low-budget setting. We can just set $\alpha\in[1-\frac{1}{\sqrt{m}},1)$ so that the upper bound of the MMD would not be larger than the one of $\alpha$ -MMD in the perspective of the order of magnitude.

7 Experiments

In this section, we first explain the implementation details of our method RDSS in Section 7.1. Next, we compare RDSS with other sampling methods by integrating them into two state-of-the-art (SOTA) SSL approaches (FlexMatch [58] and Freematch [50]) on five datasets (CIFAR-10/100, SVHN, STL-10 and ImageNet-1k) in Section 7.2. The details of the datasets, the visualization results and the computational complexity of different sampling methods are shown in Appendix D.2, D.3, and D.4, respectively. We also compare against various AL/SSAL approaches in Section 7.3. Lastly, we make quantitative analyses of the trade-off parameter $\alpha$ in Section 7.4.

7.1 Implementation Details of Our Method

First, we leverage the pre-trained image feature extraction capabilities of CLIP [32], a vision transformer architecture, to extract features. Subsequently, the [CLS] token features produced by the model’s final output are employed for sample selection. During the sample selection phase, the Gaussian kernel function is chosen as the kernel method to compute the similarity of samples in an infinite-dimensional feature space. The value of $\sigma$ for the Gaussian kernel function is set as explained in Section 6. To ensure diversity in the sampled data, we introduce a penalty factor given by $\alpha=1-\frac{1}{\sqrt{m}}$ , where $m$ denotes the number of selected samples. Concretely, we set $m=\left\{40,250,4000\right\}$ for CIFAR-10, $m=\left\{400,2500,10000\right\}$ for CIFAR-100, $m=\left\{250,1000\right\}$ for SVHN, $m=\left\{40,250\right\}$ for STL-10 and $m=\left\{100000\right\}$ for ImageNet. Next, the selected samples are used for two SSL approaches, which are trained and evaluated on the datasets using the codebase Unified SSL Benchmark (USB) [49]. The optimizer for all experiments is standard stochastic gradient descent (SGD) with a momentum of $0.9$ [41]. The initial learning rate is $0.03$ with a learning rate decay of $0.0005$ . We use ResNet-50 [15] for the ImageNet experiment and Wide ResNet-28-2 [57] for other datasets. Finally, we evaluate the performance with the Top-1 classification accuracy metric on the test set. Experiments are run on 8*NVIDIA Tesla A100 (40 GB) and 2*Intel 6248R 24-Core Processor. We average our results over five independent runs.

Table 1: Comparison with other sampling methods. Due to stratified sampling limitations, the results are marked in grey. Top and second-best performances are bolded and underlined, respectively, excluding stratified sampling. Metrics represent mean accuracy and standard deviation over five independent runs.

Dataset	CIFAR-10			CIFAR-100			SVHN		STL-10
Budget	40	250	4000	400	2500	10000	250	1000	40	250
Applied to FlexMatch [58]
Stratified	91.45 $\pm$ 3.41	95.10 $\pm$ 0.25	95.63 $\pm$ 0.24	50.23 $\pm$ 0.41	67.38 $\pm$ 0.45	73.61 $\pm$ 0.43	89.60 $\pm$ 1.86	93.66 $\pm$ 0.49	75.33 $\pm$ 3.74	92.29 $\pm$ 0.64
Random	87.30 $\pm$ 4.61	93.95 $\pm$ 0.91	95.17 $\pm$ 0.59	45.58 $\pm$ 0.97	66.48 $\pm$ 0.98	72.61 $\pm$ 0.83	87.67 $\pm$ 1.16	94.06 $\pm$ 1.14	65.81 $\pm$ 1.21	90.70 $\pm$ 0.79
$k$ -Means	81.23 $\pm$ 8.71	94.59 $\pm$ 0.51	95.09 $\pm$ 0.65	41.60 $\pm$ 1.24	65.99 $\pm$ 0.57	71.53 $\pm$ 0.42	90.28 $\pm$ 0.69	93.82 $\pm$ 1.04	55.43 $\pm$ 0.39	90.64 $\pm$ 1.05
USL [47]	91.73 $\pm$ 0.13	94.89 $\pm$ 0.20	95.43 $\pm$ 0.15	46.89 $\pm$ 0.46	66.75 $\pm$ 0.37	72.53 $\pm$ 0.32	90.03 $\pm$ 0.63	93.10 $\pm$ 0.78	75.65 $\pm$ 0.60	90.77 $\pm$ 0.36
ActiveFT [54]	70.87 $\pm$ 4.14	93.85 $\pm$ 1.37	95.31 $\pm$ 0.75	25.69 $\pm$ 0.64	57.19 $\pm$ 2.06	70.96 $\pm$ 0.75	89.32 $\pm$ 1.87	92.53 $\pm$ 0.43	55.57 $\pm$ 1.42	87.28 $\pm$ 1.19
RDSS (Ours)	94.69 $\pm$ 0.28	95.21 $\pm$ 0.47	95.71 $\pm$ 0.10	48.12 $\pm$ 0.36	67.27 $\pm$ 0.55	73.21 $\pm$ 0.29	91.70 $\pm$ 0.39	95.70 $\pm$ 0.35	77.96 $\pm$ 0.52	93.16 $\pm$ 0.41
Applied to FreeMatch [50]
Stratified	95.05 $\pm$ 0.15	95.40 $\pm$ 0.23	95.80 $\pm$ 0.29	51.29 $\pm$ 0.56	67.69 $\pm$ 0.58	73.90 $\pm$ 0.53	92.58 $\pm$ 1.05	94.22 $\pm$ 0.78	79.16 $\pm$ 5.01	91.36 $\pm$ 0.18
Random	93.41 $\pm$ 1.24	93.98 $\pm$ 0.91	95.56 $\pm$ 0.17	47.16 $\pm$ 1.25	66.09 $\pm$ 1.08	72.09 $\pm$ 0.99	91.62 $\pm$ 1.88	94.40 $\pm$ 1.28	76.66 $\pm$ 2.43	90.72 $\pm$ 0.97
$k$ -Means	88.05 $\pm$ 5.07	94.80 $\pm$ 0.48	95.51 $\pm$ 0.37	44.07 $\pm$ 1.94	66.09 $\pm$ 0.39	71.69 $\pm$ 0.72	93.30 $\pm$ 0.46	94.68 $\pm$ 0.72	63.22 $\pm$ 4.92	89.99 $\pm$ 0.87
USL [47]	93.81 $\pm$ 0.62	95.19 $\pm$ 0.18	95.78 $\pm$ 0.29	47.07 $\pm$ 0.78	66.92 $\pm$ 0.33	72.59 $\pm$ 0.36	93.36 $\pm$ 0.53	94.44 $\pm$ 0.44	76.95 $\pm$ 0.86	90.58 $\pm$ 0.58
ActiveFT [54]	78.13 $\pm$ 2.87	94.54 $\pm$ 0.81	95.33 $\pm$ 0.53	26.67 $\pm$ 0.46	56.23 $\pm$ 0.85	71.20 $\pm$ 0.68	92.60 $\pm$ 0.51	93.71 $\pm$ 0.54	63.31 $\pm$ 2.99	86.60 $\pm$ 0.30
RDSS (Ours)	95.05 $\pm$ 0.13	95.50 $\pm$ 0.20	95.98 $\pm$ 0.28	48.41 $\pm$ 0.59	67.40 $\pm$ 0.23	73.13 $\pm$ 0.19	94.54 $\pm$ 0.46	95.83 $\pm$ 0.37	81.90 $\pm$ 1.72	92.22 $\pm$ 0.40

7.2 Comparison with Other Sampling Methods

Main results. We apply RDSS on Flexmatch and Freematch to compare with the following three baselines and two SOTA methods in SSL under different annotation budget settings. The baselines conclude Stratified, Random and $k$ -Means, while the two SOTA methods are USL [47] and ActiveFT [54]. The results are shown on Table 1 from which we have several observations: (1) Our proposed RDSS achieves the highest accuracy, outperforming other sampling methods, which underscores the effectiveness of our approach; (2) USL attains suboptimal results under most budget settings yet exhibits a significant gap compared to RDSS, particularly under severely constrained ones. For instance, FreeMatch achieves a $4.95\%$ rise on the STL-10 with a budget of $40$ ; (3) In most experiments, RDSS either approaches or surpasses the performance of stratified sampling, especially on SVHN and STL-10. However, the stratified sampling method is practically infeasible given that the category labels of the data are not known a priori.

Results on ImageNet. We also compare the second-best method USL with RDSS on ImageNet. Following the settings of FreeMatch [50], we select 100k samples for annotation. FreeMatch, using RDSS and USL as sampling methods, achieves $58.24\%$ and $56.86\%$ accuracy, respectively, demonstrating a substantial enhancement in the performance of our method over the USL approach.

7.3 Comparison with AL/SSAL Approaches

First, we compare RDSS against various traditional AL approaches on CIFAR-10/100. AL approaches conclude CoreSet [35], VAAL [36], LearnLoss [56] and MCDAL [7]. For a fair comparison, we exclusively use samples selected by RDSS for supervised learning compared to other AL approaches, considering that AL relies solely on labelled samples for supervised learning. The implementation details are shown in Appendix D.5. The experimental results are presented in Table 2, from which we observe that RDSS achieves the highest accuracy under almost all budget settings when relying solely on labelled data for supervised learning, with notable improvements on CIFAR-100.

Second, we compare RDSS with sampling methods used in SSAL when applied to the same SSL framework (i.e., FlexMatch or FreeMatch). The sampling methods conclude CoreSetSSL [35], MMA [39], CBSSAL [11], and TOD-Semi [16]. In detail, we tune recent SSAL approaches with their public implementations and run experiments under an extremely low-budget setting, i.e., 40 samples in a 20-random-and-20-selected setting. Table 3 illustrates that the performance of most SSAL approaches falls below that of random sampling methods under extremely low-budget settings. This inefficiency stems from the dependency of sample selection on model performance within the SSAL framework, which struggles when the model is weak. Our model-free method, in contrast, selects samples before training, avoiding these pitfalls.

Table 2: Comparison with AL approaches under Supervised Learning (SL) paradigm. The best performance is bold and the second best performance is underlined.

Dataset	CIFAR-10		CIFAR-100
Budget	7500	10000	7500	10000
CoreSet	85.46	87.56	47.17	53.06
VAAL	86.82	88.97	47.02	53.99
LearnLoss	85.49	87.06	47.81	54.02
MCDAL	87.24	89.40	49.34	54.14
SL+RDSS (Ours)	87.18	89.77	50.13	56.04
Whole Dataset	95.62		78.83

Table 3: Comparison with SSAL approaches. The green (red) arrow represents the improvement (decrease) compared to the random sampling method.

Method	FlexMatch	FreeMatch
Stratified	91.45	95.05
Random	87.30	93.41
CoreSetSSL	87.66 $\uparrow 0.36$	91.24 $\downarrow 2.17$
MMA	74.61 $\downarrow 12.69$	87.37 $\downarrow 6.04$
CBSSAL	86.58 $\downarrow 0.72$	91.68 $\downarrow 1.73$
TOD-Semi	86.21 $\downarrow 1.09$	90.77 $\downarrow 2.64$
RDSS (Ours)	94.69 $\uparrow 7.39$	95.05 $\uparrow 1.64$

Third, when applied to SSL, we directly compare RDSS with the above AL/SSAL approaches, which may better reflect the paradigm differences. The experimental results and analysis are in the Appendix D.6.

7.4 Trade-off Parameter $\alpha$

We analyze the effect of different $\alpha$ with Freematch on CIFAR-10/100. The results are presented in Table 4, from which we have several observations: (1) Our proposed RDSS achieves the highest accuracy under all budget conditions, surpassing those that employ a fixed value; (2) The $\alpha$ that achieve the best or the second best performance are within the interval we set, which is in line with our theoretical derivation in Section 6; (3) The experimental outcomes exhibit varying degrees of reduction compared to our approach when the representativeness or diversity term is removed.

Table 4: Effect of different

\alpha

. The grey results indicate that the

\alpha

is outside the interval we set in Section 6, i.e.,

\alpha<1-1/\sqrt{m}

, while the black results indicate that the

\alpha

is within the interval we set, i.e.,

1-1/\sqrt{m}\leq\alpha\leq 1

. Among them,

\alpha=0

and

\alpha=1

indicate the removal of the representativeness and diversity terms, respectively. The best performance is bold, and the second-best performance is underlined.

Dataset	CIFAR-10			CIFAR-100
Budget ( $m$ )	40	250	4000	400	2500	10000
0	85.54 $\pm$ 0.48	93.55 $\pm$ 0.34	94.58 $\pm$ 0.27	39.26 $\pm$ 0.52	63.77 $\pm$ 0.26	71.90 $\pm$ 0.17
0.40	92.28 $\pm$ 0.24	93.68 $\pm$ 0.13	94.95 $\pm$ 0.12	42.56 $\pm$ 0.47	65.88 $\pm$ 0.24	71.71 $\pm$ 0.29
0.80	94.42 $\pm$ 0.49	94.94 $\pm$ 0.37	95.15 $\pm$ 0.35	45.62 $\pm$ 0.35	66.87 $\pm$ 0.20	72.45 $\pm$ 0.23
0.90	94.33 $\pm$ 0.28	95.03 $\pm$ 0.21	95.20 $\pm$ 0.42	48.12 $\pm$ 0.50	67.14 $\pm$ 0.16	72.15 $\pm$ 0.23
0.95	94.44 $\pm$ 0.64	95.07 $\pm$ 0.26	95.45 $\pm$ 0.38	48.41 $\pm$ 0.59	67.11 $\pm$ 0.29	72.80 $\pm$ 0.35
0.98	94.51 $\pm$ 0.39	95.02 $\pm$ 0.15	95.31 $\pm$ 0.44	48.33 $\pm$ 0.54	67.40 $\pm$ 0.23	72.68 $\pm$ 0.22
1	94.53 $\pm$ 0.42	95.01 $\pm$ 0.23	95.54 $\pm$ 0.25	48.18 $\pm$ 0.36	67.20 $\pm$ 0.29	73.05 $\pm$ 0.18
$1-1/\sqrt{m}$ (Ours)	95.05 $\pm$ 0.13	95.50 $\pm$ 0.20	95.98 $\pm$ 0.28	48.41 $\pm$ 0.59	67.40 $\pm$ 0.23	73.13 $\pm$ 0.19

8 Conclusion

In this work, we propose a model-free sampling method, RDSS, to select a subset from unlabeled data for annotation in SSL. The primary innovation of our approach lies in the introduction of $\alpha$ -MMD, designed to evaluate the representativeness and diversity of selected samples. Under a low-budget setting, we develop a fast and efficient algorithm GKHR for this problem using the Frank-Wolfe algorithm. Both theoretical analyses and empirical experiments demonstrate the effectiveness of RDSS. In future research, we would like to apply our methodology to scenarios where labelling is cost-prohibitive, such as in the medical domain.

References

Aronszajn [1950] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
Bach [2021] F. Bach. Learning theory from first principles. Draft of a book, version of Sept, 6:2021, 2021.
Bach et al. [2012] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012.
Bietti and Mairal [2019] A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep convolutional representations. The Journal of Machine Learning Research, 20(1):876–924, 2019.
Chan et al. [2021] Y.-C. Chan, M. Li, and S. Oymak. On the marginal benefit of active learning: Does self-supervision eat its cake? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3455–3459. IEEE, 2021.
Chen et al. [2012] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. arXiv preprint arXiv:1203.3472, 2012.
Cho et al. [2022] J. W. Cho, D.-J. Kim, Y. Jung, and I. S. Kweon. Mcdal: Maximum classifier discrepancy for active learning. IEEE transactions on neural networks and learning systems, 2022.
Coates et al. [2011] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Freytag et al. [2014] A. Freytag, E. Rodner, and J. Denzler. Selecting influential examples: Active learning with expected model output changes. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 562–577. Springer, 2014.
Gao et al. [2020] M. Gao, Z. Zhang, G. Yu, S. Ö. Arık, L. S. Davis, and T. Pfister. Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 510–526. Springer, 2020.
Graf and Luschgy [2007] S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer, 2007.
Gretton et al. [2006] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. Advances in neural information processing systems, 19, 2006.
Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Huang et al. [2021] S. Huang, T. Wang, H. Xiong, J. Huan, and D. Dou. Semi-supervised active learning with temporal output discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3447–3456, 2021.
Joshi et al. [2009] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Laine and Aila [2016] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2016.
Lee et al. [2013] D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
Lewis and Catlett [1994] D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
Li et al. [2023] M. Li, R. Wu, H. Liu, J. Yu, X. Yang, B. Han, and T. Liu. Instant: Semi-supervised learning with instance-dependent thresholds. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Liu et al. [2021] Z. Liu, H. Ding, H. Zhong, W. Li, J. Dai, and C. He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
Lohr [2021] S. L. Lohr. Sampling: design and analysis. Chapman and Hall/CRC, 2021.
Luo et al. [2013] W. Luo, A. Schwing, and R. Urtasun. Latent structured active learning. Advances in Neural Information Processing Systems, 26, 2013.
Mahmood et al. [2021] R. Mahmood, S. Fidler, and M. T. Law. Low budget active learning via wasserstein distance: An integer programming approach. arXiv preprint arXiv:2106.02968, 2021.
Mak and Joseph [2018] S. Mak and V. R. Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018.
Muandet et al. [2017] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
Netzer et al. [2011] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
Paulsen and Raghupathi [2016] V. I. Paulsen and M. Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge university press, 2016.
Pronzato [2021] L. Pronzato. Performance analysis of greedy algorithms for minimising a maximum mean discrepancy. arXiv preprint arXiv:2101.07564, 2021.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Sajjadi et al. [2016] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29, 2016.
Schmutz et al. [2022] H. Schmutz, O. Humbert, and P.-A. Mattei. Don’t fear the unlabelled: safe semi-supervised learning via debiasing. In The Eleventh International Conference on Learning Representations, 2022.
Sener and Savarese [2018] O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
Sinha et al. [2019] S. Sinha, S. Ebrahimi, and T. Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
Sohn et al. [2020] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
Song et al. [2009] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
Song et al. [2019] S. Song, D. Berthelot, and A. Rostamizadeh. Combining mixmatch and active learning for better accuracy with fewer labels. arXiv preprint arXiv:1912.00594, 2019.
Sriperumbudur et al. [2012] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On the empirical estimation of integral probability metrics. 2012.
Sutskever et al. [2013] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
Tarvainen and Valpola [2017] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
Thompson [2012] S. K. Thompson. Sampling, volume 755. John Wiley & Sons, 2012.
Tong and Koller [2001] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
Wang et al. [2016] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.
Wang et al. [2022a] X. Wang, L. Lian, and S. X. Yu. Unsupervised selective labeling for more effective semi-supervised learning. In European Conference on Computer Vision, pages 427–445. Springer, 2022a.
Wang et al. [2022b] X. Wang, Z. Wu, L. Lian, and S. X. Yu. Debiased learning from naturally imbalanced pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14647–14657, 2022b.
Wang et al. [2022c] Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo, et al. Usb: A unified semi-supervised learning benchmark for classification. Advances in Neural Information Processing Systems, 35:3938–3961, 2022c.
Wang et al. [2022d] Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, et al. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv preprint arXiv:2205.07246, 2022d.
Wu et al. [2023] X. Wu, Y. Huo, H. Ren, and C. Zou. Optimal subsampling via predictive inference. Journal of the American Statistical Association, (just-accepted):1–29, 2023.
Xie et al. [2020a] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256–6268, 2020a.
Xie et al. [2020b] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020b.
Xie et al. [2023] Y. Xie, H. Lu, J. Yan, X. Yang, M. Tomizuka, and W. Zhan. Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23715–23724, 2023.
Yang et al. [2023] L. Yang, Z. Zhao, L. Qi, Y. Qiao, Y. Shi, and H. Zhao. Shrinking class space for enhanced certainty in semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16187–16196, 2023.
Yoo and Kweon [2019] D. Yoo and I. S. Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019.
Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.
Zhang et al. [2021] B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419, 2021.
Zhang et al. [2023a] J. Zhang, C. Meng, J. Yu, M. Zhang, W. Zhong, and P. Ma. An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. Journal of Computational and Graphical Statistics, 32(1):329–339, 2023a.
Zhang et al. [2023b] M. Zhang, Y. Zhou, Z. Zhou, and A. Zhang. Model-free subsampling method based on uniform designs. IEEE Transactions on Knowledge and Data Engineering, 2023b.

Appendix A Algorithms

A.1 Derivation of Generalized Kernel Herding (GKH)

Proof.

The proof technique is borrowed from [31]. Let us firstly define a weighted modification of $\alpha$ -MMD. For any $\mathbf{w}\in\mathbb{R}^{n}$ such that $\mathbf{w}^{\top}\mathbf{1}=1$ , the weighted $\alpha$ -MMD is defined by

\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w})=\mathbf{w}^{\top}\mathbf{% K}\mathbf{w}-2\alpha\mathbf{w}^{\top}\mathbf{p}+\alpha^{2}\overline{K},

where $\mathbf{K}=[k(\mathbf{x}_{i},\mathbf{x}_{j})]_{1\leq i,j\leq n}$ , $\overline{K}=\mathbf{1}^{\top}\mathbf{K}\mathbf{1}/n^{2}$ , $\mathbf{p}=(\mathbf{e}_{1}^{\top}\mathbf{K}\mathbf{1}/n,\cdots,\mathbf{e}_{n}^% {\top}\mathbf{K}\mathbf{1}/n)$ , $\{\mathbf{e}_{i}\}_{i=1}^{n}$ is the set of standard basis of $\mathbb{R}^{n}$ . It is obvious that for any $\mathcal{I}_{p}\subset[n]$ ,

\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}_{p})=\text{MMD}_{k,\alpha}% ^{2}(\mathbf{X}_{\mathcal{I}_{p}},\mathbf{X}_{n}),

where $(\mathbf{w}_{p})_{i}=1/p$ if $i\in\mathcal{I}_{p}$ , and $(\mathbf{w}_{p})_{i}=0$ if not. Therefore, weighted $\alpha$ -MMD is indeed a generalization of $\alpha$ -MMD. Let

\mathbf{K}_{*}=\mathbf{K}-2\alpha\mathbf{p}\mathbf{1}^{\top}+\alpha^{2}% \overline{K}\mathbf{1}\mathbf{1}^{\top}

we obtain the quadratic form expression of weighted $\alpha$ -MMD by $\text{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})=\mathbf{w}^{\top}\mathbf{% K}_{*}\mathbf{w}$ , where $\mathbf{K}_{*}$ is strictly positive definite if $\mathbf{w}\not=\mathbf{w}_{n}$ and $k$ is a characteristic kernel according to [31]. Recall our low-budget setting and choice of kernel, $\mathbf{K}_{*}$ is indeed a strictly positive definite matrix. Thus $\text{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}$ is a convex functional w.r.t. $\mathbf{w}$ , leading to the fact that $\min_{\mathbf{w}^{\top}\mathbf{1}=1}\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(% \mathbf{w})$ can be solved by Frank-Wolfe algorithm. Then for $1\leq p<n$ ,

\mathbf{s}_{p}\in\mathop{\arg\min}_{\mathbf{s}^{\top}\mathbf{1}=1}\mathbf{s}^{% \top}(\mathbf{Kw}_{p}-\alpha\mathbf{p})=\mathop{\arg\min}_{\mathbf{e}_{i},i\in% [n]}\mathbf{e}_{i}^{\top}(\mathbf{Kw}_{p}-\alpha\mathbf{p}).

Let $\mathbf{e}_{i_{p}}=\mathbf{s}_{p}$ , under uniform step size, we have

\mathbf{w}_{p+1}=\left(\frac{p}{p+1}\right)\mathbf{w}_{p}+\frac{1}{p+1}\mathbf% {e}_{i_{p}}

as the update formula of Frank-Wolfe algorithm, which is equivalent to

i^{*}_{p}\in\arg\min_{i\in[n]}\sum_{j\in\mathcal{I}_{m}}k(\mathbf{x}_{i},% \mathbf{x}_{j})-\alpha p\sum_{l=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{l}).

Set $\mathbf{w}_{0}=0$ , we immediately derive the iterating formula in (7). ∎

A.2 Pseudo Codes

Algorithm 2 Generalized Kernel Herding

0: Data set

\mathbf{X}_{n}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\}\subset\mathcal{X}

; the number of selected samples

m<n

; a positive definite, characteristic and radial kernel

k(\cdot,\cdot)

\mathcal{X}\times\mathcal{X}

; trade-off parameter

\alpha\leq 1

0: selected samples

\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\cdots,\mathbf{x}_{i% ^{*}_{m}}\}

1: For each

\mathbf{x}_{i}\in\mathbf{X}_{n}

calculate

\mu(\mathbf{x}_{i}):=\sum_{j=1}^{n}k(\mathbf{x}_{j},\mathbf{x}_{i})/n

2: Set

\beta_{1}=1

S_{0}=0

\mathcal{I}=\emptyset

3: for

p\in\{1,\cdots,m\}

{i^{*}_{p}}\in\mathop{\arg\min}_{i\in[n]}S_{p-1}(\mathbf{x}_{i})-\alpha\mu(% \mathbf{x}_{i})

5: For all

i\in[n]

, update

S_{p}(\mathbf{x}_{i})=(1-\beta_{p})S_{p-1}(\mathbf{x}_{i})+\beta_{p}k(\mathbf{% x}_{i^{*}_{p}},\mathbf{x}_{i})

{\mathcal{I}^{*}_{p+1}}\leftarrow{\mathcal{I}^{*}_{p}}\cup\{{i^{*}_{p}}\}

p\leftarrow p+1

, set

\beta_{p}=1/p

7: end for

Appendix B Technical Lemmas

Lemma B.1 (Lemma 2 [31]).

Let $\left(t_{k}\right)_{k}$ and $\left(\alpha_{k}\right)_{k}$ be two real positive sequences and $A$ be a strictly positive real. If $t_{k}$ satisfies

t_{1}\leq A\text{ and }t_{k+1}\leq\left(1-\alpha_{k+1}\right)t_{k}+A\alpha_{k+% 1}^{2},k\geq 1,

with $\alpha_{k}=1/k$ for all $k$ , then $t_{k}<A(2+\log k)/(k+1)$ for all $k>1$ .

Lemma B.2.

The selected samples $\mathbf{X}_{\mathcal{I}^{*}_{m}}$ generated by GKH (Algorithm 2) satisfies

\operatorname{MMD}_{k,\alpha}^{2}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},% \mathbf{X}_{n}\right)\leq M_{\alpha}^{2}+B\frac{2+\log m}{m+1}

(10)

where $B=2K$ , $0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})\leq K$ , $M_{\alpha}^{2}$ is defined by

M_{\alpha}^{2}:=\min_{\mathbf{w}^{\top}\mathbf{1}=1,\mathbf{w}\geq 0}% \operatorname{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}\left({\mathbf{w}}\right)

Proof.

Following the notations in Appendix A, let $\mathbf{p}_{\alpha}=\alpha\mathbf{p}$ , we could straightly follow the proof for finite-sample-size error bound of kernel herding with predefined step sizes given by [31] to derive Lemma B.2, without any other technique. The detailed proof is omitted. ∎

Lemma B.3.

Let $\mathcal{H}$ be an RKHS over $\mathcal{X}$ associated with positive definite kernel $k$ , and $0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})\leq K$ . Let $\mathbf{X}_{m}=\{\mathbf{x}_{i}\}_{i=1}^{m}$ , $\mathbf{Y}_{n}=\{\mathbf{y}_{j}\}_{j=1}^{m}$ , $\mathbf{x}_{i},\mathbf{y}_{j}\in\mathcal{X}$ . Then for any $\alpha\leq 1$ ,

|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n})-\operatorname{% MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})|\leq(1-\alpha)\sqrt{K}

Proof.

		$\displaystyle\left\|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n}% )-\operatorname{MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})\right\|$
	$\displaystyle=$	$\displaystyle\left\|\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=1}% ^{m}f\left(\mathbf{x}_{i}\right)-\frac{\alpha}{n}\sum_{j=1}^{n}f\left(\mathbf{% y}_{j}\right)\right)-\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=% 1}^{m}f\left(\mathbf{x}_{i}\right)-\frac{1}{n}\sum_{j=1}^{n}f\left(\mathbf{y}_% {j}\right)\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left\|\frac{1-\alpha}{n}\sum_{i=1% }^{n}f\left(y_{i}\right)\right\|=\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{% \mathcal{H}}\leq 1}\left\|\sum_{i=1}^{n}f\left(y_{i}\right)\right\|$
	$\displaystyle=$	$\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{\mathcal{H}}\leq 1}% \left\|\sum_{j=1}^{n}\left\langle f,k(\cdot,\mathbf{y}_{j})\right\rangle_{% \mathcal{H}}\right\|\leq\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{\mathcal{H}% }\leq 1}\sum_{j=1}^{n}\left\|\left\langle f,k(\cdot,\mathbf{y}_{j})\right% \rangle_{\mathcal{H}}\right\|$
	$\displaystyle\leq$	$\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{\mathcal{H}}\leq 1}% \sum_{j=1}^{n}\\|f\\|_{\mathcal{H}}\\|k(\cdot,\mathbf{y}_{j})\\|_{\mathcal{H}}\leq% (1-\alpha)\sqrt{K}.$

∎

Lemma B.4 (Proposition 12.31 [45]).

Suppose that $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ are reproducing kernel Hilbert spaces of real-valued functions with domains $\mathcal{X}_{1}$ and $\mathcal{X}_{2}$ , and equipped with kernels $k_{1}$ and $k_{2}$ , respectively. Then the tensor product space $\mathcal{H}=\mathcal{H}_{1}\otimes\mathcal{H}_{2}$ is an RKHS of real-valued functions with domain $\mathcal{X}_{1}\times\mathcal{X}_{2}$ , and with kernel function

k\left(\left(x_{1},x_{2}\right),\left(x_{1}^{\prime},x_{2}^{\prime}\right)% \right)=k_{1}\left(x_{1},x_{1}^{\prime}\right)k_{2}\left(x_{2},x_{2}^{\prime}% \right).

Lemma B.5 (Theorem 5.7 [30]).

Let $f\in\mathcal{H}_{1}$ and $g\in\mathcal{H}_{2}$ , where $\mathcal{H}_{1},\mathcal{H}_{2}$ be two RKHS containing real-valued functions on $\mathcal{X}$ , which is associated with positive definite kernel $k_{1},k_{2}$ and canonical feature map $\phi_{1},\phi_{2}$ , then for any $x\in\mathcal{X}$ ,

f(x)+g(x)=\left\langle f,\phi_{1}(x)\right\rangle_{\mathcal{H}_{1}}+\left% \langle g,\phi_{2}(x)\right\rangle_{\mathcal{H}_{2}}=\left\langle f+g,(\phi_{1% }+\phi_{2})(x)\right\rangle_{\mathcal{H}_{1}+\mathcal{H}_{2}},

where

\mathcal{H}_{1}+\mathcal{H}_{2}=\{f_{1}+f_{2}|f_{i}\in\mathcal{H}_{i}\}

and $\phi_{1}+\phi_{2}$ is the canonical feature map of $\mathcal{H}_{1}+\mathcal{H}_{2}$ . Furthermore,

\|f+g\|_{\mathcal{H}_{1}+\mathcal{H}_{2}}^{2}\leq\|f\|_{\mathcal{H}_{1}}^{2}+% \|g\|_{\mathcal{H}_{2}}^{2}.

Lemma B.6.

For any unlabeled dataset $\mathbf{X}_{n}\subset\mathcal{X}$ and any subset $\mathbf{X}_{\mathcal{I}_{m}}$ ,

\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{n},\mathbf{X}_{n})=(1-\alpha)^{2% }\overline{K},\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}_{m}},% \mathbf{X}_{n})\leq(1+\alpha^{2})K,

where $\overline{K}=\sum_{i=1}^{n}\sum_{j=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{j})/n^{2}$ , $K=\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})$ .

Lemma B.6 is directly derived from the definition of $\alpha$ -MMD.

Appendix C Proof of Theorems

Proof for Theorem 5.4.

Firstly, let us denote that $\mathcal{H}_{4}=\mathcal{H}_{1}\otimes\mathcal{H}_{1}+\mathcal{H}_{1}\otimes% \mathcal{H}_{2}+\mathcal{H}_{3}$ , with kernel $k_{4}=k_{1}^{2}+k_{1}k_{2}+k_{3}$ and canonical feature map $\phi_{4}=\phi_{1}\otimes\phi_{1}+\phi_{1}\otimes\phi_{2}+\phi_{3}$ .

Under the assumptions in Theorem 5.4, according to Theorem 4 in [38], we have for any $\mathbf{x}\in\mathcal{X}$ ,

h(\mathbf{x})=\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1% }},\mathbb{E}[Y|\mathbf{x}]=\left\langle\mathbb{E}[Y|X],\phi_{2}(\mathbf{x})% \right\rangle_{\mathcal{H}_{2}},

\operatorname{Var}(Y|\mathbf{x})=\left\langle\operatorname{Var}(Y|X),\phi_{3}(% \mathbf{x})\right\rangle_{\mathcal{H}_{3}}

where $\phi_{1},\phi_{2},\phi_{3}$ are canonical feature maps in $\mathcal{H}_{1},\mathcal{H}_{2},\mathcal{H}_{3}$ . Denote that $m=\mathbb{E}[Y|X]$ and $s=\operatorname{Var}(Y|X)$ . Now by definition,

R(h)=\mathbb{E}\left[\ell(h(\mathbf{x}),y)\right]=\int_{\mathcal{X}}\int_{% \mathcal{Y}}\ell(h(\mathbf{x}),y)p(y|\mathbf{x})p(\mathbf{x})d\mathbf{x}dy=% \int_{\mathcal{X}}f(\mathbf{x})p(\mathbf{x})d\mathbf{x}

where

	$\displaystyle f(x)$	$\displaystyle=\int_{\mathcal{Y}}(y-h(\mathbf{x}))^{2}p(y\|\mathbf{x})dy$
		$\displaystyle=\operatorname{Var}(Y\|\mathbf{x})-2h(\mathbf{x})\mathbb{E}[Y\|% \mathbf{x}]+h^{2}(\mathbf{x})$
		$\displaystyle=\left\langle s,\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3% }}-2\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}\left% \langle m,\phi_{2}(\mathbf{x})\right\rangle_{\mathcal{H}_{2}}+\left\langle h,% \phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}\left\langle h,\phi_{1}(% \mathbf{x})\right\rangle_{\mathcal{H}_{1}}$
		$\displaystyle=\left\langle s,\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3% }}-\left\langle 2h\otimes m,(\phi_{1}\otimes\phi_{2})(\mathbf{x})\right\rangle% _{\mathcal{H}_{1}\otimes\mathcal{H}_{2}}+\left\langle h\otimes h,(\phi_{1}% \otimes\phi_{1})(\mathbf{x})\right\rangle_{\mathcal{H}_{1}\otimes\mathcal{H}_{% 1}}$
		$\displaystyle=\left\langle s-2h\otimes m+h\otimes h,\phi_{4}(x)\right\rangle_{% \mathcal{H}_{4}}$

where the fourth equality holds by Lemma B.4 and the last equality holds by Lemma B.5, then $f\in\mathcal{H}_{4}$ , and

	$\displaystyle\\|f\\|_{\mathcal{H}_{4}}$	$\displaystyle=\\|s-2h\otimes m+h\otimes h\\|_{\mathcal{H}_{4}}$
		$\displaystyle\leq\\|s\\|_{\mathcal{H}_{4}}+\\|2h\otimes m\\|_{\mathcal{H}_{4}}+\\|h% \otimes h\\|_{\mathcal{H}_{4}}$
		$\displaystyle\leq\\|s\\|_{\mathcal{H}_{3}}+2\\|m\\|_{\mathcal{H}_{2}}\\|h\\|_{% \mathcal{H}_{1}}+\\|h\otimes h\\|_{\mathcal{H}_{1}\otimes\mathcal{H}_{1}}$
		$\displaystyle=\\|s\\|_{\mathcal{H}_{3}}+2\\|m\\|_{\mathcal{H}_{2}}\\|h\\|_{\mathcal{% H}_{1}}+\\|h\\|_{\mathcal{H}_{1}}^{2}$
		$\displaystyle\leq K_{h}^{2}+2K_{h}K_{m}+K_{s}$

where the second inequality holds by Lemma B.5. Therefore, let $\beta=1/(K_{h}^{2}+2K_{h}K_{m}+K_{s})$ we have $\|\beta f\|_{\mathcal{H}_{4}}=\beta\|f\|_{\mathcal{H}_{4}}\leq 1$ . Then

		$\displaystyle\left\|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)\right\|$
	$\displaystyle=$	$\displaystyle\left\|\int_{\mathcal{X}}f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{% \mathcal{X}}f(\mathbf{x})dP_{S}(\mathbf{x})\right\|$
	$\displaystyle=$	$\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\left\|\int_{\mathcal{X}}\beta f(% \mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}\beta f(\mathbf{x})dP_{S}(% \mathbf{x})\right\|$
	$\displaystyle\leq$	$\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\sup_{\\|f\\|_{\mathcal{H}_{4}}\leq 1}% \left\|\int_{\mathcal{X}}f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}f(% \mathbf{x})dP_{S}(\mathbf{x})\right\|$
	$\displaystyle=$	$\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\operatorname{MMD}_{k_{4}}(\mathbf{X% }_{S},\mathbf{X}_{T})$

where $P_{T}$ denotes the empirical distribution constructed by $\mathbf{X}_{T}$ , so does $P_{S}$ . Recall Lemma B.3, we have Theorem 5.4. ∎

Proof for Theorem 5.6.

Following the notations in Appendix A, we further define

\mathbf{w}_{*}=\mathbf{1}/n,C^{2}_{\alpha}=\operatorname{MMD}^{2}_{k,\alpha,% \mathbf{X}_{n}}(\mathbf{w}_{*})=(1-\alpha)^{2}\overline{K}

(11)

\widehat{\mathbf{w}}=\mathop{\arg\min}_{\mathbf{1}^{\top}\mathbf{w}=1}% \operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})=\alpha\left(% \mathbf{K}^{-1}-\frac{\mathbf{K}^{-1}\mathbf{1}\mathbf{1}^{\top}\mathbf{K}^{-1% }}{\mathbf{1}^{\top}\mathbf{K}^{-1}\mathbf{1}}\right)\mathbf{p}+\frac{\mathbf{% K}^{-1}\mathbf{1}}{\mathbf{1}^{\top}\mathbf{K}^{-1}\mathbf{1}}

Let $\mathbf{p}_{\alpha}=\alpha\mathbf{p}$ , we have $(\mathbf{p}_{\alpha}-\mathbf{K}\widehat{\mathbf{w}})\propto\mathbf{1}$ . Define

\Delta_{\alpha}(\mathbf{w}):=\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(% \mathbf{w})-C_{\alpha}^{2}=\widehat{g}(\mathbf{w})-\widehat{g}(\mathbf{w}_{*})

where $\widehat{g}(\mathbf{w})=\left(\mathbf{w}-\widehat{\mathbf{w}}\right)^{\top}% \mathbf{K}\left(\mathbf{w}-\widehat{\mathbf{w}}\right)$ . The related details for proving the equality are omitted, since they are completely given by the proof of alternative expression of MMD in Pronzato [31]. By the convexity of $\widehat{g}(\cdot)$ , for $j=\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}f_{\mathcal{I}^{*}_{% p}}(\mathbf{x}_{i})$ ,

\widehat{g}\left(\mathbf{w}_{*}\right)\geq\widehat{g}\left(\mathbf{w}_{p}% \right)+2\left(\mathbf{w}_{*}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(% \mathbf{w}_{p}-\widehat{\mathbf{w}}\right)\geq\widehat{g}\left(\mathbf{w}_{p}% \right)+2\min_{j\in[n]\backslash\mathcal{I}^{*}_{p}}\left(\mathbf{e}_{j}-% \mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}% }\right)

where the second inequality holds with the assumption in Theorem 5.6

	$\displaystyle\left(\mathbf{w}_{*}-\mathbf{e}_{j}\right)^{\top}\mathbf{K}\left(% \mathbf{w}_{p}-\widehat{\mathbf{w}}\right)$	$\displaystyle=\left(\mathbf{w}_{*}-\mathbf{e}_{j}\right)^{\top}\left(\mathbf{K% }\mathbf{w}_{p}-\mathbf{p}_{\alpha}\right)$
		$\displaystyle=\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{}_{p}}(\mathbf{x}_{i})}{n}-% f_{\mathcal{I}^{}_{p}}(\mathbf{x}_{j_{p+1}})\geq\frac{\sum_{i=1}^{n}f_{% \mathcal{I}^{}_{p}}(\mathbf{x}_{i})}{n}-f_{\mathcal{I}^{}_{p}}(\mathbf{x}_{j% })\geq 0$

therefore, we have for $B=2K$ ,

	$\displaystyle\Delta_{\alpha}(\mathbf{w}_{p+1})$	(12)
$\displaystyle=$	$\displaystyle\widehat{g}\left(\mathbf{w}_{p}\right)-\widehat{g}\left(\mathbf{w% }_{*}\right)+\frac{2}{p+1}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}% \mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}}\right)+\frac{1}{(p+1)^{2}}% \left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{e}_{j% }-\mathbf{w}_{p}\right)$
$\displaystyle=$	$\displaystyle\frac{p}{p+1}(\widehat{g}\left(\mathbf{w}_{p}\right)-\widehat{g}% \left(\mathbf{w}_{*}\right))+\frac{1}{(p+1)^{2}}B=\frac{p}{p+1}\Delta_{\alpha}% (\mathbf{w}_{p})+\frac{1}{(p+1)^{2}}B$

where $\mathbf{w}_{p+1}=p\mathbf{w}_{p}/(p+1)+\mathbf{e}_{j}/(p+1)$ , and obviously $B$ upper bounds $\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{e}_{j% }-\mathbf{w}_{p}\right)$ . Since $\alpha\leq 1$ , it holds from Lemma B.6 that

\Delta_{\alpha}(\mathbf{w}_{1})\leq\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}% _{n}}(\mathbf{w}_{1})\leq(1+\alpha^{2})K\leq B

therefore by Lemma B.1, we have

\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_% {n})=\operatorname{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}_{p})\leq C_{% \alpha}^{2}+B\frac{2+\log m}{m+1}

∎

Appendix D Additional Experimental Details and Results

D.1 Supplementary Numerical Experiments on GKHR

Consider the fact that GKH is a convergent algorithm (Lemma B.2) and the finite-sample-size error bound (10) holds without any assumption on the data, we conduct some numerical experiments to empirically compare GKHR with GKH on datasets generated by four different distributions on $\mathbb{R}^{2}$ .

Firstly, we define four distributions on $\mathbb{R}^{2}$ :

1.

Gaussian mixture model 1 which consists of four Gaussian distributions $G_{1},G_{2},G_{3},G_{4}$ with mixture weights $[0.95,0.01,0.02,0.02]$ ,
2.

Gaussian mixture model 2 which consists of four Gaussian distributions $G_{1},G_{2},G_{3},G_{4}$ with mixture weights $[0.3,0.2,0.15,0.35]$ ,
3.

Uniform distribution 1 which consists of a uniform distribution defined in a circle with radius $0.5$ , and a uniform distribution defined in a annulus with inner radius $4$ and outer radius $6$ ,
4.

Uniform distribution 2 defined on $[-10,10]^{2}$ .

where

G_{1}=\mathcal{N}\left(\begin{bmatrix}1\\ 2\end{bmatrix},\begin{bmatrix}2&0\\ 0&5\end{bmatrix}\right),G_{2}=\mathcal{N}\left(\begin{bmatrix}-3\\ -5\end{bmatrix},\begin{bmatrix}1&0\\ 0&2\end{bmatrix}\right)

G_{3}=\mathcal{N}\left(\begin{bmatrix}-5\\ 4\end{bmatrix},\begin{bmatrix}8&0\\ 0&6\end{bmatrix}\right),G_{4}=\mathcal{N}\left(\begin{bmatrix}15\\ 10\end{bmatrix},\begin{bmatrix}4&0\\ 0&9\end{bmatrix}\right)

To consistently evaluate the performance gap between GKHR and GKH at the same order of magnitude, we propose the following criterion

D=\frac{D_{1}-D_{2}}{D_{1}+D_{2}}

where $D_{1}=\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}^{(1)}_{\mathcal{I}^{*}_{m}}% ,\mathbf{X}_{n}),D_{2}=\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}^{(2)}_{% \mathcal{I}^{*}_{m}},\mathbf{X}_{n})$ , $\mathbf{X}^{(1)}_{\mathcal{I}_{m}}$ is the selected samples from GKHR and $\mathbf{X}^{(2)}_{\mathcal{I}_{m}}$ is the selected samples from GKH. Positive value of $D$ implies that GKH outperforms GKHR, and negative values of $D$ implies that GKHR outperforms GKH. Large absolute value of $D$ shows large performance gap.

The experiments are conducted as follows. We generate 1000,3000,10000,30000 random samples from the four distributions separately, then use GKHR and GKH for sample selection under the low-budget setting, i.e., $m/n\leq 0.2$ . The $\alpha$ is set by $m/n$ . We report the results over ten independent runs in Figure 2, which shows that although the performance gap tends to grow as $m$ grows, when $m$ is relatively small, the performance of GKHR is similar to that of GKH. Therefore, under the low-budget setting, GKHR and GKH have similar performance on minimizing $\alpha$ -MMD over various type of distributions, which convinces us that GKHR could work well in the sample selection task.

D.2 Datasets

For experiments, we choose five common datasets: CIFAR-10/100, SVHN, STL-10 and ImageNet. CIFAR-10 and CIFAR-100 contain 60,000 images with 10 and 100 categories, respectively, among which 50,000 images are for training, and 10,000 images are for testing; SVHN contains 73,257 images for training and 26,032 images for testing; STL-10 contains 5,000 images for training, 8,000 images for testing and 100,000 unlabeled images as extra training data. ImageNet spans 1,000 object classes and contains 1,281,167 training and 100,000 test images. The training sets of the above datasets are considered as the unlabeled dataset for sample selection.

D.3 Visualization of Selected Samples

To offer a more intuitive comparison between various sampling methods, we visualized samples chosen by stratified, random, $k$ -Means, USL, ActiveFT and RDSS (ours). We generate 5000 samples from a Gaussian mixture model defined on $\mathbb{R}^{2}$ with 10 components and uniform mixture weights. One hundred samples are selected from the entire dataset using different sampling methods. The visualisation results in Figure 3 indicate that our selected samples distribute more similarly with the entire dataset than other counterparts.

D.4 Computational Complexity and Running Time

We compute the time complexity of various sampling methods and recorded the time required to select 400 samples on the CIFAR-100 dataset for each method. The results are presented in Table 5, where $m$ represents the annotation budget, $n$ denotes the total number of samples, and $T$ indicates the number of iterations. The sampling time was obtained by averaging the duration of three independent runs of the sampling code on an idle server without any workload. As illustrated by the results, the sampling efficiency of our method surpasses that of all other methods except for random and stratified sampling. This discrepancy is likely because the execution time of other algorithms is affected by the number of iterations $T$ .

Table 5: Efficiency comparison with other sampling methods.

Method	Time complexity	Time (s)
Random	$O(n)$	$\approx 0$
Stratified	$O(n)$	$\approx 0$
$k$ -means	$O(mnT)$	$579.97$
USL	$O(mnT)$	$257.68$
ActiveFT	$O(mnT)$	$224.35$
RDSS (Ours)	$O(mn)$	$132.77$

D.5 Implementation Details of Supervised Learning Experiments

We use ResNet-18 [15] as the classification model for all AL approaches and our method. Specifically, We train the models for $300$ epochs using SGD optimizer (initial learning rate= $0.1$ , weight decay= $5e-4$ , momentum= $0.9$ ) with batch size $128$ . Finally, we evaluate the performance with the Top-1 classification accuracy metric on the test set.

D.6 Direct Comparison with AL/SSAL

The comparative results with AL/SSAL approaches are shown in Figure 4 and Figure 5, respectively. The specific values corresponding to the comparative results in the above two figures are shown in Table 6. And the above results are from [7], [11] and [16].

Table 6: Comparative results with AL/SSAL approaches.

Dataset	CIFAR-10									CIFAR-100
Budget	40	250	500	1000	2000	4000	5000	7500	10000	400	2500	5000	7500	10000
Active Learning (AL)
CoreSet [35]	-	-	-	-	-	-	80.56	85.46	87.56	-	-	37.36	47.17	53.06
VAAL [36]	-	-	-	-	-	-	81.02	86.82	88.97	-	-	38.46	47.02	53.99
LearnLoss [56]	-	-	-	-	-	-	81.74	85.49	87.06	-	-	36.12	47.81	54.02
MCDAL [7]	-	-	-	-	-	-	81.01	87.24	89.40	-	-	38.90	49.34	54.14
Semi-Supervised Active Learning (SSAL)
CoreSetSSL [35]	-	-	90.94	92.34	93.30	94.02	-	-	-	-	-	63.14	66.29	68.63
CBSSAL [11]	-	-	91.84	92.93	93.78	94.55	-	-	-	-	-	63.73	67.14	69.34
TOD-Semi [16]	-	-	-	-	-	-	79.54	87.82	90.3	-	-	36.97	52.87	58.64
Semi-Supervised Learning (SSL) with RDSS
FlexMatch+RDSS (Ours)	94.69	95.21	-	-	-	95.71	-	-	-	48.12	67.27	-	-	73.21
FreeMatch+RDSS (Ours)	95.05	95.50	-	-	-	95.98	-	-	-	48.41	67.40	-	-	73.13

According to the results, we have several observations: (1) AL approaches often necessitate significantly larger labelling budgets, exceeding RDSS by 125 or more on CIFAR-10. This is primarily because AL paradigms are solely dependent on labelled samples not only for classification but also for feature learning. (2) SSAL and our methods leverage unlabeled samples, surpassing traditional AL approaches. However, this may not directly reflect the advantages of RDSS, as such performance enhancements could be inherently attributed to the SSL paradigm itself. Nonetheless, these experimental outcomes offer insightful implications: SSL may represent a more promising paradigm under scenarios with limited annotation budgets.

Appendix E Limitation

The choice of $\alpha$ depends on the number of full unlabeled data points, independent of the information on the shape of data distribution. This may lead to a loss of effectiveness of RDSS on those datasets with complicated distribution structures. However, it outperforms fixed-ratio approaches on the datasets under different budget settings.

Appendix F Potential Societal Impact

Positive societal impact. Our method ensures the representativeness and diversity of the selected samples and significantly improves the performance of SSL methods, especially under low-budget settings. This reduces the cost and time of data annotation and is particularly beneficial for resource-constrained research and development environments, such as medical image analysis.

Negative societal impact. When selecting representative data for analysis and annotation, the processing of sensitive data may be involved, increasing the risk of data leakage, especially in sensitive fields such as medical care and finance. It is worth noting that most algorithms applied in these sensitive areas are subject to this risk.

		$\displaystyle\left\|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n}% )-\operatorname{MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})\right\|$
	$\displaystyle=$	$\displaystyle\left\|\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=1}% ^{m}f\left(\mathbf{x}_{i}\right)-\frac{\alpha}{n}\sum_{j=1}^{n}f\left(\mathbf{% y}_{j}\right)\right)-\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=% 1}^{m}f\left(\mathbf{x}_{i}\right)-\frac{1}{n}\sum_{j=1}^{n}f\left(\mathbf{y}_% {j}\right)\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\\|f\\|_{\mathcal{H}}\leq 1}\left\|\frac{1-\alpha}{n}\sum_{i=1% }^{n}f\left(y_{i}\right)\right\|=\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{% \mathcal{H}}\leq 1}\left\|\sum_{i=1}^{n}f\left(y_{i}\right)\right\|$
	$\displaystyle=$	$\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{\mathcal{H}}\leq 1}% \left\|\sum_{j=1}^{n}\left\langle f,k(\cdot,\mathbf{y}_{j})\right\rangle_{% \mathcal{H}}\right\|\leq\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{\mathcal{H}% }\leq 1}\sum_{j=1}^{n}\left\|\left\langle f,k(\cdot,\mathbf{y}_{j})\right% \rangle_{\mathcal{H}}\right\|$
	$\displaystyle\leq$	$\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\\|f\\|_{\mathcal{H}}\leq 1}% \sum_{j=1}^{n}\\|f\\|_{\mathcal{H}}\\|k(\cdot,\mathbf{y}_{j})\\|_{\mathcal{H}}\leq% (1-\alpha)\sqrt{K}.$

	$\displaystyle\\|f\\|_{\mathcal{H}_{4}}$	$\displaystyle=\\|s-2h\otimes m+h\otimes h\\|_{\mathcal{H}_{4}}$
		$\displaystyle\leq\\|s\\|_{\mathcal{H}_{4}}+\\|2h\otimes m\\|_{\mathcal{H}_{4}}+\\|h% \otimes h\\|_{\mathcal{H}_{4}}$
		$\displaystyle\leq\\|s\\|_{\mathcal{H}_{3}}+2\\|m\\|_{\mathcal{H}_{2}}\\|h\\|_{% \mathcal{H}_{1}}+\\|h\otimes h\\|_{\mathcal{H}_{1}\otimes\mathcal{H}_{1}}$
		$\displaystyle=\\|s\\|_{\mathcal{H}_{3}}+2\\|m\\|_{\mathcal{H}_{2}}\\|h\\|_{\mathcal{% H}_{1}}+\\|h\\|_{\mathcal{H}_{1}}^{2}$
		$\displaystyle\leq K_{h}^{2}+2K_{h}K_{m}+K_{s}$

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Abstract

1 Introduction

2 Related Work

3 Problem Setup

4 Representative and Diversity Sample Selection

4.1 Quantification of Diversity and Representativeness

Definition 4.1 (Maximum Mean Discrepancy).

4.2 Sampling Algorithm

5 Theoretical Analysis

5.1 Generalization Bounds

Assumption 5.1.

Assumption 5.2.

Assumption 5.3.

Theorem 5.4.

5.2 Finite-Sample-Error-Bound for GKHR

Assumption 5.5.

Theorem 5.6.

6 Choice of Kernel and Hyperparameter Tuning

7 Experiments

7.1 Implementation Details of Our Method

7.2 Comparison with Other Sampling Methods

7.3 Comparison with AL/SSAL Approaches

7.4 Trade-off Parameter α𝛼\alphaitalic_α

8 Conclusion

References

Appendix A Algorithms

A.1 Derivation of Generalized Kernel Herding (GKH)

Proof.

A.2 Pseudo Codes

Appendix B Technical Lemmas

Lemma B.1 (Lemma 2 [31]).

Lemma B.2.

Proof.

Lemma B.3.

Proof.

Lemma B.4 (Proposition 12.31 [45]).

Lemma B.5 (Theorem 5.7 [30]).

Lemma B.6.

Appendix C Proof of Theorems

Proof for Theorem 5.4.

Proof for Theorem 5.6.

Appendix D Additional Experimental Details and Results

D.1 Supplementary Numerical Experiments on GKHR

D.2 Datasets

D.3 Visualization of Selected Samples

D.4 Computational Complexity and Running Time

D.5 Implementation Details of Supervised Learning Experiments

D.6 Direct Comparison with AL/SSAL

Appendix E Limitation

Appendix F Potential Societal Impact

7.4 Trade-off Parameter $\alpha$