Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Qian Shao1 , Jiangrui Kang211footnotemark: 1 , Qiyuan Chen111footnotemark: 1 , Zepeng Li1, Hongxia Xu1,
Yiwen Cao2, Jiajuan Liang2 , and Jian Wu122footnotemark: 2
1Zhejiang University
2BNU-HKBU United International College
These authors contributed equally to this work.Corresponding authors. Emails: jiajuanliang@uic.edu.cn, wujian2000@zju.edu.cn.
Abstract

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion α𝛼\alphaitalic_α-Maximum Mean Discrepancy (α𝛼\alphaitalic_α-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing α𝛼\alphaitalic_α-MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

1 Introduction

Semi-Supervised Learning (SSL) is a popular paradigm which reduces reliance on large amounts of labeled data in many deep learning tasks [37, 34, 55]. Previous SSL research mainly focuses on effectively utilising labelled and unlabeled data. Specifically, labelled data directly supervise model learning, while unlabeled data help learn a desirable model that makes consistent and unambiguous predictions [50]. Besides, we also find that how to select samples for annotation will greatly affect model performance, particularly under extremely low-budget settings (see Section 7.2).

The prevailing sample selection methods in SSL have many shortcomings. For example, random sampling may introduce imbalanced class distributions and inadequate coverage of the overall data distribution, resulting in poor performance. Stratified sampling randomly selects samples within each class, which is impractical in real-world scenarios where the label for each sample is unknown. Existing researchers also employ representativeness and diversity strategies to select appropriate samples for annotation. Representativeness [12] ensures that the selected subset distributes similarly with the entire dataset, and diversity [51] is designed to select informative samples by pushing away them in feature space. And focusing on only one aspect presents significant limitations (Figure 1a and b). To address these issues, Xie et al. [54] and Wang et al. [47] employ a combination of the two strategies for sample selection. These methods set a fixed ratio for representativeness and diversity, restricting the ultimate performance through our empirical evidence (see Section 7.4). Fundamentally, they lack a theoretical basis to substantiate their effectiveness.

Refer to caption
Figure 1: Visualization of selected samples from a dog dataset. The red and grey circles respectively symbolize the selected and unselected samples. a) The selected samples often contain an excessive number of highly similar instances, leading to redundancy; b) The selected samples contain too many edge points, unable to cover the entire dataset; c) The selected samples represent the entire dataset comprehensively and accurately.

We observe that Active Learning (AL) primarily focuses on selecting the right samples for annotation, and numerous studies transfer the sample selection methods of AL into SSL, giving rise to Semi-Supervised Active Learning (SSAL) [48]. However, most of these approaches exhibit several limitations: (1) They require randomly selected samples to begin with, which expends a portion of the labelling budget, making it difficult to work effectively with a very limited budget (e.g., 1% or even lower) [5]; (2) They involve human annotators in iterative cycles of labelling and training, leading to substantial labelling overhead [54]; (3) They are coupled with the model training so that samples for annotation need to be re-selected every time a model is trained [47]. In summary, selecting the appropriate samples for annotation is challenging in SSL.

To address these challenges, we propose a Representative and Diverse Sample Selection approach (RDSS) that requests annotations only once and operates independently of the downstream tasks. Specifically, inspired by the concept of Maximum Mean Discrepancy (MMD) [13], we design a novel criterion named α𝛼\alphaitalic_α-MMD. It aims to strike a balance between representativeness and diversity via a trade-off parameter α𝛼\alphaitalic_α (Figure 1c), for which we find an optimal interval adapt to different budgets. By using a modified Frank-Wolfe algorithm called Generalized Kernel Herding without Replacement (GKHR), we can get an efficient approximate solution to this minimization problem.

We prove that under certain Reproducing Kernel Hilbert Space (RKHS) assumptions, α𝛼\alphaitalic_α-MMD effectively bounds the difference between training with a constrained versus an unlimited labelling budget. This implies that our proposed method could significantly enhance the generalization ability of learning with limited labels. We also give a theoretical assessment of GKHR with some supplementary numerical experiments, showing that GKHR performs well in learning with limited labels.

Furthermore, we evaluate our proposed RDSS across several popular SSL frameworks on the datasets CIFAR-10/100 [18], SVHN [29], STL-10 [8] and ImageNet [9]. Extensive experiments show that RDSS outperforms other sample selection methods widely used in SSL, AL or SSAL, especially with a constrained annotation budget. Besides, ablation experimental results demonstrate that RDSS outperforms methods using a fixed ratio.

The main contributions of this article are as follows:

  • We propose RDSS, which selects representative and diverse samples for annotation to enhance SSL by minimizing a novel criterion α𝛼\alphaitalic_α-MMD. Under low-budget settings, we develop a fast and efficient algorithm, GKHR, for optimization.

  • We prove that our method benefits the generalizability of the trained model under certain assumptions and rigorously establish an optimal interval for the trade-off parameter α𝛼\alphaitalic_α adapt to the different budgets.

  • We compare RDSS with sample selection strategies widely used in SSL, AL or SSAL, the results of which demonstrate superior sample efficiency compared to these strategies. In addition, we conduct ablation experiments to verify our method’s superiority over the fixed-ratio approach.

2 Related Work

Semi-Supervised Learning. Semi-Supervised Learning (SSL) effectively utilizes sparse labeled data and abundant unlabeled data for model training. Consistency Regularization [33, 19, 42], Pseudo-Labeling [20, 53] and their hybrid strategies [37, 58, 34] are commonly used in SSL. Consistency Regularization ensures the model’s output stays stable even when there’s noise or small changes in the input, usually from the data augmentation [52]. Pseudo-labelling integrates high-confidence data pseudo-labels directly into training, adhering to entropy minimization [22]. Moreover, an integrative approach that combines the aforementioned strategies can also achieve substantial results [50, 55]. Even though these approaches have been proven effective, they usually assume that labelled samples are randomly selected from each class (i.e., stratified sampling), which is not practical in real-world scenarios where the label for each sample is unknown.

Active Learning. Active learning (AL) aims to optimize the learning process by selecting the appropriate samples for labelling, reducing reliance on large labelled datasets. There are two different criteria for sample selection: uncertainty and representativeness. Uncertainty sampling selects samples about which the current model is most uncertain. Earlier studies utilized posterior probability [21, 46], entropy [17, 25], and classification margin [44] to estimate uncertainty. Recent research regards uncertainty as training loss [16, 56], influence on model performance [10, 23] or the prediction discrepancies between multiple classifiers [7]. However, uncertainty sampling methods may exhibit performance disparities across different models, leading researchers to focus on representativeness sampling, which aims to align the distribution of selected subset with that of the entire dataset [35, 36, 26]. Most AL approaches are difficult to perform well under extremely low-label settings. This may be because they usually require randomly selected samples to begin with and involve human annotators in iterative cycles of labelling and training, leading to substantial labelling overhead.

Model-Free Subsampling. Subsampling is a statistical approach which selects a subset with size m𝑚mitalic_m as a surrogate for the full dataset with size nmmuch-greater-than𝑛𝑚n\gg mitalic_n ≫ italic_m. Model-free subsampling is preferred in data-driven modelling tasks, as it does not depend on the model assumptions. There are mainly two kinds of popular model-free subsampling methods. The one is induced by minimizing statistical discrepancies, which forces the distribution of subset to be similar to that of full data, in other words, selects representative subsamples, such as Wasserstein distance [12], energy distance [27], uniform design [59], maximum mean discrepancy [6] and generalized empirical F𝐹Fitalic_F-discrepancy [60]. The other tends to select a diverse subset containing as many informative samples as possible [51]. The above-mentioned methodologies either exclusively focus on representativeness or diversity, which are difficult to effectively apply to SSL.

3 Problem Setup

Let 𝒳𝒳\mathcal{X}caligraphic_X be the unlabeled data space, 𝒴𝒴\mathcal{Y}caligraphic_Y be the label space, 𝐗n={𝐱i}i[n]𝒳subscript𝐗𝑛subscriptsubscript𝐱𝑖𝑖delimited-[]𝑛𝒳\mathbf{X}_{n}=\{\mathbf{x}_{i}\}_{i\in[n]}\subset\mathcal{X}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ⊂ caligraphic_X be the full unlabeled dataset and m={i1,i2,,im}[n](m<n)subscript𝑚subscript𝑖1subscript𝑖2subscript𝑖𝑚delimited-[]𝑛𝑚𝑛\mathcal{I}_{m}=\{i_{1},i_{2},\cdots,i_{m}\}\subset[n](m<n)caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ⊂ [ italic_n ] ( italic_m < italic_n ) be an index set contained in [n]delimited-[]𝑛[n][ italic_n ], our goal is to find an index set m={i1,i2,,im}[n](m<n)subscriptsuperscript𝑚subscriptsuperscript𝑖1subscriptsuperscript𝑖2subscriptsuperscript𝑖𝑚delimited-[]𝑛𝑚𝑛\mathcal{I}^{*}_{m}=\{i^{*}_{1},i^{*}_{2},\cdots,i^{*}_{m}\}\subset[n](m<n)caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ⊂ [ italic_n ] ( italic_m < italic_n ) such that the selected set of samples 𝐗m={𝐱i1,𝐱i2,,𝐱im}subscript𝐗subscriptsuperscript𝑚subscript𝐱subscriptsuperscript𝑖1subscript𝐱subscriptsuperscript𝑖2subscript𝐱subscriptsuperscript𝑖𝑚\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\mathbf{x}_{i^{*}_{2% }},\cdots,\mathbf{x}_{i^{*}_{m}}\}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } is the most informative. After that, we can get access to the true labels of selected samples and use the set of labelled data S={(𝐱i,yi)}im𝑆subscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖subscriptsuperscript𝑚S=\{(\mathbf{x}_{i},y_{i})\}_{i\in\mathcal{I}^{*}_{m}}italic_S = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the rest of the unlabeled data to train a deep learning model.

Following the methodology of previous works, we use representativeness and diversity as criteria for evaluating the informativeness of selected samples. Representativeness ensures the selected samples distribute similarly to the full unlabeled dataset. Diversity is proposed to prevent an excessive concentration of selected samples in high-density areas of the full unlabeled dataset. Furthermore, the cluster assumption in SSL suggests that the data tend to form discrete clusters, in which boundary points are likely to be located in the low-density area. Therefore, under this assumption, selected samples with diversity contain more boundary points than the non-diversified ones, which is desired in training classifiers.

As a result, our goal can be formulated by solving the following problem:

maxm[n]Rep(𝐗m,𝐗n)+λDiv(𝐗m,𝐗n),subscriptsubscript𝑚delimited-[]𝑛Repsubscript𝐗subscript𝑚subscript𝐗𝑛𝜆Divsubscript𝐗subscript𝑚subscript𝐗𝑛\max_{\mathcal{I}_{m}\subset[n]}\text{Rep}(\mathbf{X}_{\mathcal{I}_{m}},% \mathbf{X}_{n})+\lambda\text{Div}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}),roman_max start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊂ [ italic_n ] end_POSTSUBSCRIPT Rep ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_λ Div ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (1)

where Rep(𝐗m,𝐗n)Repsubscript𝐗subscript𝑚subscript𝐗𝑛\text{Rep}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})Rep ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and Div(𝐗m,𝐗n)Divsubscript𝐗subscript𝑚subscript𝐗𝑛\text{Div}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})Div ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) quantify the representativeness and diversity of selected samples respectively and λ𝜆\lambdaitalic_λ is a hyperparameter to balance the trade-off representativeness and diversity.

Besides, we propose another two fundamental settings which are beneficial to the implementation of the framework: (1) Low-budget learning. The budget for many of the real-world tasks which require sample selection procedures is relatively low compared to the size of unlabeled data. Therefore, we set m/n0.2𝑚𝑛0.2m/n\leq 0.2italic_m / italic_n ≤ 0.2 in default in the following context, including the analysis of the sampling algorithm and the experiments; (2) Sampling without Replacement. Compared with the setting of sampling with replacement, sampling without replacement offers several benefits which better match our tasks, including bias and variance reduction, precision increase and representativeness enhancement [24, 43].

4 Representative and Diversity Sample Selection

The Representative and Diverse Sample Selection (RDSS) framework consists of two steps: (1) Quantification. We quantify the representativeness and diversity of selected samples by proposing a novel concept called α𝛼\alphaitalic_α-MMD (6), where λ𝜆\lambdaitalic_λ is replaced by α𝛼\alphaitalic_α as the trade-off hyperparameter; (2) Optimization. We optimize α𝛼\alphaitalic_α-MMD by GKHR algorithm to obtain the optimally selected samples 𝐗msubscript𝐗subscriptsuperscript𝑚\mathbf{X}_{\mathcal{I}^{*}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

4.1 Quantification of Diversity and Representativeness

In classical statistics and machine learning problems, the inner product of data points 𝐱,𝐲𝒳𝐱𝐲𝒳\mathbf{x},\mathbf{y}\in\mathcal{X}bold_x , bold_y ∈ caligraphic_X, defined by 𝐱,𝐲𝐱𝐲\langle\mathbf{x},\mathbf{y}\rangle⟨ bold_x , bold_y ⟩, is employed to as a similarity measure between 𝐱,𝐲𝐱𝐲\mathbf{x},\mathbf{y}bold_x , bold_y. However, the application of linear functions can be very restrictive in real-world problems. In contrast, kernel methods use kernel functions k(𝐱,𝐲)𝑘𝐱𝐲k(\mathbf{x},\mathbf{y})italic_k ( bold_x , bold_y ), including Gaussian kernels (RBF), Laplacian kernels and polynomial kernels, as non-linear similarity measures between 𝐱,𝐲𝐱𝐲\mathbf{x},\mathbf{y}bold_x , bold_y, which are actually inner products of the projections of k(𝐱,𝐲)𝑘𝐱𝐲k(\mathbf{x},\mathbf{y})italic_k ( bold_x , bold_y ) in some high-dimensional feature space [28].

Let k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ) be a kernel function on 𝒳×𝒳𝒳𝒳\mathcal{X}\times\mathcal{X}caligraphic_X × caligraphic_X, and we employ k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ) to measure the similarity between any two points and the average similarity, denoted by

Sk(𝐗m)=1m2imjmk(𝐱i,𝐱j),subscript𝑆𝑘subscript𝐗subscript𝑚1superscript𝑚2subscript𝑖subscript𝑚subscript𝑗subscript𝑚𝑘subscript𝐱𝑖subscript𝐱𝑗S_{k}(\mathbf{X}_{\mathcal{I}_{m}})=\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}% \sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right),italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (2)

to measure the similarity between the selected samples. Obviously, S(𝐗m)𝑆subscript𝐗subscript𝑚S(\mathbf{X}_{\mathcal{I}_{m}})italic_S ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) can evaluate the diversity of 𝐗msubscript𝐗subscript𝑚\mathbf{X}_{\mathcal{I}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT since larger similarity implies smaller diversity.

As a statistical discrepancy which measures the distance between distributions, the maximum mean discrepancy (MMD) is introduced here to quantify the representativeness of 𝐗msubscript𝐗subscript𝑚\mathbf{X}_{\mathcal{I}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT to 𝐗nsubscript𝐗𝑛\mathbf{X}_{n}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Proposed by Gretton et al. [13], MMD is formally defined below:

Definition 4.1 (Maximum Mean Discrepancy).

Let P,Q𝑃𝑄P,Qitalic_P , italic_Q be two Borel probability measures on 𝒳𝒳\mathcal{X}caligraphic_X. Suppose f𝑓fitalic_f is sampled from the unit ball in a reproducing kernel Hilbert space (RKHS) \mathcal{H}caligraphic_H associated with its reproducing kernel k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ), i.e., f1subscriptnorm𝑓1\|f\|_{\mathcal{H}}\leq 1∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1, then the MMD between P𝑃Pitalic_P and Q𝑄Qitalic_Q is defined by

MMDk2(P,Q):=supf1(f𝑑Pf𝑑Q)2=𝔼[k(X,X)+k(Y,Y)2k(X,Y)],assignsuperscriptsubscriptMMD𝑘2𝑃𝑄subscriptsupremumsubscriptnorm𝑓1superscript𝑓differential-d𝑃𝑓differential-d𝑄2𝔼delimited-[]𝑘𝑋superscript𝑋𝑘𝑌superscript𝑌2𝑘𝑋𝑌\operatorname{MMD}_{k}^{2}(P,Q):=\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\int fdP% -\int fdQ\right)^{2}=\mathbb{E}\left[k\left(X,X^{\prime}\right)+k\left(Y,Y^{% \prime}\right)-2k(X,Y)\right],roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_P , italic_Q ) := roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( ∫ italic_f italic_d italic_P - ∫ italic_f italic_d italic_Q ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E [ italic_k ( italic_X , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_k ( italic_Y , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - 2 italic_k ( italic_X , italic_Y ) ] , (3)

where X,XPsimilar-to𝑋superscript𝑋𝑃X,X^{\prime}\sim Pitalic_X , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P and Y,YQsimilar-to𝑌superscript𝑌𝑄Y,Y^{\prime}\sim Qitalic_Y , italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_Q are independent copies.

We can next derive the empirical version for MMD that is able to measure the representativeness of 𝐗m={𝐱i}imsubscript𝐗subscript𝑚subscriptsubscript𝐱𝑖𝑖subscript𝑚\mathbf{X}_{\mathcal{I}_{m}}=\{\mathbf{x}_{i}\}_{i\in\mathcal{I}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT relative to 𝐗n={𝐱i}i=1nsubscript𝐗𝑛superscriptsubscriptsubscript𝐱𝑖𝑖1𝑛\mathbf{X}_{n}=\{\mathbf{x}_{i}\}_{i=1}^{n}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT by replacing P,Q𝑃𝑄P,Qitalic_P , italic_Q with the empirical distribution constructed by 𝐗m,𝐗nsubscript𝐗subscript𝑚subscript𝐗𝑛\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in (3):

MMDk2(𝐗m,𝐗n):=1n2i=1nj=1nk(𝐱i,𝐱j)+1m2imjmk(𝐱i,𝐱j)2mni=1njmk(𝐱i,𝐱j).assignsuperscriptsubscriptMMD𝑘2subscript𝐗subscript𝑚subscript𝐗𝑛1superscript𝑛2superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑗1superscript𝑚2subscript𝑖subscript𝑚subscript𝑗subscript𝑚𝑘subscript𝐱𝑖subscript𝐱𝑗2𝑚𝑛superscriptsubscript𝑖1𝑛subscript𝑗subscript𝑚𝑘subscript𝐱𝑖subscript𝐱𝑗\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}):=\frac% {1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}% \right)+\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}\sum_{j\in\mathcal{I}_{m}}k% \left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\frac{2}{mn}\sum_{i=1}^{n}\sum_{j% \in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right).roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - divide start_ARG 2 end_ARG start_ARG italic_m italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (4)

Optimization objective. Set Rep(,)=MMDk2(,)RepsuperscriptsubscriptMMD𝑘2\text{Rep}(\cdot,\cdot)=-\operatorname{MMD}_{k}^{2}(\cdot,\cdot)Rep ( ⋅ , ⋅ ) = - roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ⋅ , ⋅ ) and Div()=Sk()Divsubscript𝑆𝑘\text{Div}(\cdot)=-S_{k}(\cdot)Div ( ⋅ ) = - italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) in (1), where k𝑘kitalic_k is a proper kernel function, our optimization objective becomes

minm[n]MMDk2(𝐗m,𝐗n)+λSk(𝐗m).subscriptsubscript𝑚delimited-[]𝑛superscriptsubscriptMMD𝑘2subscript𝐗subscript𝑚subscript𝐗𝑛𝜆subscript𝑆𝑘subscript𝐗subscript𝑚\min_{\mathcal{I}_{m}\subset[n]}\operatorname{MMD}_{k}^{2}(\mathbf{X}_{% \mathcal{I}_{m}},\mathbf{X}_{n})+\lambda S_{k}(\mathbf{X}_{\mathcal{I}_{m}}).roman_min start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊂ [ italic_n ] end_POSTSUBSCRIPT roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_λ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (5)

Set λ=1ααm𝜆1𝛼𝛼𝑚\lambda=\frac{1-\alpha}{\alpha m}italic_λ = divide start_ARG 1 - italic_α end_ARG start_ARG italic_α italic_m end_ARG, since i=1nj=1nk(𝐱i,𝐱j)superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑗\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a constant, the objective function in (5) can be rewritten by

αMMDk2(𝐗m,𝐗n)+1αmSk(𝐗m)+α(α1)n2i=1nj=1nk(𝐱i,𝐱j)𝛼superscriptsubscriptMMD𝑘2subscript𝐗subscript𝑚subscript𝐗𝑛1𝛼𝑚subscript𝑆𝑘subscript𝐗subscript𝑚𝛼𝛼1superscript𝑛2superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑗\displaystyle\alpha\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},% \mathbf{X}_{n})+\frac{1-\alpha}{m}S_{k}(\mathbf{X}_{\mathcal{I}_{m}})+\frac{% \alpha(\alpha-1)}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},% \mathbf{x}_{j}\right)italic_α roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + divide start_ARG 1 - italic_α end_ARG start_ARG italic_m end_ARG italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + divide start_ARG italic_α ( italic_α - 1 ) end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (6)
=\displaystyle== α2n2i=1nj=1nk(𝐱i,𝐱j)+1m2imjmk(𝐱i,𝐱j)2αmni=1njmk(𝐱i,𝐱j)superscript𝛼2superscript𝑛2superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑗1superscript𝑚2subscript𝑖subscript𝑚subscript𝑗subscript𝑚𝑘subscript𝐱𝑖subscript𝐱𝑗2𝛼𝑚𝑛superscriptsubscript𝑖1𝑛subscript𝑗subscript𝑚𝑘subscript𝐱𝑖subscript𝐱𝑗\displaystyle\frac{\alpha^{2}}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(% \mathbf{x}_{i},\mathbf{x}_{j}\right)+\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}% \sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\frac{2% \alpha}{mn}\sum_{i=1}^{n}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},% \mathbf{x}_{j}\right)divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - divide start_ARG 2 italic_α end_ARG start_ARG italic_m italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=\displaystyle== supf1(1mimf(𝐱i)αnj=1nf(𝐱j))2subscriptsupremumsubscriptnorm𝑓1superscript1𝑚subscript𝑖subscript𝑚𝑓subscript𝐱𝑖𝛼𝑛superscriptsubscript𝑗1𝑛𝑓subscript𝐱𝑗2\displaystyle\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i\in% \mathcal{I}_{m}}f(\mathbf{x}_{i})-\frac{\alpha}{n}\sum_{j=1}^{n}f(\mathbf{x}_{% j})\right)^{2}roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG italic_α end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

which defines a new concept called α𝛼\alphaitalic_α-MMD, denoted by MMDk,α(𝐗m,𝐗n)subscriptMMD𝑘𝛼subscript𝐗subscript𝑚subscript𝐗𝑛\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). This new concept distinguishes our method from those existing methods, which is essential for developing the sampling algorithms and theoretical analysis. Note that α𝛼\alphaitalic_α-MMD degenerates to classical MMD when α=1𝛼1\alpha=1italic_α = 1 and degenerates to average similarity when α=0𝛼0\alpha=0italic_α = 0. As α𝛼\alphaitalic_α decreases, λ𝜆\lambdaitalic_λ increases, thereby encouraging the diversity for sample selection.

Remark 1. In the following context, all the kernels are assumed to be characteristic and positive definite if not specified. The following illustrates the advantages of the two properties.

Characteristics kernels. The MMD is generally a pseudo-metric on the space of all Borel probability distributions, implying that the MMD between two different distributions can be zero. Nevertheless, MMD becomes a proper metric when k𝑘kitalic_k is a characteristic kernel, i.e., P𝒳k(,𝐱)𝑑P𝑃subscript𝒳𝑘𝐱differential-d𝑃P\rightarrow\int_{\mathcal{X}}k(\cdot,\mathbf{x})dPitalic_P → ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_k ( ⋅ , bold_x ) italic_d italic_P for any Borel probability distribution P𝑃Pitalic_P on 𝒳𝒳\mathcal{X}caligraphic_X [28]. Therefore, MMD induced by characteristic kernels can be more appropriate for measuring representativeness.

Positive definite kernels. Aronszajn [1] showed that for every positive definite kernel k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ), i.e., its Gram matrix is always positive definite and symmetric, it uniquely determines an RKHS \mathcal{H}caligraphic_H and vice versa. This property is not only important for evaluating the property of MMD [40] but also required in optimizing MMD [31] by Frank-Wolfe algorithm.

4.2 Sampling Algorithm

In the previous research [35, 26, 47], sample selection is usually modelled by a non-convex combinatorial optimization problem. In contrast, following the idea of [3], we regard minm[n]MMDk,α2(𝐗m,𝐗n)subscriptsubscript𝑚delimited-[]𝑛subscriptsuperscriptMMD2𝑘𝛼subscript𝐗subscript𝑚subscript𝐗𝑛\min_{\mathcal{I}_{m}\in[n]}\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}_{% \mathcal{I}_{m}},\mathbf{X}_{n})roman_min start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ [ italic_n ] end_POSTSUBSCRIPT roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as a convex optimization problem by exploiting the convexity of α𝛼\alphaitalic_α-MMD, and then solve it by a fast iterative minimization procedure derived from Frank-Wolfe algorithm (see Appendix A for derivation details):

𝐱ip+1argmini[n]fp(𝐱i),p+1p{ip+1},0=,formulae-sequencesubscript𝐱subscriptsuperscript𝑖𝑝1subscript𝑖delimited-[]𝑛subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑖formulae-sequencesubscriptsuperscript𝑝1subscriptsuperscript𝑝subscriptsuperscript𝑖𝑝1subscript0\mathbf{x}_{i^{*}_{p+1}}\in\mathop{\arg\min}_{i\in[n]}f_{\mathcal{I}^{*}_{p}}(% \mathbf{x}_{i}),\mathcal{I}^{*}_{p+1}\leftarrow\mathcal{I}^{*}_{p}\cup\{{i^{*}% _{p+1}}\},\mathcal{I}_{0}=\emptyset,bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ← caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∪ { italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT } , caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∅ , (7)

where fp(𝐱i)=jpk(𝐱i,𝐱j)αpl=1nk(𝐱i,𝐱l)/nsubscript𝑓subscript𝑝subscript𝐱𝑖subscript𝑗subscript𝑝𝑘subscript𝐱𝑖subscript𝐱𝑗𝛼𝑝superscriptsubscript𝑙1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑙𝑛f_{\mathcal{I}_{p}}(\mathbf{x}_{i})=\sum_{j\in\mathcal{I}_{p}}k\left(\mathbf{x% }_{i},\mathbf{x}_{j}\right)-\alpha p\sum_{l=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_% {l})/nitalic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_α italic_p ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) / italic_n. As an extension of kernel herding [6], its corresponding algorithm (see Algorithm 2) is called Generalized Kernel Herding (GKH). Note that fp(𝐱i)subscript𝑓subscript𝑝subscript𝐱𝑖f_{\mathcal{I}_{p}}(\mathbf{x}_{i})italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is iteratively updated in Algorithm 2, which can save a lot of running time. However, GKH can select repeated samples that contradict the setting of sampling without replacement. To address this issue, we propose a modified iterating formula based on (7):

𝐱ip+1argmini[n]\pfp(𝐱i),p+1p{ip+1},0=,formulae-sequencesubscript𝐱subscriptsuperscript𝑖𝑝1subscript𝑖\delimited-[]𝑛subscriptsuperscript𝑝subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑖formulae-sequencesubscriptsuperscript𝑝1subscriptsuperscript𝑝subscriptsuperscript𝑖𝑝1subscriptsuperscript0\mathbf{x}_{i^{*}_{p+1}}\in\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}% _{p}}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i}),\mathcal{I}^{*}_{p+1}\leftarrow% \mathcal{I}^{*}_{p}\cup\{{i^{*}_{p+1}}\},\mathcal{I}^{*}_{0}=\emptyset,bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] \ caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ← caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∪ { italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT } , caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∅ , (8)

which admits no repetitiveness in the selected samples. Its corresponding algorithm (see Algorithm 1) is thereby named as Generalized Kernel Herding without Replacement (GKHR), employed as the sampling algorithm for RDSS.

Algorithm 1 Generalized Kernel Herding without Replacement
0:  Data set 𝐗n={𝐱1,,𝐱n}𝒳subscript𝐗𝑛subscript𝐱1subscript𝐱𝑛𝒳\mathbf{X}_{n}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\}\subset\mathcal{X}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ⊂ caligraphic_X; the number of selected samples m<n𝑚𝑛m<nitalic_m < italic_n; a positive definite, characteristic and radial kernel k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ) on 𝒳×𝒳𝒳𝒳\mathcal{X}\times\mathcal{X}caligraphic_X × caligraphic_X; trade-off parameter α1𝛼1\alpha\leq 1italic_α ≤ 1.
0:  Selected samples 𝐗m={𝐱i1,,𝐱im}subscript𝐗subscriptsuperscript𝑚subscript𝐱subscriptsuperscript𝑖1subscript𝐱subscriptsuperscript𝑖𝑚\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\cdots,\mathbf{x}_{i% ^{*}_{m}}\}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT }.
1:  For each 𝐱i𝐗nsubscript𝐱𝑖subscript𝐗𝑛\mathbf{x}_{i}\in\mathbf{X}_{n}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT calculate μ(𝐱i):=j=1nk(𝐱j,𝐱i)/nassign𝜇subscript𝐱𝑖superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑗subscript𝐱𝑖𝑛\mu(\mathbf{x}_{i}):=\sum_{j=1}^{n}k(\mathbf{x}_{j},\mathbf{x}_{i})/nitalic_μ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_n.
2:  Set β1=1subscript𝛽11\beta_{1}=1italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, S0=0subscript𝑆00S_{0}=0italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, =\mathcal{I}=\emptysetcaligraphic_I = ∅.
3:  for p{1,,m}𝑝1𝑚p\in\{1,\cdots,m\}italic_p ∈ { 1 , ⋯ , italic_m } do
4:     ipargmini[n]\pSp1(𝐱i)αμ(𝐱i)subscriptsuperscript𝑖𝑝subscript𝑖\delimited-[]𝑛subscriptsuperscript𝑝subscript𝑆𝑝1subscript𝐱𝑖𝛼𝜇subscript𝐱𝑖{i^{*}_{p}}\in\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}S_{p-1}(% \mathbf{x}_{i})-\alpha\mu(\mathbf{x}_{i})italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] \ caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_α italic_μ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
5:     For all i[n]\p𝑖\delimited-[]𝑛subscriptsuperscript𝑝i\in[n]\backslash\mathcal{I}^{*}_{p}italic_i ∈ [ italic_n ] \ caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, update Sp(𝐱i)=(1βp)Sp1(𝐱i)+βpk(𝐱ip,𝐱i)subscript𝑆𝑝subscript𝐱𝑖1subscript𝛽𝑝subscript𝑆𝑝1subscript𝐱𝑖subscript𝛽𝑝𝑘subscript𝐱subscriptsuperscript𝑖𝑝subscript𝐱𝑖S_{p}(\mathbf{x}_{i})=(1-\beta_{p})S_{p-1}(\mathbf{x}_{i})+\beta_{p}k(\mathbf{% x}_{i^{*}_{p}},\mathbf{x}_{i})italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( 1 - italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) italic_S start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
6:     p+1p{ip}subscriptsuperscript𝑝1subscriptsuperscript𝑝subscriptsuperscript𝑖𝑝{\mathcal{I}^{*}_{p+1}}\leftarrow{\mathcal{I}^{*}_{p}}\cup\{{i^{*}_{p}}\}caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ← caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∪ { italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT }, pp+1𝑝𝑝1p\leftarrow p+1italic_p ← italic_p + 1, set βp=1/psubscript𝛽𝑝1𝑝\beta_{p}=1/pitalic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 / italic_p.
7:  end for

Computational complexity. Despite the time cost for calculating kernel functions, the computational complexity of GKHR is O(mn)𝑂𝑚𝑛O(mn)italic_O ( italic_m italic_n ), since in each iteration, the steps in lines 4 and 5 of Algorithm 2 respectively require O(n)𝑂𝑛O(n)italic_O ( italic_n ) computations. Note that GKH has the same order of computational complexity as GKHR.

5 Theoretical Analysis

5.1 Generalization Bounds

Recall the core-set approach in [35], i.e., for any hh\in\mathcal{H}italic_h ∈ caligraphic_H,

R(h)R^S(h)+|R(h)R^T(h)|+|R^T(h)R^S(h)|,𝑅subscript^𝑅𝑆𝑅subscript^𝑅𝑇subscript^𝑅𝑇subscript^𝑅𝑆R(h)\leq\widehat{R}_{S}(h)+|R(h)-\widehat{R}_{T}(h)|+|\widehat{R}_{T}(h)-% \widehat{R}_{S}(h)|,italic_R ( italic_h ) ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) + | italic_R ( italic_h ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) | + | over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) | ,

where T𝑇Titalic_T is the full labeled dataset and ST𝑆𝑇S\subset Titalic_S ⊂ italic_T is the core set, R(h)𝑅R(h)italic_R ( italic_h ) is the expected risk of hhitalic_h, R^T(h),R^S(h)subscript^𝑅𝑇subscript^𝑅𝑆\widehat{R}_{T}(h),\widehat{R}_{S}(h)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) are empirical risk of hhitalic_h on T,S𝑇𝑆T,Sitalic_T , italic_S. The first term R^S(h)subscript^𝑅𝑆\widehat{R}_{S}(h)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) is unknown before we label the selected samples, and the second term |R(h)R^T(h)|𝑅subscript^𝑅𝑇|R(h)-\widehat{R}_{T}(h)|| italic_R ( italic_h ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) | can be upper bounded by the so-called generalization bounds [2] which do not depend on the choice of core set. Therefore, to control the upper bound of R(h)𝑅R(h)italic_R ( italic_h ), we only need to analyse the upper bound of the third term |R^T(h)R^S(h)|subscript^𝑅𝑇subscript^𝑅𝑆|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)|| over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) | called core-set loss, which requires several mild assumptions.

Let 1={h|h:𝒳𝒴}subscript1conditional-set:𝒳𝒴\mathcal{H}_{1}=\{h|h:\mathcal{X}\rightarrow\mathcal{Y}\}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_h | italic_h : caligraphic_X → caligraphic_Y } be a hypothesis set in which we are going to select a predictor and suppose that the labelled data T={(𝐱i,yi)}i=1n𝑇superscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛T=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}italic_T = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are i.i.d. sampled from a random vector (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) defined on 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. We firstly assume that 1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an RKHS, which is mild in machine learning theory [2, 4].

Assumption 5.1.

1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an RKHS associated with bounded positive definite kernel k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT where the norm of any h1subscript1h\in\mathcal{H}_{1}italic_h ∈ caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is bounded by Khsubscript𝐾K_{h}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

We further make RKHS assumptions on the functional space of 𝔼(Y|X)𝔼conditional𝑌𝑋\mathbb{E}(Y|X)blackboard_E ( italic_Y | italic_X ) and Var(Y|X)Varconditional𝑌𝑋\operatorname{Var}(Y|X)roman_Var ( italic_Y | italic_X ) that are fundamental in the field of conditional distribution embedding [38, 40].

Assumption 5.2.

There is an RKHS 2subscript2\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT associated with bounded positive definite kernel k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that 𝔼(Y|X)2𝔼conditional𝑌𝑋subscript2\mathbb{E}(Y|X)\in\mathcal{H}_{2}blackboard_E ( italic_Y | italic_X ) ∈ caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the norm of any 𝔼(Y|X)𝔼conditional𝑌𝑋\mathbb{E}(Y|X)blackboard_E ( italic_Y | italic_X ) is bounded by Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Assumption 5.3.

There is an RKHS 3subscript3\mathcal{H}_{3}caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT associated with bounded positive definite kernel k3subscript𝑘3k_{3}italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT such that Var(Y|X)3Varconditional𝑌𝑋subscript3\operatorname{Var}(Y|X)\in\mathcal{H}_{3}roman_Var ( italic_Y | italic_X ) ∈ caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and the norm of any Var(Y|X)Varconditional𝑌𝑋\operatorname{Var}(Y|X)roman_Var ( italic_Y | italic_X ) is bounded by Kssubscript𝐾𝑠K_{s}italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

We next give a α𝛼\alphaitalic_α-MMD-type upper bound for the core-set loss by the following theorem:

Theorem 5.4.

Take k=k12+k1k2+k3𝑘superscriptsubscript𝑘12subscript𝑘1subscript𝑘2subscript𝑘3k=k_{1}^{2}+k_{1}k_{2}+k_{3}italic_k = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, then under assumptions 1-3, for any selected samples ST𝑆𝑇S\subset Titalic_S ⊂ italic_T, there exists a positive constant Kcsubscript𝐾𝑐K_{c}italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT such that the following inequality holds:

|R^T(h)R^S(h)|Kc(MMDk,α(𝐗S,𝐗T)+(1α)K)2,subscript^𝑅𝑇subscript^𝑅𝑆subscript𝐾𝑐superscriptsubscriptMMD𝑘𝛼subscript𝐗𝑆subscript𝐗𝑇1𝛼𝐾2|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)|\leq K_{c}(\operatorname{MMD}_{k,\alpha% }(\mathbf{X}_{S},\mathbf{X}_{T})+(1-\alpha)\sqrt{K})^{2},| over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) | ≤ italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + ( 1 - italic_α ) square-root start_ARG italic_K end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 0α10𝛼10\leq\alpha\leq 10 ≤ italic_α ≤ 1, 0max𝐱𝒳k(𝐱,𝐱)=K0subscript𝐱𝒳𝑘𝐱𝐱𝐾0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K0 ≤ roman_max start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_k ( bold_x , bold_x ) = italic_K and 𝐗S,𝐗Tsubscript𝐗𝑆subscript𝐗𝑇\mathbf{X}_{S},\mathbf{X}_{T}bold_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are projections of S,T𝑆𝑇S,Titalic_S , italic_T on 𝒳𝒳\mathcal{X}caligraphic_X.

Therefore, minimizing α𝛼\alphaitalic_α-MMD can optimize the generalization bound for R(h)𝑅R(h)italic_R ( italic_h ) and benefit the generalizability of the trained model (predictor).

5.2 Finite-Sample-Error-Bound for GKHR

The concept of convergence does not apply to analyzing GKHR. With n𝑛nitalic_n fixed, GKHR iterates for at most n𝑛nitalic_n times and then returns 𝐗n=𝐗nsubscript𝐗subscriptsuperscript𝑛subscript𝐗𝑛\mathbf{X}_{\mathcal{I}^{*}_{n}}=\mathbf{X}_{n}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Consequently, we analyze the performance of GKHR by its finite-sample-error bound. Previous to that, we make an assumption on the mean of fpsubscript𝑓subscriptsuperscript𝑝f_{\mathcal{I}^{*}_{p}}italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT over the full unlabeled dataset.

Assumption 5.5.

For any psubscriptsuperscript𝑝\mathcal{I}^{*}_{p}caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT returned by GKHR, 1pm11𝑝𝑚11\leq p\leq m-11 ≤ italic_p ≤ italic_m - 1, there exists p+1𝑝1p+1italic_p + 1 elements {𝐱jl}l=1p+1superscriptsubscriptsubscript𝐱subscript𝑗𝑙𝑙1𝑝1\{\mathbf{x}_{j_{l}}\}_{l=1}^{p+1}{ bold_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT in 𝐗nsubscript𝐗𝑛\mathbf{X}_{n}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such that

fp(𝐱j1)fp(𝐱jp+1)i=1nfp(𝐱i)n.subscript𝑓subscriptsuperscript𝑝subscript𝐱subscript𝑗1subscript𝑓subscriptsuperscript𝑝subscript𝐱subscript𝑗𝑝1superscriptsubscript𝑖1𝑛subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑖𝑛f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j_{1}})\leq\cdots f_{\mathcal{I}^{*}_{p}}(% \mathbf{x}_{j_{p+1}})\leq\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{*}_{p}}(\mathbf{x% }_{i})}{n}.italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ ⋯ italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n end_ARG .

When m𝑚mitalic_m is not relatively small, this assumption is rather unrealistic. Nevertheless, under our low-budget setting, especially when mnmuch-less-than𝑚𝑛m\ll nitalic_m ≪ italic_n, the assumption becomes an extension of the principle that "the minimum is never larger than the mean", which still probably makes sense. We can then show that the decaying rate for optimization error of GKHR can be upper bounded by O(logm/m)𝑂𝑚𝑚O(\log m/m)italic_O ( roman_log italic_m / italic_m ):

Theorem 5.6.

Let 𝐗msubscript𝐗subscriptsuperscript𝑚\mathbf{X}_{\mathcal{I}^{*}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the samples selected by GKHR, under assumption 4, it holds that

MMDk,α2(𝐗m,𝐗n)Cα2+B2+logmm+1subscriptsuperscriptMMD2𝑘𝛼subscript𝐗subscriptsuperscript𝑚subscript𝐗𝑛subscriptsuperscript𝐶2𝛼𝐵2𝑚𝑚1\operatorname{MMD}^{2}_{k,\alpha}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},% \mathbf{X}_{n}\right)\leq C^{2}_{\alpha}+B\frac{2+\log m}{m+1}roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_B divide start_ARG 2 + roman_log italic_m end_ARG start_ARG italic_m + 1 end_ARG (9)

where B=2K𝐵2𝐾B=2Kitalic_B = 2 italic_K, 0max𝐱𝒳k(𝐱,𝐱)=K0subscript𝐱𝒳𝑘𝐱𝐱𝐾0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K0 ≤ roman_max start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_k ( bold_x , bold_x ) = italic_K, Cα2=(1α)2K¯subscriptsuperscript𝐶2𝛼superscript1𝛼2¯𝐾C^{2}_{\alpha}=(1-\alpha)^{2}\overline{K}italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_K end_ARG where K¯¯𝐾\overline{K}over¯ start_ARG italic_K end_ARG is defined in Lemma B.6.

6 Choice of Kernel and Hyperparameter Tuning

In this section, we make some suggestions for choosing the kernel and tuning the hyperparameter α𝛼\alphaitalic_α.

Choice of kernel. Recall Remark 1 in Section 4.1, we only consider characteristic and positive definite kernels in RDSS. Since the Gaussian kernels are the most commonly used kernels in the field of machine learning and statistics [2, 14], we introduce Gaussian kernel as our choice, which is defined by k(𝐱,𝐲)=exp(𝐱𝐲22)/σ2𝑘𝐱𝐲superscriptsubscriptnorm𝐱𝐲22superscript𝜎2k(\mathbf{x},\mathbf{y})=\exp(-\|\mathbf{x}-\mathbf{y}\|_{2}^{2})/\sigma^{2}italic_k ( bold_x , bold_y ) = roman_exp ( - ∥ bold_x - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The bandwidth parameter σ𝜎\sigmaitalic_σ is set to be the median distance between samples in the aggregate dataset [14], i.e., σ=Median({𝐱𝐲2|𝐱,𝐲𝐗n})𝜎Medianconditionalsubscriptnorm𝐱𝐲2𝐱𝐲subscript𝐗𝑛\sigma=\operatorname{Median}(\{\|\mathbf{x}-\mathbf{y}\|_{2}|\mathbf{x},% \mathbf{y}\in\mathbf{X}_{n}\})italic_σ = roman_Median ( { ∥ bold_x - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | bold_x , bold_y ∈ bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ), since the median is robust and also compromises between extreme cases.

Tuning trade-off hyperparameter α𝛼\alphaitalic_α. According to Theorem 5.6 and Lemma B.3, by straightforward deduction we have

MMDk(𝐗m,𝐗n)Cα+𝒪(logmm)+(1α)KsubscriptMMD𝑘subscript𝐗subscriptsuperscript𝑚subscript𝐗𝑛subscript𝐶𝛼𝒪𝑚𝑚1𝛼𝐾\operatorname{MMD}_{k}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}% \right)\leq C_{\alpha}+\mathcal{O}\left(\sqrt{\frac{\log m}{m}}\right)+(1-% \alpha)\sqrt{K}roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + caligraphic_O ( square-root start_ARG divide start_ARG roman_log italic_m end_ARG start_ARG italic_m end_ARG end_ARG ) + ( 1 - italic_α ) square-root start_ARG italic_K end_ARG

to upper bound the MMD between the selected samples and the full dataset under a low-budget setting. We can just set α[11m,1)𝛼11𝑚1\alpha\in[1-\frac{1}{\sqrt{m}},1)italic_α ∈ [ 1 - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG , 1 ) so that the upper bound of the MMD would not be larger than the one of α𝛼\alphaitalic_α-MMD in the perspective of the order of magnitude.

7 Experiments

In this section, we first explain the implementation details of our method RDSS in Section 7.1. Next, we compare RDSS with other sampling methods by integrating them into two state-of-the-art (SOTA) SSL approaches (FlexMatch [58] and Freematch [50]) on five datasets (CIFAR-10/100, SVHN, STL-10 and ImageNet-1k) in Section 7.2. The details of the datasets, the visualization results and the computational complexity of different sampling methods are shown in Appendix D.2D.3, and D.4, respectively. We also compare against various AL/SSAL approaches in Section 7.3. Lastly, we make quantitative analyses of the trade-off parameter α𝛼\alphaitalic_α in Section 7.4.

7.1 Implementation Details of Our Method

First, we leverage the pre-trained image feature extraction capabilities of CLIP [32], a vision transformer architecture, to extract features. Subsequently, the [CLS] token features produced by the model’s final output are employed for sample selection. During the sample selection phase, the Gaussian kernel function is chosen as the kernel method to compute the similarity of samples in an infinite-dimensional feature space. The value of σ𝜎\sigmaitalic_σ for the Gaussian kernel function is set as explained in Section 6. To ensure diversity in the sampled data, we introduce a penalty factor given by α=11m𝛼11𝑚\alpha=1-\frac{1}{\sqrt{m}}italic_α = 1 - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG, where m𝑚mitalic_m denotes the number of selected samples. Concretely, we set m={40,250,4000}𝑚402504000m=\left\{40,250,4000\right\}italic_m = { 40 , 250 , 4000 } for CIFAR-10, m={400,2500,10000}𝑚400250010000m=\left\{400,2500,10000\right\}italic_m = { 400 , 2500 , 10000 } for CIFAR-100, m={250,1000}𝑚2501000m=\left\{250,1000\right\}italic_m = { 250 , 1000 } for SVHN, m={40,250}𝑚40250m=\left\{40,250\right\}italic_m = { 40 , 250 } for STL-10 and m={100000}𝑚100000m=\left\{100000\right\}italic_m = { 100000 } for ImageNet. Next, the selected samples are used for two SSL approaches, which are trained and evaluated on the datasets using the codebase Unified SSL Benchmark (USB) [49]. The optimizer for all experiments is standard stochastic gradient descent (SGD) with a momentum of 0.90.90.90.9 [41]. The initial learning rate is 0.030.030.030.03 with a learning rate decay of 0.00050.00050.00050.0005. We use ResNet-50 [15] for the ImageNet experiment and Wide ResNet-28-2 [57] for other datasets. Finally, we evaluate the performance with the Top-1 classification accuracy metric on the test set. Experiments are run on 8*NVIDIA Tesla A100 (40 GB) and 2*Intel 6248R 24-Core Processor. We average our results over five independent runs.

Table 1: Comparison with other sampling methods. Due to stratified sampling limitations, the results are marked in grey. Top and second-best performances are bolded and underlined, respectively, excluding stratified sampling. Metrics represent mean accuracy and standard deviation over five independent runs.
Dataset CIFAR-10 CIFAR-100 SVHN STL-10
Budget 40 250 4000 400 2500 10000 250 1000 40 250
Applied to FlexMatch [58]
Stratified 91.45±plus-or-minus\pm±3.41 95.10±plus-or-minus\pm±0.25 95.63±plus-or-minus\pm±0.24 50.23±plus-or-minus\pm±0.41 67.38±plus-or-minus\pm±0.45 73.61±plus-or-minus\pm±0.43 89.60±plus-or-minus\pm±1.86 93.66±plus-or-minus\pm±0.49 75.33±plus-or-minus\pm±3.74 92.29±plus-or-minus\pm±0.64
Random 87.30±plus-or-minus\pm±4.61 93.95±plus-or-minus\pm±0.91 95.17±plus-or-minus\pm±0.59 45.58±plus-or-minus\pm±0.97 66.48±plus-or-minus\pm±0.98 72.61±plus-or-minus\pm±0.83 87.67±plus-or-minus\pm±1.16 94.06±plus-or-minus\pm±1.14 65.81±plus-or-minus\pm±1.21 90.70±plus-or-minus\pm±0.79
k𝑘kitalic_k-Means 81.23±plus-or-minus\pm±8.71 94.59±plus-or-minus\pm±0.51 95.09±plus-or-minus\pm±0.65 41.60±plus-or-minus\pm±1.24 65.99±plus-or-minus\pm±0.57 71.53±plus-or-minus\pm±0.42 90.28±plus-or-minus\pm±0.69 93.82±plus-or-minus\pm±1.04 55.43±plus-or-minus\pm±0.39 90.64±plus-or-minus\pm±1.05
USL [47] 91.73±plus-or-minus\pm±0.13 94.89±plus-or-minus\pm±0.20 95.43±plus-or-minus\pm±0.15 46.89±plus-or-minus\pm±0.46 66.75±plus-or-minus\pm±0.37 72.53±plus-or-minus\pm±0.32 90.03±plus-or-minus\pm±0.63 93.10±plus-or-minus\pm±0.78 75.65±plus-or-minus\pm±0.60 90.77±plus-or-minus\pm±0.36
ActiveFT [54] 70.87±plus-or-minus\pm±4.14 93.85±plus-or-minus\pm±1.37 95.31±plus-or-minus\pm±0.75 25.69±plus-or-minus\pm±0.64 57.19±plus-or-minus\pm±2.06 70.96±plus-or-minus\pm±0.75 89.32±plus-or-minus\pm±1.87 92.53±plus-or-minus\pm±0.43 55.57±plus-or-minus\pm±1.42 87.28±plus-or-minus\pm±1.19
RDSS (Ours) 94.69±plus-or-minus\pm±0.28 95.21±plus-or-minus\pm±0.47 95.71±plus-or-minus\pm±0.10 48.12±plus-or-minus\pm±0.36 67.27±plus-or-minus\pm±0.55 73.21±plus-or-minus\pm±0.29 91.70±plus-or-minus\pm±0.39 95.70±plus-or-minus\pm±0.35 77.96±plus-or-minus\pm±0.52 93.16±plus-or-minus\pm±0.41
Applied to FreeMatch [50]
Stratified 95.05±plus-or-minus\pm±0.15 95.40±plus-or-minus\pm±0.23 95.80±plus-or-minus\pm±0.29 51.29±plus-or-minus\pm±0.56 67.69±plus-or-minus\pm±0.58 73.90±plus-or-minus\pm±0.53 92.58±plus-or-minus\pm±1.05 94.22±plus-or-minus\pm±0.78 79.16±plus-or-minus\pm±5.01 91.36±plus-or-minus\pm±0.18
Random 93.41±plus-or-minus\pm±1.24 93.98±plus-or-minus\pm±0.91 95.56±plus-or-minus\pm±0.17 47.16±plus-or-minus\pm±1.25 66.09±plus-or-minus\pm±1.08 72.09±plus-or-minus\pm±0.99 91.62±plus-or-minus\pm±1.88 94.40±plus-or-minus\pm±1.28 76.66±plus-or-minus\pm±2.43 90.72±plus-or-minus\pm±0.97
k𝑘kitalic_k-Means 88.05±plus-or-minus\pm±5.07 94.80±plus-or-minus\pm±0.48 95.51±plus-or-minus\pm±0.37 44.07±plus-or-minus\pm±1.94 66.09±plus-or-minus\pm±0.39 71.69±plus-or-minus\pm±0.72 93.30±plus-or-minus\pm±0.46 94.68±plus-or-minus\pm±0.72 63.22±plus-or-minus\pm±4.92 89.99±plus-or-minus\pm±0.87
USL [47] 93.81±plus-or-minus\pm±0.62 95.19±plus-or-minus\pm±0.18 95.78±plus-or-minus\pm±0.29 47.07±plus-or-minus\pm±0.78 66.92±plus-or-minus\pm±0.33 72.59±plus-or-minus\pm±0.36 93.36±plus-or-minus\pm±0.53 94.44±plus-or-minus\pm±0.44 76.95±plus-or-minus\pm±0.86 90.58±plus-or-minus\pm±0.58
ActiveFT [54] 78.13±plus-or-minus\pm±2.87 94.54±plus-or-minus\pm±0.81 95.33±plus-or-minus\pm±0.53 26.67±plus-or-minus\pm±0.46 56.23±plus-or-minus\pm±0.85 71.20±plus-or-minus\pm±0.68 92.60±plus-or-minus\pm±0.51 93.71±plus-or-minus\pm±0.54 63.31±plus-or-minus\pm±2.99 86.60±plus-or-minus\pm±0.30
RDSS (Ours) 95.05±plus-or-minus\pm±0.13 95.50±plus-or-minus\pm±0.20 95.98±plus-or-minus\pm±0.28 48.41±plus-or-minus\pm±0.59 67.40±plus-or-minus\pm±0.23 73.13±plus-or-minus\pm±0.19 94.54±plus-or-minus\pm±0.46 95.83±plus-or-minus\pm±0.37 81.90±plus-or-minus\pm±1.72 92.22±plus-or-minus\pm±0.40

7.2 Comparison with Other Sampling Methods

Main results. We apply RDSS on Flexmatch and Freematch to compare with the following three baselines and two SOTA methods in SSL under different annotation budget settings. The baselines conclude Stratified, Random and k𝑘kitalic_k-Means, while the two SOTA methods are USL [47] and ActiveFT [54]. The results are shown on Table 1 from which we have several observations: (1) Our proposed RDSS achieves the highest accuracy, outperforming other sampling methods, which underscores the effectiveness of our approach; (2) USL attains suboptimal results under most budget settings yet exhibits a significant gap compared to RDSS, particularly under severely constrained ones. For instance, FreeMatch achieves a 4.95%percent4.954.95\%4.95 % rise on the STL-10 with a budget of 40404040; (3) In most experiments, RDSS either approaches or surpasses the performance of stratified sampling, especially on SVHN and STL-10. However, the stratified sampling method is practically infeasible given that the category labels of the data are not known a priori.

Results on ImageNet. We also compare the second-best method USL with RDSS on ImageNet. Following the settings of FreeMatch [50], we select 100k samples for annotation. FreeMatch, using RDSS and USL as sampling methods, achieves 58.24%percent58.2458.24\%58.24 % and 56.86%percent56.8656.86\%56.86 % accuracy, respectively, demonstrating a substantial enhancement in the performance of our method over the USL approach.

7.3 Comparison with AL/SSAL Approaches

First, we compare RDSS against various traditional AL approaches on CIFAR-10/100. AL approaches conclude CoreSet [35], VAAL [36], LearnLoss [56] and MCDAL [7]. For a fair comparison, we exclusively use samples selected by RDSS for supervised learning compared to other AL approaches, considering that AL relies solely on labelled samples for supervised learning. The implementation details are shown in Appendix D.5. The experimental results are presented in Table 2, from which we observe that RDSS achieves the highest accuracy under almost all budget settings when relying solely on labelled data for supervised learning, with notable improvements on CIFAR-100.

Second, we compare RDSS with sampling methods used in SSAL when applied to the same SSL framework (i.e., FlexMatch or FreeMatch). The sampling methods conclude CoreSetSSL [35], MMA [39], CBSSAL [11], and TOD-Semi [16]. In detail, we tune recent SSAL approaches with their public implementations and run experiments under an extremely low-budget setting, i.e., 40 samples in a 20-random-and-20-selected setting. Table 3 illustrates that the performance of most SSAL approaches falls below that of random sampling methods under extremely low-budget settings. This inefficiency stems from the dependency of sample selection on model performance within the SSAL framework, which struggles when the model is weak. Our model-free method, in contrast, selects samples before training, avoiding these pitfalls.

Table 2: Comparison with AL approaches under Supervised Learning (SL) paradigm. The best performance is bold and the second best performance is underlined.
Dataset CIFAR-10 CIFAR-100
Budget 7500 10000 7500 10000
CoreSet 85.46 87.56 47.17 53.06
VAAL 86.82 88.97 47.02 53.99
LearnLoss 85.49 87.06 47.81 54.02
MCDAL 87.24 89.40 49.34 54.14
SL+RDSS (Ours) 87.18 89.77 50.13 56.04
Whole Dataset 95.62 78.83
Table 3: Comparison with SSAL approaches. The green (red) arrow represents the improvement (decrease) compared to the random sampling method.
Method FlexMatch FreeMatch
Stratified 91.45 95.05
Random 87.30 93.41
CoreSetSSL 87.66 0.36absent0.36\uparrow 0.36↑ 0.36 91.24 2.17absent2.17\downarrow 2.17↓ 2.17
MMA 74.61 12.69absent12.69\downarrow 12.69↓ 12.69 87.37 6.04absent6.04\downarrow 6.04↓ 6.04
CBSSAL 86.58 0.72absent0.72\downarrow 0.72↓ 0.72 91.68 1.73absent1.73\downarrow 1.73↓ 1.73
TOD-Semi 86.21 1.09absent1.09\downarrow 1.09↓ 1.09 90.77 2.64absent2.64\downarrow 2.64↓ 2.64
RDSS (Ours) 94.69 7.39absent7.39\uparrow 7.39↑ 7.39 95.05 1.64absent1.64\uparrow 1.64↑ 1.64

Third, when applied to SSL, we directly compare RDSS with the above AL/SSAL approaches, which may better reflect the paradigm differences. The experimental results and analysis are in the Appendix D.6.

7.4 Trade-off Parameter α𝛼\alphaitalic_α

We analyze the effect of different α𝛼\alphaitalic_α with Freematch on CIFAR-10/100. The results are presented in Table 4, from which we have several observations: (1) Our proposed RDSS achieves the highest accuracy under all budget conditions, surpassing those that employ a fixed value; (2) The α𝛼\alphaitalic_α that achieve the best or the second best performance are within the interval we set, which is in line with our theoretical derivation in Section 6; (3) The experimental outcomes exhibit varying degrees of reduction compared to our approach when the representativeness or diversity term is removed.

Table 4: Effect of different α𝛼\alphaitalic_α. The grey results indicate that the α𝛼\alphaitalic_α is outside the interval we set in Section 6, i.e., α<11/m𝛼11𝑚\alpha<1-1/\sqrt{m}italic_α < 1 - 1 / square-root start_ARG italic_m end_ARG, while the black results indicate that the α𝛼\alphaitalic_α is within the interval we set, i.e., 11/mα111𝑚𝛼11-1/\sqrt{m}\leq\alpha\leq 11 - 1 / square-root start_ARG italic_m end_ARG ≤ italic_α ≤ 1. Among them, α=0𝛼0\alpha=0italic_α = 0 and α=1𝛼1\alpha=1italic_α = 1 indicate the removal of the representativeness and diversity terms, respectively. The best performance is bold, and the second-best performance is underlined.
Dataset CIFAR-10 CIFAR-100
Budget (m𝑚mitalic_m) 40 250 4000 400 2500 10000
0 85.54±plus-or-minus\pm±0.48 93.55±plus-or-minus\pm±0.34 94.58±plus-or-minus\pm±0.27 39.26±plus-or-minus\pm±0.52 63.77±plus-or-minus\pm±0.26 71.90±plus-or-minus\pm±0.17
0.40 92.28±plus-or-minus\pm±0.24 93.68±plus-or-minus\pm±0.13 94.95±plus-or-minus\pm±0.12 42.56±plus-or-minus\pm±0.47 65.88±plus-or-minus\pm±0.24 71.71±plus-or-minus\pm±0.29
0.80 94.42±plus-or-minus\pm±0.49 94.94±plus-or-minus\pm±0.37 95.15±plus-or-minus\pm±0.35 45.62±plus-or-minus\pm±0.35 66.87±plus-or-minus\pm±0.20 72.45±plus-or-minus\pm±0.23
0.90 94.33±plus-or-minus\pm±0.28 95.03±plus-or-minus\pm±0.21 95.20±plus-or-minus\pm±0.42 48.12±plus-or-minus\pm±0.50 67.14±plus-or-minus\pm±0.16 72.15±plus-or-minus\pm±0.23
0.95 94.44±plus-or-minus\pm±0.64 95.07±plus-or-minus\pm±0.26 95.45±plus-or-minus\pm±0.38 48.41±plus-or-minus\pm±0.59 67.11±plus-or-minus\pm±0.29 72.80±plus-or-minus\pm±0.35
0.98 94.51±plus-or-minus\pm±0.39 95.02±plus-or-minus\pm±0.15 95.31±plus-or-minus\pm±0.44 48.33±plus-or-minus\pm±0.54 67.40±plus-or-minus\pm±0.23 72.68±plus-or-minus\pm±0.22
1 94.53±plus-or-minus\pm±0.42 95.01±plus-or-minus\pm±0.23 95.54±plus-or-minus\pm±0.25 48.18±plus-or-minus\pm±0.36 67.20±plus-or-minus\pm±0.29 73.05±plus-or-minus\pm±0.18
11/m11𝑚1-1/\sqrt{m}1 - 1 / square-root start_ARG italic_m end_ARG (Ours) 95.05±plus-or-minus\pm±0.13 95.50±plus-or-minus\pm±0.20 95.98±plus-or-minus\pm±0.28 48.41±plus-or-minus\pm±0.59 67.40±plus-or-minus\pm±0.23 73.13±plus-or-minus\pm±0.19

8 Conclusion

In this work, we propose a model-free sampling method, RDSS, to select a subset from unlabeled data for annotation in SSL. The primary innovation of our approach lies in the introduction of α𝛼\alphaitalic_α-MMD, designed to evaluate the representativeness and diversity of selected samples. Under a low-budget setting, we develop a fast and efficient algorithm GKHR for this problem using the Frank-Wolfe algorithm. Both theoretical analyses and empirical experiments demonstrate the effectiveness of RDSS. In future research, we would like to apply our methodology to scenarios where labelling is cost-prohibitive, such as in the medical domain.

References

  • Aronszajn [1950] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  • Bach [2021] F. Bach. Learning theory from first principles. Draft of a book, version of Sept, 6:2021, 2021.
  • Bach et al. [2012] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012.
  • Bietti and Mairal [2019] A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep convolutional representations. The Journal of Machine Learning Research, 20(1):876–924, 2019.
  • Chan et al. [2021] Y.-C. Chan, M. Li, and S. Oymak. On the marginal benefit of active learning: Does self-supervision eat its cake? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3455–3459. IEEE, 2021.
  • Chen et al. [2012] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. arXiv preprint arXiv:1203.3472, 2012.
  • Cho et al. [2022] J. W. Cho, D.-J. Kim, Y. Jung, and I. S. Kweon. Mcdal: Maximum classifier discrepancy for active learning. IEEE transactions on neural networks and learning systems, 2022.
  • Coates et al. [2011] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Freytag et al. [2014] A. Freytag, E. Rodner, and J. Denzler. Selecting influential examples: Active learning with expected model output changes. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 562–577. Springer, 2014.
  • Gao et al. [2020] M. Gao, Z. Zhang, G. Yu, S. Ö. Arık, L. S. Davis, and T. Pfister. Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 510–526. Springer, 2020.
  • Graf and Luschgy [2007] S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer, 2007.
  • Gretton et al. [2006] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. Advances in neural information processing systems, 19, 2006.
  • Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Huang et al. [2021] S. Huang, T. Wang, H. Xiong, J. Huan, and D. Dou. Semi-supervised active learning with temporal output discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3447–3456, 2021.
  • Joshi et al. [2009] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Laine and Aila [2016] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2016.
  • Lee et al. [2013] D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
  • Lewis and Catlett [1994] D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
  • Li et al. [2023] M. Li, R. Wu, H. Liu, J. Yu, X. Yang, B. Han, and T. Liu. Instant: Semi-supervised learning with instance-dependent thresholds. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Liu et al. [2021] Z. Liu, H. Ding, H. Zhong, W. Li, J. Dai, and C. He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
  • Lohr [2021] S. L. Lohr. Sampling: design and analysis. Chapman and Hall/CRC, 2021.
  • Luo et al. [2013] W. Luo, A. Schwing, and R. Urtasun. Latent structured active learning. Advances in Neural Information Processing Systems, 26, 2013.
  • Mahmood et al. [2021] R. Mahmood, S. Fidler, and M. T. Law. Low budget active learning via wasserstein distance: An integer programming approach. arXiv preprint arXiv:2106.02968, 2021.
  • Mak and Joseph [2018] S. Mak and V. R. Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018.
  • Muandet et al. [2017] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
  • Netzer et al. [2011] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Paulsen and Raghupathi [2016] V. I. Paulsen and M. Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge university press, 2016.
  • Pronzato [2021] L. Pronzato. Performance analysis of greedy algorithms for minimising a maximum mean discrepancy. arXiv preprint arXiv:2101.07564, 2021.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Sajjadi et al. [2016] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29, 2016.
  • Schmutz et al. [2022] H. Schmutz, O. Humbert, and P.-A. Mattei. Don’t fear the unlabelled: safe semi-supervised learning via debiasing. In The Eleventh International Conference on Learning Representations, 2022.
  • Sener and Savarese [2018] O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • Sinha et al. [2019] S. Sinha, S. Ebrahimi, and T. Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
  • Sohn et al. [2020] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
  • Song et al. [2009] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
  • Song et al. [2019] S. Song, D. Berthelot, and A. Rostamizadeh. Combining mixmatch and active learning for better accuracy with fewer labels. arXiv preprint arXiv:1912.00594, 2019.
  • Sriperumbudur et al. [2012] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On the empirical estimation of integral probability metrics. 2012.
  • Sutskever et al. [2013] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  • Tarvainen and Valpola [2017] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
  • Thompson [2012] S. K. Thompson. Sampling, volume 755. John Wiley & Sons, 2012.
  • Tong and Koller [2001] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
  • Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. [2016] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.
  • Wang et al. [2022a] X. Wang, L. Lian, and S. X. Yu. Unsupervised selective labeling for more effective semi-supervised learning. In European Conference on Computer Vision, pages 427–445. Springer, 2022a.
  • Wang et al. [2022b] X. Wang, Z. Wu, L. Lian, and S. X. Yu. Debiased learning from naturally imbalanced pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14647–14657, 2022b.
  • Wang et al. [2022c] Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo, et al. Usb: A unified semi-supervised learning benchmark for classification. Advances in Neural Information Processing Systems, 35:3938–3961, 2022c.
  • Wang et al. [2022d] Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, et al. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv preprint arXiv:2205.07246, 2022d.
  • Wu et al. [2023] X. Wu, Y. Huo, H. Ren, and C. Zou. Optimal subsampling via predictive inference. Journal of the American Statistical Association, (just-accepted):1–29, 2023.
  • Xie et al. [2020a] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256–6268, 2020a.
  • Xie et al. [2020b] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020b.
  • Xie et al. [2023] Y. Xie, H. Lu, J. Yan, X. Yang, M. Tomizuka, and W. Zhan. Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23715–23724, 2023.
  • Yang et al. [2023] L. Yang, Z. Zhao, L. Qi, Y. Qiao, Y. Shi, and H. Zhao. Shrinking class space for enhanced certainty in semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16187–16196, 2023.
  • Yoo and Kweon [2019] D. Yoo and I. S. Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019.
  • Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.
  • Zhang et al. [2021] B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419, 2021.
  • Zhang et al. [2023a] J. Zhang, C. Meng, J. Yu, M. Zhang, W. Zhong, and P. Ma. An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. Journal of Computational and Graphical Statistics, 32(1):329–339, 2023a.
  • Zhang et al. [2023b] M. Zhang, Y. Zhou, Z. Zhou, and A. Zhang. Model-free subsampling method based on uniform designs. IEEE Transactions on Knowledge and Data Engineering, 2023b.

Appendix A Algorithms

A.1 Derivation of Generalized Kernel Herding (GKH)

Proof.

The proof technique is borrowed from [31]. Let us firstly define a weighted modification of α𝛼\alphaitalic_α-MMD. For any 𝐰n𝐰superscript𝑛\mathbf{w}\in\mathbb{R}^{n}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that 𝐰𝟏=1superscript𝐰top11\mathbf{w}^{\top}\mathbf{1}=1bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_1 = 1, the weighted α𝛼\alphaitalic_α-MMD is defined by

MMDk,α,𝐗n2(𝐰)=𝐰𝐊𝐰2α𝐰𝐩+α2K¯,superscriptsubscriptMMD𝑘𝛼subscript𝐗𝑛2𝐰superscript𝐰top𝐊𝐰2𝛼superscript𝐰top𝐩superscript𝛼2¯𝐾\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w})=\mathbf{w}^{\top}\mathbf{% K}\mathbf{w}-2\alpha\mathbf{w}^{\top}\mathbf{p}+\alpha^{2}\overline{K},MMD start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_w ) = bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Kw - 2 italic_α bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_p + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_K end_ARG ,

where 𝐊=[k(𝐱i,𝐱j)]1i,jn𝐊subscriptdelimited-[]𝑘subscript𝐱𝑖subscript𝐱𝑗formulae-sequence1𝑖𝑗𝑛\mathbf{K}=[k(\mathbf{x}_{i},\mathbf{x}_{j})]_{1\leq i,j\leq n}bold_K = [ italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 1 ≤ italic_i , italic_j ≤ italic_n end_POSTSUBSCRIPT, K¯=𝟏𝐊𝟏/n2¯𝐾superscript1top𝐊𝟏superscript𝑛2\overline{K}=\mathbf{1}^{\top}\mathbf{K}\mathbf{1}/n^{2}over¯ start_ARG italic_K end_ARG = bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K1 / italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝐩=(𝐞1𝐊𝟏/n,,𝐞n𝐊𝟏/n)𝐩superscriptsubscript𝐞1top𝐊𝟏𝑛superscriptsubscript𝐞𝑛top𝐊𝟏𝑛\mathbf{p}=(\mathbf{e}_{1}^{\top}\mathbf{K}\mathbf{1}/n,\cdots,\mathbf{e}_{n}^% {\top}\mathbf{K}\mathbf{1}/n)bold_p = ( bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K1 / italic_n , ⋯ , bold_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K1 / italic_n ), {𝐞i}i=1nsuperscriptsubscriptsubscript𝐞𝑖𝑖1𝑛\{\mathbf{e}_{i}\}_{i=1}^{n}{ bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the set of standard basis of nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. It is obvious that for any p[n]subscript𝑝delimited-[]𝑛\mathcal{I}_{p}\subset[n]caligraphic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⊂ [ italic_n ],

MMDk,α,𝐗n2(𝐰p)=MMDk,α2(𝐗p,𝐗n),superscriptsubscriptMMD𝑘𝛼subscript𝐗𝑛2subscript𝐰𝑝superscriptsubscriptMMD𝑘𝛼2subscript𝐗subscript𝑝subscript𝐗𝑛\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}_{p})=\text{MMD}_{k,\alpha}% ^{2}(\mathbf{X}_{\mathcal{I}_{p}},\mathbf{X}_{n}),MMD start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,

where (𝐰p)i=1/psubscriptsubscript𝐰𝑝𝑖1𝑝(\mathbf{w}_{p})_{i}=1/p( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 / italic_p if ip𝑖subscript𝑝i\in\mathcal{I}_{p}italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and (𝐰p)i=0subscriptsubscript𝐰𝑝𝑖0(\mathbf{w}_{p})_{i}=0( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 if not. Therefore, weighted α𝛼\alphaitalic_α-MMD is indeed a generalization of α𝛼\alphaitalic_α-MMD. Let

𝐊=𝐊2α𝐩𝟏+α2K¯𝟏𝟏subscript𝐊𝐊2𝛼superscript𝐩𝟏topsuperscript𝛼2¯𝐾superscript11top\mathbf{K}_{*}=\mathbf{K}-2\alpha\mathbf{p}\mathbf{1}^{\top}+\alpha^{2}% \overline{K}\mathbf{1}\mathbf{1}^{\top}bold_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_K - 2 italic_α bold_p1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_K end_ARG bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

we obtain the quadratic form expression of weighted α𝛼\alphaitalic_α-MMD by MMDk,α,𝐗n2(𝐰)=𝐰𝐊𝐰subscriptsuperscriptMMD2𝑘𝛼subscript𝐗𝑛𝐰superscript𝐰topsubscript𝐊𝐰\text{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})=\mathbf{w}^{\top}\mathbf{% K}_{*}\mathbf{w}MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_w ) = bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT bold_w, where 𝐊subscript𝐊\mathbf{K}_{*}bold_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is strictly positive definite if 𝐰𝐰n𝐰subscript𝐰𝑛\mathbf{w}\not=\mathbf{w}_{n}bold_w ≠ bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and k𝑘kitalic_k is a characteristic kernel according to [31]. Recall our low-budget setting and choice of kernel, 𝐊subscript𝐊\mathbf{K}_{*}bold_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is indeed a strictly positive definite matrix. Thus MMDk,α,𝐗n2subscriptsuperscriptMMD2𝑘𝛼subscript𝐗𝑛\text{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a convex functional w.r.t. 𝐰𝐰\mathbf{w}bold_w, leading to the fact that min𝐰𝟏=1MMDk,α,𝐗n2(𝐰)subscriptsuperscript𝐰top11superscriptsubscriptMMD𝑘𝛼subscript𝐗𝑛2𝐰\min_{\mathbf{w}^{\top}\mathbf{1}=1}\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(% \mathbf{w})roman_min start_POSTSUBSCRIPT bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_1 = 1 end_POSTSUBSCRIPT MMD start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_w ) can be solved by Frank-Wolfe algorithm. Then for 1p<n1𝑝𝑛1\leq p<n1 ≤ italic_p < italic_n,

𝐬pargmin𝐬𝟏=1𝐬(𝐊𝐰pα𝐩)=argmin𝐞i,i[n]𝐞i(𝐊𝐰pα𝐩).subscript𝐬𝑝subscriptsuperscript𝐬top11superscript𝐬topsubscript𝐊𝐰𝑝𝛼𝐩subscriptsubscript𝐞𝑖𝑖delimited-[]𝑛superscriptsubscript𝐞𝑖topsubscript𝐊𝐰𝑝𝛼𝐩\mathbf{s}_{p}\in\mathop{\arg\min}_{\mathbf{s}^{\top}\mathbf{1}=1}\mathbf{s}^{% \top}(\mathbf{Kw}_{p}-\alpha\mathbf{p})=\mathop{\arg\min}_{\mathbf{e}_{i},i\in% [n]}\mathbf{e}_{i}^{\top}(\mathbf{Kw}_{p}-\alpha\mathbf{p}).bold_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_1 = 1 end_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Kw start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_α bold_p ) = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Kw start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_α bold_p ) .

Let 𝐞ip=𝐬psubscript𝐞subscript𝑖𝑝subscript𝐬𝑝\mathbf{e}_{i_{p}}=\mathbf{s}_{p}bold_e start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, under uniform step size, we have

𝐰p+1=(pp+1)𝐰p+1p+1𝐞ipsubscript𝐰𝑝1𝑝𝑝1subscript𝐰𝑝1𝑝1subscript𝐞subscript𝑖𝑝\mathbf{w}_{p+1}=\left(\frac{p}{p+1}\right)\mathbf{w}_{p}+\frac{1}{p+1}\mathbf% {e}_{i_{p}}bold_w start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT = ( divide start_ARG italic_p end_ARG start_ARG italic_p + 1 end_ARG ) bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_p + 1 end_ARG bold_e start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT

as the update formula of Frank-Wolfe algorithm, which is equivalent to

ipargmini[n]jmk(𝐱i,𝐱j)αpl=1nk(𝐱i,𝐱l).subscriptsuperscript𝑖𝑝subscript𝑖delimited-[]𝑛subscript𝑗subscript𝑚𝑘subscript𝐱𝑖subscript𝐱𝑗𝛼𝑝superscriptsubscript𝑙1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑙i^{*}_{p}\in\arg\min_{i\in[n]}\sum_{j\in\mathcal{I}_{m}}k(\mathbf{x}_{i},% \mathbf{x}_{j})-\alpha p\sum_{l=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{l}).italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_α italic_p ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) .

Set 𝐰0=0subscript𝐰00\mathbf{w}_{0}=0bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, we immediately derive the iterating formula in (7). ∎

A.2 Pseudo Codes

Algorithm 2 Generalized Kernel Herding
0:  Data set 𝐗n={𝐱1,,𝐱n}𝒳subscript𝐗𝑛subscript𝐱1subscript𝐱𝑛𝒳\mathbf{X}_{n}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\}\subset\mathcal{X}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ⊂ caligraphic_X; the number of selected samples m<n𝑚𝑛m<nitalic_m < italic_n; a positive definite, characteristic and radial kernel k(,)𝑘k(\cdot,\cdot)italic_k ( ⋅ , ⋅ ) on 𝒳×𝒳𝒳𝒳\mathcal{X}\times\mathcal{X}caligraphic_X × caligraphic_X; trade-off parameter α1𝛼1\alpha\leq 1italic_α ≤ 1.
0:  selected samples 𝐗m={𝐱i1,,𝐱im}subscript𝐗subscriptsuperscript𝑚subscript𝐱subscriptsuperscript𝑖1subscript𝐱subscriptsuperscript𝑖𝑚\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\cdots,\mathbf{x}_{i% ^{*}_{m}}\}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT }.
1:  For each 𝐱i𝐗nsubscript𝐱𝑖subscript𝐗𝑛\mathbf{x}_{i}\in\mathbf{X}_{n}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT calculate μ(𝐱i):=j=1nk(𝐱j,𝐱i)/nassign𝜇subscript𝐱𝑖superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑗subscript𝐱𝑖𝑛\mu(\mathbf{x}_{i}):=\sum_{j=1}^{n}k(\mathbf{x}_{j},\mathbf{x}_{i})/nitalic_μ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_n.
2:  Set β1=1subscript𝛽11\beta_{1}=1italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, S0=0subscript𝑆00S_{0}=0italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, =\mathcal{I}=\emptysetcaligraphic_I = ∅.
3:  for p{1,,m}𝑝1𝑚p\in\{1,\cdots,m\}italic_p ∈ { 1 , ⋯ , italic_m } do
4:     ipargmini[n]Sp1(𝐱i)αμ(𝐱i)subscriptsuperscript𝑖𝑝subscript𝑖delimited-[]𝑛subscript𝑆𝑝1subscript𝐱𝑖𝛼𝜇subscript𝐱𝑖{i^{*}_{p}}\in\mathop{\arg\min}_{i\in[n]}S_{p-1}(\mathbf{x}_{i})-\alpha\mu(% \mathbf{x}_{i})italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_α italic_μ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
5:     For all i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ], update Sp(𝐱i)=(1βp)Sp1(𝐱i)+βpk(𝐱ip,𝐱i)subscript𝑆𝑝subscript𝐱𝑖1subscript𝛽𝑝subscript𝑆𝑝1subscript𝐱𝑖subscript𝛽𝑝𝑘subscript𝐱subscriptsuperscript𝑖𝑝subscript𝐱𝑖S_{p}(\mathbf{x}_{i})=(1-\beta_{p})S_{p-1}(\mathbf{x}_{i})+\beta_{p}k(\mathbf{% x}_{i^{*}_{p}},\mathbf{x}_{i})italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( 1 - italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) italic_S start_POSTSUBSCRIPT italic_p - 1 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
6:     p+1p{ip}subscriptsuperscript𝑝1subscriptsuperscript𝑝subscriptsuperscript𝑖𝑝{\mathcal{I}^{*}_{p+1}}\leftarrow{\mathcal{I}^{*}_{p}}\cup\{{i^{*}_{p}}\}caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ← caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∪ { italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT }, pp+1𝑝𝑝1p\leftarrow p+1italic_p ← italic_p + 1, set βp=1/psubscript𝛽𝑝1𝑝\beta_{p}=1/pitalic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 / italic_p.
7:  end for

Appendix B Technical Lemmas

Lemma B.1 (Lemma 2 [31]).

Let (tk)ksubscriptsubscript𝑡𝑘𝑘\left(t_{k}\right)_{k}( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and (αk)ksubscriptsubscript𝛼𝑘𝑘\left(\alpha_{k}\right)_{k}( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be two real positive sequences and A𝐴Aitalic_A be a strictly positive real. If tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies

t1A and tk+1(1αk+1)tk+Aαk+12,k1,formulae-sequencesubscript𝑡1𝐴 and subscript𝑡𝑘11subscript𝛼𝑘1subscript𝑡𝑘𝐴superscriptsubscript𝛼𝑘12𝑘1t_{1}\leq A\text{ and }t_{k+1}\leq\left(1-\alpha_{k+1}\right)t_{k}+A\alpha_{k+% 1}^{2},k\geq 1,italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_A and italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ≤ ( 1 - italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_A italic_α start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_k ≥ 1 ,

with αk=1/ksubscript𝛼𝑘1𝑘\alpha_{k}=1/kitalic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 / italic_k for all k𝑘kitalic_k, then tk<A(2+logk)/(k+1)subscript𝑡𝑘𝐴2𝑘𝑘1t_{k}<A(2+\log k)/(k+1)italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_A ( 2 + roman_log italic_k ) / ( italic_k + 1 ) for all k>1𝑘1k>1italic_k > 1.

Lemma B.2.

The selected samples 𝐗msubscript𝐗subscriptsuperscript𝑚\mathbf{X}_{\mathcal{I}^{*}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT generated by GKH (Algorithm 2) satisfies

MMDk,α2(𝐗m,𝐗n)Mα2+B2+logmm+1superscriptsubscriptMMD𝑘𝛼2subscript𝐗subscriptsuperscript𝑚subscript𝐗𝑛superscriptsubscript𝑀𝛼2𝐵2𝑚𝑚1\operatorname{MMD}_{k,\alpha}^{2}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},% \mathbf{X}_{n}\right)\leq M_{\alpha}^{2}+B\frac{2+\log m}{m+1}roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_M start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B divide start_ARG 2 + roman_log italic_m end_ARG start_ARG italic_m + 1 end_ARG (10)

where B=2K𝐵2𝐾B=2Kitalic_B = 2 italic_K, 0max𝐱𝒳k(𝐱,𝐱)K0subscript𝐱𝒳𝑘𝐱𝐱𝐾0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})\leq K0 ≤ roman_max start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_k ( bold_x , bold_x ) ≤ italic_K, Mα2superscriptsubscript𝑀𝛼2M_{\alpha}^{2}italic_M start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is defined by

Mα2:=min𝐰𝟏=1,𝐰0MMDk,α,𝐗n2(𝐰)assignsuperscriptsubscript𝑀𝛼2subscriptformulae-sequencesuperscript𝐰top11𝐰0superscriptsubscriptMMD𝑘𝛼subscript𝐗𝑛2𝐰M_{\alpha}^{2}:=\min_{\mathbf{w}^{\top}\mathbf{1}=1,\mathbf{w}\geq 0}% \operatorname{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}\left({\mathbf{w}}\right)italic_M start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := roman_min start_POSTSUBSCRIPT bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_1 = 1 , bold_w ≥ 0 end_POSTSUBSCRIPT roman_MMD start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_w )
Proof.

Following the notations in Appendix A, let 𝐩α=α𝐩subscript𝐩𝛼𝛼𝐩\mathbf{p}_{\alpha}=\alpha\mathbf{p}bold_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_α bold_p, we could straightly follow the proof for finite-sample-size error bound of kernel herding with predefined step sizes given by [31] to derive Lemma B.2, without any other technique. The detailed proof is omitted. ∎

Lemma B.3.

Let \mathcal{H}caligraphic_H be an RKHS over 𝒳𝒳\mathcal{X}caligraphic_X associated with positive definite kernel k𝑘kitalic_k, and 0max𝐱𝒳k(𝐱,𝐱)K0subscript𝐱𝒳𝑘𝐱𝐱𝐾0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})\leq K0 ≤ roman_max start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_k ( bold_x , bold_x ) ≤ italic_K. Let 𝐗m={𝐱i}i=1msubscript𝐗𝑚superscriptsubscriptsubscript𝐱𝑖𝑖1𝑚\mathbf{X}_{m}=\{\mathbf{x}_{i}\}_{i=1}^{m}bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, 𝐘n={𝐲j}j=1msubscript𝐘𝑛superscriptsubscriptsubscript𝐲𝑗𝑗1𝑚\mathbf{Y}_{n}=\{\mathbf{y}_{j}\}_{j=1}^{m}bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, 𝐱i,𝐲j𝒳subscript𝐱𝑖subscript𝐲𝑗𝒳\mathbf{x}_{i},\mathbf{y}_{j}\in\mathcal{X}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X. Then for any α1𝛼1\alpha\leq 1italic_α ≤ 1,

|MMDk,α(𝐗m,𝐘n)MMDk(𝐗m,𝐘n)|(1α)KsubscriptMMD𝑘𝛼subscript𝐗𝑚subscript𝐘𝑛subscriptMMD𝑘subscript𝐗𝑚subscript𝐘𝑛1𝛼𝐾|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n})-\operatorname{% MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})|\leq(1-\alpha)\sqrt{K}| roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | ≤ ( 1 - italic_α ) square-root start_ARG italic_K end_ARG
Proof.
|MMDk,α(𝐗m,𝐘n)MMDk(𝐗m,𝐘n)|subscriptMMD𝑘𝛼subscript𝐗𝑚subscript𝐘𝑛subscriptMMD𝑘subscript𝐗𝑚subscript𝐘𝑛\displaystyle\left|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n}% )-\operatorname{MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})\right|| roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_MMD start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) |
=\displaystyle== |supf1(1mi=1mf(𝐱i)αnj=1nf(𝐲j))supf1(1mi=1mf(𝐱i)1nj=1nf(𝐲j))|subscriptsupremumsubscriptnorm𝑓11𝑚superscriptsubscript𝑖1𝑚𝑓subscript𝐱𝑖𝛼𝑛superscriptsubscript𝑗1𝑛𝑓subscript𝐲𝑗subscriptsupremumsubscriptnorm𝑓11𝑚superscriptsubscript𝑖1𝑚𝑓subscript𝐱𝑖1𝑛superscriptsubscript𝑗1𝑛𝑓subscript𝐲𝑗\displaystyle\left|\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=1}% ^{m}f\left(\mathbf{x}_{i}\right)-\frac{\alpha}{n}\sum_{j=1}^{n}f\left(\mathbf{% y}_{j}\right)\right)-\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=% 1}^{m}f\left(\mathbf{x}_{i}\right)-\frac{1}{n}\sum_{j=1}^{n}f\left(\mathbf{y}_% {j}\right)\right)\right|| roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG italic_α end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) - roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) |
\displaystyle\leq supf1|1αni=1nf(yi)|=(1αn)supf1|i=1nf(yi)|subscriptsupremumsubscriptnorm𝑓11𝛼𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝑦𝑖1𝛼𝑛subscriptsupremumsubscriptnorm𝑓1superscriptsubscript𝑖1𝑛𝑓subscript𝑦𝑖\displaystyle\sup_{\|f\|_{\mathcal{H}}\leq 1}\left|\frac{1-\alpha}{n}\sum_{i=1% }^{n}f\left(y_{i}\right)\right|=\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{% \mathcal{H}}\leq 1}\left|\sum_{i=1}^{n}f\left(y_{i}\right)\right|roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | = ( divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ) roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) |
=\displaystyle== (1αn)supf1|j=1nf,k(,𝐲j)|(1αn)supf1j=1n|f,k(,𝐲j)|1𝛼𝑛subscriptsupremumsubscriptnorm𝑓1superscriptsubscript𝑗1𝑛subscript𝑓𝑘subscript𝐲𝑗1𝛼𝑛subscriptsupremumsubscriptnorm𝑓1superscriptsubscript𝑗1𝑛subscript𝑓𝑘subscript𝐲𝑗\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}}\leq 1}% \left|\sum_{j=1}^{n}\left\langle f,k(\cdot,\mathbf{y}_{j})\right\rangle_{% \mathcal{H}}\right|\leq\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}% }\leq 1}\sum_{j=1}^{n}\left|\left\langle f,k(\cdot,\mathbf{y}_{j})\right% \rangle_{\mathcal{H}}\right|( divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ) roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟨ italic_f , italic_k ( ⋅ , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT | ≤ ( divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ) roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | ⟨ italic_f , italic_k ( ⋅ , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT |
\displaystyle\leq (1αn)supf1j=1nfk(,𝐲j)(1α)K.1𝛼𝑛subscriptsupremumsubscriptnorm𝑓1superscriptsubscript𝑗1𝑛subscriptnorm𝑓subscriptnorm𝑘subscript𝐲𝑗1𝛼𝐾\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}}\leq 1}% \sum_{j=1}^{n}\|f\|_{\mathcal{H}}\|k(\cdot,\mathbf{y}_{j})\|_{\mathcal{H}}\leq% (1-\alpha)\sqrt{K}.( divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ) roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ∥ italic_k ( ⋅ , bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT ≤ ( 1 - italic_α ) square-root start_ARG italic_K end_ARG .

Lemma B.4 (Proposition 12.31 [45]).

Suppose that 1subscript1\mathcal{H}_{1}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are reproducing kernel Hilbert spaces of real-valued functions with domains 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and equipped with kernels k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. Then the tensor product space =12tensor-productsubscript1subscript2\mathcal{H}=\mathcal{H}_{1}\otimes\mathcal{H}_{2}caligraphic_H = caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is an RKHS of real-valued functions with domain 𝒳1×𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1}\times\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and with kernel function

k((x1,x2),(x1,x2))=k1(x1,x1)k2(x2,x2).𝑘subscript𝑥1subscript𝑥2superscriptsubscript𝑥1superscriptsubscript𝑥2subscript𝑘1subscript𝑥1superscriptsubscript𝑥1subscript𝑘2subscript𝑥2superscriptsubscript𝑥2k\left(\left(x_{1},x_{2}\right),\left(x_{1}^{\prime},x_{2}^{\prime}\right)% \right)=k_{1}\left(x_{1},x_{1}^{\prime}\right)k_{2}\left(x_{2},x_{2}^{\prime}% \right).italic_k ( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .
Lemma B.5 (Theorem 5.7 [30]).

Let f1𝑓subscript1f\in\mathcal{H}_{1}italic_f ∈ caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and g2𝑔subscript2g\in\mathcal{H}_{2}italic_g ∈ caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where 1,2subscript1subscript2\mathcal{H}_{1},\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two RKHS containing real-valued functions on 𝒳𝒳\mathcal{X}caligraphic_X, which is associated with positive definite kernel k1,k2subscript𝑘1subscript𝑘2k_{1},k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and canonical feature map ϕ1,ϕ2subscriptitalic-ϕ1subscriptitalic-ϕ2\phi_{1},\phi_{2}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X,

f(x)+g(x)=f,ϕ1(x)1+g,ϕ2(x)2=f+g,(ϕ1+ϕ2)(x)1+2,𝑓𝑥𝑔𝑥subscript𝑓subscriptitalic-ϕ1𝑥subscript1subscript𝑔subscriptitalic-ϕ2𝑥subscript2subscript𝑓𝑔subscriptitalic-ϕ1subscriptitalic-ϕ2𝑥subscript1subscript2f(x)+g(x)=\left\langle f,\phi_{1}(x)\right\rangle_{\mathcal{H}_{1}}+\left% \langle g,\phi_{2}(x)\right\rangle_{\mathcal{H}_{2}}=\left\langle f+g,(\phi_{1% }+\phi_{2})(x)\right\rangle_{\mathcal{H}_{1}+\mathcal{H}_{2}},italic_f ( italic_x ) + italic_g ( italic_x ) = ⟨ italic_f , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ⟨ italic_g , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ⟨ italic_f + italic_g , ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

where

1+2={f1+f2|fii}subscript1subscript2conditional-setsubscript𝑓1subscript𝑓2subscript𝑓𝑖subscript𝑖\mathcal{H}_{1}+\mathcal{H}_{2}=\{f_{1}+f_{2}|f_{i}\in\mathcal{H}_{i}\}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }

and ϕ1+ϕ2subscriptitalic-ϕ1subscriptitalic-ϕ2\phi_{1}+\phi_{2}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the canonical feature map of 1+2subscript1subscript2\mathcal{H}_{1}+\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Furthermore,

f+g1+22f12+g22.superscriptsubscriptnorm𝑓𝑔subscript1subscript22superscriptsubscriptnorm𝑓subscript12superscriptsubscriptnorm𝑔subscript22\|f+g\|_{\mathcal{H}_{1}+\mathcal{H}_{2}}^{2}\leq\|f\|_{\mathcal{H}_{1}}^{2}+% \|g\|_{\mathcal{H}_{2}}^{2}.∥ italic_f + italic_g ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_g ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Lemma B.6.

For any unlabeled dataset 𝐗n𝒳subscript𝐗𝑛𝒳\mathbf{X}_{n}\subset\mathcal{X}bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ caligraphic_X and any subset 𝐗msubscript𝐗subscript𝑚\mathbf{X}_{\mathcal{I}_{m}}bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT,

MMDk,α2(𝐗n,𝐗n)=(1α)2K¯,MMDk,α2(𝐗m,𝐗n)(1+α2)K,formulae-sequencesuperscriptsubscriptMMD𝑘𝛼2subscript𝐗𝑛subscript𝐗𝑛superscript1𝛼2¯𝐾superscriptsubscriptMMD𝑘𝛼2subscript𝐗subscript𝑚subscript𝐗𝑛1superscript𝛼2𝐾\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{n},\mathbf{X}_{n})=(1-\alpha)^{2% }\overline{K},\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}_{m}},% \mathbf{X}_{n})\leq(1+\alpha^{2})K,roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_K end_ARG , roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ ( 1 + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_K ,

where K¯=i=1nj=1nk(𝐱i,𝐱j)/n2¯𝐾superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑘subscript𝐱𝑖subscript𝐱𝑗superscript𝑛2\overline{K}=\sum_{i=1}^{n}\sum_{j=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{j})/n^{2}over¯ start_ARG italic_K end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, K=max𝐱𝒳k(𝐱,𝐱)𝐾subscript𝐱𝒳𝑘𝐱𝐱K=\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})italic_K = roman_max start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_k ( bold_x , bold_x ).

Lemma B.6 is directly derived from the definition of α𝛼\alphaitalic_α-MMD.

Appendix C Proof of Theorems

Proof for Theorem 5.4.

Firstly, let us denote that 4=11+12+3subscript4tensor-productsubscript1subscript1tensor-productsubscript1subscript2subscript3\mathcal{H}_{4}=\mathcal{H}_{1}\otimes\mathcal{H}_{1}+\mathcal{H}_{1}\otimes% \mathcal{H}_{2}+\mathcal{H}_{3}caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, with kernel k4=k12+k1k2+k3subscript𝑘4superscriptsubscript𝑘12subscript𝑘1subscript𝑘2subscript𝑘3k_{4}=k_{1}^{2}+k_{1}k_{2}+k_{3}italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and canonical feature map ϕ4=ϕ1ϕ1+ϕ1ϕ2+ϕ3subscriptitalic-ϕ4tensor-productsubscriptitalic-ϕ1subscriptitalic-ϕ1tensor-productsubscriptitalic-ϕ1subscriptitalic-ϕ2subscriptitalic-ϕ3\phi_{4}=\phi_{1}\otimes\phi_{1}+\phi_{1}\otimes\phi_{2}+\phi_{3}italic_ϕ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

Under the assumptions in Theorem 5.4, according to Theorem 4 in [38], we have for any 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X,

h(𝐱)=h,ϕ1(𝐱)1,𝔼[Y|𝐱]=𝔼[Y|X],ϕ2(𝐱)2,formulae-sequence𝐱subscriptsubscriptitalic-ϕ1𝐱subscript1𝔼delimited-[]conditional𝑌𝐱subscript𝔼delimited-[]conditional𝑌𝑋subscriptitalic-ϕ2𝐱subscript2h(\mathbf{x})=\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1% }},\mathbb{E}[Y|\mathbf{x}]=\left\langle\mathbb{E}[Y|X],\phi_{2}(\mathbf{x})% \right\rangle_{\mathcal{H}_{2}},italic_h ( bold_x ) = ⟨ italic_h , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , blackboard_E [ italic_Y | bold_x ] = ⟨ blackboard_E [ italic_Y | italic_X ] , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,
Var(Y|𝐱)=Var(Y|X),ϕ3(𝐱)3Varconditional𝑌𝐱subscriptVarconditional𝑌𝑋subscriptitalic-ϕ3𝐱subscript3\operatorname{Var}(Y|\mathbf{x})=\left\langle\operatorname{Var}(Y|X),\phi_{3}(% \mathbf{x})\right\rangle_{\mathcal{H}_{3}}roman_Var ( italic_Y | bold_x ) = ⟨ roman_Var ( italic_Y | italic_X ) , italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where ϕ1,ϕ2,ϕ3subscriptitalic-ϕ1subscriptitalic-ϕ2subscriptitalic-ϕ3\phi_{1},\phi_{2},\phi_{3}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are canonical feature maps in 1,2,3subscript1subscript2subscript3\mathcal{H}_{1},\mathcal{H}_{2},\mathcal{H}_{3}caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Denote that m=𝔼[Y|X]𝑚𝔼delimited-[]conditional𝑌𝑋m=\mathbb{E}[Y|X]italic_m = blackboard_E [ italic_Y | italic_X ] and s=Var(Y|X)𝑠Varconditional𝑌𝑋s=\operatorname{Var}(Y|X)italic_s = roman_Var ( italic_Y | italic_X ). Now by definition,

R(h)=𝔼[(h(𝐱),y)]=𝒳𝒴(h(𝐱),y)p(y|𝐱)p(𝐱)𝑑𝐱𝑑y=𝒳f(𝐱)p(𝐱)𝑑𝐱𝑅𝔼delimited-[]𝐱𝑦subscript𝒳subscript𝒴𝐱𝑦𝑝conditional𝑦𝐱𝑝𝐱differential-d𝐱differential-d𝑦subscript𝒳𝑓𝐱𝑝𝐱differential-d𝐱R(h)=\mathbb{E}\left[\ell(h(\mathbf{x}),y)\right]=\int_{\mathcal{X}}\int_{% \mathcal{Y}}\ell(h(\mathbf{x}),y)p(y|\mathbf{x})p(\mathbf{x})d\mathbf{x}dy=% \int_{\mathcal{X}}f(\mathbf{x})p(\mathbf{x})d\mathbf{x}italic_R ( italic_h ) = blackboard_E [ roman_ℓ ( italic_h ( bold_x ) , italic_y ) ] = ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_Y end_POSTSUBSCRIPT roman_ℓ ( italic_h ( bold_x ) , italic_y ) italic_p ( italic_y | bold_x ) italic_p ( bold_x ) italic_d bold_x italic_d italic_y = ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( bold_x ) italic_p ( bold_x ) italic_d bold_x

where

f(x)𝑓𝑥\displaystyle f(x)italic_f ( italic_x ) =𝒴(yh(𝐱))2p(y|𝐱)𝑑yabsentsubscript𝒴superscript𝑦𝐱2𝑝conditional𝑦𝐱differential-d𝑦\displaystyle=\int_{\mathcal{Y}}(y-h(\mathbf{x}))^{2}p(y|\mathbf{x})dy= ∫ start_POSTSUBSCRIPT caligraphic_Y end_POSTSUBSCRIPT ( italic_y - italic_h ( bold_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( italic_y | bold_x ) italic_d italic_y
=Var(Y|𝐱)2h(𝐱)𝔼[Y|𝐱]+h2(𝐱)absentVarconditional𝑌𝐱2𝐱𝔼delimited-[]conditional𝑌𝐱superscript2𝐱\displaystyle=\operatorname{Var}(Y|\mathbf{x})-2h(\mathbf{x})\mathbb{E}[Y|% \mathbf{x}]+h^{2}(\mathbf{x})= roman_Var ( italic_Y | bold_x ) - 2 italic_h ( bold_x ) blackboard_E [ italic_Y | bold_x ] + italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x )
=s,ϕ3(𝐱)32h,ϕ1(𝐱)1m,ϕ2(𝐱)2+h,ϕ1(𝐱)1h,ϕ1(𝐱)1absentsubscript𝑠subscriptitalic-ϕ3𝐱subscript32subscriptsubscriptitalic-ϕ1𝐱subscript1subscript𝑚subscriptitalic-ϕ2𝐱subscript2subscriptsubscriptitalic-ϕ1𝐱subscript1subscriptsubscriptitalic-ϕ1𝐱subscript1\displaystyle=\left\langle s,\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3% }}-2\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}\left% \langle m,\phi_{2}(\mathbf{x})\right\rangle_{\mathcal{H}_{2}}+\left\langle h,% \phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}\left\langle h,\phi_{1}(% \mathbf{x})\right\rangle_{\mathcal{H}_{1}}= ⟨ italic_s , italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - 2 ⟨ italic_h , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_m , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ⟨ italic_h , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_h , italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=s,ϕ3(𝐱)32hm,(ϕ1ϕ2)(𝐱)12+hh,(ϕ1ϕ1)(𝐱)11absentsubscript𝑠subscriptitalic-ϕ3𝐱subscript3subscripttensor-product2𝑚tensor-productsubscriptitalic-ϕ1subscriptitalic-ϕ2𝐱tensor-productsubscript1subscript2subscripttensor-producttensor-productsubscriptitalic-ϕ1subscriptitalic-ϕ1𝐱tensor-productsubscript1subscript1\displaystyle=\left\langle s,\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3% }}-\left\langle 2h\otimes m,(\phi_{1}\otimes\phi_{2})(\mathbf{x})\right\rangle% _{\mathcal{H}_{1}\otimes\mathcal{H}_{2}}+\left\langle h\otimes h,(\phi_{1}% \otimes\phi_{1})(\mathbf{x})\right\rangle_{\mathcal{H}_{1}\otimes\mathcal{H}_{% 1}}= ⟨ italic_s , italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - ⟨ 2 italic_h ⊗ italic_m , ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ⟨ italic_h ⊗ italic_h , ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( bold_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=s2hm+hh,ϕ4(x)4absentsubscript𝑠tensor-product2𝑚tensor-productsubscriptitalic-ϕ4𝑥subscript4\displaystyle=\left\langle s-2h\otimes m+h\otimes h,\phi_{4}(x)\right\rangle_{% \mathcal{H}_{4}}= ⟨ italic_s - 2 italic_h ⊗ italic_m + italic_h ⊗ italic_h , italic_ϕ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where the fourth equality holds by Lemma B.4 and the last equality holds by Lemma B.5, then f4𝑓subscript4f\in\mathcal{H}_{4}italic_f ∈ caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, and

f4subscriptnorm𝑓subscript4\displaystyle\|f\|_{\mathcal{H}_{4}}∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =s2hm+hh4absentsubscriptnorm𝑠tensor-product2𝑚tensor-productsubscript4\displaystyle=\|s-2h\otimes m+h\otimes h\|_{\mathcal{H}_{4}}= ∥ italic_s - 2 italic_h ⊗ italic_m + italic_h ⊗ italic_h ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
s4+2hm4+hh4absentsubscriptnorm𝑠subscript4subscriptnormtensor-product2𝑚subscript4subscriptnormtensor-productsubscript4\displaystyle\leq\|s\|_{\mathcal{H}_{4}}+\|2h\otimes m\|_{\mathcal{H}_{4}}+\|h% \otimes h\|_{\mathcal{H}_{4}}≤ ∥ italic_s ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ 2 italic_h ⊗ italic_m ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ italic_h ⊗ italic_h ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
s3+2m2h1+hh11absentsubscriptnorm𝑠subscript32subscriptnorm𝑚subscript2subscriptnormsubscript1subscriptnormtensor-producttensor-productsubscript1subscript1\displaystyle\leq\|s\|_{\mathcal{H}_{3}}+2\|m\|_{\mathcal{H}_{2}}\|h\|_{% \mathcal{H}_{1}}+\|h\otimes h\|_{\mathcal{H}_{1}\otimes\mathcal{H}_{1}}≤ ∥ italic_s ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 2 ∥ italic_m ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_h ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ italic_h ⊗ italic_h ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=s3+2m2h1+h12absentsubscriptnorm𝑠subscript32subscriptnorm𝑚subscript2subscriptnormsubscript1superscriptsubscriptnormsubscript12\displaystyle=\|s\|_{\mathcal{H}_{3}}+2\|m\|_{\mathcal{H}_{2}}\|h\|_{\mathcal{% H}_{1}}+\|h\|_{\mathcal{H}_{1}}^{2}= ∥ italic_s ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 2 ∥ italic_m ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_h ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ italic_h ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Kh2+2KhKm+Ksabsentsuperscriptsubscript𝐾22subscript𝐾subscript𝐾𝑚subscript𝐾𝑠\displaystyle\leq K_{h}^{2}+2K_{h}K_{m}+K_{s}≤ italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

where the second inequality holds by Lemma B.5. Therefore, let β=1/(Kh2+2KhKm+Ks)𝛽1superscriptsubscript𝐾22subscript𝐾subscript𝐾𝑚subscript𝐾𝑠\beta=1/(K_{h}^{2}+2K_{h}K_{m}+K_{s})italic_β = 1 / ( italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) we have βf4=βf41subscriptnorm𝛽𝑓subscript4𝛽subscriptnorm𝑓subscript41\|\beta f\|_{\mathcal{H}_{4}}=\beta\|f\|_{\mathcal{H}_{4}}\leq 1∥ italic_β italic_f ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_β ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 1. Then

|R^T(h)R^S(h)|subscript^𝑅𝑇subscript^𝑅𝑆\displaystyle\left|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)\right|| over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_h ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_h ) |
=\displaystyle== |𝒳f(𝐱)𝑑PT(𝐱)𝒳f(𝐱)𝑑PS(𝐱)|subscript𝒳𝑓𝐱differential-dsubscript𝑃𝑇𝐱subscript𝒳𝑓𝐱differential-dsubscript𝑃𝑆𝐱\displaystyle\left|\int_{\mathcal{X}}f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{% \mathcal{X}}f(\mathbf{x})dP_{S}(\mathbf{x})\right|| ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( bold_x ) italic_d italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_x ) - ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( bold_x ) italic_d italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_x ) |
=\displaystyle== (Kh2+2KhKm+Ks)|𝒳βf(𝐱)𝑑PT(𝐱)𝒳βf(𝐱)𝑑PS(𝐱)|superscriptsubscript𝐾22subscript𝐾subscript𝐾𝑚subscript𝐾𝑠subscript𝒳𝛽𝑓𝐱differential-dsubscript𝑃𝑇𝐱subscript𝒳𝛽𝑓𝐱differential-dsubscript𝑃𝑆𝐱\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\left|\int_{\mathcal{X}}\beta f(% \mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}\beta f(\mathbf{x})dP_{S}(% \mathbf{x})\right|( italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) | ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_β italic_f ( bold_x ) italic_d italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_x ) - ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_β italic_f ( bold_x ) italic_d italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_x ) |
\displaystyle\leq (Kh2+2KhKm+Ks)supf41|𝒳f(𝐱)𝑑PT(𝐱)𝒳f(𝐱)𝑑PS(𝐱)|superscriptsubscript𝐾22subscript𝐾subscript𝐾𝑚subscript𝐾𝑠subscriptsupremumsubscriptnorm𝑓subscript41subscript𝒳𝑓𝐱differential-dsubscript𝑃𝑇𝐱subscript𝒳𝑓𝐱differential-dsubscript𝑃𝑆𝐱\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\sup_{\|f\|_{\mathcal{H}_{4}}\leq 1}% \left|\int_{\mathcal{X}}f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}f(% \mathbf{x})dP_{S}(\mathbf{x})\right|( italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( bold_x ) italic_d italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_x ) - ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( bold_x ) italic_d italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_x ) |
=\displaystyle== (Kh2+2KhKm+Ks)MMDk4(𝐗S,𝐗T)superscriptsubscript𝐾22subscript𝐾subscript𝐾𝑚subscript𝐾𝑠subscriptMMDsubscript𝑘4subscript𝐗𝑆subscript𝐗𝑇\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\operatorname{MMD}_{k_{4}}(\mathbf{X% }_{S},\mathbf{X}_{T})( italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_MMD start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

where PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denotes the empirical distribution constructed by 𝐗Tsubscript𝐗𝑇\mathbf{X}_{T}bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, so does PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Recall Lemma B.3, we have Theorem 5.4. ∎

Proof for Theorem 5.6.

Following the notations in Appendix A, we further define

𝐰=𝟏/n,Cα2=MMDk,α,𝐗n2(𝐰)=(1α)2K¯formulae-sequencesubscript𝐰1𝑛subscriptsuperscript𝐶2𝛼subscriptsuperscriptMMD2𝑘𝛼subscript𝐗𝑛subscript𝐰superscript1𝛼2¯𝐾\mathbf{w}_{*}=\mathbf{1}/n,C^{2}_{\alpha}=\operatorname{MMD}^{2}_{k,\alpha,% \mathbf{X}_{n}}(\mathbf{w}_{*})=(1-\alpha)^{2}\overline{K}bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_1 / italic_n , italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_K end_ARG (11)
𝐰^=argmin𝟏𝐰=1MMDk,α,𝐗n2(𝐰)=α(𝐊1𝐊1𝟏𝟏𝐊1𝟏𝐊1𝟏)𝐩+𝐊1𝟏𝟏𝐊1𝟏^𝐰subscriptsuperscript1top𝐰1subscriptsuperscriptMMD2𝑘𝛼subscript𝐗𝑛𝐰𝛼superscript𝐊1superscript𝐊1superscript11topsuperscript𝐊1superscript1topsuperscript𝐊11𝐩superscript𝐊11superscript1topsuperscript𝐊11\widehat{\mathbf{w}}=\mathop{\arg\min}_{\mathbf{1}^{\top}\mathbf{w}=1}% \operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})=\alpha\left(% \mathbf{K}^{-1}-\frac{\mathbf{K}^{-1}\mathbf{1}\mathbf{1}^{\top}\mathbf{K}^{-1% }}{\mathbf{1}^{\top}\mathbf{K}^{-1}\mathbf{1}}\right)\mathbf{p}+\frac{\mathbf{% K}^{-1}\mathbf{1}}{\mathbf{1}^{\top}\mathbf{K}^{-1}\mathbf{1}}over^ start_ARG bold_w end_ARG = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w = 1 end_POSTSUBSCRIPT roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_w ) = italic_α ( bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - divide start_ARG bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 end_ARG ) bold_p + divide start_ARG bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 end_ARG start_ARG bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 end_ARG

Let 𝐩α=α𝐩subscript𝐩𝛼𝛼𝐩\mathbf{p}_{\alpha}=\alpha\mathbf{p}bold_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_α bold_p, we have (𝐩α𝐊𝐰^)𝟏proportional-tosubscript𝐩𝛼𝐊^𝐰1(\mathbf{p}_{\alpha}-\mathbf{K}\widehat{\mathbf{w}})\propto\mathbf{1}( bold_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT - bold_K over^ start_ARG bold_w end_ARG ) ∝ bold_1. Define

Δα(𝐰):=MMDk,α,𝐗n2(𝐰)Cα2=g^(𝐰)g^(𝐰)assignsubscriptΔ𝛼𝐰subscriptsuperscriptMMD2𝑘𝛼subscript𝐗𝑛𝐰superscriptsubscript𝐶𝛼2^𝑔𝐰^𝑔subscript𝐰\Delta_{\alpha}(\mathbf{w}):=\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(% \mathbf{w})-C_{\alpha}^{2}=\widehat{g}(\mathbf{w})-\widehat{g}(\mathbf{w}_{*})roman_Δ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_w ) := roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_w ) - italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = over^ start_ARG italic_g end_ARG ( bold_w ) - over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )

where g^(𝐰)=(𝐰𝐰^)𝐊(𝐰𝐰^)^𝑔𝐰superscript𝐰^𝐰top𝐊𝐰^𝐰\widehat{g}(\mathbf{w})=\left(\mathbf{w}-\widehat{\mathbf{w}}\right)^{\top}% \mathbf{K}\left(\mathbf{w}-\widehat{\mathbf{w}}\right)over^ start_ARG italic_g end_ARG ( bold_w ) = ( bold_w - over^ start_ARG bold_w end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_w - over^ start_ARG bold_w end_ARG ). The related details for proving the equality are omitted, since they are completely given by the proof of alternative expression of MMD in Pronzato [31]. By the convexity of g^()^𝑔\widehat{g}(\cdot)over^ start_ARG italic_g end_ARG ( ⋅ ), for j=argmini[n]\pfp(𝐱i)𝑗subscript𝑖\delimited-[]𝑛subscriptsuperscript𝑝subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑖j=\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}f_{\mathcal{I}^{*}_{% p}}(\mathbf{x}_{i})italic_j = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] \ caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ),

g^(𝐰)g^(𝐰p)+2(𝐰𝐰p)𝐊(𝐰p𝐰^)g^(𝐰p)+2minj[n]\p(𝐞j𝐰p)𝐊(𝐰p𝐰^)\widehat{g}\left(\mathbf{w}_{*}\right)\geq\widehat{g}\left(\mathbf{w}_{p}% \right)+2\left(\mathbf{w}_{*}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(% \mathbf{w}_{p}-\widehat{\mathbf{w}}\right)\geq\widehat{g}\left(\mathbf{w}_{p}% \right)+2\min_{j\in[n]\backslash\mathcal{I}^{*}_{p}}\left(\mathbf{e}_{j}-% \mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}% }\right)over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≥ over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + 2 ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - over^ start_ARG bold_w end_ARG ) ≥ over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + 2 roman_min start_POSTSUBSCRIPT italic_j ∈ [ italic_n ] \ caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - over^ start_ARG bold_w end_ARG )

where the second inequality holds with the assumption in Theorem 5.6

(𝐰𝐞j)𝐊(𝐰p𝐰^)superscriptsubscript𝐰subscript𝐞𝑗top𝐊subscript𝐰𝑝^𝐰\displaystyle\left(\mathbf{w}_{*}-\mathbf{e}_{j}\right)^{\top}\mathbf{K}\left(% \mathbf{w}_{p}-\widehat{\mathbf{w}}\right)( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - over^ start_ARG bold_w end_ARG ) =(𝐰𝐞j)(𝐊𝐰p𝐩α)absentsuperscriptsubscript𝐰subscript𝐞𝑗topsubscript𝐊𝐰𝑝subscript𝐩𝛼\displaystyle=\left(\mathbf{w}_{*}-\mathbf{e}_{j}\right)^{\top}\left(\mathbf{K% }\mathbf{w}_{p}-\mathbf{p}_{\alpha}\right)= ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_Kw start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - bold_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT )
=i=1nfp(𝐱i)nfp(𝐱jp+1)i=1nfp(𝐱i)nfp(𝐱j)0absentsuperscriptsubscript𝑖1𝑛subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑖𝑛subscript𝑓subscriptsuperscript𝑝subscript𝐱subscript𝑗𝑝1superscriptsubscript𝑖1𝑛subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑖𝑛subscript𝑓subscriptsuperscript𝑝subscript𝐱𝑗0\displaystyle=\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i})}{n}-% f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j_{p+1}})\geq\frac{\sum_{i=1}^{n}f_{% \mathcal{I}^{*}_{p}}(\mathbf{x}_{i})}{n}-f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j% })\geq 0= divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n end_ARG - italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n end_ARG - italic_f start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ 0

therefore, we have for B=2K𝐵2𝐾B=2Kitalic_B = 2 italic_K,

Δα(𝐰p+1)subscriptΔ𝛼subscript𝐰𝑝1\displaystyle\Delta_{\alpha}(\mathbf{w}_{p+1})roman_Δ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ) (12)
=\displaystyle== g^(𝐰p)g^(𝐰)+2p+1(𝐞j𝐰p)𝐊(𝐰p𝐰^)+1(p+1)2(𝐞j𝐰p)𝐊(𝐞j𝐰p)^𝑔subscript𝐰𝑝^𝑔subscript𝐰2𝑝1superscriptsubscript𝐞𝑗subscript𝐰𝑝top𝐊subscript𝐰𝑝^𝐰1superscript𝑝12superscriptsubscript𝐞𝑗subscript𝐰𝑝top𝐊subscript𝐞𝑗subscript𝐰𝑝\displaystyle\widehat{g}\left(\mathbf{w}_{p}\right)-\widehat{g}\left(\mathbf{w% }_{*}\right)+\frac{2}{p+1}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}% \mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}}\right)+\frac{1}{(p+1)^{2}}% \left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{e}_{j% }-\mathbf{w}_{p}\right)over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + divide start_ARG 2 end_ARG start_ARG italic_p + 1 end_ARG ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - over^ start_ARG bold_w end_ARG ) + divide start_ARG 1 end_ARG start_ARG ( italic_p + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )
=\displaystyle== pp+1(g^(𝐰p)g^(𝐰))+1(p+1)2B=pp+1Δα(𝐰p)+1(p+1)2B𝑝𝑝1^𝑔subscript𝐰𝑝^𝑔subscript𝐰1superscript𝑝12𝐵𝑝𝑝1subscriptΔ𝛼subscript𝐰𝑝1superscript𝑝12𝐵\displaystyle\frac{p}{p+1}(\widehat{g}\left(\mathbf{w}_{p}\right)-\widehat{g}% \left(\mathbf{w}_{*}\right))+\frac{1}{(p+1)^{2}}B=\frac{p}{p+1}\Delta_{\alpha}% (\mathbf{w}_{p})+\frac{1}{(p+1)^{2}}Bdivide start_ARG italic_p end_ARG start_ARG italic_p + 1 end_ARG ( over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - over^ start_ARG italic_g end_ARG ( bold_w start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG ( italic_p + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_B = divide start_ARG italic_p end_ARG start_ARG italic_p + 1 end_ARG roman_Δ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG ( italic_p + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_B

where 𝐰p+1=p𝐰p/(p+1)+𝐞j/(p+1)subscript𝐰𝑝1𝑝subscript𝐰𝑝𝑝1subscript𝐞𝑗𝑝1\mathbf{w}_{p+1}=p\mathbf{w}_{p}/(p+1)+\mathbf{e}_{j}/(p+1)bold_w start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT = italic_p bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / ( italic_p + 1 ) + bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ( italic_p + 1 ), and obviously B𝐵Bitalic_B upper bounds (𝐞j𝐰p)𝐊(𝐞j𝐰p)superscriptsubscript𝐞𝑗subscript𝐰𝑝top𝐊subscript𝐞𝑗subscript𝐰𝑝\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{e}_{j% }-\mathbf{w}_{p}\right)( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_K ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). Since α1𝛼1\alpha\leq 1italic_α ≤ 1, it holds from Lemma B.6 that

Δα(𝐰1)MMDk,α,𝐗n2(𝐰1)(1+α2)KBsubscriptΔ𝛼subscript𝐰1subscriptsuperscriptMMD2𝑘𝛼subscript𝐗𝑛subscript𝐰11superscript𝛼2𝐾𝐵\Delta_{\alpha}(\mathbf{w}_{1})\leq\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}% _{n}}(\mathbf{w}_{1})\leq(1+\alpha^{2})K\leq Broman_Δ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ ( 1 + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_K ≤ italic_B

therefore by Lemma B.1, we have

MMDk,α2(𝐗m,𝐗n)=MMDk,α,𝐗n2(𝐰p)Cα2+B2+logmm+1superscriptsubscriptMMD𝑘𝛼2subscript𝐗subscriptsuperscript𝑚subscript𝐗𝑛superscriptsubscriptMMD𝑘𝛼subscript𝐗𝑛2subscript𝐰𝑝superscriptsubscript𝐶𝛼2𝐵2𝑚𝑚1\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_% {n})=\operatorname{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}_{p})\leq C_{% \alpha}^{2}+B\frac{2+\log m}{m+1}roman_MMD start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = roman_MMD start_POSTSUBSCRIPT italic_k , italic_α , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ≤ italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B divide start_ARG 2 + roman_log italic_m end_ARG start_ARG italic_m + 1 end_ARG

Appendix D Additional Experimental Details and Results

D.1 Supplementary Numerical Experiments on GKHR

Consider the fact that GKH is a convergent algorithm (Lemma B.2) and the finite-sample-size error bound (10) holds without any assumption on the data, we conduct some numerical experiments to empirically compare GKHR with GKH on datasets generated by four different distributions on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Firstly, we define four distributions on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT:

  1. 1.

    Gaussian mixture model 1 which consists of four Gaussian distributions G1,G2,G3,G4subscript𝐺1subscript𝐺2subscript𝐺3subscript𝐺4G_{1},G_{2},G_{3},G_{4}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT with mixture weights [0.95,0.01,0.02,0.02]0.950.010.020.02[0.95,0.01,0.02,0.02][ 0.95 , 0.01 , 0.02 , 0.02 ],

  2. 2.

    Gaussian mixture model 2 which consists of four Gaussian distributions G1,G2,G3,G4subscript𝐺1subscript𝐺2subscript𝐺3subscript𝐺4G_{1},G_{2},G_{3},G_{4}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT with mixture weights [0.3,0.2,0.15,0.35]0.30.20.150.35[0.3,0.2,0.15,0.35][ 0.3 , 0.2 , 0.15 , 0.35 ],

  3. 3.

    Uniform distribution 1 which consists of a uniform distribution defined in a circle with radius 0.50.50.50.5, and a uniform distribution defined in a annulus with inner radius 4444 and outer radius 6666,

  4. 4.

    Uniform distribution 2 defined on [10,10]2superscript10102[-10,10]^{2}[ - 10 , 10 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

where

G1=𝒩([12],[2005]),G2=𝒩([35],[1002])formulae-sequencesubscript𝐺1𝒩matrix12matrix2005subscript𝐺2𝒩matrix35matrix1002G_{1}=\mathcal{N}\left(\begin{bmatrix}1\\ 2\end{bmatrix},\begin{bmatrix}2&0\\ 0&5\end{bmatrix}\right),G_{2}=\mathcal{N}\left(\begin{bmatrix}-3\\ -5\end{bmatrix},\begin{bmatrix}1&0\\ 0&2\end{bmatrix}\right)italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_N ( [ start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 2 end_CELL end_ROW end_ARG ] , [ start_ARG start_ROW start_CELL 2 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 5 end_CELL end_ROW end_ARG ] ) , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = caligraphic_N ( [ start_ARG start_ROW start_CELL - 3 end_CELL end_ROW start_ROW start_CELL - 5 end_CELL end_ROW end_ARG ] , [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 end_CELL end_ROW end_ARG ] )
G3=𝒩([54],[8006]),G4=𝒩([1510],[4009])formulae-sequencesubscript𝐺3𝒩matrix54matrix8006subscript𝐺4𝒩matrix1510matrix4009G_{3}=\mathcal{N}\left(\begin{bmatrix}-5\\ 4\end{bmatrix},\begin{bmatrix}8&0\\ 0&6\end{bmatrix}\right),G_{4}=\mathcal{N}\left(\begin{bmatrix}15\\ 10\end{bmatrix},\begin{bmatrix}4&0\\ 0&9\end{bmatrix}\right)italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = caligraphic_N ( [ start_ARG start_ROW start_CELL - 5 end_CELL end_ROW start_ROW start_CELL 4 end_CELL end_ROW end_ARG ] , [ start_ARG start_ROW start_CELL 8 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW end_ARG ] ) , italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = caligraphic_N ( [ start_ARG start_ROW start_CELL 15 end_CELL end_ROW start_ROW start_CELL 10 end_CELL end_ROW end_ARG ] , [ start_ARG start_ROW start_CELL 4 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 9 end_CELL end_ROW end_ARG ] )
Refer to caption
Figure 2: The performance comparison between GKHR and GKH with different m,n𝑚𝑛m,nitalic_m , italic_n over ten independent runs. The blue line is the mean value of D𝐷Ditalic_D, the red dotted line over (under) the blue line is the mean value of D𝐷Ditalic_D plus (minus) its standard deviation, and the pink area is the area between the upper and lower red dotted lines.

To consistently evaluate the performance gap between GKHR and GKH at the same order of magnitude, we propose the following criterion

D=D1D2D1+D2𝐷subscript𝐷1subscript𝐷2subscript𝐷1subscript𝐷2D=\frac{D_{1}-D_{2}}{D_{1}+D_{2}}italic_D = divide start_ARG italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG

where D1=MMDk,α2(𝐗m(1),𝐗n),D2=MMDk,α2(𝐗m(2),𝐗n)formulae-sequencesubscript𝐷1subscriptsuperscriptMMD2𝑘𝛼subscriptsuperscript𝐗1subscriptsuperscript𝑚subscript𝐗𝑛subscript𝐷2subscriptsuperscriptMMD2𝑘𝛼subscriptsuperscript𝐗2subscriptsuperscript𝑚subscript𝐗𝑛D_{1}=\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}^{(1)}_{\mathcal{I}^{*}_{m}}% ,\mathbf{X}_{n}),D_{2}=\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}^{(2)}_{% \mathcal{I}^{*}_{m}},\mathbf{X}_{n})italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_α end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), 𝐗m(1)subscriptsuperscript𝐗1subscript𝑚\mathbf{X}^{(1)}_{\mathcal{I}_{m}}bold_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the selected samples from GKHR and 𝐗m(2)subscriptsuperscript𝐗2subscript𝑚\mathbf{X}^{(2)}_{\mathcal{I}_{m}}bold_X start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the selected samples from GKH. Positive value of D𝐷Ditalic_D implies that GKH outperforms GKHR, and negative values of D𝐷Ditalic_D implies that GKHR outperforms GKH. Large absolute value of D𝐷Ditalic_D shows large performance gap.

The experiments are conducted as follows. We generate 1000,3000,10000,30000 random samples from the four distributions separately, then use GKHR and GKH for sample selection under the low-budget setting, i.e., m/n0.2𝑚𝑛0.2m/n\leq 0.2italic_m / italic_n ≤ 0.2. The α𝛼\alphaitalic_α is set by m/n𝑚𝑛m/nitalic_m / italic_n. We report the results over ten independent runs in Figure 2, which shows that although the performance gap tends to grow as m𝑚mitalic_m grows, when m𝑚mitalic_m is relatively small, the performance of GKHR is similar to that of GKH. Therefore, under the low-budget setting, GKHR and GKH have similar performance on minimizing α𝛼\alphaitalic_α-MMD over various type of distributions, which convinces us that GKHR could work well in the sample selection task.

D.2 Datasets

For experiments, we choose five common datasets: CIFAR-10/100, SVHN, STL-10 and ImageNet. CIFAR-10 and CIFAR-100 contain 60,000 images with 10 and 100 categories, respectively, among which 50,000 images are for training, and 10,000 images are for testing; SVHN contains 73,257 images for training and 26,032 images for testing; STL-10 contains 5,000 images for training, 8,000 images for testing and 100,000 unlabeled images as extra training data. ImageNet spans 1,000 object classes and contains 1,281,167 training and 100,000 test images. The training sets of the above datasets are considered as the unlabeled dataset for sample selection.

D.3 Visualization of Selected Samples

To offer a more intuitive comparison between various sampling methods, we visualized samples chosen by stratified, random, k𝑘kitalic_k-Means, USL, ActiveFT and RDSS (ours). We generate 5000 samples from a Gaussian mixture model defined on 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with 10 components and uniform mixture weights. One hundred samples are selected from the entire dataset using different sampling methods. The visualisation results in Figure 3 indicate that our selected samples distribute more similarly with the entire dataset than other counterparts.

Refer to caption
Figure 3: Visualization of selected samples using different sampling methods. Points of different colours represent samples from different classes, while black points indicate the selected samples.

D.4 Computational Complexity and Running Time

We compute the time complexity of various sampling methods and recorded the time required to select 400 samples on the CIFAR-100 dataset for each method. The results are presented in Table 5, where m𝑚mitalic_m represents the annotation budget, n𝑛nitalic_n denotes the total number of samples, and T𝑇Titalic_T indicates the number of iterations. The sampling time was obtained by averaging the duration of three independent runs of the sampling code on an idle server without any workload. As illustrated by the results, the sampling efficiency of our method surpasses that of all other methods except for random and stratified sampling. This discrepancy is likely because the execution time of other algorithms is affected by the number of iterations T𝑇Titalic_T.

Table 5: Efficiency comparison with other sampling methods.
Method Time complexity Time (s)
Random O(n)𝑂𝑛O(n)italic_O ( italic_n ) 0absent0\approx 0≈ 0
Stratified O(n)𝑂𝑛O(n)italic_O ( italic_n ) 0absent0\approx 0≈ 0
k𝑘kitalic_k-means O(mnT)𝑂𝑚𝑛𝑇O(mnT)italic_O ( italic_m italic_n italic_T ) 579.97579.97579.97579.97
USL O(mnT)𝑂𝑚𝑛𝑇O(mnT)italic_O ( italic_m italic_n italic_T ) 257.68257.68257.68257.68
ActiveFT O(mnT)𝑂𝑚𝑛𝑇O(mnT)italic_O ( italic_m italic_n italic_T ) 224.35224.35224.35224.35
RDSS (Ours) O(mn)𝑂𝑚𝑛O(mn)italic_O ( italic_m italic_n ) 132.77132.77132.77132.77

D.5 Implementation Details of Supervised Learning Experiments

We use ResNet-18 [15] as the classification model for all AL approaches and our method. Specifically, We train the models for 300300300300 epochs using SGD optimizer (initial learning rate=0.10.10.10.1, weight decay=5e45𝑒45e-45 italic_e - 4, momentum=0.90.90.90.9) with batch size 128128128128. Finally, we evaluate the performance with the Top-1 classification accuracy metric on the test set.

D.6 Direct Comparison with AL/SSAL

The comparative results with AL/SSAL approaches are shown in Figure 4 and Figure 5, respectively. The specific values corresponding to the comparative results in the above two figures are shown in Table 6. And the above results are from  [7],  [11] and  [16].

Table 6: Comparative results with AL/SSAL approaches.
Dataset CIFAR-10 CIFAR-100
Budget 40 250 500 1000 2000 4000 5000 7500 10000 400 2500 5000 7500 10000
Active Learning (AL)
CoreSet [35] - - - - - - 80.56 85.46 87.56 - - 37.36 47.17 53.06
VAAL [36] - - - - - - 81.02 86.82 88.97 - - 38.46 47.02 53.99
LearnLoss [56] - - - - - - 81.74 85.49 87.06 - - 36.12 47.81 54.02
MCDAL [7] - - - - - - 81.01 87.24 89.40 - - 38.90 49.34 54.14
Semi-Supervised Active Learning (SSAL)
CoreSetSSL [35] - - 90.94 92.34 93.30 94.02 - - - - - 63.14 66.29 68.63
CBSSAL [11] - - 91.84 92.93 93.78 94.55 - - - - - 63.73 67.14 69.34
TOD-Semi [16] - - - - - - 79.54 87.82 90.3 - - 36.97 52.87 58.64
Semi-Supervised Learning (SSL) with RDSS
FlexMatch+RDSS (Ours) 94.69 95.21 - - - 95.71 - - - 48.12 67.27 - - 73.21
FreeMatch+RDSS (Ours) 95.05 95.50 - - - 95.98 - - - 48.41 67.40 - - 73.13

According to the results, we have several observations: (1) AL approaches often necessitate significantly larger labelling budgets, exceeding RDSS by 125 or more on CIFAR-10. This is primarily because AL paradigms are solely dependent on labelled samples not only for classification but also for feature learning. (2) SSAL and our methods leverage unlabeled samples, surpassing traditional AL approaches. However, this may not directly reflect the advantages of RDSS, as such performance enhancements could be inherently attributed to the SSL paradigm itself. Nonetheless, these experimental outcomes offer insightful implications: SSL may represent a more promising paradigm under scenarios with limited annotation budgets.

Refer to caption
Figure 4: Comparison with AL/SSAL approaches on CIFAR-10.
Refer to caption
Figure 5: Comparison with AL/SSAL approaches on CIFAR-100.

Appendix E Limitation

The choice of α𝛼\alphaitalic_α depends on the number of full unlabeled data points, independent of the information on the shape of data distribution. This may lead to a loss of effectiveness of RDSS on those datasets with complicated distribution structures. However, it outperforms fixed-ratio approaches on the datasets under different budget settings.

Appendix F Potential Societal Impact

Positive societal impact. Our method ensures the representativeness and diversity of the selected samples and significantly improves the performance of SSL methods, especially under low-budget settings. This reduces the cost and time of data annotation and is particularly beneficial for resource-constrained research and development environments, such as medical image analysis.

Negative societal impact. When selecting representative data for analysis and annotation, the processing of sensitive data may be involved, increasing the risk of data leakage, especially in sensitive fields such as medical care and finance. It is worth noting that most algorithms applied in these sensitive areas are subject to this risk.