Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

Jing Ma , Xiang Xiang
School of Artificial Intelligence and Automation,
Huazhong University of Science and Tech, Wuhan, China
Equal contribution, co-first author; also with Nat. Key Lab of MSIIPT.Correspondence to xex@hust.edu.cn; also with Peng Cheng Lab.
   Ke Wang, Yuchuan Wu, Yongbin Li
DAMO Academy,
Alibaba Group, Beijing, China
Abstract

Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper, we formalize a two-step workflow consisting of deprivatization and distillation, and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance, we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses, and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs, our method yields inspiring distillation performance on various benchmarks, and outperforms the previous state-of-the-art approaches.

1 Introduction

Knowledge Distillation (KD) is a widely accepted approach to the problem of model compression and acceleration, which has received sustained attention from both the academic and industrial research communities [15, 38, 47, 17]. The goal of KD is to extract knowledge from a cumbersome model or an ensemble of models, known as the teacher, and use it as supervision to guide the training of lightweight models, known as the student [5, 39, 1]. In the application of KD, privacy protection has always been a very concerning issue for researchers and users, which not only refers to the privacy of user data but also includes the model copyright of cloud service providers.

Black-Box Knowledge Distillation (B2KD) is a problem posed in the process of cloud-to-edge model compression [37, 51, 54]. The cloud server hosts a teacher model whose internal structure and composition, connections between layers, model parameters, and gradients used for back-propagation are all invisible and unavailable to edge devices, as shown in Fig. 1. Due to resource limitations, the edge device can only host a lightweight student model. At the same time, low-quality and unlabeled local data cannot be used to train a reliable deep neural network. As a result, it must rely on sending query samples to the APIs of cloud servers for heavy inference [49].

In practice, B2KD faces some key challenges. (a) Cloud servers and edge devices should maintain limited data exchange due to Internet latency and bandwidth constraints, as well as charges for the amount of queried data or API usage time. (b) In some cases, for query samples, these APIs only provide indexes or semantic tags for the category with the highest probability (i.e., hard responses), rather than probability vectors for all possible classes (i.e., soft responses). (c) Because users refuse to send sensitive data to cloud servers, the distribution gap between local and cloud data is difficult to measure, making the distilled student model inaccurate in the application.

Refer to caption
Figure 1: Schematic process of cloud-to-edge model compression. A cumbersome black-box model is deployed on a cloud server, trained with millions of samples and tags. The cloud server only provides APIs to receive query data and return inference responses of either soft or hard type. The edge device needs to distill a lightweight model using unlabeled local data.

Adversarial learning has been shown to be effective in generating pseudo samples, which is widely used in data augmentation and low-shot learning [8, 54]. A well-trained generator can overcome the mode collapse problem and align real and synthetic data distribution. In particular, we want to produce images relevant to training, whether or not they resemble real data [32]. Meanwhile, images generated to obtain high responses from the teacher model combine different patterns with highly generalized features instead of sample-specific idiosyncrasies [52]. Therefore, using a well-trained generator to synthesize pseudo images can automatically filter out privacy-related high-frequency information, this process is called deprivatization,

In this paper, we propose an approach to solve B2KD by mapping emulation. Our motivation is in accordance with the fact that it can drive alignment between low-dimensional logits by reducing the distance between two generated images in the high-dimensional space. In addition, we argue that an image contains a lot of fine-grained information, which can be treated as another type of knowledge to provide different gradient directions for updating the parameters of student model, as shown in Fig. 4. Combining image-level loss with coarse-grained logit-level loss can effectively improve the distillation effect. According to the Kolmogorov theorem [25, 6], a sufficiently complex neural network is capable of representing an arbitrary multivariate continuous function from any dimension to another. Thus, a well-trained generator can not only emulate the inverse mapping of the teacher function (Thm. 4) but also help update the logits of a student to converge to the logits of a teacher (Thm. 5), with reasonable generalizability (Thm. 6).

In practice, we derive using a generative adversarial network (GAN) for deprivatization and exploit it as an inverse mapping of the teacher function. The generator uses random variables as inputs that are sampled from a prior distribution with the same dimensionality as the logits. The well-trained generator is frozen and grafted behind the teacher and student model, whose output logits of the same examples are used as the inputs of the generator, as shown in Fig. 2. Experimental results show that MEKD can effectively protect the privacy of local data and models in the cloud, and it performs well under either soft or hard responses. At the same time, MEKD has robust results in the case of limited query samples and out-of-domain data.

Overall, the contributions of this paper are: 1) We formalize the problem of B2KD and provide a two-step workflow of deprivatization and distillation. 2) We theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. 3) We propose a new method of Mapping-Emulation Knowledge Distillation (MEKD). The improved experimental performance has demonstrated the effectiveness of our approach.

2 Related Work

Knowledge Distillation (KD). Hinton et al. [19] propose an original teacher-student architecture that uses the logits of the teacher model as the knowledge. Since then, some KD methods regard knowledge as final responses to input samples [3, 31, 58], some regard knowledge as features extracted from different layers of neural networks [24, 23, 41], and some regard knowledge as relations between such layers [57, 40, 9]. The purpose of defining different types of knowledge is to efficiently extract the underlying representation learned by the teacher model from the large-scale data. If we consider a network as a mapping function of input distribution to output, then different knowledge types help to approximate such a function. Based on the type of knowledge transferred, KD can be divided into response-based, feature-based, and relation-based [15]. The first two aim to derive the student to mimic the responses of the output layer or the feature maps of the hidden layers of the teacher, and the last approach uses the relationships between the teacher’s different layers to guide the training of the student model. Feature-based and relation-based methods [24, 57], depending on the model utilized, may leak the information of structures and parameters through the intermediate layers’ data. For example, we can reconstruct a ResNet [18] based on the feature dimensions of different layers, and calculate each neuron’s parameter using specific images and their responses in the feature maps.

Black-Box Knowledge Distillation (B2KD). Response-based KD methods [19, 58, 3] have the natural property of hiding models. Hinton et al. [19] use Kullback-Leibler Divergence (KLD) between the softened logits of teacher and student models as the loss to align the output distribution, and Zhao et al. [58] decouple the KLD into two uncorrelated losses and combine them by weighted summation. These calculations do not take into account the details of the teacher model, which is exactly a black box. The recently proposed approaches for B2KD also address the issue of hiding the teacher model deployed in the cloud server [37, 51, 54]. Orekondy et al. [37] use a reinforcement learning approach to improve query sample efficiency. Wang et al. [51] blend mixup and activate learning to augment the few unlabeled images and choose hard examples for distillation. And Wang [54] proposes a decision-based black-box model and constructs the soft label for each training sample by computing its distances to the decision boundaries of the teacher model. These existing approaches partially address the challenges of cloud-to-edge black-box model distillation, but none of them take into account the privacy leak of user data when sending original local images to the cloud.

Refer to caption
Figure 2: The overall framework of MEKD. Lower left: two architectures of GAN-based KD. Upper right: the process of deprivatization. GAN is used to synthetic high-response images to the teacher model within the distribution of data in edge devices. Lower right: the process of distillation with the frozen generator. The synthetic privacy-free images are query samples sent to the teacher model through the APIs of cloud servers. The student model is distilled by reducing the logit-level and image-level discrepancy.

Generative Adversarial Networks (GANs) have the capacity to handle sharp estimated density functions and generate realistic-looking images efficiently. A typical GAN [14] comprises a discriminator distinguishing real images and generated images, and a generator synthesizing images to fool the discriminator. GANs are divided into architecture-variant and loss-variant. The former focuses on network architectures [42, 7] or latent space [34, 10], e.g., some specific architectures are proposed for specific tasks [59, 22]. The latter utilizes different loss types and regularization tools [2, 16] to enable more stable learning.

Adversarial Distillation (AD) exploits adversarial architecture to help the teacher and student model have a better understanding of the real data distribution [15, 55, 53, 4, 44]. The methods of AD can be divided into two types according to the generator-discriminator architecture, as shown in Fig. 2: (a) the generator is used to synthetic images to obey a real distribution, and these images are used to help distill models [8, 53]; (b) the teacher and student models are regarded as generators and another discriminator is drafted behind them to judge whether the distribution of features or logits is consistent [55, 56]. AD is also employed for low-shot knowledge distillation and received inspiring results [8]. Our method provides an alternative adversarial architecture, which utilizes a well-trained generator to guide the alignment between the outputs of models.

3 Theory for Mapping-Emulation KD

First, we propose two definitions. Def. 3 defines that two functions that map the same data distribution μ𝜇\muitalic_μ to the same latent distribution υ𝜐\upsilonitalic_υ are equivalent. The ideal state of KD is to obtain a student function fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT that is equivalent to the teacher function fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Def. 4 defines that the mapping function of a generator G𝐺Gitalic_G, which can map a prior distribution p𝑝pitalic_p to data manifold ΣΣ\Sigmaroman_Σ and guarantee that the generated image distribution μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the same as the real image distribution μ𝜇\muitalic_μ, is considered to be the inverse mapping of the teacher function, i.e. fG=fT1subscript𝑓𝐺superscriptsubscript𝑓𝑇1f_{G}=f_{T}^{-1}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. And we call it a well-trained generator. The mapping relationships are shown in Fig. 3.

Refer to caption
Figure 3: Mapping relationships of fS,fT,fGsubscript𝑓𝑆subscript𝑓𝑇subscript𝑓𝐺f_{S},f_{T},f_{G}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. If fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT can map μ𝜇\muitalic_μ to the same distribution υ𝜐\upsilonitalic_υ, then fS=fTsubscript𝑓𝑆subscript𝑓𝑇f_{S}=f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and if fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT can map the prior distribution p𝑝pitalic_p to μ𝜇\muitalic_μ, then fG=fT1subscript𝑓𝐺superscriptsubscript𝑓𝑇1f_{G}=f_{T}^{-1}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.
Definition 1.

(Function Equivalence) Giving the student and teacher model fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, for a data distribution μ𝒳𝜇𝒳\mu\in\mathcal{X}italic_μ ∈ caligraphic_X in image space which is mapped to S𝒴subscript𝑆𝒴\mathbb{P}_{S}\in\mathcal{Y}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_Y and T𝒴subscript𝑇𝒴\mathbb{P}_{T}\in\mathcal{Y}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_Y in latent space. If the Wasserstein distance between Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT equals zero,

W(S,T)=infγΠ(S,T)𝔼(yS,yT)γ[ySyT]=0,𝑊subscript𝑆subscript𝑇subscriptinfimum𝛾Πsubscript𝑆subscript𝑇subscript𝔼similar-tosubscript𝑦𝑆subscript𝑦𝑇𝛾delimited-[]normsubscript𝑦𝑆subscript𝑦𝑇0\vspace{-1mm}W(\mathbb{P}_{S},\mathbb{P}_{T})=\inf_{\gamma\in\Pi(\mathbb{P}_{S% },\mathbb{P}_{T})}\mathbb{E}_{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|% \ \right]=0,italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Π ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ ] = 0 , (1)

the student and teacher model are equivalent, i.e., fS=fTsubscript𝑓𝑆subscript𝑓𝑇f_{S}=f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where Π(S,T)Πsubscript𝑆subscript𝑇\Pi(\mathbb{P}_{S},\mathbb{P}_{T})roman_Π ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is the set of all joint distributions γ(yS,yT)𝛾subscript𝑦𝑆subscript𝑦𝑇\gamma(y_{S},y_{T})italic_γ ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) whose marginals are Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, respectively.

Definition 2.

(Inverse Mapping) Giving a prior distribution pC𝑝superscript𝐶p\in\mathbb{R}^{C}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, for a data distribution μn𝜇superscript𝑛\mu\in\mathbb{R}^{n}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, if the Wasserstein distance between generated distribution μ=(fG)#psuperscript𝜇subscriptsubscript𝑓𝐺#𝑝\mu^{\prime}=(f_{G})_{\#}pitalic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_p and μ𝜇\muitalic_μ equals zero,

W(μ,μ)=infγΠ(μ,μ)𝔼(x,x)γ[xx]=0,𝑊superscript𝜇𝜇subscriptinfimum𝛾Πsuperscript𝜇𝜇subscript𝔼similar-tosuperscript𝑥𝑥𝛾delimited-[]normsuperscript𝑥𝑥0\vspace{-2mm}W(\mu^{\prime},\mu)=\inf_{\gamma\in\Pi(\mu^{\prime},\mu)}\mathbb{% E}_{(x^{\prime},x)\sim\gamma}[\ \|x^{\prime}-x\|\ ]=0,italic_W ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) = roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Π ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x ∥ ] = 0 , (2)

then the generator fG:Cn:subscript𝑓𝐺superscript𝐶superscript𝑛f_{G}:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the inverse mapping of the teacher function fT:nC:subscript𝑓𝑇superscript𝑛superscript𝐶f_{T}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{C}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, denoted as fG=fT1subscript𝑓𝐺superscriptsubscript𝑓𝑇1f_{G}=f_{T}^{-1}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where Π(μ,μ)Πsuperscript𝜇𝜇\Pi(\mu^{\prime},\mu)roman_Π ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) is the set of all joint distributions γ(x,x)𝛾superscript𝑥𝑥\gamma(x^{\prime},x)italic_γ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ) whose marginals are respectively μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and μ𝜇\muitalic_μ.

Fixing a decoding map fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for a well-trained generator G𝐺Gitalic_G, the latent space 𝒵𝒵\mathcal{Z}caligraphic_Z is partitioned as

𝒟(fG):𝒵=αUα,:𝒟subscript𝑓𝐺𝒵subscript𝛼subscript𝑈𝛼\vspace{-2mm}\mathcal{D}(f_{G}):\mathcal{Z}=\bigcup_{\alpha}U_{\alpha},caligraphic_D ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) : caligraphic_Z = ⋃ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , (3)

where 𝒟(fG)𝒟subscript𝑓𝐺\mathcal{D}(f_{G})caligraphic_D ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) is called the decomposition induced by the decoding map fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [29], and {Uα}subscript𝑈𝛼\{U_{\alpha}\}{ italic_U start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } are called cells. As shown in Fig. 4, fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT maps a cell decomposition in the latent space 𝒟(fG)𝒟subscript𝑓𝐺\mathcal{D}(f_{G})caligraphic_D ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) to a cell decomposition in the image space 1niδx(i)1𝑛subscript𝑖subscript𝛿superscript𝑥𝑖\frac{1}{n}\sum_{i}\delta_{x^{(i)}}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Each cell Uαsubscript𝑈𝛼U_{\alpha}italic_U start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is mapped to a sample δx(i)subscript𝛿superscript𝑥𝑖\delta_{x^{(i)}}italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by the decoding map fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [28]. In another word, fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT pushes the prior distribution p𝑝pitalic_p to the exact empirical distribution,

(fG)#p=1niδx(i).subscriptsubscript𝑓𝐺#𝑝1𝑛subscript𝑖subscript𝛿superscript𝑥𝑖\vspace{-1mm}({f_{G}})_{\#}p=\frac{1}{n}\sum_{i}\delta_{x^{(i)}}.( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_p = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (4)
Theorem 1.

(Empirical Approximation) For any 0<ϵ<1/20italic-ϵ120<\epsilon<1/20 < italic_ϵ < 1 / 2 and any integer m>4𝑚4m>4italic_m > 4, let g:Cn:𝑔superscript𝐶superscript𝑛g:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the mapping function of generator G𝐺Gitalic_G with n20logmϵ2𝑛20𝑚superscriptitalic-ϵ2n\leq\frac{20\log m}{\epsilon^{2}}italic_n ≤ divide start_ARG 20 roman_log italic_m end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. For two sets VS={yS:ySS}subscript𝑉𝑆conditional-setsubscript𝑦𝑆subscript𝑦𝑆subscript𝑆V_{S}=\{y_{S}:y_{S}\in\mathbb{P}_{S}\}italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } and VT={yT:yTT}subscript𝑉𝑇conditional-setsubscript𝑦𝑇subscript𝑦𝑇subscript𝑇V_{T}=\{y_{T}:y_{T}\in\mathbb{P}_{T}\}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, both of which have m𝑚mitalic_m points in Csuperscript𝐶\mathbb{R}^{C}blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, if the empirical Wasserstein distance between g(VS)𝑔subscript𝑉𝑆g(V_{S})italic_g ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and g(VT)𝑔subscript𝑉𝑇g(V_{T})italic_g ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) equals zero,

W^(g(VS),g(VT))=1mi=1mg(ySi)g(yTi)=0,^𝑊𝑔subscript𝑉𝑆𝑔subscript𝑉𝑇1𝑚superscriptsubscript𝑖1𝑚norm𝑔superscriptsubscript𝑦𝑆𝑖𝑔superscriptsubscript𝑦𝑇𝑖0\vspace{-2mm}\hat{W}(g(V_{S}),g(V_{T}))=\frac{1}{m}\sum_{i=1}^{m}\|g(y_{S}^{i}% )-g(y_{T}^{i})\|=0,over^ start_ARG italic_W end_ARG ( italic_g ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) , italic_g ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_g ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_g ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ = 0 , (5)

then W(S,T)=0𝑊subscript𝑆subscript𝑇0W(\mathbb{P}_{S},\mathbb{P}_{T})=0italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = 0.

Thm. 4 (see Appendix for proof) provides a method to approximate the expected Wasserstein distance W(S,T)𝑊subscript𝑆subscript𝑇W(\mathbb{P}_{S},\mathbb{P}_{T})italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) using the empirical Wasserstein distance W^(g(VS),g(VT))^𝑊𝑔subscript𝑉𝑆𝑔subscript𝑉𝑇\hat{W}(g(V_{S}),g(V_{T}))over^ start_ARG italic_W end_ARG ( italic_g ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) , italic_g ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ). By reducing the distance between points g(ySi)𝑔superscriptsubscript𝑦𝑆𝑖g(y_{S}^{i})italic_g ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) and g(yTi)𝑔superscriptsubscript𝑦𝑇𝑖g(y_{T}^{i})italic_g ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) in high-dimensional space, an optimization direction Fsubscript𝐹\nabla\mathcal{L}_{F}∇ caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT different from KLsubscript𝐾𝐿\nabla\mathcal{L}_{KL}∇ caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is produced for logits ySisuperscriptsubscript𝑦𝑆𝑖y_{S}^{i}italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and yTisuperscriptsubscript𝑦𝑇𝑖y_{T}^{i}italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in low-dimensional space. The gradient update causes ySisuperscriptsubscript𝑦𝑆𝑖y_{S}^{i}italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to move towards the boundary of the cell in which yTisuperscriptsubscript𝑦𝑇𝑖y_{T}^{i}italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT resides, as shown in Fig. 4.

Theorem 2.

(Optimization Direction) Let μ𝒳𝜇𝒳\mu\in\mathcal{X}italic_μ ∈ caligraphic_X be any distribution. fS,fT,fGsubscript𝑓𝑆subscript𝑓𝑇subscript𝑓𝐺f_{S},f_{T},f_{G}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are the mapping functions of the student, teacher, and generator, respectively. fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is parameterized by θSΘSsubscript𝜃𝑆subscriptΘ𝑆\theta_{S}\in\Theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Then, when

minθSΘS𝔼xμ[fGfS(x),fGfT(x)]0,\vspace{-1mm}\min_{\theta_{S}\in\Theta_{S}}\mathbb{E}_{x\sim\mu}\left[\|f_{G}% \circ f_{S}(x),f_{G}\circ f_{T}(x)\|\right]\rightarrow 0,roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] → 0 , (6)

it holds that fSfTsubscript𝑓𝑆subscript𝑓𝑇f_{S}\rightarrow f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT → italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and we have

θS𝔼xμ[fS(x)]=θSW(S,T)subscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝜇delimited-[]subscript𝑓𝑆𝑥subscriptsubscript𝜃𝑆𝑊subscript𝑆subscript𝑇\displaystyle\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[f_{S}(x)]=\nabla_{\theta% _{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ] = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
=𝔼xμ[θSfGfS(x)fGfT(x)].absentsubscript𝔼similar-to𝑥𝜇delimited-[]subscriptsubscript𝜃𝑆normsubscript𝑓𝐺subscript𝑓𝑆𝑥subscript𝑓𝐺subscript𝑓𝑇𝑥\displaystyle\ \ \ \ \ \ \ \ =\mathbb{E}_{x\sim\mu}[\nabla_{\theta_{S}}\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\|].= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] . (7)

Thus, to achieve fSfTsubscript𝑓𝑆subscript𝑓𝑇f_{S}\rightarrow f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT → italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, it is sufficient to optimize 𝔼xμ[fGfS(x),fGfT(x)]\mathbb{E}_{x\sim\mu}\left[\|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\|\right]blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] in the parameter space ΘSsubscriptΘ𝑆\Theta_{S}roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. The global gradient of parameter θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT can be replaced by the gradient calculated on the empirical distance of high-dimensional image points, refer to Appendix for proof.

Theorem 3.

(Generalization Bound) Let H𝒳×𝒴𝐻superscript𝒳𝒴H\subseteq\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}italic_H ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_X × caligraphic_Y end_POSTSUPERSCRIPT be a hypothesis set for C𝐶Citalic_C-way classification task. For any 0<ϵ<1/20italic-ϵ120<\epsilon<1/20 < italic_ϵ < 1 / 2 and a sample S𝑆Sitalic_S of size m>4𝑚4m>4italic_m > 4 drawn according to μ𝜇\muitalic_μ, let g:Cn:𝑔superscript𝐶superscript𝑛g:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a mapping function of generator G𝐺Gitalic_G with n20logmϵ2𝑛20𝑚superscriptitalic-ϵ2n\leq\frac{20\log m}{\epsilon^{2}}italic_n ≤ divide start_ARG 20 roman_log italic_m end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Fix ρ>0𝜌0\rho>0italic_ρ > 0, for any 1>δ>01𝛿01>\delta>01 > italic_δ > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds for all hH𝐻h\in Hitalic_h ∈ italic_H,

R(h)R^ρ(h)+2C2ρ(1ϵ)r2Λ2m+log1δ2m.𝑅subscript^𝑅𝜌2superscript𝐶2𝜌1italic-ϵsuperscript𝑟2superscriptΛ2𝑚1𝛿2𝑚\vspace{-1mm}R(h)\leq\hat{R}_{\rho}(h)+\frac{2C^{2}}{\rho(1-\epsilon)}\sqrt{% \frac{r^{2}\Lambda^{2}}{m}}+\sqrt{\frac{\log\frac{1}{\delta}}{2m}}.italic_R ( italic_h ) ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_h ) + divide start_ARG 2 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ ( 1 - italic_ϵ ) end_ARG square-root start_ARG divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG 2 italic_m end_ARG end_ARG . (8)

For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, the Λ0Λ0\Lambda\geq 0roman_Λ ≥ 0 and (y=1Ch(x,y)p)1/pΛsuperscriptsuperscriptsubscript𝑦1𝐶superscriptnorm𝑥𝑦𝑝1𝑝Λ(\sum_{y=1}^{C}\|h(x,y)\|^{p})^{1/p}\leq\Lambda( ∑ start_POSTSUBSCRIPT italic_y = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∥ italic_h ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ≤ roman_Λ for any p1𝑝1p\geq 1italic_p ≥ 1, and the r>0𝑟0r>0italic_r > 0 for K(x,x)r2𝐾𝑥𝑥superscript𝑟2K(x,x)\leq r^{2}italic_K ( italic_x , italic_x ) ≤ italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where kernel K:𝒳×𝒳:𝐾𝒳𝒳K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}italic_K : caligraphic_X × caligraphic_X → blackboard_R is positive definite symmetric.

Refer to caption
Figure 4: Cell Uαsubscript𝑈𝛼U_{\alpha}italic_U start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT in the latent space is mapped via fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT to an exact image x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of the same color. The move of point xSsuperscriptsubscript𝑥𝑆x_{S}^{\prime}italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to xTsuperscriptsubscript𝑥𝑇x_{T}^{\prime}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT causes the logits ySsubscript𝑦𝑆y_{S}italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to align with yTsubscript𝑦𝑇y_{T}italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT from a direction different from KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT.

Thm. 6 (see Appendix for proof) gives the generalization bound of aligning low-dimensional logits by reducing the distance of high-dimensional image points, which guarantees generalizability to the unseen samples.

4 Algorithm of Mapping-Emulation KD

Hinton et al. [19] propose a simple but effective KD method that uses the softened logits of the teacher model as a supervision to guide student training. They use the Kullback-Leibler Divergence (KLD) to measure the discrepancy between the logits of the two models, where the student model is trained to minimize the gap in the hope of achieving the same output. The loss is defined as

KLsubscript𝐾𝐿\displaystyle\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT =𝒦[p(c|𝐱i;θT)||p(c|𝐱i;θS)]\displaystyle=\mathcal{KL}[p(c|\mathbf{x}_{i};\theta_{T})||p(c|\mathbf{x}_{i};% \theta_{S})]= caligraphic_K caligraphic_L [ italic_p ( italic_c | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) | | italic_p ( italic_c | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ] (9)
=1NiNcCp(c|𝐱i;θT)log[p(c|𝐱i;θT)p(c|𝐱i;θS)],absent1𝑁superscriptsubscript𝑖𝑁superscriptsubscript𝑐𝐶𝑝conditional𝑐subscript𝐱𝑖subscript𝜃𝑇𝑝conditional𝑐subscript𝐱𝑖subscript𝜃𝑇𝑝conditional𝑐subscript𝐱𝑖subscript𝜃𝑆\displaystyle=\frac{1}{N}\sum_{i}^{N}\sum_{c}^{C}p(c|\mathbf{x}_{i};\theta_{T}% )\log\left[\frac{p(c|\mathbf{x}_{i};\theta_{T})}{p(c|\mathbf{x}_{i};\theta_{S}% )}\right],= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_p ( italic_c | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) roman_log [ divide start_ARG italic_p ( italic_c | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_c | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG ] ,

where i𝑖iitalic_i is the sample index and N𝑁Nitalic_N is the number of samples. Regardless of the method used, the essence of KD is to learn the mapping function of the teacher model from input to output, i.e., fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. However, it is hard to deduce the mapping function from the existing parameters of the teacher model. One can only guess the mapping process by using the responses to the input samples of different network layers or the relations between features and treat them as knowledge to guide the training of the student model [57]. However, in the black-box KD problem, the internal responses or relations between layers of the teacher model are not available, which makes effective distillation more challenging.

Deprivatization. For a C𝐶Citalic_C-way classification problem, we first train a GAN using the random noise variable z𝑧zitalic_z sampled from the prior distribution p𝑝pitalic_p in latent space 𝒴𝒴\mathcal{Y}caligraphic_Y as input. Note that the dimensionality of z𝑧zitalic_z is the same as the output logits of the teacher model, i.e. |z|=C𝑧𝐶|z|=C| italic_z | = italic_C. The generator G𝐺Gitalic_G uses noise z𝑧zitalic_z to synthesize images, and the discriminator D𝐷Ditalic_D minimizes the Wasserstein distance between the generated μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the real distribution μ𝜇\muitalic_μ. The synthetic privacy-free images are simultaneously sent to the cloud server for inference responses, which can be soft (probability vectors for all possible classes) or hard (indexes or semantic tags for the category with the highest probability). We expect the synthetic images to match the high responses of the teacher model so that they can maximize the containment of patterns in real data. We adopt the information maximization (IM) loss [21, 45], which is formulated as

IM=1mi=1my^t(i)log(D(G(z(i)))),subscript𝐼𝑀1𝑚superscriptsubscript𝑖1𝑚subscriptsuperscript^𝑦𝑖𝑡𝐷𝐺superscript𝑧𝑖\vspace{-2mm}\mathcal{L}_{IM}=-\frac{1}{m}\sum_{i=1}^{m}\hat{y}^{(i)}_{t}\log% \left(D\left(G\left(z^{(i)}\right)\right)\right),caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_log ( italic_D ( italic_G ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) ) , (10)

where y^t(i)=maxcCT(G(z(i)))subscriptsuperscript^𝑦𝑖𝑡subscript𝑐𝐶𝑇𝐺superscript𝑧𝑖\hat{y}^{(i)}_{t}=\max_{c\in C}T\left(G\left(z^{(i)}\right)\right)over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_c ∈ italic_C end_POSTSUBSCRIPT italic_T ( italic_G ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) for 0.0y^t(i)1.00.0subscriptsuperscript^𝑦𝑖𝑡1.00.0\leq\hat{y}^{(i)}_{t}\leq 1.00.0 ≤ over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 1.0 in soft responses and y^t(i)=1.0subscriptsuperscript^𝑦𝑖𝑡1.0\hat{y}^{(i)}_{t}=1.0over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1.0 in hard responses.

Suppose the discriminator is capable of completely blurring the discrepancy between synthetic and genuine images. In this case, the resulting generator represents a function from the latent space to the image space, defined as fG:𝒴𝒳:subscript𝑓𝐺𝒴𝒳f_{G}:\mathcal{Y}\rightarrow\mathcal{X}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT : caligraphic_Y → caligraphic_X, with an inverse mapping of the teacher function. Note that the generator and the discriminator are trained simultaneously: we adjust parameters for the generator to minimize log(1D(G(z)))1𝐷𝐺𝑧\log(1-D(G(z)))roman_log ( 1 - italic_D ( italic_G ( italic_z ) ) ) and adjust parameters for the discriminator to minimize logD(x)𝐷𝑥\log D(x)roman_log italic_D ( italic_x ). And their loss functions are

D=1mi=1m[logD(x(i))+log(1D(G(z(i))))],subscript𝐷1𝑚superscriptsubscript𝑖1𝑚delimited-[]𝐷superscript𝑥𝑖1𝐷𝐺superscript𝑧𝑖\vspace{-1mm}\mathcal{L}_{D}=-\frac{1}{m}\sum_{i=1}^{m}\left[\log D\left(x^{(i% )}\right)+\log\left(1-D\left(G\left(z^{(i)}\right)\right)\right)\right],caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ roman_log italic_D ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) + roman_log ( 1 - italic_D ( italic_G ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) ) ] , (11)
G=1mi=1mlog(1D(G(z(i)))).subscript𝐺1𝑚superscriptsubscript𝑖1𝑚1𝐷𝐺superscript𝑧𝑖\mathcal{L}_{G}=-\frac{1}{m}\sum_{i=1}^{m}\log\left(1-D\left(G\left(z^{(i)}% \right)\right)\right).caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_log ( 1 - italic_D ( italic_G ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) ) . (12)

We introduce a trade-off hyperparameter α𝛼\alphaitalic_α to balance GANsubscript𝐺𝐴𝑁\mathcal{L}_{GAN}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N end_POSTSUBSCRIPT and IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT, and all the losses in the first step of deprivatization constitute

Dp=(G+D)+αIM,subscript𝐷𝑝subscript𝐺subscript𝐷𝛼subscript𝐼𝑀\vspace{-1mm}\mathcal{L}_{Dp}=(\mathcal{L}_{G}+\mathcal{L}_{D})+\alpha\mathcal% {L}_{IM},caligraphic_L start_POSTSUBSCRIPT italic_D italic_p end_POSTSUBSCRIPT = ( caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) + italic_α caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT , (13)
Algorithm 1 MEKD optimization algorithm.

Input: Pre-trained teacher T(x;θT)𝑇𝑥subscript𝜃𝑇T(x;\theta_{T})italic_T ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) deployed in the cloud server, random initialized student S(x;θS)𝑆𝑥subscript𝜃𝑆S(x;\theta_{S})italic_S ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and local dataset X𝑋Xitalic_X hosted in the edge device.
Output: An optimized student S(x;θS)𝑆𝑥subscript𝜃𝑆S(x;\theta_{S})italic_S ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) on dataset X𝑋Xitalic_X.

1:  \triangleright Step 1: Deprivatization
2:  Initialize a generator G(z;θG)𝐺𝑧subscript𝜃𝐺G(z;\theta_{G})italic_G ( italic_z ; italic_θ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) and a discriminator D(x;θD)𝐷𝑥subscript𝜃𝐷D(x;\theta_{D})italic_D ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), and ensure the dimensionality of z𝑧zitalic_z equals to the category count C𝐶Citalic_C.
3:  repeat
4:       Sample a batch of noises 𝒵𝒵\mathcal{Z}caligraphic_Z from a prior distribu-
5:       tion p𝑝pitalic_p and synthetic images 𝒳=G(𝒵)superscript𝒳𝐺𝒵\mathcal{X}^{\prime}=G(\mathcal{Z})caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G ( caligraphic_Z ).
6:       The 𝒳superscript𝒳\mathcal{X}^{\prime}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is sent to T𝑇Titalic_T in cloud to get soft or hard
7:       inference responses 𝒴^t=T(𝒳)subscriptsuperscript^𝒴𝑡𝑇superscript𝒳\hat{\mathcal{Y}}^{\prime}_{t}=T(\mathcal{X}^{\prime})over^ start_ARG caligraphic_Y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_T ( caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).
8:       Sample a batch of examples 𝒳𝒳\mathcal{X}caligraphic_X from dataset X𝑋Xitalic_X.
9:       Update the discriminator D𝐷Ditalic_D to distinguish 𝒳𝒳\mathcal{X}caligraphic_X
10:       and 𝒳superscript𝒳\mathcal{X}^{\prime}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using Dsubscript𝐷\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT from Eqn. 11.
11:       Update the generator G𝐺Gitalic_G to fool the discriminator D𝐷Ditalic_D
12:       using G+αIMsubscript𝐺𝛼subscript𝐼𝑀\mathcal{L}_{G}+\alpha\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT from Eqn. 10 and Eqn. 12.
13:  until converge
14:  \triangleright Step 2: Distillation
15:  Initialize the student S𝑆Sitalic_S and freeze the generator G𝐺Gitalic_G.
16:  repeat
17:       Sample a batch of noises 𝒵𝒵\mathcal{Z}caligraphic_Z from a prior distribu-
18:       tion p𝑝pitalic_p and synthetic images 𝒳=G(𝒵)superscript𝒳𝐺𝒵\mathcal{X}^{\prime}=G(\mathcal{Z})caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G ( caligraphic_Z ).
19:       The 𝒳superscript𝒳\mathcal{X}^{\prime}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is sent to T𝑇Titalic_T in cloud to get soft or hard
20:       inference responses 𝒴^t=T(𝒳)subscriptsuperscript^𝒴𝑡𝑇superscript𝒳\hat{\mathcal{Y}}^{\prime}_{t}=T(\mathcal{X}^{\prime})over^ start_ARG caligraphic_Y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_T ( caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).
21:       Update the student S𝑆Sitalic_S using Dtsubscript𝐷𝑡\mathcal{L}_{Dt}caligraphic_L start_POSTSUBSCRIPT italic_D italic_t end_POSTSUBSCRIPT from Eqn. 4.
22:  until converge
Method Data Size MNIST CIFAR-10 CIFAR-100 Tiny ImageNet
Teacher 50Ksimilar-to\sim100K ResNet32 VGG13 ResNet56 ResNet56 VGG13 ResNet56 ResNet110 ResNet110
99.50 99.52 94.15 94.15 74.68 72.06 60.71 60.71
Student 50Ksimilar-to\sim100K ResNet8 VGG11 ResNet8 VGG11 VGG11 VGG11 ResNet32 MobileNet
99.24 99.41 87.74 91.81 69.12 69.12 55.47 56.07
KD [19] 50Ksimilar-to\sim100K 99.33 99.44 86.58 82.25 70.88 67.97 54.14 57.85
ML [3] 50Ksimilar-to\sim100K 99.49 99.40 87.89 91.91 67.78 70.18 56.56 60.07
AL [53] 50Ksimilar-to\sim100K 99.37 99.26 87.25 91.97 69.92 71.13 46.02 51.29
DKD [58] 50Ksimilar-to\sim100K 99.33 99.43 86.61 92.42 67.32 70.10 55.99 59.43
DAFL [8] 0K 96.42 97.00 60.67 66.03 43.78 48.32 38.44 40.93
KN [37] 10K 98.61 98.81 80.62 82.41 57.83 55.64 48.92 50.22
AM [51] 10K 99.33 99.47 74.89 74.26 62.17 63.20 47.72 51.54
DB3KD [54] 10K 98.94 99.16 78.47 85.84 63.48 62.76 47.95 50.49
MEKD (soft) 10K 99.40 99.43 85.36 87.27 64.76 64.83 50.87 54.93
MEKD (hard) 10K 99.40 99.45 84.45 87.25 64.72 65.32 49.89 54.71
Table 1: Top-1 classification accuracy (%) of the student model on MNIST, CIFAR-10, CIFAR-100 and Tiny ImageNet.

Distillation. The well-trained generator G𝐺Gitalic_G contains the knowledge that the teacher uses to make inferences. It is equivalent to a teacher assistant transferring the teacher’s knowledge to the student. Fig. 2 illustrates the architecture of MEKD. We freeze the generator and graft it behind the teacher and student model in the same way, using the softened logits of both models as the generator input. A batch of synthetic images X={x(i)=fG(z(i))}i=1msuperscript𝑋superscriptsubscriptsuperscriptsuperscript𝑥𝑖subscript𝑓𝐺superscript𝑧𝑖𝑖1𝑚X^{\prime}=\{{x^{\prime}}^{(i)}=f_{G}(z^{(i)})\}_{i=1}^{m}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is fed into the embedded network to output high-dimensional points in the same image space, simultaneously. The distance between the output high-dimensional points from the logits of the teacher model XT′′=fGfT(X)superscriptsubscript𝑋𝑇′′subscript𝑓𝐺subscript𝑓𝑇superscript𝑋X_{T}^{\prime\prime}=f_{G}\circ f_{T}(X^{\prime})italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and the others from the student XS′′=fGfS(X)superscriptsubscript𝑋𝑆′′subscript𝑓𝐺subscript𝑓𝑆superscript𝑋X_{S}^{\prime\prime}=f_{G}\circ f_{S}(X^{\prime})italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are measured by the distance measurement formula F=𝕕(XS′′,XT′′)subscript𝐹𝕕superscriptsubscript𝑋𝑆′′superscriptsubscript𝑋𝑇′′\mathcal{L}_{F}=\mathbbm{d}(X_{S}^{\prime\prime},X_{T}^{\prime\prime})caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_d ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ). We minimize the distance Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT to drive the student model to mimic the output logits of the teacher model, and use 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm (F=1𝐹1F=1italic_F = 1) of XS′′superscriptsubscript𝑋𝑆′′X_{S}^{\prime\prime}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and XT′′superscriptsubscript𝑋𝑇′′X_{T}^{\prime\prime}italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT as the loss function to distill the student,

Dtsubscript𝐷𝑡\displaystyle\mathcal{L}_{Dt}caligraphic_L start_POSTSUBSCRIPT italic_D italic_t end_POSTSUBSCRIPT =1mi=1mG(S(x(i))/τ)G(T(x(i))/τ)Fabsent1𝑚superscriptsubscript𝑖1𝑚subscriptnorm𝐺𝑆superscriptsuperscript𝑥𝑖𝜏𝐺𝑇superscriptsuperscript𝑥𝑖𝜏𝐹\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\ \left\|G\left(S\left({x^{\prime}}^{(i% )}\right)/\tau\right)-G\left(T\left({x^{\prime}}^{(i)}\right)/\tau\right)% \right\|_{F}= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_G ( italic_S ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) / italic_τ ) - italic_G ( italic_T ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) / italic_τ ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT
+β1mi=1mT(x(i))log(T(x(i))S(x(i))),𝛽1𝑚superscriptsubscript𝑖1𝑚𝑇superscriptsuperscript𝑥𝑖𝑇superscriptsuperscript𝑥𝑖𝑆superscriptsuperscript𝑥𝑖\displaystyle\ \ \ \ \ +\beta\frac{1}{m}\sum_{i=1}^{m}\ T\left({x^{\prime}}^{(% i)}\right)\log\left(\frac{T\left({x^{\prime}}^{(i)}\right)}{S\left({x^{\prime}% }^{(i)}\right)}\right),+ italic_β divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_T ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_S ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG ) , (14)

where query sample x(i)=G(z(i))superscriptsuperscript𝑥𝑖𝐺superscript𝑧𝑖{x^{\prime}}^{(i)}=G(z^{(i)})italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_G ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) is generated from noise z(i)superscript𝑧𝑖z^{(i)}italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and temperature τ𝜏\tauitalic_τ is used to soften the output logits. Through the experiments, we found that 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm has a similar effect with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm, refer to Tab. 5.

We also add logit-level knowledge (Eqn. 9) to induce distillation and use a hyperparameter β𝛽\betaitalic_β to balance these two losses. Unlike most KD methods, we do not use cross-entropy loss with ground-truth labels, due to its unavailability in edge devices. An algorithm is summarized in Alg. 1.

5 Experiments

5.1 Experiment Setup

In this section, we compare our method with response-based KD and black-box KD methods in an unsupervised environment. Experimental results show that when the cross-entropy loss based on ground-truth labels is removed, the distillation performance of these methods decreases.

Datasets Setup. We conduct experiments on MNIST [27], CIFAR [26], Tiny ImageNet [11], and ImageNet-1K [11], all of which are widely used for image classification. While training B2KD methods, we randomly select 10K10𝐾10K10 italic_K images (100K100𝐾100K100 italic_K for ImageNet-1K) from the training set, and all images in the test set (val set for ImageNet) are used as the benchmark to calculate accuracy. For other approaches, except DAFL [8] based on zero-shot learning, we use the whole training set. We mainly use top-1 classification accuracy as an evaluation metric to assess the distillation effect. To make a fair and intuitive comparison, we follow the same setup as previous B2KD methods in our main experiments. However, we find that the original settings in the B2KD experiments do not represent the challenges raised in practical applications, so we add extended experiments in Sec. 5.4 to illustrate the practicability of our proposed method.

Implementation. See also the project page111https://github.com/HAIV-Lab/MEKD. We use ResNet [18], VGG [46] and MobileNet [20] as the backbone, and adopt standard data augmentation techniques (random crop and horizontal flip) and an SGD optimizer in all experiments. We consistently train the teacher and student model for 350350350350 epochs, except for 12121212 epochs for MNIST, and we adopt a multi-step LR scheduler following the paper [23]. After training the teacher, we train a DCGAN [42] with Gaussian noise in the same dimension as the category counts. The output logits of teacher or student for samples in the same class follow a Gaussian distribution, and the logits center is the mean of the Gaussian. Since the conversion between different Gaussian distributions is a linear process, using Gaussian as the prior distribution p𝑝pitalic_p provides a smooth dual space for the student’s logits update.

Competing Methods. In order to verify the effectiveness of our method, we compare several methods of response-based KD and black-box KD. We select KD [19] proposed by Hinton et al. and ML [3] proposed by Ba and Caruana as the baselines, and we also compare the recently published DKD [58] based on decoupled KLD. For the two GAN-based KD frameworks summarized in Sec. 2, we choose AL [53] and DAFL [8] as comparison methods. Meanwhile, we compete with some black-box KD methods such as KN [37], AM [51] and DB3KD [54]. Of these methods, DB3KD and MEKD(hard) only utilize hard responses, while the other methods are based on soft responses.

T - S KD AL AM DB3KD MEKD
Pairs (soft) (soft) (soft) (hard) (soft)
RN50 - RN34 52.08 53.50 56.92 58.61 59.89
RX101 - RX50 54.90 50.88 55.64 59.90 61.21
Table 2: Top-1 classification accuracy (%) of the student model on ImageNet-1K with data size 100K100𝐾100K100 italic_K. We use the pre-trained ResNet50 (76.13%) and ResNeXt101 (79.32%) as teachers.

5.2 Performance Evaluation

On MNIST, CIFAR, and Tiny ImageNet, we use ResNet32/56/110 and VGG13 as the teacher model and use ResNet8/32, VGG11, and MobileNet as the student model. We compare the top-1 classification accuracy (ACC) of different teacher-student pairs, the results are shown in Tab. 1.

On relatively easy tasks, such as MNIST and CIFAR-10, our proposed method has a small gap compared to response-baed KD methods that use the full training set. This makes sense in the applications of cloud-to-edge model compression because edge devices do not have a lot of capacity to store more than ten thousand pieces of data.

CIFAR-100 and Tiny ImageNet are more challenging. These tasks contain far more patterns than MNIST and CIFAR-10, and data distributions are so complex that it is difficult for a generator to capture all the patterns. However, as long as the mode collapse problem can be mitigated, it is possible to synthesize complex samples beneficial to distillation, so we exploit DCGAN [42] as our generator. DCGAN has a more stable training process and is more suitable for generating RGB images than a fully-connected GAN [42]. Experimental results show that MEKD can obtain an accuracy improvement of 5%10%similar-topercent5percent105\%\sim 10\%5 % ∼ 10 % compared to other B2KD methods, and the accuracy of MEKD with soft or hard responses is similar, with a difference of less than 1%percent11\%1 %.

We also conduct experiments on large-scale datasets and sophisticated networks. On ImageNet-1K, we use two teacher-student (T-S) pairs of ResNet50 (RN50) - ResNet34 (RN34) and ResNeXt101 (RX101) - ResNeXt50 (RX50). All methods are trained using a subset of 100K100𝐾100K100 italic_K samples. The experimental results are shown in Tab. 2.

Uniformly, we set the number of query samples to 50K50𝐾50K50 italic_K on CIFAR and MNIST, 300K300𝐾300K300 italic_K on ImageNet, and discuss the performance impact of limited query samples in Sec. 5.4.

Data Size 0.1K 1K 10K 50K (full)
KD [19] 16.74 31.25 70.90 90.43
AL [53] 12.97 32.05 68.61 90.54
AM [51] 48.31 62.05 73.65 86.33
DB3KD [54] 43.05 64.28 81.67 92.46
MEKD (soft) 49.04 69.84 86.85 93.48
MEKD (hard) 47.12 68.66 86.53 93.09
Table 3: Ablation study of data size on CIFAR-10. We use the T-S pair of ResNet56 - MobileNet, and the full training set is 50K50𝐾50K50 italic_K.

5.3 Ablation Study

We choose an effective T-S pair [35] of ResNet56 - MobileNet for ablation studies unless otherwise stated.

Ablation Study of Data Size. We explore the performance with different data sizes, the results are shown in Tab. 3. In general, B2KD methods have higher robustness to small data sizes than traditional KD methods, and in which MEKD achieves the highest distillation performance.

Ablation Study of Deprivatization. The α𝛼\alphaitalic_α is a hyperparameter to balance GANsubscript𝐺𝐴𝑁\mathcal{L}_{GAN}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N end_POSTSUBSCRIPT and IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT. The IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT is used to maximize the responses of the teacher to the generated samples. Therefore, the training of the generator with or without IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT will affect the quality of synthetic images. Fig. 5 (a) shows real images of CIFAR-10. Fig. 5 (b) shows synthetic images with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and Fig. 5 (c) shows synthetic images without IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT (i.e. α=0𝛼0\alpha=0italic_α = 0), both using the same noise vectors. The teacher of ResNet56 responds from 0.720.96similar-to0.720.960.72\sim 0.960.72 ∼ 0.96 to the synthetic image in Fig. 5 (b) and from 0.410.87similar-to0.410.870.41\sim 0.870.41 ∼ 0.87 to the one in Fig. 5 (c). The effect of α𝛼\alphaitalic_α is also reported in Tab. 4, which reflects that the utilization of IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT can improve the performance of model distillation.

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Figure 5: Real images of CIFAR-10 (a) and synthetic images using MEKD with IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT (b) and without IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT (c).

Refer to caption

     (a)

Refer to caption

      (b)

Figure 6: Ablation study of temperature τ𝜏\tauitalic_τ on CIFAR-100 (a) and CIFAR-10 (b). We use the T-S pair of ResNet56 - MobileNet.
α𝛼\alphaitalic_α Response Type
Soft Hard
0.0 60.87 61.31
0.1 66.06 66.86
0.5 67.07 67.36
1.0 67.01 67.11
β𝛽\betaitalic_β Response Type
Soft Hard
0.0 56.28 56.23
0.1 64.13 65.60
0.5 66.79 67.01
1.0 67.07 67.36
Table 4: Ablation study of hyperparamete α𝛼\alphaitalic_α and β𝛽\betaitalic_β on CIFAR-100. We use the T-S pair of ResNet56 - MobileNet.

Ablation Study of Distillation. In Eqn. 4, the β𝛽\betaitalic_β is a trade-off hyperparameter to balance Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT, which provide different gradient directions for θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. As shown in Tab. 4, the distillation performance can be improved by introducing KLD as an additional loss function.

The temperature τ𝜏\tauitalic_τ is another important hyperparameter for MEKD since it softens the output logits of both the teacher and student models. The results are shown in Fig. 6. Its validity comes from the fact that softened logits can increase the probability of being sampled in a standard normal distribution. Since GANs use a standard Gaussian distribution as input, samples generated from out-of-distribution noises with low-sampling probability are usually fuzzy and incorporate few patterns [43], which are meaningless for distillation. Meanwhile, a high value of τ𝜏\tauitalic_τ reduces the discrepancy between softened logits, and F=0subscript𝐹0\mathcal{L}_{F}=0caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = 0 when they locate in the same cell. It reduces the performance of distillation, especially for challenging tasks, such as ImageNet.

Ablation Study of Different Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. In Eqn. 4, we use Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT to calculate the distance between generated samples XS′′superscriptsubscript𝑋𝑆′′X_{S}^{\prime\prime}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and XT′′superscriptsubscript𝑋𝑇′′X_{T}^{\prime\prime}italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. From the analysis of experimental results, as shown in Tab. 5, we argue that the effect on distillation is similar whether F𝐹Fitalic_F equals 1111 or 2222. The reason is that Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is used to measure the distance between logits of the student and the boundary of cells, in which logits of the teacher reside, and different Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT represent similar gradient directions.

Dataset Method Model ACC (1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT/2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
CIFAR-10 MEKD (soft) MobileNet 86.85 / 86.63
MEKD (hard) MobileNet 86.53 / 86.88
CIFAR-100 MEKD (soft) MobileNet 67.07 / 66.95
MEKD (hard) MobileNet 67.36 / 66.94
Table 5: Ablation study of different Fsubscript𝐹\mathcal{L}_{F}caligraphic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. We use ResNet56 as the teacher model. ACC: top-1 classification accuracy (%).

Refer to caption

     (a)

Refer to caption

     (b)

Refer to caption

     (c)

Refer to caption

     (d)

Figure 7: Curve of top-1 classification accuracy on the datasets of CIFAR-100 (a,b) and CIFAR-10 (c,d). Using MEKD with soft (a,c) or hard (b,d) responses with or without IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT and KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT. We use ResNet56 as the teacher and use MobileNet as the student.

5.4 Extended Experiments

In the real-world application of cloud-to-edge model compression, there are some restrictions, such as the limitation of Internet data exchange and the domain shift in practical scenarios. We conduct additional experiments to explore the effect of MEKD under these constraints.

MEKD with Limited Query Samples. We distill a student MobileNet on CIFAR-10 and CIFAR-100 with a total query sample size ranging from 10K10𝐾10K10 italic_K to 50K50𝐾50K50 italic_K with an interval of 10K10𝐾10K10 italic_K. We report the ACC of MEKD with or without IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT and KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT. The curves in Fig. 7 show that with more query samples sent to the cloud server, the student model in the edge device can be trained more fully. We can also analyze from the curves that IMsubscript𝐼𝑀\mathcal{L}_{IM}caligraphic_L start_POSTSUBSCRIPT italic_I italic_M end_POSTSUBSCRIPT does not seem to be that useful without using KLD as an additional distillation loss function, and it gives a big boost to the overall MEKD due to the extra gradient direction of the mapping emulation.

MEKD with Out-of-Domain Data. We train a teacher (ResNet56 or VGG13) with vanilla supervised learning on Syn. Digits [13], which contains about 500K500𝐾500K500 italic_K software-synthesized images. We distill a student (MobileNet) on SVHN [36] consisting only of real-shooting photographs. Tab. 6 shows the ACC on the test set of SVHN. MEKD outperforms most methods in the task of out-of-domain distillation, while DB3KD achieves higher performance due to the use of robust labels [54]. However, DB3KD leads to a very high data exchange cost between the server and client, since it requires multiple queries to find a mixed image located in the decision boundary to compute robust labels. In contrast, the data exchange cost of MEKD is much lower.

Teacher ResNet56 VGG13 Data
Exchange
74.37 79.86
Student MobileNet MobileNet
KD [19] 76.27 80.67 similar-to\sim 175 MB
ML [3] 76.78 81.90 similar-to\sim 175 MB
AL [53] 77.09 80.98 similar-to\sim 175 MB
DKD [58] 75.47 80.64 similar-to\sim 175 MB
DAFL [8] 69.20 67.07 similar-to\sim 28.4 GB
KN [37] 79.65 83.37 similar-to\sim 145 MB
AM [51] 84.05 86.70 similar-to\sim 11.6 GB
DB3KD [54] 90.15 91.14 similar-to\sim 20.8 GB
MEKD (soft) 86.45 88.65 similar-to\sim 120 MB
MEKD (hard) 86.77 89.21 similar-to\sim 120 MB
Table 6: Top-1 classification accuracy (%) of methods on SVHN. The teacher models are trained on Syn. Digits with vanilla supervised learning, and achieve the top-1 classification accuracy of 99.56% for ResNet56 and 99.52% for VGG13 on Syn.Digits.

6 Conclusion

In this paper, we provide a two-step workflow of deprivatization and distillation for B2KD. Different from aligning logits directly, we theoretically provide a new optimization direction from logits to cell boundaries, and propose a new method of MEKD. Taking a generator as an inverse mapping of the teacher function does not leak information about the internal structure or parameters of the teacher, because it has a completely different network structure.

Limitation. A well-trained generator is critical in MEKD, and GANs are known to suffer from mode collapse, especially for challenging tasks. We alleviate this problem with DCGAN. Although the parameter size and structural limitations of the model prevent the student from fully mimicking the function of the teacher, MEKD can still improve distillation performance compared with other B2KD methods.

Acknowledgement. This research was supported by Natural Science Fund of Hubei Province (Grant # 2022CFB823), Alibaba Innovation Research program under Grant Contract # CRAQ7WHZ11220001-20978282, and HUST Independent Innovation Research Fund (Grant # 2021XXJS096).

References

  • Aguilar et al. [2020] Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7350–7357, 2020.
  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR, 2017.
  • Ba and Caruana [2014] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in Neural Information Processing Systems, 27, 2014.
  • Belagiannis et al. [2018] Vasileios Belagiannis, Azade Farshad, and Fabio Galasso. Adversarial network compression. In Proceedings of the European Conference on Computer Vision Workshops, pages 0–0, 2018.
  • Bergmann et al. [2020] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4183–4192, 2020.
  • Braun and Griebel [2009] Jürgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive Approximation, 30(3):653–675, 2009.
  • Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
  • Chen et al. [2019] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3514–3522, 2019.
  • Chen et al. [2020] Hanting Chen, Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Learning student networks via feature embedding. IEEE Transactions on Neural Networks and Learning Systems, 32(1):25–35, 2020.
  • Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 29, 2016.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Frankl and Maehara [1988] Peter Frankl and Hiroshi Maehara. The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, Series B, 44(3):355–362, 1988.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
  • Gou et al. [2021] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
  • He et al. [2020] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068–14080, 2020.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint:1503.02531, 2(7), 2015.
  • Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Hu et al. [2017] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. In International conference on machine learning, pages 1558–1567. PMLR, 2017.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • Kim et al. [2018] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems, 31, 2018.
  • Komodakis and Zagoruyko [2017] Nikos Komodakis and Sergey Zagoruyko. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
  • Köppen [2002] Mario Köppen. On the training of a kolmogorov network. In International Conference on Artificial Neural Networks, pages 474–479. Springer, 2002.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Unvieristy of Toronto: Technical Report, 2009.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lei et al. [2019] Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
  • Lei et al. [2020] Na Lei, Dongsheng An, Yang Guo, Kehua Su, Shixia Liu, Zhongxuan Luo, Shing-Tung Yau, and Xianfeng Gu. A geometric understanding of deep learning. Engineering, 6(3):361–374, 2020.
  • Lewis and Lucchetti [2000] Adrian Stephen Lewis and RE Lucchetti. Nonsmooth duality, sandwich, and squeeze theorems. SIAM Journal on Control and Optimization, 38(2):613–626, 2000.
  • Meng et al. [2019] Zhong Meng, Jinyu Li, Yong Zhao, and Yifan Gong. Conditional teacher-student learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6445–6449, 2019.
  • Micaelli and Storkey [2019] Paul Micaelli and Amos J Storkey. Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems, 32, 2019.
  • Milgrom and Segal [2002] Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
  • Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint:1411.1784, 2014.
  • Mirzadeh et al. [2020] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, pages 5191–5198, 2020.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Orekondy et al. [2019] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4954–4963, 2019.
  • Ozkara et al. [2021] Kaan Ozkara, Navjot Singh, Deepesh Data, and Suhas Diggavi. Quped: Quantized personalization via distillation with applications to federated learning. Advances in Neural Information Processing Systems, 34, 2021.
  • Pan et al. [2020] Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
  • Passalis et al. [2020] Nikolaos Passalis, Maria Tzelepi, and Anastasios Tefas. Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2339–2348, 2020.
  • Passban et al. [2021] Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, and Qun Liu. Alp-kd: Attention-based layer projection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13657–13665, 2021.
  • Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint:1511.06434, 2015.
  • Schlegl et al. [2017] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pages 146–157. Springer, 2017.
  • Shen et al. [2019] Zhiqiang Shen, Zhankui He, and Xiangyang Xue. Meal: Multi-model ensemble via adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4886–4893, 2019.
  • Shi and Sha [2012] Yuan Shi and Fei Sha. Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In ICML, 2012.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Stanton et al. [2021] Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34, 2021.
  • Tenenbaum et al. [2000] Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • Tramèr et al. [2016] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction {{\{{APIs}}\}}. In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
  • Villani [2009] Cédric Villani. Optimal transport: old and new. Springer, 2009.
  • Wang et al. [2020a] Dongdong Wang, Yandong Li, Liqiang Wang, and Boqing Gong. Neural networks are more productive teachers than human raters: Active mixup for data-efficient knowledge distillation from a blackbox model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1498–1507, 2020a.
  • Wang et al. [2020b] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8684–8694, 2020b.
  • Wang et al. [2018] Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Adversarial learning of portable student networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • Wang [2021] Zi Wang. Zero-shot knowledge distillation from a decision-based black-box model. In International Conference on Machine Learning, pages 10675–10685. PMLR, 2021.
  • Xu et al. [2017] Zheng Xu, Yen-Chang Hsu, and Jiawei Huang. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint:1709.00513, 2017.
  • Ye et al. [2020] Jingwen Ye, Yixin Ji, Xinchao Wang, Xin Gao, and Mingli Song. Data-free knowledge amalgamation via group-stack dual-gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12516–12525, 2020.
  • Yim et al. [2017] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
  • Zhao et al. [2022] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11953–11962, 2022.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.

Appendix

In the Appendix, we provide proof of theorems and more experimental results for MEKD. We also visualize the real and generated distributions of MEKD with DCGAN to verify the effectiveness of our method.

A. Proofs

The success of deep learning can be attributed to the discovery of intrinsic structures of data, which is defined as the manifold distribution hypothesis [48]. The data is concentrated on a manifold ΣnΣsuperscript𝑛\Sigma\in\mathbb{R}^{n}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which is embedded in the image space 𝒳𝒳\mathcal{X}caligraphic_X, and data distribution can be abstracted as a probability distribution μ𝜇\muitalic_μ over the data manifold. The encoding-map φ:ΣΩ:𝜑ΣΩ\varphi:\Sigma\rightarrow\Omegaitalic_φ : roman_Σ → roman_Ω maps the data manifold ΣΣ\Sigmaroman_Σ to the label manifold ΩCΩsuperscript𝐶\Omega\in\mathbb{R}^{C}roman_Ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT in a label space 𝒴𝒴\mathcal{Y}caligraphic_Y which is also called latent space, while mapping the data distribution μ𝜇\muitalic_μ to latent distribution υ=φ#μ𝜐subscript𝜑#𝜇\upsilon=\varphi_{\#}\muitalic_υ = italic_φ start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ. Each sample x𝑥xitalic_x is mapped from the image space into the latent space, and its result φ(x)𝜑𝑥\varphi(x)italic_φ ( italic_x ) is called a latent code. The decoding-map φ1superscript𝜑1\varphi^{-1}italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT remaps latent codes to the data manifold. Both φ𝜑\varphiitalic_φ and φ1superscript𝜑1\varphi^{-1}italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT are strongly nonlinear functions, which can be simulated with different neural networks [28, 29]. Meanwhile, the well-known Kolmogorov Theorem [25, 6] indicates that any multivariate continuous function can be represented as the sum of continuous real-valued functions with continuous one-dimensional outer and inner functions ΦqsubscriptΦ𝑞\Phi_{q}roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and Ψq,psubscriptΨ𝑞𝑝\Psi_{q,p}roman_Ψ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT.

The teacher function fTφsubscript𝑓𝑇𝜑f_{T}\in\varphiitalic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ italic_φ can be considered as a kind of encoding map, and the generator function fGφ1subscript𝑓𝐺superscript𝜑1f_{G}\in\varphi^{-1}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∈ italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can be considered as a kind of decoding map. Let 𝒳n𝒳superscript𝑛\mathcal{X}\in\mathbb{R}^{n}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the image space, where data x𝑥xitalic_x is sampled from. For a C𝐶Citalic_C-way classification task, let 𝒴C𝒴superscript𝐶\mathcal{Y}\in\mathbb{R}^{C}caligraphic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT be the latent space, where |𝒴|=C𝒴𝐶|\mathcal{Y}|=C| caligraphic_Y | = italic_C. Defining the model as a complex mapping function from the image distribution to the latent distribution, we can consider the teacher model as fT:𝒳𝒴:subscript𝑓𝑇𝒳𝒴f_{T}:\mathcal{X}\rightarrow\mathcal{Y}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y parameterized by θTΘTsubscript𝜃𝑇subscriptΘ𝑇\theta_{T}\in\Theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, whose outputs indicate the probabilities (e.g., logits) of what category the samples belong to. The same for the student model fS:𝒳𝒴:subscript𝑓𝑆𝒳𝒴f_{S}:\mathcal{X}\rightarrow\mathcal{Y}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y parameterized by θSΘSsubscript𝜃𝑆subscriptΘ𝑆\theta_{S}\in\Theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.

Definition 3.

(Function Equivalence) Giving the student and teacher model fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, for a data distribution μ𝒳𝜇𝒳\mu\in\mathcal{X}italic_μ ∈ caligraphic_X in image space which is mapped to S𝒴subscript𝑆𝒴\mathbb{P}_{S}\in\mathcal{Y}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_Y and T𝒴subscript𝑇𝒴\mathbb{P}_{T}\in\mathcal{Y}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_Y in latent space. If the Wasserstein distance between Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT equals zero,

W(S,T)=infγΠ(S,T)𝔼(yS,yT)γ[ySyT]=0,𝑊subscript𝑆subscript𝑇subscriptinfimum𝛾Πsubscript𝑆subscript𝑇subscript𝔼similar-tosubscript𝑦𝑆subscript𝑦𝑇𝛾delimited-[]normsubscript𝑦𝑆subscript𝑦𝑇0W(\mathbb{P}_{S},\mathbb{P}_{T})=\inf_{\gamma\in\Pi(\mathbb{P}_{S},\mathbb{P}_% {T})}\mathbb{E}_{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|\ \right]=0,italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Π ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ ] = 0 , (15)

the student and teacher model are equivalent, i.e., fS=fTsubscript𝑓𝑆subscript𝑓𝑇f_{S}=f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where Π(S,T)Πsubscript𝑆subscript𝑇\Pi(\mathbb{P}_{S},\mathbb{P}_{T})roman_Π ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is the set of all joint distributions γ(yS,yT)𝛾subscript𝑦𝑆subscript𝑦𝑇\gamma(y_{S},y_{T})italic_γ ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) whose marginals are Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, respectively.

Definition 4.

(Inverse Mapping) Giving a prior distribution pC𝑝superscript𝐶p\in\mathbb{R}^{C}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, for a data distribution μn𝜇superscript𝑛\mu\in\mathbb{R}^{n}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, if the Wasserstein distance between generated distribution μ=(fG)#psuperscript𝜇subscriptsubscript𝑓𝐺#𝑝\mu^{\prime}=(f_{G})_{\#}pitalic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_p and μ𝜇\muitalic_μ equals zero,

W(μ,μ)=infγΠ(μ,μ)𝔼(x,x)γ[xx]=0,𝑊superscript𝜇𝜇subscriptinfimum𝛾Πsuperscript𝜇𝜇subscript𝔼similar-tosuperscript𝑥𝑥𝛾delimited-[]normsuperscript𝑥𝑥0W(\mu^{\prime},\mu)=\inf_{\gamma\in\Pi(\mu^{\prime},\mu)}\mathbb{E}_{(x^{% \prime},x)\sim\gamma}[\ \|x^{\prime}-x\|\ ]=0,italic_W ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) = roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Π ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_x ∥ ] = 0 , (16)

then the generator fG:Cn:subscript𝑓𝐺superscript𝐶superscript𝑛f_{G}:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the inverse mapping of the teacher function fT:nC:subscript𝑓𝑇superscript𝑛superscript𝐶f_{T}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{C}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, denoted as fG=fT1subscript𝑓𝐺superscriptsubscript𝑓𝑇1f_{G}=f_{T}^{-1}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where Π(μ,μ)Πsuperscript𝜇𝜇\Pi(\mu^{\prime},\mu)roman_Π ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) is the set of all joint distributions γ(x,x)𝛾superscript𝑥𝑥\gamma(x^{\prime},x)italic_γ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ) whose marginals are respectively μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and μ𝜇\muitalic_μ.

A.1. Proof of Theorem 1

Theorem 4.

(Empirical Approximation) For any 0<ϵ<1/20italic-ϵ120<\epsilon<1/20 < italic_ϵ < 1 / 2 and any integer m>4𝑚4m>4italic_m > 4, let g:Cn:𝑔superscript𝐶superscript𝑛g:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the mapping function of generator G𝐺Gitalic_G with n20logmϵ2𝑛20𝑚superscriptitalic-ϵ2n\leq\frac{20\log m}{\epsilon^{2}}italic_n ≤ divide start_ARG 20 roman_log italic_m end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. For two sets VS={yS:ySS}subscript𝑉𝑆conditional-setsubscript𝑦𝑆subscript𝑦𝑆subscript𝑆V_{S}=\{y_{S}:y_{S}\in\mathbb{P}_{S}\}italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } and VT={yT:yTT}subscript𝑉𝑇conditional-setsubscript𝑦𝑇subscript𝑦𝑇subscript𝑇V_{T}=\{y_{T}:y_{T}\in\mathbb{P}_{T}\}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, both of which have m𝑚mitalic_m points in Csuperscript𝐶\mathbb{R}^{C}blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, if the empirical Wasserstein distance between g(VS)𝑔subscript𝑉𝑆g(V_{S})italic_g ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and g(VT)𝑔subscript𝑉𝑇g(V_{T})italic_g ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) equals zero,

W^(g(VS),g(VT))=1mi=1mg(ySi)g(yTi)=0,^𝑊𝑔subscript𝑉𝑆𝑔subscript𝑉𝑇1𝑚superscriptsubscript𝑖1𝑚norm𝑔superscriptsubscript𝑦𝑆𝑖𝑔superscriptsubscript𝑦𝑇𝑖0\hat{W}(g(V_{S}),g(V_{T}))=\frac{1}{m}\sum_{i=1}^{m}\|g(y_{S}^{i})-g(y_{T}^{i}% )\|=0,over^ start_ARG italic_W end_ARG ( italic_g ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) , italic_g ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_g ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_g ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ = 0 , (17)

then W(S,T)=0𝑊subscript𝑆subscript𝑇0W(\mathbb{P}_{S},\mathbb{P}_{T})=0italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = 0.

Proof.

According to Johnson-Lindenstrauss theorem, for ySVSsubscript𝑦𝑆subscript𝑉𝑆y_{S}\in V_{S}italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and yTVTsubscript𝑦𝑇subscript𝑉𝑇y_{T}\in V_{T}italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we have

ySyT(1+ϵ)g(yS)g(yT).normsubscript𝑦𝑆subscript𝑦𝑇1italic-ϵnorm𝑔subscript𝑦𝑆𝑔subscript𝑦𝑇\displaystyle\|y_{S}-y_{T}\|\leq(1+\epsilon)\|g(y_{S})-g(y_{T})\|.∥ italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ ≤ ( 1 + italic_ϵ ) ∥ italic_g ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) - italic_g ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ . (18)

For set VSsubscript𝑉𝑆V_{S}italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and VTsubscript𝑉𝑇V_{T}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we can get the empirical Wasserstein distance between them:

W^(VS,VT)=1mi=1mySiyTi1mi=1m(1+ϵ)g(ySi)g(yTi)=1+ϵmi=1mg(ySi)g(yTi)=(1+ϵ)W^(g(VS),g(VT))=0.^𝑊subscript𝑉𝑆subscript𝑉𝑇absent1𝑚superscriptsubscript𝑖1𝑚normsuperscriptsubscript𝑦𝑆𝑖superscriptsubscript𝑦𝑇𝑖missing-subexpressionabsent1𝑚superscriptsubscript𝑖1𝑚1italic-ϵnorm𝑔superscriptsubscript𝑦𝑆𝑖𝑔superscriptsubscript𝑦𝑇𝑖missing-subexpressionabsent1italic-ϵ𝑚superscriptsubscript𝑖1𝑚norm𝑔superscriptsubscript𝑦𝑆𝑖𝑔superscriptsubscript𝑦𝑇𝑖missing-subexpressionabsent1italic-ϵ^𝑊𝑔subscript𝑉𝑆𝑔subscript𝑉𝑇0\displaystyle\begin{aligned} \hat{W}(V_{S},V_{T})&=\frac{1}{m}\sum_{i=1}^{m}\|% y_{S}^{i}-y_{T}^{i}\|\\ &\leq\frac{1}{m}\sum_{i=1}^{m}(1+\epsilon)\|g(y_{S}^{i})-g(y_{T}^{i})\|\\ &=\frac{1+\epsilon}{m}\sum_{i=1}^{m}\|g(y_{S}^{i})-g(y_{T}^{i})\|\\ &=(1+\epsilon)\hat{W}(g(V_{S}),g(V_{T}))=0.\end{aligned}start_ROW start_CELL over^ start_ARG italic_W end_ARG ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 + italic_ϵ ) ∥ italic_g ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_g ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 + italic_ϵ end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_g ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_g ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( 1 + italic_ϵ ) over^ start_ARG italic_W end_ARG ( italic_g ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) , italic_g ( italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) = 0 . end_CELL end_ROW (19)

Because the Wasserstein distance between Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the expectation of the empirical Wasserstein distance between VSsubscript𝑉𝑆V_{S}italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and VTsubscript𝑉𝑇V_{T}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, i.e.,

W(S,T)=𝔼(VS,VT)Π(S,T)[W^(VS,VT)],𝑊subscript𝑆subscript𝑇subscript𝔼similar-tosubscript𝑉𝑆subscript𝑉𝑇Πsubscript𝑆subscript𝑇delimited-[]^𝑊subscript𝑉𝑆subscript𝑉𝑇\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})=\mathbb{E}_{(V_{S},V_{T})\sim% \Pi(\mathbb{P}_{S},\mathbb{P}_{T})}\left[\hat{W}(V_{S},V_{T})\right],italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∼ roman_Π ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_W end_ARG ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] , (20)

so we can get

W(S,T)W^(VS,VT)=0.𝑊subscript𝑆subscript𝑇^𝑊subscript𝑉𝑆subscript𝑉𝑇0\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})\leq\hat{W}(V_{S},V_{T})=0.italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ over^ start_ARG italic_W end_ARG ( italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = 0 . (21)

Since

W(S,T)=infγΠ(S,T)𝔼(yS,yT)γ[ySyT]0,𝑊subscript𝑆subscript𝑇subscriptinfimum𝛾Πsubscript𝑆subscript𝑇subscript𝔼similar-tosubscript𝑦𝑆subscript𝑦𝑇𝛾delimited-[]normsubscript𝑦𝑆subscript𝑦𝑇0\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})=\inf_{\gamma\in\Pi(\mathbb{P}_{% S},\mathbb{P}_{T})}\mathbb{E}_{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|% \ \right]\geq 0,italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Π ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ ] ≥ 0 , (22)

then the result W(S,T)=0𝑊subscript𝑆subscript𝑇0W(\mathbb{P}_{S},\mathbb{P}_{T})=0italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = 0 is derived. ∎

A.2. Proof of Theorem 2

Theorem 5.

(Optimization Direction) Let μ𝒳𝜇𝒳\mu\in\mathcal{X}italic_μ ∈ caligraphic_X be any distribution. fS,fT,fGsubscript𝑓𝑆subscript𝑓𝑇subscript𝑓𝐺f_{S},f_{T},f_{G}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are the mapping functions of the student, teacher, and generator, respectively. fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is parameterized by θSΘSsubscript𝜃𝑆subscriptΘ𝑆\theta_{S}\in\Theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Then, when

minθSΘS𝔼xμ[fGfS(x),fGfT(x)]0,\min_{\theta_{S}\in\Theta_{S}}\mathbb{E}_{x\sim\mu}\left[\|f_{G}\circ f_{S}(x)% ,f_{G}\circ f_{T}(x)\|\right]\rightarrow 0,roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] → 0 , (23)

it holds that fSfTsubscript𝑓𝑆subscript𝑓𝑇f_{S}\rightarrow f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT → italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and we have

θS𝔼xμ[fS(x)]=θSW(S,T)subscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝜇delimited-[]subscript𝑓𝑆𝑥subscriptsubscript𝜃𝑆𝑊subscript𝑆subscript𝑇\displaystyle\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[f_{S}(x)]=\nabla_{\theta% _{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ] = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
=𝔼xμ[θSfGfS(x)fGfT(x)].absentsubscript𝔼similar-to𝑥𝜇delimited-[]subscriptsubscript𝜃𝑆normsubscript𝑓𝐺subscript𝑓𝑆𝑥subscript𝑓𝐺subscript𝑓𝑇𝑥\displaystyle\ \ \ \ \ \ \ \ =\mathbb{E}_{x\sim\mu}[\nabla_{\theta_{S}}\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\|].= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] . (24)
Proof.

Let us define

V(fS,θS)=𝔼xμ[fS(x),fT(x)],\displaystyle V(f_{S},\theta_{S})=\mathbb{E}_{x\sim\mu}\left[\ \|f_{S}(x),f_{T% }(x)\|\ \right],italic_V ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] , (25)
V(fS,θS)=𝔼xμ[fGfS(x),fGfT(x)],\displaystyle V^{\prime}(f_{S},\theta_{S})=\mathbb{E}_{x\sim\mu}[\ \|f_{G}% \circ f_{S}(x),f_{G}\circ f_{T}(x)\|\ ],italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] , (26)

where fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT lies in 𝒮={fS:𝒳𝒴}subscript𝒮conditional-setsubscript𝑓𝑆𝒳𝒴\mathcal{F_{S}}=\{f_{S}:\mathcal{X}\rightarrow\mathcal{Y}\}caligraphic_F start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : caligraphic_X → caligraphic_Y } and θSΘSsubscript𝜃𝑆subscriptΘ𝑆\theta_{S}\in\Theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.

According to the Johnson-Lindenstrauss Lemma [12], for any 0<ϵ<1/20italic-ϵ120<\epsilon<1/20 < italic_ϵ < 1 / 2 and any integer m>4𝑚4m>4italic_m > 4, let n=20logmϵ2𝑛20𝑚superscriptitalic-ϵ2n=\frac{20\log m}{\epsilon^{2}}italic_n = divide start_ARG 20 roman_log italic_m end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, then for any set S𝑆Sitalic_S of m𝑚mitalic_m points in Csuperscript𝐶\mathbb{R}^{C}blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, the generator mapping function fG:Cn:subscript𝑓𝐺superscript𝐶superscript𝑛f_{G}:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for all fS(x),fT(x)Ssubscript𝑓𝑆𝑥subscript𝑓𝑇𝑥𝑆f_{S}(x),f_{T}(x)\in Sitalic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∈ italic_S holds that

(1ϵ)fGfS(x),fGfT(x)\displaystyle(1-\epsilon)\ \|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\|\ \ \ \ % \ \ \ \ \ \ \ \ \ \ \ \ \ ( 1 - italic_ϵ ) ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥
fS(x),fT(x)\displaystyle\leq\|f_{S}(x),f_{T}(x)\|\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ % \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ ≤ ∥ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥
(1+ϵ)fGfS(x),fGfT(x).\displaystyle\leq(1+\epsilon)\ \|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\|.≤ ( 1 + italic_ϵ ) ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ . (27)

Using Squeeze Theorem [30], we know that the minimization of equation 25 and equation 26 converge to the same results, i.e.,

infV(fS,θS)=infV(fS,θS).infimum𝑉subscript𝑓𝑆subscript𝜃𝑆infimumsuperscript𝑉subscript𝑓𝑆subscript𝜃𝑆\inf V(f_{S},\theta_{S})=\inf V^{\prime}(f_{S},\theta_{S}).roman_inf italic_V ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = roman_inf italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) . (28)

We can rewrite the equation 15 using xμsimilar-to𝑥𝜇x\sim\muitalic_x ∼ italic_μ:

W(S,T)𝑊subscript𝑆subscript𝑇\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =infγ(S,T)𝔼(yS,yT)γ[ySyT]absentsubscriptinfimum𝛾productsubscript𝑆subscript𝑇subscript𝔼similar-tosubscript𝑦𝑆subscript𝑦𝑇𝛾delimited-[]normsubscript𝑦𝑆subscript𝑦𝑇\displaystyle=\inf_{\gamma\in{\prod}(\mathbb{P}_{S},\mathbb{P}_{T})}\mathbb{E}% _{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|\ \right]= roman_inf start_POSTSUBSCRIPT italic_γ ∈ ∏ ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∼ italic_γ end_POSTSUBSCRIPT [ ∥ italic_y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ ] (29)
=infγ(fS(μ),fT(μ))𝔼xμ[fS(x),fT(x)]\displaystyle=\inf_{\gamma\in\prod(f_{S}(\mu),f_{T}(\mu))}\mathbb{E}_{x\sim\mu% }\left[\ \|f_{S}(x),f_{T}(x)\|\ \right]= roman_inf start_POSTSUBSCRIPT italic_γ ∈ ∏ ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_μ ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_μ ) ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ]
=infγ(fS(μ),fT(μ))V(fS,θS),absentsubscriptinfimum𝛾productsubscript𝑓𝑆𝜇subscript𝑓𝑇𝜇𝑉subscript𝑓𝑆subscript𝜃𝑆\displaystyle=\inf_{\gamma\in\prod(f_{S}(\mu),f_{T}(\mu))}V(f_{S},\theta_{S}),= roman_inf start_POSTSUBSCRIPT italic_γ ∈ ∏ ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_μ ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_μ ) ) end_POSTSUBSCRIPT italic_V ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ,

where fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT map distribution μ𝜇\muitalic_μ to Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, respectively. So we can get

infV(fS,θS)=infV(fS,θS)=W(S,T).infimumsuperscript𝑉subscript𝑓𝑆subscript𝜃𝑆infimum𝑉subscript𝑓𝑆subscript𝜃𝑆𝑊subscript𝑆subscript𝑇\inf\ V^{\prime}(f_{S},\theta_{S})=\inf\ V(f_{S},\theta_{S})=W(\mathbb{P}_{S},% \mathbb{P}_{T}).roman_inf italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = roman_inf italic_V ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) . (30)

According to Def. 3, when infV(fS,θS)0infimumsuperscript𝑉subscript𝑓𝑆subscript𝜃𝑆0\inf V^{\prime}(f_{S},\theta_{S})\rightarrow 0roman_inf italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) → 0, then W(S,T)0𝑊subscript𝑆subscript𝑇0W(\mathbb{P}_{S},\mathbb{P}_{T})\rightarrow 0italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) → 0, and we can derive that fSfTsubscript𝑓𝑆subscript𝑓𝑇f_{S}\rightarrow f_{T}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT → italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

The rest of the proof will be dedicated to show that the optimal solution of minV(fS,θS)superscript𝑉subscript𝑓𝑆subscript𝜃𝑆\min V^{\prime}(f_{S},\theta_{S})roman_min italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) leads to reduce the Wasserstein distance of Ssubscript𝑆\mathbb{P}_{S}blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and Tsubscript𝑇\mathbb{P}_{T}blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, which drives fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to approximate fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

We know by the Kantorovich-Rubinstein duality [50] that there is an f~SSsubscript~𝑓𝑆subscript𝑆\tilde{f}_{S}\in\mathcal{F}_{S}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT that attains

inf𝔼xμ[f~S(x),fT(x)]\displaystyle\inf\ \mathbb{E}_{x\sim\mu}[\ \|\tilde{f}_{S}(x),f_{T}(x)\|\ ]\ % \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ roman_inf blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ]
=sup𝔼xμ[f~S(x)]𝔼xμ[fT(x)].absentsupremumsubscript𝔼similar-to𝑥𝜇delimited-[]subscript~𝑓𝑆𝑥subscript𝔼similar-to𝑥𝜇delimited-[]subscript𝑓𝑇𝑥\displaystyle=\sup\ \mathbb{E}_{x\sim\mu}[\ \tilde{f}_{S}(x)\ ]-\mathbb{E}_{x% \sim\mu}[\ f_{T}(x)\ ].= roman_sup blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ] . (31)

Let us define X~(θS)={f~SS:V(f~S,θS)=W(S,T)}~𝑋subscript𝜃𝑆conditional-setsubscript~𝑓𝑆subscript𝑆𝑉subscript~𝑓𝑆subscript𝜃𝑆𝑊subscript𝑆subscript𝑇\tilde{X}(\theta_{S})=\{\tilde{f}_{S}\in\mathcal{F}_{S}:V(\tilde{f}_{S},\theta% _{S})=W(\mathbb{P}_{S},\mathbb{P}_{T})\}over~ start_ARG italic_X end_ARG ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = { over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : italic_V ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) } which is non-empty. We know by a simple envelope theorem [33] that

θSW(S,T)=θSV(f~S,θS),subscriptsubscript𝜃𝑆𝑊subscript𝑆subscript𝑇subscriptsubscript𝜃𝑆𝑉subscript~𝑓𝑆subscript𝜃𝑆\displaystyle\nabla_{\theta_{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})=\nabla_{% \theta_{S}}V(\tilde{f}_{S},\theta_{S}),∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) , (32)

for any f~SX~(θS)subscript~𝑓𝑆~𝑋subscript𝜃𝑆\tilde{f}_{S}\in\tilde{X}(\theta_{S})over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ over~ start_ARG italic_X end_ARG ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) when both terms are well-defined.

Let f~SX~(θS)subscript~𝑓𝑆~𝑋subscript𝜃𝑆\tilde{f}_{S}\in\tilde{X}(\theta_{S})over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ over~ start_ARG italic_X end_ARG ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ), which we knows exists since X~(θS)~𝑋subscript𝜃𝑆\tilde{X}(\theta_{S})over~ start_ARG italic_X end_ARG ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) is non-empty for all θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Then, we get

θSW(S,T)subscriptsubscript𝜃𝑆𝑊subscript𝑆subscript𝑇\displaystyle\nabla_{\theta_{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) =θSV(f~S,θS)absentsubscriptsubscript𝜃𝑆𝑉subscript~𝑓𝑆subscript𝜃𝑆\displaystyle=\nabla_{\theta_{S}}V(\tilde{f}_{S},\theta_{S})= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) (33)
=θS𝔼xμ[f~S(x),fT(x)]\displaystyle=\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\ \|\tilde{f}_{S}(x),f_% {T}(x)\|\ ]= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ]
=θS𝔼xμ[f~S(x)]𝔼xμ[fT(x)]absentsubscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝜇delimited-[]subscript~𝑓𝑆𝑥subscript𝔼similar-to𝑥𝜇delimited-[]subscript𝑓𝑇𝑥\displaystyle=\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\ \tilde{f}_{S}(x)\ ]-% \mathbb{E}_{x\sim\mu}\left[\ f_{T}(x)\ \right]= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ]
=θS𝔼xμ[f~S(x)].absentsubscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝜇delimited-[]subscript~𝑓𝑆𝑥\displaystyle=\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\ \tilde{f}_{S}(x)\ ].= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ] .

In practice, we use empirical distance between generated images of the student and teacher as loss to update θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT by back-propagation, i.e.,

θS𝔼xμ[fS(x)]=θSW(S,T)subscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝜇delimited-[]subscript𝑓𝑆𝑥subscriptsubscript𝜃𝑆𝑊subscript𝑆subscript𝑇\displaystyle\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[f_{S}(x)]=\nabla_{\theta% _{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) ] = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
=θSW((fG)#S,(fG)#T)absentsubscriptsubscript𝜃𝑆𝑊subscriptsubscript𝑓𝐺#subscript𝑆subscriptsubscript𝑓𝐺#subscript𝑇\displaystyle\ \ \ \ \ \ \ \ =\nabla_{\theta_{S}}W((f_{G})_{\#}\mathbb{P}_{S},% (f_{G})_{\#}\mathbb{P}_{T})= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W ( ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , ( italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
=θS𝔼xμ[fGfS(x)fGfT(x)]absentsubscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝜇delimited-[]normsubscript𝑓𝐺subscript𝑓𝑆𝑥subscript𝑓𝐺subscript𝑓𝑇𝑥\displaystyle\ \ \ \ \ \ \ \ =\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\|]= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ]
=𝔼xμ[θSfGfS(x)fGfT(x)],absentsubscript𝔼similar-to𝑥𝜇delimited-[]subscriptsubscript𝜃𝑆normsubscript𝑓𝐺subscript𝑓𝑆𝑥subscript𝑓𝐺subscript𝑓𝑇𝑥\displaystyle\ \ \ \ \ \ \ \ =\mathbb{E}_{x\sim\mu}[\nabla_{\theta_{S}}\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\|],= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) ∥ ] , (34)

when W(S,T)0𝑊subscript𝑆subscript𝑇0W(\mathbb{P}_{S},\mathbb{P}_{T})\rightarrow 0italic_W ( blackboard_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) → 0, the student function fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT converges to the teacher function fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. ∎

A.3. Proof of Theorem 3

Theorem 6.

(Generalization Bound) Let H𝒳×𝒴𝐻superscript𝒳𝒴H\subseteq\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}italic_H ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_X × caligraphic_Y end_POSTSUPERSCRIPT be a hypothesis set for C𝐶Citalic_C-way classification task. For any 0<ϵ<1/20italic-ϵ120<\epsilon<1/20 < italic_ϵ < 1 / 2 and a sample S𝑆Sitalic_S of size m>4𝑚4m>4italic_m > 4 drawn according to μ𝜇\muitalic_μ, let g:Cn:𝑔superscript𝐶superscript𝑛g:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a mapping function of generator G𝐺Gitalic_G with n20logmϵ2𝑛20𝑚superscriptitalic-ϵ2n\leq\frac{20\log m}{\epsilon^{2}}italic_n ≤ divide start_ARG 20 roman_log italic_m end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Fix ρ>0𝜌0\rho>0italic_ρ > 0, for any 1>δ>01𝛿01>\delta>01 > italic_δ > 0, with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds for all hH𝐻h\in Hitalic_h ∈ italic_H,

R(h)R^ρ(h)+2C2ρ(1ϵ)r2Λ2m+log1δ2m.𝑅subscript^𝑅𝜌2superscript𝐶2𝜌1italic-ϵsuperscript𝑟2superscriptΛ2𝑚1𝛿2𝑚R(h)\leq\hat{R}_{\rho}(h)+\frac{2C^{2}}{\rho(1-\epsilon)}\sqrt{\frac{r^{2}% \Lambda^{2}}{m}}+\sqrt{\frac{\log\frac{1}{\delta}}{2m}}.italic_R ( italic_h ) ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_h ) + divide start_ARG 2 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ ( 1 - italic_ϵ ) end_ARG square-root start_ARG divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG 2 italic_m end_ARG end_ARG . (35)

For any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, the Λ0Λ0\Lambda\geq 0roman_Λ ≥ 0 and (y=1Ch(x,y)p)1/pΛsuperscriptsuperscriptsubscript𝑦1𝐶superscriptnorm𝑥𝑦𝑝1𝑝Λ(\sum_{y=1}^{C}\|h(x,y)\|^{p})^{1/p}\leq\Lambda( ∑ start_POSTSUBSCRIPT italic_y = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∥ italic_h ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ≤ roman_Λ for any p1𝑝1p\geq 1italic_p ≥ 1, and the r>0𝑟0r>0italic_r > 0 for K(x,x)r2𝐾𝑥𝑥superscript𝑟2K(x,x)\leq r^{2}italic_K ( italic_x , italic_x ) ≤ italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where kernel K:𝒳×𝒳:𝐾𝒳𝒳K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}italic_K : caligraphic_X × caligraphic_X → blackboard_R is positive definite symmetric.

Proof.

For the C𝐶Citalic_C-way classification task, a hypothesis h:𝒳×𝒴:𝒳𝒴h:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}italic_h : caligraphic_X × caligraphic_Y → blackboard_R aims to get y𝑦yitalic_y with the minimum distance, i.e. argminy𝒴h¯(x)h¯ysubscript𝑦𝒴norm¯𝑥subscript¯𝑦\arg\min_{y\in\mathcal{Y}}\|\overline{h}(x)-\overline{h}_{y}\|roman_arg roman_min start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h end_ARG ( italic_x ) - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∥ which is equivalent to argminy𝒴(1+ϵ)g(h¯(x))g(h¯y)subscript𝑦𝒴1italic-ϵnorm𝑔¯𝑥𝑔subscript¯𝑦\arg\min_{y\in\mathcal{Y}}(1+\epsilon)\|g(\overline{h}(x))-g(\overline{h}_{y})\|roman_arg roman_min start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ( 1 + italic_ϵ ) ∥ italic_g ( over¯ start_ARG italic_h end_ARG ( italic_x ) ) - italic_g ( over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ∥ by Johnson-Lindenstrauss theorem, as the result of x𝑥xitalic_x. We define the margin ρh(x,y)subscript𝜌𝑥𝑦\rho_{h}(x,y)italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_y ) of the hypothesis hhitalic_h as

ρh(x,y)=g(h¯(x))g(h¯y)minyyg(h¯(x))g(h¯y),subscript𝜌𝑥𝑦norm𝑔¯𝑥𝑔subscript¯𝑦subscriptsuperscript𝑦𝑦norm𝑔¯𝑥𝑔subscript¯superscript𝑦\displaystyle\rho_{h}(x,y)=\|g(\overline{h}(x))-g(\overline{h}_{y})\|-\min_{y^% {\prime}\neq y}\|g(\overline{h}(x))-g(\overline{h}_{y^{\prime}})\|,italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∥ italic_g ( over¯ start_ARG italic_h end_ARG ( italic_x ) ) - italic_g ( over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ∥ - roman_min start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y end_POSTSUBSCRIPT ∥ italic_g ( over¯ start_ARG italic_h end_ARG ( italic_x ) ) - italic_g ( over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ∥ , (36)

where h¯(x)¯𝑥\overline{h}(x)over¯ start_ARG italic_h end_ARG ( italic_x ) is the vector of h(x,y),y𝒴𝑥𝑦𝑦𝒴h(x,y),y\in\mathcal{Y}italic_h ( italic_x , italic_y ) , italic_y ∈ caligraphic_Y and h¯ysubscript¯𝑦\overline{h}_{y}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT use the mean of x𝑥xitalic_x which belong to class y𝑦yitalic_y as input. g𝑔gitalic_g is the mapping function of generator G𝐺Gitalic_G.

For any ρ<0𝜌0\rho<0italic_ρ < 0, we can define the empirical margin loss of hypothesis hhitalic_h for multi-class classification as

R^ρ(h)=1mi=1mΦρ(ρh(xi,yi)),subscript^𝑅𝜌1𝑚superscriptsubscript𝑖1𝑚subscriptΦ𝜌subscript𝜌subscript𝑥𝑖subscript𝑦𝑖\displaystyle\hat{R}_{\rho}(h)=\frac{1}{m}\sum_{i=1}^{m}\Phi_{\rho}(\rho_{h}(x% _{i},y_{i})),over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_h ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (37)

where ΦρsubscriptΦ𝜌\Phi_{\rho}roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT is the margin loss function

Φρ(x)={1 0x,1x/ρρx0,0xρ.subscriptΦ𝜌𝑥cases10𝑥1𝑥𝜌𝜌𝑥00𝑥𝜌\displaystyle\Phi_{\rho}(x)=\left\{\begin{array}[]{l}1\ \ \ \ \ \ \ \ \ \ \ \ % \ \ \ \ 0\leq x,\\ 1-x/\rho\ \ \ \ \ \rho\leq x\leq 0,\\ 0\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x\leq\rho.\end{array}\right.roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_x ) = { start_ARRAY start_ROW start_CELL 1 0 ≤ italic_x , end_CELL end_ROW start_ROW start_CELL 1 - italic_x / italic_ρ italic_ρ ≤ italic_x ≤ 0 , end_CELL end_ROW start_ROW start_CELL 0 italic_x ≤ italic_ρ . end_CELL end_ROW end_ARRAY (41)

Thus, empirical margin loss is upper bounded by

R^ρ(h)1mi=1m𝟙ρh(xi,yi)ρ.subscript^𝑅𝜌1𝑚superscriptsubscript𝑖1𝑚subscript1subscript𝜌subscript𝑥𝑖subscript𝑦𝑖𝜌\displaystyle\hat{R}_{\rho}(h)\leq\frac{1}{m}\sum_{i=1}^{m}\mathbbm{1}_{\rho_{% h}(x_{i},y_{i})\geq\rho}.over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_h ) ≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_ρ end_POSTSUBSCRIPT . (42)

Let H~={(x,y)ρh(x,y):hH}~𝐻conditional-setmaps-to𝑥𝑦subscript𝜌𝑥𝑦𝐻\tilde{H}=\{(x,y)\mapsto\rho_{h}(x,y):h\in H\}over~ start_ARG italic_H end_ARG = { ( italic_x , italic_y ) ↦ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_y ) : italic_h ∈ italic_H }, consider the family of functions ~={Φρr:rH~}~conditional-setsubscriptΦ𝜌𝑟𝑟~𝐻\tilde{\mathcal{H}}=\{\Phi_{\rho}\circ r:r\in\tilde{H}\}over~ start_ARG caligraphic_H end_ARG = { roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ∘ italic_r : italic_r ∈ over~ start_ARG italic_H end_ARG } derived from H~~𝐻\tilde{H}over~ start_ARG italic_H end_ARG, which take values in [0,1]01[0,1][ 0 , 1 ]. By Rademacher theorem, with the probability at least 1δ1𝛿1-\delta1 - italic_δ, for all hH𝐻h\in Hitalic_h ∈ italic_H,

E[Φρ(ρh(x,y))]R^ρ(h)+2m(ΦH^)+log1δ2m.𝐸delimited-[]subscriptΦ𝜌subscript𝜌𝑥𝑦subscript^𝑅𝜌2subscript𝑚Φ^𝐻1𝛿2𝑚\displaystyle E[\Phi_{\rho}(\rho_{h}(x,y))]\leq\hat{R}_{\rho}(h)+2\mathcal{R}_% {m}(\Phi\circ\hat{H})+\sqrt{\frac{\log\frac{1}{\delta}}{2m}}.italic_E [ roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ] ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_h ) + 2 caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_Φ ∘ over^ start_ARG italic_H end_ARG ) + square-root start_ARG divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG 2 italic_m end_ARG end_ARG . (43)

Since 𝟙μ0Φρ(μ)subscript1𝜇0subscriptΦ𝜌𝜇\mathbbm{1}_{\mu\geq 0}\leq\Phi_{\rho}(\mu)blackboard_1 start_POSTSUBSCRIPT italic_μ ≥ 0 end_POSTSUBSCRIPT ≤ roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_μ ) for all μ𝜇\mu\in\mathbb{R}italic_μ ∈ blackboard_R, the generalization error R(h)𝑅R(h)italic_R ( italic_h ) is a lower bound on the left-hand side by Johnson-Lindenstrauss theorem, R(h)=E[𝟙h¯(x)h¯yminyyh¯(x)h¯y0]E[Φρ(ρh(x,y))]𝑅𝐸delimited-[]subscript1norm¯𝑥subscript¯𝑦subscriptsuperscript𝑦𝑦norm¯𝑥subscript¯superscript𝑦0𝐸delimited-[]subscriptΦ𝜌subscript𝜌𝑥𝑦R(h)=E\left[\mathbbm{1}_{\|\overline{h}(x)-\overline{h}_{y}\|-\min_{y^{\prime}% \neq y}\|\overline{h}(x)-\overline{h}_{y^{\prime}}\|\geq 0}\right]\leq E[\Phi_% {\rho}(\rho_{h}(x,y))]italic_R ( italic_h ) = italic_E [ blackboard_1 start_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h end_ARG ( italic_x ) - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∥ - roman_min start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h end_ARG ( italic_x ) - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ≥ 0 end_POSTSUBSCRIPT ] ≤ italic_E [ roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ], and we get

R(h)R^ρ(h)+2m(ΦH^)+log1δ2m.𝑅subscript^𝑅𝜌2subscript𝑚Φ^𝐻1𝛿2𝑚\displaystyle R(h)\leq\hat{R}_{\rho}(h)+2\mathcal{R}_{m}(\Phi\circ\hat{H})+% \sqrt{\frac{\log\frac{1}{\delta}}{2m}}.italic_R ( italic_h ) ≤ over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_h ) + 2 caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_Φ ∘ over^ start_ARG italic_H end_ARG ) + square-root start_ARG divide start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG start_ARG 2 italic_m end_ARG end_ARG . (44)

Let ρ=ρ𝜌𝜌\rho=-\rhoitalic_ρ = - italic_ρ, because the (1/ρ)1𝜌(1/\rho)( 1 / italic_ρ )-Lipschitzness of ΦpsubscriptΦ𝑝\Phi_{p}roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, so that m(ΦpH~)1ρm(H~)subscript𝑚subscriptΦ𝑝~𝐻1𝜌subscript𝑚~𝐻\mathcal{R}_{m}(\Phi_{p}\circ\tilde{H})\leq\frac{1}{\rho}\mathcal{R}_{m}(% \tilde{H})caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∘ over~ start_ARG italic_H end_ARG ) ≤ divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over~ start_ARG italic_H end_ARG ). Here, m(H~)subscript𝑚~𝐻\mathcal{R}_{m}(\tilde{H})caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over~ start_ARG italic_H end_ARG ) can be upper bounded as follows:

m(H~)=1m𝐸S,σ[suphHi=1mσiρh(xi,yi)]=1m𝐸S,σ[suphHi=1my𝒴σiρh(xi,y)𝟙y=yi]1my𝒴𝐸S,σ[suphHi=1mσiρh(xi,y)𝟙y=yi]=1my𝒴𝐸S,σ[suphHi=1mσiρh(xi,y)(2(𝟙y=yi)12+12)]12my𝒴𝐸S,σ[suphHi=1mσi(2(𝟙y=yi)1)ρh(xi,y)]+12my𝒴𝐸S,σ[suphHi=1mσiρh(xi,y)]=1my𝒴𝐸S,σ[suphHi=1mσiρh(xi,y)],subscript𝑚~𝐻absent1𝑚subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝜌subscript𝑥𝑖subscript𝑦𝑖missing-subexpressionabsent1𝑚subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝑦𝒴subscript𝜎𝑖subscript𝜌subscript𝑥𝑖𝑦subscript1𝑦subscript𝑦𝑖missing-subexpressionabsent1𝑚subscript𝑦𝒴subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝜌subscript𝑥𝑖𝑦subscript1𝑦subscript𝑦𝑖missing-subexpressionabsent1𝑚subscript𝑦𝒴subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝜌subscript𝑥𝑖𝑦2subscript1𝑦subscript𝑦𝑖1212missing-subexpressionabsentlimit-from12𝑚subscript𝑦𝒴subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖2subscript1𝑦subscript𝑦𝑖1subscript𝜌subscript𝑥𝑖𝑦missing-subexpression12𝑚subscript𝑦𝒴subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝜌subscript𝑥𝑖𝑦missing-subexpressionabsent1𝑚subscript𝑦𝒴subscript𝐸𝑆𝜎delimited-[]subscriptsupremum𝐻superscriptsubscript𝑖1𝑚subscript𝜎𝑖subscript𝜌subscript𝑥𝑖𝑦\displaystyle\begin{aligned} \mathcal{R}_{m}(\tilde{H})&=\frac{1}{m}\mathop{E}% _{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}\rho_{h}(x_{i},y_{i})]\\ &=\frac{1}{m}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sum_{y\in% \mathcal{Y}}\sigma_{i}\rho_{h}(x_{i},y)\mathbbm{1}_{y=y_{i}}]\\ &\leq\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_% {i=1}^{m}\sigma_{i}\rho_{h}(x_{i},y)\mathbbm{1}_{y=y_{i}}]\\ &=\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=% 1}^{m}\sigma_{i}\rho_{h}(x_{i},y)(\frac{2(\mathbbm{1}_{y=y_{i}})-1}{2}+\frac{1% }{2})]\\ &\leq\frac{1}{2m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum% _{i=1}^{m}\sigma_{i}(2(\mathbbm{1}_{y=y_{i}})-1)\rho_{h}(x_{i},y)]+\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{2m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,% \sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}\rho_{h}(x_{i},y)]\\ &=\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=% 1}^{m}\sigma_{i}\rho_{h}(x_{i},y)],\end{aligned}start_ROW start_CELL caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over~ start_ARG italic_H end_ARG ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) blackboard_1 start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) blackboard_1 start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) ( divide start_ARG 2 ( blackboard_1 start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 2 ( blackboard_1 start_POSTSUBSCRIPT italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - 1 ) italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) ] + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) ] , end_CELL end_ROW (45)

where σ=(σ1,,σm)T𝜎superscriptsubscript𝜎1subscript𝜎𝑚𝑇\mathbf{\sigma}=(\sigma_{1},\ldots,\sigma_{m})^{T}italic_σ = ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT independent uniform random variables taking values in {1,+1}11\{-1,+1\}{ - 1 , + 1 }, observing that σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σisubscript𝜎𝑖-\sigma_{i}- italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are distributed in the same way.

Let Π1(H)(C1)={min{h1,,hl}:hiΠ1(H),i[1,C1]}subscriptΠ1superscript𝐻𝐶1conditional-setsubscript1subscript𝑙formulae-sequencesubscript𝑖subscriptΠ1𝐻𝑖1𝐶1\Pi_{1}(H)^{(C-1)}=\{\min\{h_{1},\ldots,h_{l}\}:h_{i}\in\Pi_{1}(H),i\in[1,C-1]\}roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) start_POSTSUPERSCRIPT ( italic_C - 1 ) end_POSTSUPERSCRIPT = { roman_min { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } : italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) , italic_i ∈ [ 1 , italic_C - 1 ] }. By Johnson-Lindenstrauss theorem, we get

m(H~)1my𝒴𝐸S,σ[suphHi=1mσi(g(h¯(x))g(h¯y)minyyg(h¯(x))g(h¯y))]1my𝒴𝐸S,σ[suphHi=1mσi11ϵ(h¯(xi)h¯yminyyh¯(xi)h¯y)]1(1ϵ)my𝒴[𝐸S,σ[suphHi=1mσih¯(xi)h¯y]+𝐸S,σ[suphHi=1mσiminyyh¯(xi)h¯y]]1(1ϵ)my𝒴[𝐸S,σ[suphΠ1(H)i=1mσih(xi)]+𝐸S,σ[suphΠ1(H)(C1)i=1mσih(xi)]]C(1ϵ)m[C𝐸S,σ[suphΠ1(H)i=1mσih(xi)]]=C21ϵ[1m𝐸S,σ[suphΠ1(H)i=1mσih(xi)]]=C21ϵm(Π1(H)).\displaystyle\begin{aligned} \mathcal{R}_{m}(\tilde{H})&\leq\frac{1}{m}\sum_{y% \in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}(\|g% (\overline{h}(x))-g(\overline{h}_{y})\|\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ -\min_{y^{\prime}\neq y}\|g(\overline{h}(x))-g(% \overline{h}_{y^{\prime}})\|)]\\ &\leq\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_% {i=1}^{m}\sigma_{i}\frac{1}{1-\epsilon}(\|\overline{h}(x_{i})\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ -\overline{h}_{y}\|-\min_{y^{\prime}\neq y}\|% \overline{h}(x_{i})-\overline{h}_{y^{\prime}}\|)]\\ &\leq\frac{1}{(1-\epsilon)m}\sum_{y\in\mathcal{Y}}[\mathop{E}_{S,\sigma}[\sup_% {h\in H}\sum_{i=1}^{m}\sigma_{i}\|\overline{h}(x_{i})-\overline{h}_{y}\|]\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ +\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}% \sigma_{i}\min_{y^{\prime}\neq y}\|\overline{h}(x_{i})-\overline{h}_{y^{\prime% }}\|]]\\ &\leq\frac{1}{(1-\epsilon)m}\sum_{y\in\mathcal{Y}}[\mathop{E}_{S,\sigma}[\sup_% {h\in\Pi_{1}(H)}\sum_{i=1}^{m}\sigma_{i}h(x_{i})]\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ +\mathop{E}_{S,\sigma}[\sup_{h\in\Pi_{1}(H)^{(C-1)}% }\sum_{i=1}^{m}\sigma_{i}h(x_{i})]]\\ &\leq\frac{C}{(1-\epsilon)m}[C\mathop{E}_{S,\sigma}[\sup_{h\in\Pi_{1}(H)}\sum_% {i=1}^{m}\sigma_{i}h(x_{i})]]\\ &=\frac{C^{2}}{1-\epsilon}[\frac{1}{m}\mathop{E}_{S,\sigma}[\sup_{h\in\Pi_{1}(% H)}\sum_{i=1}^{m}\sigma_{i}h(x_{i})]]\\ &=\frac{C^{2}}{1-\epsilon}\mathcal{R}_{m}(\Pi_{1}(H)).\end{aligned}start_ROW start_CELL caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over~ start_ARG italic_H end_ARG ) end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∥ italic_g ( over¯ start_ARG italic_h end_ARG ( italic_x ) ) - italic_g ( over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - roman_min start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y end_POSTSUBSCRIPT ∥ italic_g ( over¯ start_ARG italic_h end_ARG ( italic_x ) ) - italic_g ( over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ∥ ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 1 - italic_ϵ end_ARG ( ∥ over¯ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∥ - roman_min start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG ( 1 - italic_ϵ ) italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT [ italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ italic_H end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG ( 1 - italic_ϵ ) italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT [ italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) start_POSTSUPERSCRIPT ( italic_C - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG italic_C end_ARG start_ARG ( 1 - italic_ϵ ) italic_m end_ARG [ italic_C italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ϵ end_ARG [ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_h ∈ roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ϵ end_ARG caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) ) . end_CELL end_ROW (46)

Let K:𝒳×𝒳:𝐾𝒳𝒳K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}italic_K : caligraphic_X × caligraphic_X → blackboard_R be a positive definite symmetric kernel and let h(x,y)=argmaxy𝒴wyΦ(x)𝑥𝑦subscript𝑦𝒴subscript𝑤𝑦Φ𝑥h(x,y)=\arg\max_{y\in\mathcal{Y}}w_{y}\cdot\Phi(x)italic_h ( italic_x , italic_y ) = roman_arg roman_max start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⋅ roman_Φ ( italic_x ), where Φ:𝒳n:Φ𝒳superscript𝑛\Phi:\mathcal{X}\rightarrow\mathbb{R}^{n}roman_Φ : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a feature mapping associated to K𝐾Kitalic_K. We denote W𝑊Witalic_W as W=(w1,,wC)𝑊superscriptsubscript𝑤1topsuperscriptsubscript𝑤𝐶topW=(w_{1}^{\top},\ldots,w_{C}^{\top})italic_W = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). For any p1𝑝1p\geq 1italic_p ≥ 1, the family of kernel-based hypotheses is

H={h𝒳×𝒴:h(x,y)n,hpΛ},𝐻conditional-setsuperscript𝒳𝒴formulae-sequence𝑥𝑦superscript𝑛subscriptnorm𝑝Λ\displaystyle H=\{h\in\mathcal{R}^{\mathcal{X}\times\mathcal{Y}}:h(x,y)\in% \mathbb{R}^{n},\|h\|_{p}\leq\Lambda\},italic_H = { italic_h ∈ caligraphic_R start_POSTSUPERSCRIPT caligraphic_X × caligraphic_Y end_POSTSUPERSCRIPT : italic_h ( italic_x , italic_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , ∥ italic_h ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ roman_Λ } , (47)

where hp=(y=1Ch(x,y)p)1/psubscriptnorm𝑝superscriptsuperscriptsubscript𝑦1𝐶superscriptnorm𝑥𝑦𝑝1𝑝\|h\|_{p}=(\sum_{y=1}^{C}\|h(x,y)\|^{p})^{1/p}∥ italic_h ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_y = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∥ italic_h ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT.

Observe that for all l[1,C]𝑙1𝐶l\in[1,C]italic_l ∈ [ 1 , italic_C ], we have wl(l=1Cwlp)1/p=WphpΛnormsubscript𝑤𝑙superscriptsuperscriptsubscript𝑙1𝐶superscriptnormsubscript𝑤𝑙𝑝1𝑝subscriptnorm𝑊𝑝subscriptnorm𝑝Λ\|w_{l}\|\leq(\sum_{l=1}^{C}\|w_{l}\|^{p})^{1/p}=\|W\|_{p}\leq\|h\|_{p}\leq\Lambda∥ italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ ≤ ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT = ∥ italic_W ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ ∥ italic_h ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ roman_Λ.And for ij𝑖𝑗i\neq jitalic_i ≠ italic_j, 𝐸σ[σi,σj]=0subscript𝐸𝜎subscript𝜎𝑖subscript𝜎𝑗0\mathop{E}_{\sigma}[\sigma_{i},\sigma_{j}]=0italic_E start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT [ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] = 0. The Radmacher complexity of the hypotheses set Π1(H)subscriptΠ1𝐻\Pi_{1}(H)roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) can be expressed and bounded as follows:

m(Π1(H))subscript𝑚subscriptΠ1𝐻\displaystyle\mathcal{R}_{m}(\Pi_{1}(H))caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_Π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_H ) ) =1m𝐸S,σ[supy𝒴,WΛwy,i=1mσiΦ(xi)]absent1𝑚subscript𝐸𝑆𝜎delimited-[]subscriptsupremumformulae-sequence𝑦𝒴norm𝑊Λsubscript𝑤𝑦superscriptsubscript𝑖1𝑚subscript𝜎𝑖Φsubscript𝑥𝑖\displaystyle=\frac{1}{m}\mathop{E}_{S,\sigma}\left[\sup_{y\in\mathcal{Y},\|W% \|\leq\Lambda}\left\langle w_{y},\sum_{i=1}^{m}\sigma_{i}\Phi(x_{i})\right% \rangle\right]= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y , ∥ italic_W ∥ ≤ roman_Λ end_POSTSUBSCRIPT ⟨ italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ ]
1m𝐸S,σ[supy𝒴,WΛwyi=1mσiΦ(xi)]absent1𝑚subscript𝐸𝑆𝜎delimited-[]subscriptsupremumformulae-sequence𝑦𝒴norm𝑊Λnormsubscript𝑤𝑦normsuperscriptsubscript𝑖1𝑚subscript𝜎𝑖Φsubscript𝑥𝑖\displaystyle\leq\frac{1}{m}\mathop{E}_{S,\sigma}\left[\sup_{y\in\mathcal{Y},% \|W\|\leq\Lambda}\|w_{y}\|\left\|\sum_{i=1}^{m}\sigma_{i}\Phi(x_{i})\right\|\right]≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y , ∥ italic_W ∥ ≤ roman_Λ end_POSTSUBSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∥ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ]
Λm𝐸S,σ[i=1mσiΦ(xi)]absentΛ𝑚subscript𝐸𝑆𝜎delimited-[]normsuperscriptsubscript𝑖1𝑚subscript𝜎𝑖Φsubscript𝑥𝑖\displaystyle\leq\frac{\Lambda}{m}\mathop{E}_{S,\sigma}\left[\left\|\sum_{i=1}% ^{m}\sigma_{i}\Phi(x_{i})\right\|\right]≤ divide start_ARG roman_Λ end_ARG start_ARG italic_m end_ARG italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ ]
Λm[𝐸S,σ[i=1mσiΦ(xi)2]]1/2absentΛ𝑚superscriptdelimited-[]subscript𝐸𝑆𝜎delimited-[]superscriptnormsuperscriptsubscript𝑖1𝑚subscript𝜎𝑖Φsubscript𝑥𝑖212\displaystyle\leq\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\left\|\sum% _{i=1}^{m}\sigma_{i}\Phi(x_{i})\right\|^{2}\right]\right]^{1/2}≤ divide start_ARG roman_Λ end_ARG start_ARG italic_m end_ARG [ italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
=Λm[𝐸S,σ[i=1mΦ(xi)2]]1/2absentΛ𝑚superscriptdelimited-[]subscript𝐸𝑆𝜎delimited-[]superscriptsubscript𝑖1𝑚superscriptnormΦsubscript𝑥𝑖212\displaystyle=\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\sum_{i=1}^{m}% \|\Phi(x_{i})\|^{2}\right]\right]^{1/2}= divide start_ARG roman_Λ end_ARG start_ARG italic_m end_ARG [ italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
=Λm[𝐸S,σ[i=1mK(xi,xi)]]1/2absentΛ𝑚superscriptdelimited-[]subscript𝐸𝑆𝜎delimited-[]superscriptsubscript𝑖1𝑚𝐾subscript𝑥𝑖subscript𝑥𝑖12\displaystyle=\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\sum_{i=1}^{m}% K(x_{i},x_{i})\right]\right]^{1/2}= divide start_ARG roman_Λ end_ARG start_ARG italic_m end_ARG [ italic_E start_POSTSUBSCRIPT italic_S , italic_σ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_K ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT
Λmr2m=r2Λ2m,absentΛ𝑚superscript𝑟2𝑚superscript𝑟2superscriptΛ2𝑚\displaystyle\leq\frac{\Lambda\sqrt{mr^{2}}}{m}=\sqrt{\frac{r^{2}\Lambda^{2}}{% m}},≤ divide start_ARG roman_Λ square-root start_ARG italic_m italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_m end_ARG = square-root start_ARG divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG end_ARG , (48)

which concludes the proof.

Method Data Size MNIST CIFAR-10
Teacher 50Ksimilar-to\sim100K ResNet32 VGG13 ResNet32 ResNet32 ResNet56 VGG13 ResNet56 ResNet56
99.50 99.52 99.50 99.50 94.15 94.42 94.15 94.15
Student 50Ksimilar-to\sim100K ResNet8 VGG11 VGG11 MobileNet ResNet8 VGG11 VGG11 MobileNet
99.24 99.41 99.41 99.18 87.74 91.81 91.81 90.04
KD [19] 50Ksimilar-to\sim100K 99.33 99.44 99.31 99.30 86.58 92.16 92.25 90.43
ML [3] 50Ksimilar-to\sim100K 99.49 99.40 99.44 99.40 87.89 91.58 91.91 91.19
AL [53] 50Ksimilar-to\sim100K 99.37 99.26 99.26 99.21 87.25 91.96 91.97 90.54
DKD [58] 50Ksimilar-to\sim100K 99.33 99.43 99.48 99.42 86.61 92.06 92.42 90.50
DAFL [8] 0K 96.42 97.00 96.14 97.85 60.67 65.41 66.03 69.59
KN [37] 10K 98.61 98.81 98.07 98.54 80.62 81.83 82.41 85.07
AM [51] 10K 99.33 99.47 99.50 99.42 74.89 77.25 74.26 73.65
DB3KD [54] 10K 98.94 99.16 98.91 98.91 78.47 83.72 85.84 81.67
MEKD (soft) 10K 99.40 99.43 99.36 99.25 85.36 86.11 87.27 86.85
MEKD (hard) 10K 99.40 99.45 99.28 99.27 84.45 86.16 87.25 86.53
Table 7: Top-1 classification accuracy (%) of the student model on MNIST and CIFAR-10.
Method Data Size CIFAR-100 Tiny ImageNet
Teacher 50Ksimilar-to\sim100K ResNet56 VGG13 ResNet56 ResNet56 ResNet110 VGG13 ResNet110 ResNet110
72.06 74.68 72.06 72.06 60.71 59.89 60.71 60.71
Student 50Ksimilar-to\sim100K ResNet8 VGG11 VGG11 MobileNet ResNet32 VGG11 VGG11 MobileNet
59.92 69.12 69.12 68.14 55.47 54.14 54.14 56.07
KD [19] 50Ksimilar-to\sim100K 53.31 70.88 67.97 71.86 54.14 54.40 49.63 57.85
ML [3] 50Ksimilar-to\sim100K 54.44 67.78 70.18 73.08 56.56 57.46 56.78 60.07
AL [53] 50Ksimilar-to\sim100K 58.36 69.92 71.13 71.33 46.02 46.26 45.60 51.29
DKD [58] 50Ksimilar-to\sim100K 54.28 67.32 70.10 72.38 55.99 55.88 56.52 59.43
DAFL [8] 0K 42.44 43.78 48.32 54.10 38.44 31.93 34.13 40.93
KN [37] 10K 48.75 57.83 55.64 58.49 48.92 46.99 45.05 50.22
AM [51] 10K 50.69 62.17 63.20 65.58 47.72 49.26 47.32 51.54
DB3KD [54] 10K 50.49 63.48 62.76 63.67 47.95 48.46 46.93 50.49
MEKD (soft) 10K 51.87 64.76 64.83 67.07 50.87 51.85 49.95 54.93
MEKD (hard) 10K 51.67 64.72 65.32 67.36 49.89 51.33 49.36 54.71
Table 8: Top-1 classification accuracy (%) of the student model on CIFAR-100 and Tiny ImageNet.
Dataset T - S Data Size KD ML AL DKD KN AM DB3KD MEKD MEKD
Pairs (soft) (soft) (soft) (soft) (soft) (soft) (hard) (soft) (hard)
ImageNet-1K RN50 - RN34 100K 52.08 54.97 53.50 53.57 56.77 56.92 58.61 59.89 59.32
RX101 - RX50 100K 54.90 56.58 50.88 55.31 57.43 55.64 59.90 61.21 60.54
Table 9: Top-1 classification accuracy (%) of the student model on ImageNet-1K. We use pretrained RN50 (76.13%percent76.1376.13\%76.13 %) and RX101 (79.31%percent79.3179.31\%79.31 %) as the teacher models, respectively. RN is ResNet and RX is ResNeXt
Dataset T - S Data Size KD ML AL DKD KN AM DB3KD MEKD MEKD
Pairs (soft) (soft) (soft) (soft) (soft) (soft) (hard) (soft) (hard)
CIFAR10 T: ResNet56 0.1K 16.74 17.78 12.97 20.66 27.67 48.31 43.05 49.04 47.12
(94.15%) 1K 31.25 31.57 32.05 31.09 58.65 62.05 64.28 69.84 68.66
S: MobileNet 10K 70.90 73.06 68.61 75.44 85.07 73.65 81.67 86.85 86.53
(90.04%) 50K(full) 90.43 91.19 90.54 90.50 92.19 86.33 92.46 93.48 93.09
CIFAR100 T: ResNet56 0.1K 01.96 01.88 01.72 02.56 13.23 36.73 30.72 33.56 34.60
(72.06%) 1K 10.36 10.06 09.62 10.81 35.80 52.09 50.14 53.84 54.52
S: MobileNet 10K 44.32 48.08 40.57 47.24 58.49 65.58 63.67 67.07 67.36
(68.14%) 50K(full) 71.86 73.08 71.33 72.38 70.85 71.77 73.36 73.84 73.27
Table 10: Ablation study of data size with top-1 classification accuracy (%) of the student model on CIFAR-10 and CIFAR-100.

B. More Results

B.1. Complete Distillation Experiments

We conduct different teacher-student model pairs for distillation experiments, and use ResNet32 / ResNet56 / VGG13 / ResNet110 / ResNet50 / ResNeXt101 as teacher models and use ResNet8 / ResNet32 / VGG11 / MobileNet / ResNet34 / ResNeXt50 as student models. Distillation performance is tested on various datasets, such as MNIST, CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-1K, as top-1 classification accuracy is exploited as an evaluation metric. The experimental results are shown in Tab. 7, Tab. 8 and Tab. 9. For the training of teacher and student models, we adopt the same setting of hyperparameters, so as to verify the distillation effect of student models trained with different methods compared with the teacher model trained with vanilla supervised learning under the same conditions.

We also provide complete ablation results of different data sizes on CIFAR-10 and CIFAR-100, as shown in Tab. 10. We use an effective teacher-student pair of ResNet56 - MobileNet for experiments. The results show that B2KD methods are generally more robust than traditional KD methods for small data sizes, and they can utilize the information in available samples maximumly to model compression in extreme cases. In the comparison of all methods, MEKD achieves the best performance, which also validates the effectiveness and robustness of our proposed method.

Refer to caption

EPOCH 0

Refer to caption

EPOCH 10

Refer to caption

EPOCH 20

Refer to caption

EPOCH 30

Figure 8: t-SNE visualization of synthetic (blue) and genuine (red) images of MEKD with DCGAN on MNIST.

Refer to caption

EPOCH 0

Refer to caption

EPOCH 50

Refer to caption

EPOCH 100

Refer to caption

EPOCH 200

Figure 9: t-SNE visualization of synthetic (blue) and genuine (red) images of MEKD with DCGAN on CIFAR-10.

In all experiments, teacher and student models are trained for 350350350350 epochs, except 12121212 epochs for MNIST. We use Nesterov SGD with momentum 0.90.90.90.9 and weight-decay 0.00050.00050.00050.0005 for training and use a mini-batch size of 128128128128 images on a single NVIDIA GeForce RTX 3090 GPU. The initial learning rate is 0.10.10.10.1, except 0.010.010.010.01 for MNIST, and we conduct a multi-step learning rate schedule which decreases the learning rate by 0.1 at the 116thsuperscript116𝑡116^{th}116 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT and 233thsuperscript233𝑡233^{th}233 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT epoch for the training of models, except no learning rate schedule is used for MNIST. For the training of student models, we follow the unsupervised setting and only use the soft or hard responses of teacher models for distillation. Note that for all experiments, we conduct three times experiments and report the mean accuracy.

For the training of DCGAN, we follow the hyperparameters’ settings of the work [42]. DCGAN composes of a generator realized by transposed convolution layer and a discriminator realized by an ordinary convolution layer, which greatly reduces the number of network parameters and improves the image generation effect. As an extension of our method, we believe that generative models of different architectures can also be used as emulators to learn the inverse mapping of the teacher function, by adding information maximization (IM) loss to alleviate the problem of mode collapse and achieve the purpose of deprivatization. This will be our research work in the future.

B.2. Visualization Results

We evaluate the training process of DCGAN in terms of whether the generated distribution is consistent with the real distribution, and visualize the synthetic and genuine images by t-SNE projection. As shown in Fig. 8 and Fig. 9, it can be observed that in the training process of DCGAN, the generated distribution is gradually closer to the real distribution. This verifies the effectiveness of using DCGAN as the emulator to learn the inverse mapping of the teacher function, and also proves that DCGAN can indeed alleviate the problem of mode collapse and generate images consistent with the distribution of real images. These synthetic images can not only effectively integrate various patterns in genuine images, but also serve as effective query samples to support the distillation of student models.