Aligning Logits Generatively for Principled Black-Box Knowledge Distillation

Jing Ma , Xiang Xiang^∗
School of Artificial Intelligence and Automation,
Huazhong University of Science and Tech, Wuhan, China Equal contribution, co-first author; also with Nat. Key Lab of MSIIPT.Correspondence to xex@hust.edu.cn; also with Peng Cheng Lab. Ke Wang, Yuchuan Wu, Yongbin Li
DAMO Academy,
Alibaba Group, Beijing, China

Abstract

Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper, we formalize a two-step workflow consisting of deprivatization and distillation, and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance, we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses, and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs, our method yields inspiring distillation performance on various benchmarks, and outperforms the previous state-of-the-art approaches.

1 Introduction

Knowledge Distillation (KD) is a widely accepted approach to the problem of model compression and acceleration, which has received sustained attention from both the academic and industrial research communities [15, 38, 47, 17]. The goal of KD is to extract knowledge from a cumbersome model or an ensemble of models, known as the teacher, and use it as supervision to guide the training of lightweight models, known as the student [5, 39, 1]. In the application of KD, privacy protection has always been a very concerning issue for researchers and users, which not only refers to the privacy of user data but also includes the model copyright of cloud service providers.

Black-Box Knowledge Distillation (B2KD) is a problem posed in the process of cloud-to-edge model compression [37, 51, 54]. The cloud server hosts a teacher model whose internal structure and composition, connections between layers, model parameters, and gradients used for back-propagation are all invisible and unavailable to edge devices, as shown in Fig. 1. Due to resource limitations, the edge device can only host a lightweight student model. At the same time, low-quality and unlabeled local data cannot be used to train a reliable deep neural network. As a result, it must rely on sending query samples to the APIs of cloud servers for heavy inference [49].

In practice, B2KD faces some key challenges. (a) Cloud servers and edge devices should maintain limited data exchange due to Internet latency and bandwidth constraints, as well as charges for the amount of queried data or API usage time. (b) In some cases, for query samples, these APIs only provide indexes or semantic tags for the category with the highest probability (i.e., hard responses), rather than probability vectors for all possible classes (i.e., soft responses). (c) Because users refuse to send sensitive data to cloud servers, the distribution gap between local and cloud data is difficult to measure, making the distilled student model inaccurate in the application.

Refer to caption — Figure 1: Schematic process of cloud-to-edge model compression. A cumbersome black-box model is deployed on a cloud server, trained with millions of samples and tags. The cloud server only provides APIs to receive query data and return inference responses of either soft or hard type. The edge device needs to distill a lightweight model using unlabeled local data.

Adversarial learning has been shown to be effective in generating pseudo samples, which is widely used in data augmentation and low-shot learning [8, 54]. A well-trained generator can overcome the mode collapse problem and align real and synthetic data distribution. In particular, we want to produce images relevant to training, whether or not they resemble real data [32]. Meanwhile, images generated to obtain high responses from the teacher model combine different patterns with highly generalized features instead of sample-specific idiosyncrasies [52]. Therefore, using a well-trained generator to synthesize pseudo images can automatically filter out privacy-related high-frequency information, this process is called deprivatization,

In this paper, we propose an approach to solve B2KD by mapping emulation. Our motivation is in accordance with the fact that it can drive alignment between low-dimensional logits by reducing the distance between two generated images in the high-dimensional space. In addition, we argue that an image contains a lot of fine-grained information, which can be treated as another type of knowledge to provide different gradient directions for updating the parameters of student model, as shown in Fig. 4. Combining image-level loss with coarse-grained logit-level loss can effectively improve the distillation effect. According to the Kolmogorov theorem [25, 6], a sufficiently complex neural network is capable of representing an arbitrary multivariate continuous function from any dimension to another. Thus, a well-trained generator can not only emulate the inverse mapping of the teacher function (Thm. 4) but also help update the logits of a student to converge to the logits of a teacher (Thm. 5), with reasonable generalizability (Thm. 6).

In practice, we derive using a generative adversarial network (GAN) for deprivatization and exploit it as an inverse mapping of the teacher function. The generator uses random variables as inputs that are sampled from a prior distribution with the same dimensionality as the logits. The well-trained generator is frozen and grafted behind the teacher and student model, whose output logits of the same examples are used as the inputs of the generator, as shown in Fig. 2. Experimental results show that MEKD can effectively protect the privacy of local data and models in the cloud, and it performs well under either soft or hard responses. At the same time, MEKD has robust results in the case of limited query samples and out-of-domain data.

Overall, the contributions of this paper are: 1) We formalize the problem of B2KD and provide a two-step workflow of deprivatization and distillation. 2) We theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. 3) We propose a new method of Mapping-Emulation Knowledge Distillation (MEKD). The improved experimental performance has demonstrated the effectiveness of our approach.

2 Related Work

Knowledge Distillation (KD). Hinton et al. [19] propose an original teacher-student architecture that uses the logits of the teacher model as the knowledge. Since then, some KD methods regard knowledge as final responses to input samples [3, 31, 58], some regard knowledge as features extracted from different layers of neural networks [24, 23, 41], and some regard knowledge as relations between such layers [57, 40, 9]. The purpose of defining different types of knowledge is to efficiently extract the underlying representation learned by the teacher model from the large-scale data. If we consider a network as a mapping function of input distribution to output, then different knowledge types help to approximate such a function. Based on the type of knowledge transferred, KD can be divided into response-based, feature-based, and relation-based [15]. The first two aim to derive the student to mimic the responses of the output layer or the feature maps of the hidden layers of the teacher, and the last approach uses the relationships between the teacher’s different layers to guide the training of the student model. Feature-based and relation-based methods [24, 57], depending on the model utilized, may leak the information of structures and parameters through the intermediate layers’ data. For example, we can reconstruct a ResNet [18] based on the feature dimensions of different layers, and calculate each neuron’s parameter using specific images and their responses in the feature maps.

Black-Box Knowledge Distillation (B2KD). Response-based KD methods [19, 58, 3] have the natural property of hiding models. Hinton et al. [19] use Kullback-Leibler Divergence (KLD) between the softened logits of teacher and student models as the loss to align the output distribution, and Zhao et al. [58] decouple the KLD into two uncorrelated losses and combine them by weighted summation. These calculations do not take into account the details of the teacher model, which is exactly a black box. The recently proposed approaches for B2KD also address the issue of hiding the teacher model deployed in the cloud server [37, 51, 54]. Orekondy et al. [37] use a reinforcement learning approach to improve query sample efficiency. Wang et al. [51] blend mixup and activate learning to augment the few unlabeled images and choose hard examples for distillation. And Wang [54] proposes a decision-based black-box model and constructs the soft label for each training sample by computing its distances to the decision boundaries of the teacher model. These existing approaches partially address the challenges of cloud-to-edge black-box model distillation, but none of them take into account the privacy leak of user data when sending original local images to the cloud.

Generative Adversarial Networks (GANs) have the capacity to handle sharp estimated density functions and generate realistic-looking images efficiently. A typical GAN [14] comprises a discriminator distinguishing real images and generated images, and a generator synthesizing images to fool the discriminator. GANs are divided into architecture-variant and loss-variant. The former focuses on network architectures [42, 7] or latent space [34, 10], e.g., some specific architectures are proposed for specific tasks [59, 22]. The latter utilizes different loss types and regularization tools [2, 16] to enable more stable learning.

Adversarial Distillation (AD) exploits adversarial architecture to help the teacher and student model have a better understanding of the real data distribution [15, 55, 53, 4, 44]. The methods of AD can be divided into two types according to the generator-discriminator architecture, as shown in Fig. 2: (a) the generator is used to synthetic images to obey a real distribution, and these images are used to help distill models [8, 53]; (b) the teacher and student models are regarded as generators and another discriminator is drafted behind them to judge whether the distribution of features or logits is consistent [55, 56]. AD is also employed for low-shot knowledge distillation and received inspiring results [8]. Our method provides an alternative adversarial architecture, which utilizes a well-trained generator to guide the alignment between the outputs of models.

3 Theory for Mapping-Emulation KD

First, we propose two definitions. Def. 3 defines that two functions that map the same data distribution $\mu$ to the same latent distribution $\upsilon$ are equivalent. The ideal state of KD is to obtain a student function $f_{S}$ that is equivalent to the teacher function $f_{T}$ . Def. 4 defines that the mapping function of a generator $G$ , which can map a prior distribution $p$ to data manifold $\Sigma$ and guarantee that the generated image distribution $\mu^{\prime}$ is the same as the real image distribution $\mu$ , is considered to be the inverse mapping of the teacher function, i.e. $f_{G}=f_{T}^{-1}$ . And we call it a well-trained generator. The mapping relationships are shown in Fig. 3.

Definition 1.

(Function Equivalence) Giving the student and teacher model $f_{S}$ and $f_{T}$ , for a data distribution $\mu\in\mathcal{X}$ in image space which is mapped to $\mathbb{P}_{S}\in\mathcal{Y}$ and $\mathbb{P}_{T}\in\mathcal{Y}$ in latent space. If the Wasserstein distance between $\mathbb{P}_{S}$ and $\mathbb{P}_{T}$ equals zero,

\vspace{-1mm}W(\mathbb{P}_{S},\mathbb{P}_{T})=\inf_{\gamma\in\Pi(\mathbb{P}_{S% },\mathbb{P}_{T})}\mathbb{E}_{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|% \ \right]=0,

(1)

the student and teacher model are equivalent, i.e., $f_{S}=f_{T}$ , where $\Pi(\mathbb{P}_{S},\mathbb{P}_{T})$ is the set of all joint distributions $\gamma(y_{S},y_{T})$ whose marginals are $\mathbb{P}_{S}$ and $\mathbb{P}_{T}$ , respectively.

Definition 2.

(Inverse Mapping) Giving a prior distribution $p\in\mathbb{R}^{C}$ , for a data distribution $\mu\in\mathbb{R}^{n}$ , if the Wasserstein distance between generated distribution $\mu^{\prime}=(f_{G})_{\#}p$ and $\mu$ equals zero,

\vspace{-2mm}W(\mu^{\prime},\mu)=\inf_{\gamma\in\Pi(\mu^{\prime},\mu)}\mathbb{% E}_{(x^{\prime},x)\sim\gamma}[\ \|x^{\prime}-x\|\ ]=0,

(2)

then the generator $f_{G}:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}$ is the inverse mapping of the teacher function $f_{T}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{C}$ , denoted as $f_{G}=f_{T}^{-1}$ , where $\Pi(\mu^{\prime},\mu)$ is the set of all joint distributions $\gamma(x^{\prime},x)$ whose marginals are respectively $\mu^{\prime}$ and $\mu$ .

Fixing a decoding map $f_{G}$ for a well-trained generator $G$ , the latent space $\mathcal{Z}$ is partitioned as

\vspace{-2mm}\mathcal{D}(f_{G}):\mathcal{Z}=\bigcup_{\alpha}U_{\alpha},

(3)

where $\mathcal{D}(f_{G})$ is called the decomposition induced by the decoding map $f_{G}$ [29], and $\{U_{\alpha}\}$ are called cells. As shown in Fig. 4, $f_{G}$ maps a cell decomposition in the latent space $\mathcal{D}(f_{G})$ to a cell decomposition in the image space $\frac{1}{n}\sum_{i}\delta_{x^{(i)}}$ . Each cell $U_{\alpha}$ is mapped to a sample $\delta_{x^{(i)}}$ by the decoding map $f_{G}$ [28]. In another word, $f_{G}$ pushes the prior distribution $p$ to the exact empirical distribution,

\vspace{-1mm}({f_{G}})_{\#}p=\frac{1}{n}\sum_{i}\delta_{x^{(i)}}.

(4)

Theorem 1.

(Empirical Approximation) For any $0<\epsilon<1/2$ and any integer $m>4$ , let $g:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}$ be the mapping function of generator $G$ with $n\leq\frac{20\log m}{\epsilon^{2}}$ . For two sets $V_{S}=\{y_{S}:y_{S}\in\mathbb{P}_{S}\}$ and $V_{T}=\{y_{T}:y_{T}\in\mathbb{P}_{T}\}$ , both of which have $m$ points in $\mathbb{R}^{C}$ , if the empirical Wasserstein distance between $g(V_{S})$ and $g(V_{T})$ equals zero,

\vspace{-2mm}\hat{W}(g(V_{S}),g(V_{T}))=\frac{1}{m}\sum_{i=1}^{m}\|g(y_{S}^{i}% )-g(y_{T}^{i})\|=0,

(5)

then $W(\mathbb{P}_{S},\mathbb{P}_{T})=0$ .

Thm. 4 (see Appendix for proof) provides a method to approximate the expected Wasserstein distance $W(\mathbb{P}_{S},\mathbb{P}_{T})$ using the empirical Wasserstein distance $\hat{W}(g(V_{S}),g(V_{T}))$ . By reducing the distance between points $g(y_{S}^{i})$ and $g(y_{T}^{i})$ in high-dimensional space, an optimization direction $\nabla\mathcal{L}_{F}$ different from $\nabla\mathcal{L}_{KL}$ is produced for logits $y_{S}^{i}$ and $y_{T}^{i}$ in low-dimensional space. The gradient update causes $y_{S}^{i}$ to move towards the boundary of the cell in which $y_{T}^{i}$ resides, as shown in Fig. 4.

Theorem 2.

(Optimization Direction) Let $\mu\in\mathcal{X}$ be any distribution. $f_{S},f_{T},f_{G}$ are the mapping functions of the student, teacher, and generator, respectively. $f_{S}$ is parameterized by $\theta_{S}\in\Theta_{S}$ . Then, when

\vspace{-1mm}\min_{\theta_{S}\in\Theta_{S}}\mathbb{E}_{x\sim\mu}\left[\|f_{G}% \circ f_{S}(x),f_{G}\circ f_{T}(x)\|\right]\rightarrow 0,

(6)

it holds that $f_{S}\rightarrow f_{T}$ , and we have

	$\displaystyle\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[f_{S}(x)]=\nabla_{\theta% _{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})$
	$\displaystyle\ \ \ \ \ \ \ \ =\mathbb{E}_{x\sim\mu}[\nabla_{\theta_{S}}\\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\\|].$		(7)

Thus, to achieve $f_{S}\rightarrow f_{T}$ , it is sufficient to optimize $\mathbb{E}_{x\sim\mu}\left[\|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\|\right]$ in the parameter space $\Theta_{S}$ . The global gradient of parameter $\theta_{S}$ can be replaced by the gradient calculated on the empirical distance of high-dimensional image points, refer to Appendix for proof.

Theorem 3.

(Generalization Bound) Let $H\subseteq\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}$ be a hypothesis set for $C$ -way classification task. For any $0<\epsilon<1/2$ and a sample $S$ of size $m>4$ drawn according to $\mu$ , let $g:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}$ be a mapping function of generator $G$ with $n\leq\frac{20\log m}{\epsilon^{2}}$ . Fix $\rho>0$ , for any $1>\delta>0$ , with probability at least $1-\delta$ , the following holds for all $h\in H$ ,

\vspace{-1mm}R(h)\leq\hat{R}_{\rho}(h)+\frac{2C^{2}}{\rho(1-\epsilon)}\sqrt{% \frac{r^{2}\Lambda^{2}}{m}}+\sqrt{\frac{\log\frac{1}{\delta}}{2m}}.

(8)

For any $x\in\mathcal{X}$ , the $\Lambda\geq 0$ and $(\sum_{y=1}^{C}\|h(x,y)\|^{p})^{1/p}\leq\Lambda$ for any $p\geq 1$ , and the $r>0$ for $K(x,x)\leq r^{2}$ where kernel $K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is positive definite symmetric.

Thm. 6 (see Appendix for proof) gives the generalization bound of aligning low-dimensional logits by reducing the distance of high-dimensional image points, which guarantees generalizability to the unseen samples.

4 Algorithm of Mapping-Emulation KD

Hinton et al. [19] propose a simple but effective KD method that uses the softened logits of the teacher model as a supervision to guide student training. They use the Kullback-Leibler Divergence (KLD) to measure the discrepancy between the logits of the two models, where the student model is trained to minimize the gap in the hope of achieving the same output. The loss is defined as

	$\displaystyle\mathcal{L}_{KL}$	$\displaystyle=\mathcal{KL}[p(c\|\mathbf{x}_{i};\theta_{T})\|\|p(c\|\mathbf{x}_{i};% \theta_{S})]$		(9)
		$\displaystyle=\frac{1}{N}\sum_{i}^{N}\sum_{c}^{C}p(c\|\mathbf{x}_{i};\theta_{T}% )\log\left[\frac{p(c\|\mathbf{x}_{i};\theta_{T})}{p(c\|\mathbf{x}_{i};\theta_{S}% )}\right],$		(9)

where $i$ is the sample index and $N$ is the number of samples. Regardless of the method used, the essence of KD is to learn the mapping function of the teacher model from input to output, i.e., $f_{T}$ . However, it is hard to deduce the mapping function from the existing parameters of the teacher model. One can only guess the mapping process by using the responses to the input samples of different network layers or the relations between features and treat them as knowledge to guide the training of the student model [57]. However, in the black-box KD problem, the internal responses or relations between layers of the teacher model are not available, which makes effective distillation more challenging.

Deprivatization. For a $C$ -way classification problem, we first train a GAN using the random noise variable $z$ sampled from the prior distribution $p$ in latent space $\mathcal{Y}$ as input. Note that the dimensionality of $z$ is the same as the output logits of the teacher model, i.e. $|z|=C$ . The generator $G$ uses noise $z$ to synthesize images, and the discriminator $D$ minimizes the Wasserstein distance between the generated $\mu^{\prime}$ and the real distribution $\mu$ . The synthetic privacy-free images are simultaneously sent to the cloud server for inference responses, which can be soft (probability vectors for all possible classes) or hard (indexes or semantic tags for the category with the highest probability). We expect the synthetic images to match the high responses of the teacher model so that they can maximize the containment of patterns in real data. We adopt the information maximization (IM) loss [21, 45], which is formulated as

\vspace{-2mm}\mathcal{L}_{IM}=-\frac{1}{m}\sum_{i=1}^{m}\hat{y}^{(i)}_{t}\log% \left(D\left(G\left(z^{(i)}\right)\right)\right),

(10)

where $\hat{y}^{(i)}_{t}=\max_{c\in C}T\left(G\left(z^{(i)}\right)\right)$ for $0.0\leq\hat{y}^{(i)}_{t}\leq 1.0$ in soft responses and $\hat{y}^{(i)}_{t}=1.0$ in hard responses.

Suppose the discriminator is capable of completely blurring the discrepancy between synthetic and genuine images. In this case, the resulting generator represents a function from the latent space to the image space, defined as $f_{G}:\mathcal{Y}\rightarrow\mathcal{X}$ , with an inverse mapping of the teacher function. Note that the generator and the discriminator are trained simultaneously: we adjust parameters for the generator to minimize $\log(1-D(G(z)))$ and adjust parameters for the discriminator to minimize $\log D(x)$ . And their loss functions are

\vspace{-1mm}\mathcal{L}_{D}=-\frac{1}{m}\sum_{i=1}^{m}\left[\log D\left(x^{(i% )}\right)+\log\left(1-D\left(G\left(z^{(i)}\right)\right)\right)\right],

(11)

\mathcal{L}_{G}=-\frac{1}{m}\sum_{i=1}^{m}\log\left(1-D\left(G\left(z^{(i)}% \right)\right)\right).

(12)

We introduce a trade-off hyperparameter $\alpha$ to balance $\mathcal{L}_{GAN}$ and $\mathcal{L}_{IM}$ , and all the losses in the first step of deprivatization constitute

\vspace{-1mm}\mathcal{L}_{Dp}=(\mathcal{L}_{G}+\mathcal{L}_{D})+\alpha\mathcal% {L}_{IM},

(13)

Algorithm 1 MEKD optimization algorithm.

Input: Pre-trained teacher $T(x;\theta_{T})$ deployed in the cloud server, random initialized student $S(x;\theta_{S})$ and local dataset $X$ hosted in the edge device.
Output: An optimized student $S(x;\theta_{S})$ on dataset $X$ .

\triangleright

Step 1: Deprivatization

2: Initialize a generator

G(z;\theta_{G})

and a discriminator

D(x;\theta_{D})

, and ensure the dimensionality of

z

equals to the category count

C

3: repeat

4: Sample a batch of noises

\mathcal{Z}

from a prior distribu-

5: tion

p

and synthetic images

\mathcal{X}^{\prime}=G(\mathcal{Z})

6: The

\mathcal{X}^{\prime}

is sent to

T

in cloud to get soft or hard

7: inference responses

\hat{\mathcal{Y}}^{\prime}_{t}=T(\mathcal{X}^{\prime})

8: Sample a batch of examples

\mathcal{X}

from dataset

X

9: Update the discriminator

D

to distinguish

\mathcal{X}

10: and

\mathcal{X}^{\prime}

using

\mathcal{L}_{D}

from Eqn. 11.

11: Update the generator

G

to fool the discriminator

D

12: using

\mathcal{L}_{G}+\alpha\mathcal{L}_{IM}

from Eqn. 10 and Eqn. 12.

13: until converge

14:

\triangleright

Step 2: Distillation

15: Initialize the student

S

and freeze the generator

G

16: repeat

17: Sample a batch of noises

\mathcal{Z}

from a prior distribu-

18: tion

p

and synthetic images

\mathcal{X}^{\prime}=G(\mathcal{Z})

19: The

\mathcal{X}^{\prime}

is sent to

T

in cloud to get soft or hard

20: inference responses

\hat{\mathcal{Y}}^{\prime}_{t}=T(\mathcal{X}^{\prime})

21: Update the student

S

using

\mathcal{L}_{Dt}

from Eqn. 4.

22: until converge

Method	Data Size	MNIST		CIFAR-10		CIFAR-100		Tiny ImageNet
Teacher	50K $\sim$ 100K	ResNet32	VGG13	ResNet56	ResNet56	VGG13	ResNet56	ResNet110	ResNet110
Teacher	50K $\sim$ 100K	99.50	99.52	94.15	94.15	74.68	72.06	60.71	60.71
Student	50K $\sim$ 100K	ResNet8	VGG11	ResNet8	VGG11	VGG11	VGG11	ResNet32	MobileNet
Student	50K $\sim$ 100K	99.24	99.41	87.74	91.81	69.12	69.12	55.47	56.07
KD [19]	50K $\sim$ 100K	99.33	99.44	86.58	82.25	70.88	67.97	54.14	57.85
ML [3]	50K $\sim$ 100K	99.49	99.40	87.89	91.91	67.78	70.18	56.56	60.07
AL [53]	50K $\sim$ 100K	99.37	99.26	87.25	91.97	69.92	71.13	46.02	51.29
DKD [58]	50K $\sim$ 100K	99.33	99.43	86.61	92.42	67.32	70.10	55.99	59.43
DAFL [8]	0K	96.42	97.00	60.67	66.03	43.78	48.32	38.44	40.93
KN [37]	10K	98.61	98.81	80.62	82.41	57.83	55.64	48.92	50.22
AM [51]	10K	99.33	99.47	74.89	74.26	62.17	63.20	47.72	51.54
DB3KD [54]	10K	98.94	99.16	78.47	85.84	63.48	62.76	47.95	50.49
MEKD (soft)	10K	99.40	99.43	85.36	87.27	64.76	64.83	50.87	54.93
MEKD (hard)	10K	99.40	99.45	84.45	87.25	64.72	65.32	49.89	54.71

Table 1: Top-1 classification accuracy (%) of the student model on MNIST, CIFAR-10, CIFAR-100 and Tiny ImageNet.

Distillation. The well-trained generator $G$ contains the knowledge that the teacher uses to make inferences. It is equivalent to a teacher assistant transferring the teacher’s knowledge to the student. Fig. 2 illustrates the architecture of MEKD. We freeze the generator and graft it behind the teacher and student model in the same way, using the softened logits of both models as the generator input. A batch of synthetic images $X^{\prime}=\{{x^{\prime}}^{(i)}=f_{G}(z^{(i)})\}_{i=1}^{m}$ is fed into the embedded network to output high-dimensional points in the same image space, simultaneously. The distance between the output high-dimensional points from the logits of the teacher model $X_{T}^{\prime\prime}=f_{G}\circ f_{T}(X^{\prime})$ and the others from the student $X_{S}^{\prime\prime}=f_{G}\circ f_{S}(X^{\prime})$ are measured by the distance measurement formula $\mathcal{L}_{F}=\mathbbm{d}(X_{S}^{\prime\prime},X_{T}^{\prime\prime})$ . We minimize the distance $\mathcal{L}_{F}$ to drive the student model to mimic the output logits of the teacher model, and use $\mathcal{L}_{1}$ -norm ( $F=1$ ) of $X_{S}^{\prime\prime}$ and $X_{T}^{\prime\prime}$ as the loss function to distill the student,

	$\displaystyle\mathcal{L}_{Dt}$	$\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\ \left\\|G\left(S\left({x^{\prime}}^{(i% )}\right)/\tau\right)-G\left(T\left({x^{\prime}}^{(i)}\right)/\tau\right)% \right\\|_{F}$
		$\displaystyle\ \ \ \ \ +\beta\frac{1}{m}\sum_{i=1}^{m}\ T\left({x^{\prime}}^{(% i)}\right)\log\left(\frac{T\left({x^{\prime}}^{(i)}\right)}{S\left({x^{\prime}% }^{(i)}\right)}\right),$		(14)

where query sample ${x^{\prime}}^{(i)}=G(z^{(i)})$ is generated from noise $z^{(i)}$ and temperature $\tau$ is used to soften the output logits. Through the experiments, we found that $\mathcal{L}_{2}$ -norm has a similar effect with $\mathcal{L}_{1}$ -norm, refer to Tab. 5.

We also add logit-level knowledge (Eqn. 9) to induce distillation and use a hyperparameter $\beta$ to balance these two losses. Unlike most KD methods, we do not use cross-entropy loss with ground-truth labels, due to its unavailability in edge devices. An algorithm is summarized in Alg. 1.

5 Experiments

5.1 Experiment Setup

In this section, we compare our method with response-based KD and black-box KD methods in an unsupervised environment. Experimental results show that when the cross-entropy loss based on ground-truth labels is removed, the distillation performance of these methods decreases.

Datasets Setup. We conduct experiments on MNIST [27], CIFAR [26], Tiny ImageNet [11], and ImageNet-1K [11], all of which are widely used for image classification. While training B2KD methods, we randomly select $10K$ images ( $100K$ for ImageNet-1K) from the training set, and all images in the test set (val set for ImageNet) are used as the benchmark to calculate accuracy. For other approaches, except DAFL [8] based on zero-shot learning, we use the whole training set. We mainly use top-1 classification accuracy as an evaluation metric to assess the distillation effect. To make a fair and intuitive comparison, we follow the same setup as previous B2KD methods in our main experiments. However, we find that the original settings in the B2KD experiments do not represent the challenges raised in practical applications, so we add extended experiments in Sec. 5.4 to illustrate the practicability of our proposed method.

Implementation. See also the project page¹¹1https://github.com/HAIV-Lab/MEKD. We use ResNet [18], VGG [46] and MobileNet [20] as the backbone, and adopt standard data augmentation techniques (random crop and horizontal flip) and an SGD optimizer in all experiments. We consistently train the teacher and student model for $350$ epochs, except for $12$ epochs for MNIST, and we adopt a multi-step LR scheduler following the paper [23]. After training the teacher, we train a DCGAN [42] with Gaussian noise in the same dimension as the category counts. The output logits of teacher or student for samples in the same class follow a Gaussian distribution, and the logits center is the mean of the Gaussian. Since the conversion between different Gaussian distributions is a linear process, using Gaussian as the prior distribution $p$ provides a smooth dual space for the student’s logits update.

Competing Methods. In order to verify the effectiveness of our method, we compare several methods of response-based KD and black-box KD. We select KD [19] proposed by Hinton et al. and ML [3] proposed by Ba and Caruana as the baselines, and we also compare the recently published DKD [58] based on decoupled KLD. For the two GAN-based KD frameworks summarized in Sec. 2, we choose AL [53] and DAFL [8] as comparison methods. Meanwhile, we compete with some black-box KD methods such as KN [37], AM [51] and DB3KD [54]. Of these methods, DB3KD and MEKD(hard) only utilize hard responses, while the other methods are based on soft responses.

T - S	KD	AL	AM	DB3KD	MEKD
Pairs	(soft)	(soft)	(soft)	(hard)	(soft)
RN50 - RN34	52.08	53.50	56.92	58.61	59.89
RX101 - RX50	54.90	50.88	55.64	59.90	61.21

Table 2: Top-1 classification accuracy (%) of the student model on ImageNet-1K with data size

100K

. We use the pre-trained ResNet50 (76.13%) and ResNeXt101 (79.32%) as teachers.

5.2 Performance Evaluation

On MNIST, CIFAR, and Tiny ImageNet, we use ResNet32/56/110 and VGG13 as the teacher model and use ResNet8/32, VGG11, and MobileNet as the student model. We compare the top-1 classification accuracy (ACC) of different teacher-student pairs, the results are shown in Tab. 1.

On relatively easy tasks, such as MNIST and CIFAR-10, our proposed method has a small gap compared to response-baed KD methods that use the full training set. This makes sense in the applications of cloud-to-edge model compression because edge devices do not have a lot of capacity to store more than ten thousand pieces of data.

CIFAR-100 and Tiny ImageNet are more challenging. These tasks contain far more patterns than MNIST and CIFAR-10, and data distributions are so complex that it is difficult for a generator to capture all the patterns. However, as long as the mode collapse problem can be mitigated, it is possible to synthesize complex samples beneficial to distillation, so we exploit DCGAN [42] as our generator. DCGAN has a more stable training process and is more suitable for generating RGB images than a fully-connected GAN [42]. Experimental results show that MEKD can obtain an accuracy improvement of $5\%\sim 10\%$ compared to other B2KD methods, and the accuracy of MEKD with soft or hard responses is similar, with a difference of less than $1\%$ .

We also conduct experiments on large-scale datasets and sophisticated networks. On ImageNet-1K, we use two teacher-student (T-S) pairs of ResNet50 (RN50) - ResNet34 (RN34) and ResNeXt101 (RX101) - ResNeXt50 (RX50). All methods are trained using a subset of $100K$ samples. The experimental results are shown in Tab. 2.

Uniformly, we set the number of query samples to $50K$ on CIFAR and MNIST, $300K$ on ImageNet, and discuss the performance impact of limited query samples in Sec. 5.4.

Data Size	0.1K	1K	10K	50K (full)
KD [19]	16.74	31.25	70.90	90.43
AL [53]	12.97	32.05	68.61	90.54
AM [51]	48.31	62.05	73.65	86.33
DB3KD [54]	43.05	64.28	81.67	92.46
MEKD (soft)	49.04	69.84	86.85	93.48
MEKD (hard)	47.12	68.66	86.53	93.09

Table 3: Ablation study of data size on CIFAR-10. We use the T-S pair of ResNet56 - MobileNet, and the full training set is

50K

5.3 Ablation Study

We choose an effective T-S pair [35] of ResNet56 - MobileNet for ablation studies unless otherwise stated.

Ablation Study of Data Size. We explore the performance with different data sizes, the results are shown in Tab. 3. In general, B2KD methods have higher robustness to small data sizes than traditional KD methods, and in which MEKD achieves the highest distillation performance.

Ablation Study of Deprivatization. The $\alpha$ is a hyperparameter to balance $\mathcal{L}_{GAN}$ and $\mathcal{L}_{IM}$ . The $\mathcal{L}_{IM}$ is used to maximize the responses of the teacher to the generated samples. Therefore, the training of the generator with or without $\mathcal{L}_{IM}$ will affect the quality of synthetic images. Fig. 5 (a) shows real images of CIFAR-10. Fig. 5 (b) shows synthetic images with $\alpha=0.5$ and Fig. 5 (c) shows synthetic images without $\mathcal{L}_{IM}$ (i.e. $\alpha=0$ ), both using the same noise vectors. The teacher of ResNet56 responds from $0.72\sim 0.96$ to the synthetic image in Fig. 5 (b) and from $0.41\sim 0.87$ to the one in Fig. 5 (c). The effect of $\alpha$ is also reported in Tab. 4, which reflects that the utilization of $\mathcal{L}_{IM}$ can improve the performance of model distillation.

$\alpha$	Response Type
$\alpha$	Soft	Hard
0.0	60.87	61.31
0.1	66.06	66.86
0.5	67.07	67.36
1.0	67.01	67.11

$\beta$	Response Type
$\beta$	Soft	Hard
0.0	56.28	56.23
0.1	64.13	65.60
0.5	66.79	67.01
1.0	67.07	67.36

Table 4: Ablation study of hyperparamete

\alpha

and

\beta

on CIFAR-100. We use the T-S pair of ResNet56 - MobileNet.

Ablation Study of Distillation. In Eqn. 4, the $\beta$ is a trade-off hyperparameter to balance $\mathcal{L}_{F}$ and $\mathcal{L}_{KL}$ , which provide different gradient directions for $\theta_{S}$ . As shown in Tab. 4, the distillation performance can be improved by introducing KLD as an additional loss function.

The temperature $\tau$ is another important hyperparameter for MEKD since it softens the output logits of both the teacher and student models. The results are shown in Fig. 6. Its validity comes from the fact that softened logits can increase the probability of being sampled in a standard normal distribution. Since GANs use a standard Gaussian distribution as input, samples generated from out-of-distribution noises with low-sampling probability are usually fuzzy and incorporate few patterns [43], which are meaningless for distillation. Meanwhile, a high value of $\tau$ reduces the discrepancy between softened logits, and $\mathcal{L}_{F}=0$ when they locate in the same cell. It reduces the performance of distillation, especially for challenging tasks, such as ImageNet.

Ablation Study of Different $\mathcal{L}_{F}$ . In Eqn. 4, we use $\mathcal{L}_{F}$ to calculate the distance between generated samples $X_{S}^{\prime\prime}$ and $X_{T}^{\prime\prime}$ . From the analysis of experimental results, as shown in Tab. 5, we argue that the effect on distillation is similar whether $F$ equals $1$ or $2$ . The reason is that $\mathcal{L}_{F}$ is used to measure the distance between logits of the student and the boundary of cells, in which logits of the teacher reside, and different $\mathcal{L}_{F}$ represent similar gradient directions.

Dataset	Method	Model	ACC ( $\mathcal{L}_{1}$ / $\mathcal{L}_{2}$ )
CIFAR-10	MEKD (soft)	MobileNet	86.85 / 86.63
CIFAR-10	MEKD (hard)	MobileNet	86.53 / 86.88
CIFAR-100	MEKD (soft)	MobileNet	67.07 / 66.95
CIFAR-100	MEKD (hard)	MobileNet	67.36 / 66.94

Table 5: Ablation study of different

\mathcal{L}_{F}

. We use ResNet56 as the teacher model. ACC: top-1 classification accuracy (%).

5.4 Extended Experiments

In the real-world application of cloud-to-edge model compression, there are some restrictions, such as the limitation of Internet data exchange and the domain shift in practical scenarios. We conduct additional experiments to explore the effect of MEKD under these constraints.

MEKD with Limited Query Samples. We distill a student MobileNet on CIFAR-10 and CIFAR-100 with a total query sample size ranging from $10K$ to $50K$ with an interval of $10K$ . We report the ACC of MEKD with or without $\mathcal{L}_{IM}$ and $\mathcal{L}_{KL}$ . The curves in Fig. 7 show that with more query samples sent to the cloud server, the student model in the edge device can be trained more fully. We can also analyze from the curves that $\mathcal{L}_{IM}$ does not seem to be that useful without using KLD as an additional distillation loss function, and it gives a big boost to the overall MEKD due to the extra gradient direction of the mapping emulation.

MEKD with Out-of-Domain Data. We train a teacher (ResNet56 or VGG13) with vanilla supervised learning on Syn. Digits [13], which contains about $500K$ software-synthesized images. We distill a student (MobileNet) on SVHN [36] consisting only of real-shooting photographs. Tab. 6 shows the ACC on the test set of SVHN. MEKD outperforms most methods in the task of out-of-domain distillation, while DB3KD achieves higher performance due to the use of robust labels [54]. However, DB3KD leads to a very high data exchange cost between the server and client, since it requires multiple queries to find a mixed image located in the decision boundary to compute robust labels. In contrast, the data exchange cost of MEKD is much lower.

Teacher	ResNet56	VGG13	Data Exchange
Teacher	74.37	79.86
Student	MobileNet	MobileNet
KD [19]	76.27	80.67	$\sim$ 175 MB
ML [3]	76.78	81.90	$\sim$ 175 MB
AL [53]	77.09	80.98	$\sim$ 175 MB
DKD [58]	75.47	80.64	$\sim$ 175 MB
DAFL [8]	69.20	67.07	$\sim$ 28.4 GB
KN [37]	79.65	83.37	$\sim$ 145 MB
AM [51]	84.05	86.70	$\sim$ 11.6 GB
DB3KD [54]	90.15	91.14	$\sim$ 20.8 GB
MEKD (soft)	86.45	88.65	$\sim$ 120 MB
MEKD (hard)	86.77	89.21	$\sim$ 120 MB

Table 6: Top-1 classification accuracy (%) of methods on SVHN. The teacher models are trained on Syn. Digits with vanilla supervised learning, and achieve the top-1 classification accuracy of 99.56% for ResNet56 and 99.52% for VGG13 on Syn.Digits.

6 Conclusion

In this paper, we provide a two-step workflow of deprivatization and distillation for B2KD. Different from aligning logits directly, we theoretically provide a new optimization direction from logits to cell boundaries, and propose a new method of MEKD. Taking a generator as an inverse mapping of the teacher function does not leak information about the internal structure or parameters of the teacher, because it has a completely different network structure.

Limitation. A well-trained generator is critical in MEKD, and GANs are known to suffer from mode collapse, especially for challenging tasks. We alleviate this problem with DCGAN. Although the parameter size and structural limitations of the model prevent the student from fully mimicking the function of the teacher, MEKD can still improve distillation performance compared with other B2KD methods.

Acknowledgement. This research was supported by Natural Science Fund of Hubei Province (Grant # 2022CFB823), Alibaba Innovation Research program under Grant Contract # CRAQ7WHZ11220001-20978282, and HUST Independent Innovation Research Fund (Grant # 2021XXJS096).

References

Aguilar et al. [2020] Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7350–7357, 2020.
Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR, 2017.
Ba and Caruana [2014] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Advances in Neural Information Processing Systems, 27, 2014.
Belagiannis et al. [2018] Vasileios Belagiannis, Azade Farshad, and Fabio Galasso. Adversarial network compression. In Proceedings of the European Conference on Computer Vision Workshops, pages 0–0, 2018.
Bergmann et al. [2020] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4183–4192, 2020.
Braun and Griebel [2009] Jürgen Braun and Michael Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive Approximation, 30(3):653–675, 2009.
Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
Chen et al. [2019] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3514–3522, 2019.
Chen et al. [2020] Hanting Chen, Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Learning student networks via feature embedding. IEEE Transactions on Neural Networks and Learning Systems, 32(1):25–35, 2020.
Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, 29, 2016.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
Frankl and Maehara [1988] Peter Frankl and Hiroshi Maehara. The johnson-lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, Series B, 44(3):355–362, 1988.
Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
Gou et al. [2021] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
He et al. [2020] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068–14080, 2020.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint:1503.02531, 2(7), 2015.
Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Hu et al. [2017] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. In International conference on machine learning, pages 1558–1567. PMLR, 2017.
Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
Kim et al. [2018] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems, 31, 2018.
Komodakis and Zagoruyko [2017] Nikos Komodakis and Sergey Zagoruyko. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
Köppen [2002] Mario Köppen. On the training of a kolmogorov network. In International Conference on Artificial Neural Networks, pages 474–479. Springer, 2002.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Unvieristy of Toronto: Technical Report, 2009.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Lei et al. [2019] Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, and Xianfeng Gu. A geometric view of optimal transportation and generative model. Computer Aided Geometric Design, 68:1–21, 2019.
Lei et al. [2020] Na Lei, Dongsheng An, Yang Guo, Kehua Su, Shixia Liu, Zhongxuan Luo, Shing-Tung Yau, and Xianfeng Gu. A geometric understanding of deep learning. Engineering, 6(3):361–374, 2020.
Lewis and Lucchetti [2000] Adrian Stephen Lewis and RE Lucchetti. Nonsmooth duality, sandwich, and squeeze theorems. SIAM Journal on Control and Optimization, 38(2):613–626, 2000.
Meng et al. [2019] Zhong Meng, Jinyu Li, Yong Zhao, and Yifan Gong. Conditional teacher-student learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6445–6449, 2019.
Micaelli and Storkey [2019] Paul Micaelli and Amos J Storkey. Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems, 32, 2019.
Milgrom and Segal [2002] Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint:1411.1784, 2014.
Mirzadeh et al. [2020] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, pages 5191–5198, 2020.
Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
Orekondy et al. [2019] Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4954–4963, 2019.
Ozkara et al. [2021] Kaan Ozkara, Navjot Singh, Deepesh Data, and Suhas Diggavi. Quped: Quantized personalization via distillation with applications to federated learning. Advances in Neural Information Processing Systems, 34, 2021.
Pan et al. [2020] Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
Passalis et al. [2020] Nikolaos Passalis, Maria Tzelepi, and Anastasios Tefas. Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2339–2348, 2020.
Passban et al. [2021] Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, and Qun Liu. Alp-kd: Attention-based layer projection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13657–13665, 2021.
Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint:1511.06434, 2015.
Schlegl et al. [2017] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, pages 146–157. Springer, 2017.
Shen et al. [2019] Zhiqiang Shen, Zhankui He, and Xiangyang Xue. Meal: Multi-model ensemble via adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4886–4893, 2019.
Shi and Sha [2012] Yuan Shi and Fei Sha. Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In ICML, 2012.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Stanton et al. [2021] Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34, 2021.
Tenenbaum et al. [2000] Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
Tramèr et al. [2016] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction $\{$ APIs $\}$ . In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
Villani [2009] Cédric Villani. Optimal transport: old and new. Springer, 2009.
Wang et al. [2020a] Dongdong Wang, Yandong Li, Liqiang Wang, and Boqing Gong. Neural networks are more productive teachers than human raters: Active mixup for data-efficient knowledge distillation from a blackbox model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1498–1507, 2020a.
Wang et al. [2020b] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8684–8694, 2020b.
Wang et al. [2018] Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Adversarial learning of portable student networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Wang [2021] Zi Wang. Zero-shot knowledge distillation from a decision-based black-box model. In International Conference on Machine Learning, pages 10675–10685. PMLR, 2021.
Xu et al. [2017] Zheng Xu, Yen-Chang Hsu, and Jiawei Huang. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint:1709.00513, 2017.
Ye et al. [2020] Jingwen Ye, Yixin Ji, Xinchao Wang, Xin Gao, and Mingli Song. Data-free knowledge amalgamation via group-stack dual-gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12516–12525, 2020.
Yim et al. [2017] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
Zhao et al. [2022] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11953–11962, 2022.
Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.

Appendix

In the Appendix, we provide proof of theorems and more experimental results for MEKD. We also visualize the real and generated distributions of MEKD with DCGAN to verify the effectiveness of our method.

A. Proofs

The success of deep learning can be attributed to the discovery of intrinsic structures of data, which is defined as the manifold distribution hypothesis [48]. The data is concentrated on a manifold $\Sigma\in\mathbb{R}^{n}$ , which is embedded in the image space $\mathcal{X}$ , and data distribution can be abstracted as a probability distribution $\mu$ over the data manifold. The encoding-map $\varphi:\Sigma\rightarrow\Omega$ maps the data manifold $\Sigma$ to the label manifold $\Omega\in\mathbb{R}^{C}$ in a label space $\mathcal{Y}$ which is also called latent space, while mapping the data distribution $\mu$ to latent distribution $\upsilon=\varphi_{\#}\mu$ . Each sample $x$ is mapped from the image space into the latent space, and its result $\varphi(x)$ is called a latent code. The decoding-map $\varphi^{-1}$ remaps latent codes to the data manifold. Both $\varphi$ and $\varphi^{-1}$ are strongly nonlinear functions, which can be simulated with different neural networks [28, 29]. Meanwhile, the well-known Kolmogorov Theorem [25, 6] indicates that any multivariate continuous function can be represented as the sum of continuous real-valued functions with continuous one-dimensional outer and inner functions $\Phi_{q}$ and $\Psi_{q,p}$ .

The teacher function $f_{T}\in\varphi$ can be considered as a kind of encoding map, and the generator function $f_{G}\in\varphi^{-1}$ can be considered as a kind of decoding map. Let $\mathcal{X}\in\mathbb{R}^{n}$ be the image space, where data $x$ is sampled from. For a $C$ -way classification task, let $\mathcal{Y}\in\mathbb{R}^{C}$ be the latent space, where $|\mathcal{Y}|=C$ . Defining the model as a complex mapping function from the image distribution to the latent distribution, we can consider the teacher model as $f_{T}:\mathcal{X}\rightarrow\mathcal{Y}$ parameterized by $\theta_{T}\in\Theta_{T}$ , whose outputs indicate the probabilities (e.g., logits) of what category the samples belong to. The same for the student model $f_{S}:\mathcal{X}\rightarrow\mathcal{Y}$ parameterized by $\theta_{S}\in\Theta_{S}$ .

Definition 3.

W(\mathbb{P}_{S},\mathbb{P}_{T})=\inf_{\gamma\in\Pi(\mathbb{P}_{S},\mathbb{P}_% {T})}\mathbb{E}_{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|\ \right]=0,

(15)

Definition 4.

W(\mu^{\prime},\mu)=\inf_{\gamma\in\Pi(\mu^{\prime},\mu)}\mathbb{E}_{(x^{% \prime},x)\sim\gamma}[\ \|x^{\prime}-x\|\ ]=0,

(16)

A.1. Proof of Theorem 1

Theorem 4.

\hat{W}(g(V_{S}),g(V_{T}))=\frac{1}{m}\sum_{i=1}^{m}\|g(y_{S}^{i})-g(y_{T}^{i}% )\|=0,

(17)

then $W(\mathbb{P}_{S},\mathbb{P}_{T})=0$ .

Proof.

According to Johnson-Lindenstrauss theorem, for $y_{S}\in V_{S}$ and $y_{T}\in V_{T}$ , we have

\displaystyle\|y_{S}-y_{T}\|\leq(1+\epsilon)\|g(y_{S})-g(y_{T})\|.

(18)

For set $V_{S}$ and $V_{T}$ , we can get the empirical Wasserstein distance between them:

\displaystyle\begin{aligned} \hat{W}(V_{S},V_{T})&=\frac{1}{m}\sum_{i=1}^{m}\|% y_{S}^{i}-y_{T}^{i}\|\\ &\leq\frac{1}{m}\sum_{i=1}^{m}(1+\epsilon)\|g(y_{S}^{i})-g(y_{T}^{i})\|\\ &=\frac{1+\epsilon}{m}\sum_{i=1}^{m}\|g(y_{S}^{i})-g(y_{T}^{i})\|\\ &=(1+\epsilon)\hat{W}(g(V_{S}),g(V_{T}))=0.\end{aligned}

(19)

Because the Wasserstein distance between $\mathbb{P}_{S}$ and $\mathbb{P}_{T}$ is the expectation of the empirical Wasserstein distance between $V_{S}$ and $V_{T}$ , i.e.,

\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})=\mathbb{E}_{(V_{S},V_{T})\sim% \Pi(\mathbb{P}_{S},\mathbb{P}_{T})}\left[\hat{W}(V_{S},V_{T})\right],

(20)

so we can get

\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})\leq\hat{W}(V_{S},V_{T})=0.

(21)

Since

\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})=\inf_{\gamma\in\Pi(\mathbb{P}_{% S},\mathbb{P}_{T})}\mathbb{E}_{(y_{S},y_{T})\sim\gamma}\left[\ \|y_{S}-y_{T}\|% \ \right]\geq 0,

(22)

then the result $W(\mathbb{P}_{S},\mathbb{P}_{T})=0$ is derived. ∎

A.2. Proof of Theorem 2

Theorem 5.

\min_{\theta_{S}\in\Theta_{S}}\mathbb{E}_{x\sim\mu}\left[\|f_{G}\circ f_{S}(x)% ,f_{G}\circ f_{T}(x)\|\right]\rightarrow 0,

(23)

it holds that $f_{S}\rightarrow f_{T}$ , and we have

	$\displaystyle\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[f_{S}(x)]=\nabla_{\theta% _{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})$
	$\displaystyle\ \ \ \ \ \ \ \ =\mathbb{E}_{x\sim\mu}[\nabla_{\theta_{S}}\\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\\|].$		(24)

Proof.

Let us define

\displaystyle V(f_{S},\theta_{S})=\mathbb{E}_{x\sim\mu}\left[\ \|f_{S}(x),f_{T% }(x)\|\ \right],

(25)

\displaystyle V^{\prime}(f_{S},\theta_{S})=\mathbb{E}_{x\sim\mu}[\ \|f_{G}% \circ f_{S}(x),f_{G}\circ f_{T}(x)\|\ ],

(26)

where $f_{S}$ lies in $\mathcal{F_{S}}=\{f_{S}:\mathcal{X}\rightarrow\mathcal{Y}\}$ and $\theta_{S}\in\Theta_{S}$ .

According to the Johnson-Lindenstrauss Lemma [12], for any $0<\epsilon<1/2$ and any integer $m>4$ , let $n=\frac{20\log m}{\epsilon^{2}}$ , then for any set $S$ of $m$ points in $\mathbb{R}^{C}$ , the generator mapping function $f_{G}:\mathbb{R}^{C}\rightarrow\mathbb{R}^{n}$ for all $f_{S}(x),f_{T}(x)\in S$ holds that

	$\displaystyle(1-\epsilon)\ \\|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\\|\ \ \ \ % \ \ \ \ \ \ \ \ \ \ \ \ \$
	$\displaystyle\leq\\|f_{S}(x),f_{T}(x)\\|\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ % \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$
	$\displaystyle\leq(1+\epsilon)\ \\|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\\|.$		(27)

Using Squeeze Theorem [30], we know that the minimization of equation 25 and equation 26 converge to the same results, i.e.,

\inf V(f_{S},\theta_{S})=\inf V^{\prime}(f_{S},\theta_{S}).

(28)

We can rewrite the equation 15 using $x\sim\mu$ :

$\displaystyle W(\mathbb{P}_{S},\mathbb{P}_{T})$	$\displaystyle=\inf_{\gamma\in{\prod}(\mathbb{P}_{S},\mathbb{P}_{T})}\mathbb{E}% _{(y_{S},y_{T})\sim\gamma}\left[\ \\|y_{S}-y_{T}\\|\ \right]$	(29)
	$\displaystyle=\inf_{\gamma\in\prod(f_{S}(\mu),f_{T}(\mu))}\mathbb{E}_{x\sim\mu% }\left[\ \\|f_{S}(x),f_{T}(x)\\|\ \right]$
	$\displaystyle=\inf_{\gamma\in\prod(f_{S}(\mu),f_{T}(\mu))}V(f_{S},\theta_{S}),$

where $f_{S}$ and $f_{T}$ map distribution $\mu$ to $\mathbb{P}_{S}$ and $\mathbb{P}_{T}$ , respectively. So we can get

\inf\ V^{\prime}(f_{S},\theta_{S})=\inf\ V(f_{S},\theta_{S})=W(\mathbb{P}_{S},% \mathbb{P}_{T}).

(30)

According to Def. 3, when $\inf V^{\prime}(f_{S},\theta_{S})\rightarrow 0$ , then $W(\mathbb{P}_{S},\mathbb{P}_{T})\rightarrow 0$ , and we can derive that $f_{S}\rightarrow f_{T}$ .

The rest of the proof will be dedicated to show that the optimal solution of $\min V^{\prime}(f_{S},\theta_{S})$ leads to reduce the Wasserstein distance of $\mathbb{P}_{S}$ and $\mathbb{P}_{T}$ , which drives $f_{S}$ to approximate $f_{T}$ .

We know by the Kantorovich-Rubinstein duality [50] that there is an $\tilde{f}_{S}\in\mathcal{F}_{S}$ that attains

	$\displaystyle\inf\ \mathbb{E}_{x\sim\mu}[\ \\|\tilde{f}_{S}(x),f_{T}(x)\\|\ ]\ % \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$
	$\displaystyle=\sup\ \mathbb{E}_{x\sim\mu}[\ \tilde{f}_{S}(x)\ ]-\mathbb{E}_{x% \sim\mu}[\ f_{T}(x)\ ].$		(31)

Let us define $\tilde{X}(\theta_{S})=\{\tilde{f}_{S}\in\mathcal{F}_{S}:V(\tilde{f}_{S},\theta% _{S})=W(\mathbb{P}_{S},\mathbb{P}_{T})\}$ which is non-empty. We know by a simple envelope theorem [33] that

\displaystyle\nabla_{\theta_{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})=\nabla_{% \theta_{S}}V(\tilde{f}_{S},\theta_{S}),

(32)

for any $\tilde{f}_{S}\in\tilde{X}(\theta_{S})$ when both terms are well-defined.

Let $\tilde{f}_{S}\in\tilde{X}(\theta_{S})$ , which we knows exists since $\tilde{X}(\theta_{S})$ is non-empty for all $\theta_{S}$ . Then, we get

$\displaystyle\nabla_{\theta_{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})$	$\displaystyle=\nabla_{\theta_{S}}V(\tilde{f}_{S},\theta_{S})$	(33)
	$\displaystyle=\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\ \\|\tilde{f}_{S}(x),f_% {T}(x)\\|\ ]$
	$\displaystyle=\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\ \tilde{f}_{S}(x)\ ]-% \mathbb{E}_{x\sim\mu}\left[\ f_{T}(x)\ \right]$
	$\displaystyle=\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\ \tilde{f}_{S}(x)\ ].$

In practice, we use empirical distance between generated images of the student and teacher as loss to update $\theta_{S}$ by back-propagation, i.e.,

	$\displaystyle\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[f_{S}(x)]=\nabla_{\theta% _{S}}W(\mathbb{P}_{S},\mathbb{P}_{T})$
	$\displaystyle\ \ \ \ \ \ \ \ =\nabla_{\theta_{S}}W((f_{G})_{\#}\mathbb{P}_{S},% (f_{G})_{\#}\mathbb{P}_{T})$
	$\displaystyle\ \ \ \ \ \ \ \ =\nabla_{\theta_{S}}\mathbb{E}_{x\sim\mu}[\\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\\|]$
	$\displaystyle\ \ \ \ \ \ \ \ =\mathbb{E}_{x\sim\mu}[\nabla_{\theta_{S}}\\|f_{G}% \circ f_{S}(x)-f_{G}\circ f_{T}(x)\\|],$		(34)

when $W(\mathbb{P}_{S},\mathbb{P}_{T})\rightarrow 0$ , the student function $f_{S}$ converges to the teacher function $f_{T}$ . ∎

A.3. Proof of Theorem 3

Theorem 6.

R(h)\leq\hat{R}_{\rho}(h)+\frac{2C^{2}}{\rho(1-\epsilon)}\sqrt{\frac{r^{2}% \Lambda^{2}}{m}}+\sqrt{\frac{\log\frac{1}{\delta}}{2m}}.

(35)

Proof.

For the $C$ -way classification task, a hypothesis $h:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}$ aims to get $y$ with the minimum distance, i.e. $\arg\min_{y\in\mathcal{Y}}\|\overline{h}(x)-\overline{h}_{y}\|$ which is equivalent to $\arg\min_{y\in\mathcal{Y}}(1+\epsilon)\|g(\overline{h}(x))-g(\overline{h}_{y})\|$ by Johnson-Lindenstrauss theorem, as the result of $x$ . We define the margin $\rho_{h}(x,y)$ of the hypothesis $h$ as

\displaystyle\rho_{h}(x,y)=\|g(\overline{h}(x))-g(\overline{h}_{y})\|-\min_{y^% {\prime}\neq y}\|g(\overline{h}(x))-g(\overline{h}_{y^{\prime}})\|,

(36)

where $\overline{h}(x)$ is the vector of $h(x,y),y\in\mathcal{Y}$ and $\overline{h}_{y}$ use the mean of $x$ which belong to class $y$ as input. $g$ is the mapping function of generator $G$ .

For any $\rho<0$ , we can define the empirical margin loss of hypothesis $h$ for multi-class classification as

\displaystyle\hat{R}_{\rho}(h)=\frac{1}{m}\sum_{i=1}^{m}\Phi_{\rho}(\rho_{h}(x% _{i},y_{i})),

(37)

where $\Phi_{\rho}$ is the margin loss function

\displaystyle\Phi_{\rho}(x)=\left\{\begin{array}[]{l}1\ \ \ \ \ \ \ \ \ \ \ \ % \ \ \ \ 0\leq x,\\ 1-x/\rho\ \ \ \ \ \rho\leq x\leq 0,\\ 0\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x\leq\rho.\end{array}\right.

(41)

Thus, empirical margin loss is upper bounded by

\displaystyle\hat{R}_{\rho}(h)\leq\frac{1}{m}\sum_{i=1}^{m}\mathbbm{1}_{\rho_{% h}(x_{i},y_{i})\geq\rho}.

(42)

Let $\tilde{H}=\{(x,y)\mapsto\rho_{h}(x,y):h\in H\}$ , consider the family of functions $\tilde{\mathcal{H}}=\{\Phi_{\rho}\circ r:r\in\tilde{H}\}$ derived from $\tilde{H}$ , which take values in $[0,1]$ . By Rademacher theorem, with the probability at least $1-\delta$ , for all $h\in H$ ,

\displaystyle E[\Phi_{\rho}(\rho_{h}(x,y))]\leq\hat{R}_{\rho}(h)+2\mathcal{R}_% {m}(\Phi\circ\hat{H})+\sqrt{\frac{\log\frac{1}{\delta}}{2m}}.

(43)

Since $\mathbbm{1}_{\mu\geq 0}\leq\Phi_{\rho}(\mu)$ for all $\mu\in\mathbb{R}$ , the generalization error $R(h)$ is a lower bound on the left-hand side by Johnson-Lindenstrauss theorem, $R(h)=E\left[\mathbbm{1}_{\|\overline{h}(x)-\overline{h}_{y}\|-\min_{y^{\prime}% \neq y}\|\overline{h}(x)-\overline{h}_{y^{\prime}}\|\geq 0}\right]\leq E[\Phi_% {\rho}(\rho_{h}(x,y))]$ , and we get

\displaystyle R(h)\leq\hat{R}_{\rho}(h)+2\mathcal{R}_{m}(\Phi\circ\hat{H})+% \sqrt{\frac{\log\frac{1}{\delta}}{2m}}.

(44)

Let $\rho=-\rho$ , because the $(1/\rho)$ -Lipschitzness of $\Phi_{p}$ , so that $\mathcal{R}_{m}(\Phi_{p}\circ\tilde{H})\leq\frac{1}{\rho}\mathcal{R}_{m}(% \tilde{H})$ . Here, $\mathcal{R}_{m}(\tilde{H})$ can be upper bounded as follows:

\displaystyle\begin{aligned} \mathcal{R}_{m}(\tilde{H})&=\frac{1}{m}\mathop{E}% _{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}\rho_{h}(x_{i},y_{i})]\\ &=\frac{1}{m}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sum_{y\in% \mathcal{Y}}\sigma_{i}\rho_{h}(x_{i},y)\mathbbm{1}_{y=y_{i}}]\\ &\leq\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_% {i=1}^{m}\sigma_{i}\rho_{h}(x_{i},y)\mathbbm{1}_{y=y_{i}}]\\ &=\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=% 1}^{m}\sigma_{i}\rho_{h}(x_{i},y)(\frac{2(\mathbbm{1}_{y=y_{i}})-1}{2}+\frac{1% }{2})]\\ &\leq\frac{1}{2m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum% _{i=1}^{m}\sigma_{i}(2(\mathbbm{1}_{y=y_{i}})-1)\rho_{h}(x_{i},y)]+\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{2m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,% \sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}\rho_{h}(x_{i},y)]\\ &=\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=% 1}^{m}\sigma_{i}\rho_{h}(x_{i},y)],\end{aligned}

(45)

where $\mathbf{\sigma}=(\sigma_{1},\ldots,\sigma_{m})^{T}$ with $\sigma_{i}$ independent uniform random variables taking values in $\{-1,+1\}$ , observing that $\sigma_{i}$ and $-\sigma_{i}$ are distributed in the same way.

Let $\Pi_{1}(H)^{(C-1)}=\{\min\{h_{1},\ldots,h_{l}\}:h_{i}\in\Pi_{1}(H),i\in[1,C-1]\}$ . By Johnson-Lindenstrauss theorem, we get

\displaystyle\begin{aligned} \mathcal{R}_{m}(\tilde{H})&\leq\frac{1}{m}\sum_{y% \in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}\sigma_{i}(\|g% (\overline{h}(x))-g(\overline{h}_{y})\|\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ -\min_{y^{\prime}\neq y}\|g(\overline{h}(x))-g(% \overline{h}_{y^{\prime}})\|)]\\ &\leq\frac{1}{m}\sum_{y\in\mathcal{Y}}\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_% {i=1}^{m}\sigma_{i}\frac{1}{1-\epsilon}(\|\overline{h}(x_{i})\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ -\overline{h}_{y}\|-\min_{y^{\prime}\neq y}\|% \overline{h}(x_{i})-\overline{h}_{y^{\prime}}\|)]\\ &\leq\frac{1}{(1-\epsilon)m}\sum_{y\in\mathcal{Y}}[\mathop{E}_{S,\sigma}[\sup_% {h\in H}\sum_{i=1}^{m}\sigma_{i}\|\overline{h}(x_{i})-\overline{h}_{y}\|]\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ +\mathop{E}_{S,\sigma}[\sup_{h\in H}\sum_{i=1}^{m}% \sigma_{i}\min_{y^{\prime}\neq y}\|\overline{h}(x_{i})-\overline{h}_{y^{\prime% }}\|]]\\ &\leq\frac{1}{(1-\epsilon)m}\sum_{y\in\mathcal{Y}}[\mathop{E}_{S,\sigma}[\sup_% {h\in\Pi_{1}(H)}\sum_{i=1}^{m}\sigma_{i}h(x_{i})]\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ +\mathop{E}_{S,\sigma}[\sup_{h\in\Pi_{1}(H)^{(C-1)}% }\sum_{i=1}^{m}\sigma_{i}h(x_{i})]]\\ &\leq\frac{C}{(1-\epsilon)m}[C\mathop{E}_{S,\sigma}[\sup_{h\in\Pi_{1}(H)}\sum_% {i=1}^{m}\sigma_{i}h(x_{i})]]\\ &=\frac{C^{2}}{1-\epsilon}[\frac{1}{m}\mathop{E}_{S,\sigma}[\sup_{h\in\Pi_{1}(% H)}\sum_{i=1}^{m}\sigma_{i}h(x_{i})]]\\ &=\frac{C^{2}}{1-\epsilon}\mathcal{R}_{m}(\Pi_{1}(H)).\end{aligned}

(46)

Let $K:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ be a positive definite symmetric kernel and let $h(x,y)=\arg\max_{y\in\mathcal{Y}}w_{y}\cdot\Phi(x)$ , where $\Phi:\mathcal{X}\rightarrow\mathbb{R}^{n}$ be a feature mapping associated to $K$ . We denote $W$ as $W=(w_{1}^{\top},\ldots,w_{C}^{\top})$ . For any $p\geq 1$ , the family of kernel-based hypotheses is

\displaystyle H=\{h\in\mathcal{R}^{\mathcal{X}\times\mathcal{Y}}:h(x,y)\in% \mathbb{R}^{n},\|h\|_{p}\leq\Lambda\},

(47)

where $\|h\|_{p}=(\sum_{y=1}^{C}\|h(x,y)\|^{p})^{1/p}$ .

Observe that for all $l\in[1,C]$ , we have $\|w_{l}\|\leq(\sum_{l=1}^{C}\|w_{l}\|^{p})^{1/p}=\|W\|_{p}\leq\|h\|_{p}\leq\Lambda$ .And for $i\neq j$ , $\mathop{E}_{\sigma}[\sigma_{i},\sigma_{j}]=0$ . The Radmacher complexity of the hypotheses set $\Pi_{1}(H)$ can be expressed and bounded as follows:

$\displaystyle\mathcal{R}_{m}(\Pi_{1}(H))$	$\displaystyle=\frac{1}{m}\mathop{E}_{S,\sigma}\left[\sup_{y\in\mathcal{Y},\\|W% \\|\leq\Lambda}\left\langle w_{y},\sum_{i=1}^{m}\sigma_{i}\Phi(x_{i})\right% \rangle\right]$
	$\displaystyle\leq\frac{1}{m}\mathop{E}_{S,\sigma}\left[\sup_{y\in\mathcal{Y},% \\|W\\|\leq\Lambda}\\|w_{y}\\|\left\\|\sum_{i=1}^{m}\sigma_{i}\Phi(x_{i})\right\\|\right]$
	$\displaystyle\leq\frac{\Lambda}{m}\mathop{E}_{S,\sigma}\left[\left\\|\sum_{i=1}% ^{m}\sigma_{i}\Phi(x_{i})\right\\|\right]$
	$\displaystyle\leq\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\left\\|\sum% _{i=1}^{m}\sigma_{i}\Phi(x_{i})\right\\|^{2}\right]\right]^{1/2}$
	$\displaystyle=\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\sum_{i=1}^{m}% \\|\Phi(x_{i})\\|^{2}\right]\right]^{1/2}$
	$\displaystyle=\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\sum_{i=1}^{m}% K(x_{i},x_{i})\right]\right]^{1/2}$
	$\displaystyle\leq\frac{\Lambda\sqrt{mr^{2}}}{m}=\sqrt{\frac{r^{2}\Lambda^{2}}{% m}},$	(48)

which concludes the proof.

∎

Method	Data Size	MNIST				CIFAR-10
Teacher	50K $\sim$ 100K	ResNet32	VGG13	ResNet32	ResNet32	ResNet56	VGG13	ResNet56	ResNet56
Teacher	50K $\sim$ 100K	99.50	99.52	99.50	99.50	94.15	94.42	94.15	94.15
Student	50K $\sim$ 100K	ResNet8	VGG11	VGG11	MobileNet	ResNet8	VGG11	VGG11	MobileNet
Student	50K $\sim$ 100K	99.24	99.41	99.41	99.18	87.74	91.81	91.81	90.04
KD [19]	50K $\sim$ 100K	99.33	99.44	99.31	99.30	86.58	92.16	92.25	90.43
ML [3]	50K $\sim$ 100K	99.49	99.40	99.44	99.40	87.89	91.58	91.91	91.19
AL [53]	50K $\sim$ 100K	99.37	99.26	99.26	99.21	87.25	91.96	91.97	90.54
DKD [58]	50K $\sim$ 100K	99.33	99.43	99.48	99.42	86.61	92.06	92.42	90.50
DAFL [8]	0K	96.42	97.00	96.14	97.85	60.67	65.41	66.03	69.59
KN [37]	10K	98.61	98.81	98.07	98.54	80.62	81.83	82.41	85.07
AM [51]	10K	99.33	99.47	99.50	99.42	74.89	77.25	74.26	73.65
DB3KD [54]	10K	98.94	99.16	98.91	98.91	78.47	83.72	85.84	81.67
MEKD (soft)	10K	99.40	99.43	99.36	99.25	85.36	86.11	87.27	86.85
MEKD (hard)	10K	99.40	99.45	99.28	99.27	84.45	86.16	87.25	86.53

Table 7: Top-1 classification accuracy (%) of the student model on MNIST and CIFAR-10.

Method	Data Size	CIFAR-100				Tiny ImageNet
Teacher	50K $\sim$ 100K	ResNet56	VGG13	ResNet56	ResNet56	ResNet110	VGG13	ResNet110	ResNet110
Teacher	50K $\sim$ 100K	72.06	74.68	72.06	72.06	60.71	59.89	60.71	60.71
Student	50K $\sim$ 100K	ResNet8	VGG11	VGG11	MobileNet	ResNet32	VGG11	VGG11	MobileNet
Student	50K $\sim$ 100K	59.92	69.12	69.12	68.14	55.47	54.14	54.14	56.07
KD [19]	50K $\sim$ 100K	53.31	70.88	67.97	71.86	54.14	54.40	49.63	57.85
ML [3]	50K $\sim$ 100K	54.44	67.78	70.18	73.08	56.56	57.46	56.78	60.07
AL [53]	50K $\sim$ 100K	58.36	69.92	71.13	71.33	46.02	46.26	45.60	51.29
DKD [58]	50K $\sim$ 100K	54.28	67.32	70.10	72.38	55.99	55.88	56.52	59.43
DAFL [8]	0K	42.44	43.78	48.32	54.10	38.44	31.93	34.13	40.93
KN [37]	10K	48.75	57.83	55.64	58.49	48.92	46.99	45.05	50.22
AM [51]	10K	50.69	62.17	63.20	65.58	47.72	49.26	47.32	51.54
DB3KD [54]	10K	50.49	63.48	62.76	63.67	47.95	48.46	46.93	50.49
MEKD (soft)	10K	51.87	64.76	64.83	67.07	50.87	51.85	49.95	54.93
MEKD (hard)	10K	51.67	64.72	65.32	67.36	49.89	51.33	49.36	54.71

Table 8: Top-1 classification accuracy (%) of the student model on CIFAR-100 and Tiny ImageNet.

Dataset	T - S	Data Size	KD	ML	AL	DKD	KN	AM	DB3KD	MEKD	MEKD
Dataset	Pairs	Data Size	(soft)	(soft)	(soft)	(soft)	(soft)	(soft)	(hard)	(soft)	(hard)
ImageNet-1K	RN50 - RN34	100K	52.08	54.97	53.50	53.57	56.77	56.92	58.61	59.89	59.32
ImageNet-1K	RX101 - RX50	100K	54.90	56.58	50.88	55.31	57.43	55.64	59.90	61.21	60.54

Table 9: Top-1 classification accuracy (%) of the student model on ImageNet-1K. We use pretrained RN50 (

76.13\%

) and RX101 (

79.31\%

) as the teacher models, respectively. RN is ResNet and RX is ResNeXt

Dataset	T - S	Data Size	KD	ML	AL	DKD	KN	AM	DB3KD	MEKD	MEKD
Dataset	Pairs	Data Size	(soft)	(soft)	(soft)	(soft)	(soft)	(soft)	(hard)	(soft)	(hard)
CIFAR10	T: ResNet56	0.1K	16.74	17.78	12.97	20.66	27.67	48.31	43.05	49.04	47.12
	(94.15%)	1K	31.25	31.57	32.05	31.09	58.65	62.05	64.28	69.84	68.66
	S: MobileNet	10K	70.90	73.06	68.61	75.44	85.07	73.65	81.67	86.85	86.53
	(90.04%)	50K(full)	90.43	91.19	90.54	90.50	92.19	86.33	92.46	93.48	93.09
CIFAR100	T: ResNet56	0.1K	01.96	01.88	01.72	02.56	13.23	36.73	30.72	33.56	34.60
	(72.06%)	1K	10.36	10.06	09.62	10.81	35.80	52.09	50.14	53.84	54.52
	S: MobileNet	10K	44.32	48.08	40.57	47.24	58.49	65.58	63.67	67.07	67.36
	(68.14%)	50K(full)	71.86	73.08	71.33	72.38	70.85	71.77	73.36	73.84	73.27

Table 10: Ablation study of data size with top-1 classification accuracy (%) of the student model on CIFAR-10 and CIFAR-100.

B. More Results

B.1. Complete Distillation Experiments

We conduct different teacher-student model pairs for distillation experiments, and use ResNet32 / ResNet56 / VGG13 / ResNet110 / ResNet50 / ResNeXt101 as teacher models and use ResNet8 / ResNet32 / VGG11 / MobileNet / ResNet34 / ResNeXt50 as student models. Distillation performance is tested on various datasets, such as MNIST, CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-1K, as top-1 classification accuracy is exploited as an evaluation metric. The experimental results are shown in Tab. 7, Tab. 8 and Tab. 9. For the training of teacher and student models, we adopt the same setting of hyperparameters, so as to verify the distillation effect of student models trained with different methods compared with the teacher model trained with vanilla supervised learning under the same conditions.

We also provide complete ablation results of different data sizes on CIFAR-10 and CIFAR-100, as shown in Tab. 10. We use an effective teacher-student pair of ResNet56 - MobileNet for experiments. The results show that B2KD methods are generally more robust than traditional KD methods for small data sizes, and they can utilize the information in available samples maximumly to model compression in extreme cases. In the comparison of all methods, MEKD achieves the best performance, which also validates the effectiveness and robustness of our proposed method.

In all experiments, teacher and student models are trained for $350$ epochs, except $12$ epochs for MNIST. We use Nesterov SGD with momentum $0.9$ and weight-decay $0.0005$ for training and use a mini-batch size of $128$ images on a single NVIDIA GeForce RTX 3090 GPU. The initial learning rate is $0.1$ , except $0.01$ for MNIST, and we conduct a multi-step learning rate schedule which decreases the learning rate by 0.1 at the $116^{th}$ and $233^{th}$ epoch for the training of models, except no learning rate schedule is used for MNIST. For the training of student models, we follow the unsupervised setting and only use the soft or hard responses of teacher models for distillation. Note that for all experiments, we conduct three times experiments and report the mean accuracy.

For the training of DCGAN, we follow the hyperparameters’ settings of the work [42]. DCGAN composes of a generator realized by transposed convolution layer and a discriminator realized by an ordinary convolution layer, which greatly reduces the number of network parameters and improves the image generation effect. As an extension of our method, we believe that generative models of different architectures can also be used as emulators to learn the inverse mapping of the teacher function, by adding information maximization (IM) loss to alleviate the problem of mode collapse and achieve the purpose of deprivatization. This will be our research work in the future.

B.2. Visualization Results

We evaluate the training process of DCGAN in terms of whether the generated distribution is consistent with the real distribution, and visualize the synthetic and genuine images by t-SNE projection. As shown in Fig. 8 and Fig. 9, it can be observed that in the training process of DCGAN, the generated distribution is gradually closer to the real distribution. This verifies the effectiveness of using DCGAN as the emulator to learn the inverse mapping of the teacher function, and also proves that DCGAN can indeed alleviate the problem of mode collapse and generate images consistent with the distribution of real images. These synthetic images can not only effectively integrate various patterns in genuine images, but also serve as effective query samples to support the distillation of student models.

	$\displaystyle(1-\epsilon)\ \\|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\\|\ \ \ \ % \ \ \ \ \ \ \ \ \ \ \ \ \$
	$\displaystyle\leq\\|f_{S}(x),f_{T}(x)\\|\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ % \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$
	$\displaystyle\leq(1+\epsilon)\ \\|f_{G}\circ f_{S}(x),f_{G}\circ f_{T}(x)\\|.$		(27)

$\displaystyle\mathcal{R}_{m}(\Pi_{1}(H))$	$\displaystyle=\frac{1}{m}\mathop{E}_{S,\sigma}\left[\sup_{y\in\mathcal{Y},\\|W% \\|\leq\Lambda}\left\langle w_{y},\sum_{i=1}^{m}\sigma_{i}\Phi(x_{i})\right% \rangle\right]$
	$\displaystyle\leq\frac{1}{m}\mathop{E}_{S,\sigma}\left[\sup_{y\in\mathcal{Y},% \\|W\\|\leq\Lambda}\\|w_{y}\\|\left\\|\sum_{i=1}^{m}\sigma_{i}\Phi(x_{i})\right\\|\right]$
	$\displaystyle\leq\frac{\Lambda}{m}\mathop{E}_{S,\sigma}\left[\left\\|\sum_{i=1}% ^{m}\sigma_{i}\Phi(x_{i})\right\\|\right]$
	$\displaystyle\leq\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\left\\|\sum% _{i=1}^{m}\sigma_{i}\Phi(x_{i})\right\\|^{2}\right]\right]^{1/2}$
	$\displaystyle=\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\sum_{i=1}^{m}% \\|\Phi(x_{i})\\|^{2}\right]\right]^{1/2}$
	$\displaystyle=\frac{\Lambda}{m}\left[\mathop{E}_{S,\sigma}\left[\sum_{i=1}^{m}% K(x_{i},x_{i})\right]\right]^{1/2}$
	$\displaystyle\leq\frac{\Lambda\sqrt{mr^{2}}}{m}=\sqrt{\frac{r^{2}\Lambda^{2}}{% m}},$	(48)