Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Instance Temperature Knowledge Distillation

Zhengbo Zhang1, Yuxi Zhou2 ,Jia Gong1, Jun Liu1, Zhigang Tu2
1Singapore University of Technology and Design, 2Wuhan University
zhengbo_zhang@mymail.sutd.edu.sg, yuxizhou@whu.edu.cn
jia_gong@mymail.sutd.edu.sg, jun_liu@sutd.edu.sg, tuzhigang@whu.edu.cn
Zhengbo Zhang and Yuxi Zhou contributed equally.Corresponding author
Abstract

Knowledge distillation (KD) enhances the performance of a student network by allowing it to learn the knowledge transferred from a teacher network incrementally. Existing methods dynamically adjust the temperature to enable the student network to adapt to the varying learning difficulties at different learning stages of KD. KD is a continuous process, but when adjusting the temperature, these methods consider only the immediate benefits of the operation in the current learning phase and fail to take into account its future returns. To address this issue, we formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning, termed RLKD. Importantly, we design a novel state representation to enable the agent to make more informed action (i.e., instance temperature adjustment). To handle the problem of delayed rewards in our method due to the KD setting, we explore an instance reward calibration approach. In addition, we devise an efficient exploration strategy that enables the agent to learn valuable instance temperature adjustment policy more efficiently. Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily, and we validate its effectiveness on both image classification and object detection tasks. Our project is at https://www.zayx.me/ITKD.github.io/.

1 Introduction

Refer to caption
Figure 1: KD is a continual process, however, the previous KD methods [23, 20] do not consider the future benefits of instance temperature adjustment during the KD process.

Over the past few decades, the field of computer vision has undergone a transformative shift thanks to the remarkable progress of deep neural networks (DNNs). Nonetheless, the significant computation and storage demands of DNNs, present great challenges, especially in industrial applications where there is a preference for efficient and lightweight models. Typically, lightweight networks do not perform as well as deeper networks. To solve this issue, knowledge distillation (KD), which aims to enable smaller (student) models to compete with larger (teacher) models in performance, has been introduced [14]. Due to its remarkable effectiveness in boosting the capability of lightweight models, KD has been widely used in various tasks, e.g., object detection [7, 39], semantic segmentation [24, 36, 41], and natural language processing [30, 11].

KD enhances the student network by transferring knowledge from a higher-capacity teacher network. During the process of KD, the capability of the student network is constantly changing. This results in the same piece of knowledge (training instance) has varying degrees of value to the student network at different learning stages [19]. Moreover, even within the same learning stage, the difficulty of learning varies between instances. The student network should assign more weight to examples that are difficult to learn [17]. However, most previous KD methods [14, 42, 35, 10] have not simultaneously taken into account the learning difficulty of each training instance as well as the learning stage they are in.

To address this issue, recent efforts [23, 20] have been made, where the temperature for each instance is adjusted to match its respective learning difficulty. This is because temperature, as a critical hyperparameter in KD, modulates the smoothness of the predictive distribution and sets the difficulty of the KD process. However, as illustrated in Fig. 1 the previous methods [23, 20] do not taken into account the continuous nature of KD. When adjusting the instance temperature, it is important to consider the future benefits of this operation. Our key insight is that formulating the instance temperature adjustment in the KD process as a sequential decision-making task, with the adjustment of instance temperature being treated as the action in this task. For this sequential decision-making task, we set the reward as the performance improvement of the student network between two learning stages. Our goal is to maximize cumulative rewards, that is, to maximize the enhancement of the student network’s performance over the course of the task. To achieve this goal, we propose a novel method based on reinforcement learning (RL), termed as RLKD.

In the proposed RLKD, we employ an agent network to determine the instance temperature. To aid the agent in making prudent decisions regarding instance temperature, we devise a comprehensive state representation which encompasses two performance features and an uncertainty feature. The two performance features respectively represent the teacher and student networks’ performance on each instance, while the uncertainty feature is used to measure the student network’s mastery over the instance. This innovative uncertainty feature design is arose by uncertainty-based sampling in active learning [27, 4]. Nevertheless, we discover that applying the RL framework directly to instance temperature adjustment in KD can cause a significant delayed reward issue.

Given the setting of our reward (performance improvement of the student network) and the setup of KD, we calculate the reward at the end of training for each batch. Since a batch typically contains many training instances, often 32323232, it means that the agent receives rewards only after the 32thsuperscript32𝑡32^{th}32 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT actions. This brings us a challenge of delayed rewards, causing difficulties in credit assignment [16, 13]. We explore an instance reward calibration method to handle this challenge, building upon the refinement of reward decomposition [3]. Due to the absence of ground truth for the instance temperature, we adopt an online training mode to update the agent’s policy. However, in the initial phase of training, the agent may engage in random exploration within a vast and inefficient action space [2]. To address this issue, we design an efficient exploration strategy that guides the agent to learn on high-quality training instances during the early phase of training. This strategy aims to expedite the agent’s learning process, enabling it to quickly acquire valuable instance temperature adjustment policy.

In summary, our main contributions are: 1) In KD, to account for the future benefits of adjusting instance temperature at the current stage, we formulate the instance temperature adjustment as a sequential decision-making task, and propose a novel method RLKD based on RL to handle this task. 2) To overcome the challenge of delayed rewards in our RLKD, we exploit a mechanism for instance reward calibration. Furthermore, we design a valid exploration strategy to promote the agent to learn valuable temperature adjustment policy with high efficiency. 3) Our RLKD can serve as a plug-and-play technique to boost the performance of KD algorithms. We validate its effectiveness on three benchmarks for image classification and object detection, all obtaining the state-of-the-art result.

2 Related Work

KD, as a model compression method, can trace back its origin in  [14]. In the KD process, KL-divergence loss between teacher and student model predictions is minimized using a key hyperparameter known as “temperature”. As [14, 6, 23, 20] stated, temperature helps adjusting the smoothness of the prediction distribution and sets the KD process’s difficulty effectively. Due to the student model’s learning capacity varies at different stages [19], some works [23, 20] explore to adjust the temperature dynamically based on the current learning stage to help the student network learn better from the teacher network.

MKD [23] learns a dynamically varying temperature via the method of meta-learning [33] as the KD process progresses, but it is primarily designed for scenarios involving vision transformer [9] and strong data augmentation. The limitations of MKD preclude its effective application in temperature adjustment within the majority of KD methods, and previous studies [20] have confirmed that directly applying MKD to KD models results in a significant degradation in performance. CTKD [20] utilizes a curriculum learning approach [5] to progressively learn a dynamic temperature parameter, starting from simple to complex scenarios. Particularly, CTKD progressively learns two versions of temperature: the global temperature and the instance temperature. However, CTKD does not take into account the future benefits (the performance enhancement of the student network between adjacent learning stages) when adjusting the instance temperature, and it also does not consider the student network’s mastery of the instance. These shortcomings of CTKD causing its instance temperature being not robust, preventing the student network trained with CTKD from learning knowledge effectively. To overcome these drawbacks, we formulate the instance temperature adjustment in the KD process as a sequential decision-making task, where adjusting instance temperature is considered as an action. Besides, we present a novel state representation, which includes a feature that reflects the student network’s degree of mastery over a training instance.

3 Preliminary on Reinforcement Learning

Refer to caption
Figure 2: Overview of our RLKD method. Solid lines represent the processing flow of the training instances in our framework, and dashed lines indicate the backpropagation process used for model (student model and agent) updates. The workflow of our RLKD method is as follows: 1) Given a batch of training instances, we first calculate the state through the outputs of the teacher and student networks, including the teacher’s predicted probabilities, the student’s predicted probabilities, and the student’s uncertainty score. 2) Our agent makes the decision (action) on the temperature for each training instance based on the state. 3) Reward is calculated depending on the action that taken by the agent, followed by the instance reward calibration. 4) According to the knowledge value of each training instance, we select the top 10-20% highest-valued instances and the latter 40-50% instances to perform a mix-up operation [38], accordingly high-quality training instances are obtained for the efficient exploration strategy.

Reinforcement Learning (RL) involves an agent aiming to gain maximum cumulative rewards through interactions with its environment. Key components include the Agent (the decision-maker), Environment (the context in which the agent operates), Action A𝐴Aitalic_A (choices made by the agent), State S𝑆Sitalic_S (the current situation), Reward R𝑅Ritalic_R (feedback from the environment), and Policy π𝜋\piitalic_π (the agent’s strategy at a specific instance).

In RL, an agent, at state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, decides on action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on its policy, transitions to a new state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, and receives reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The agent’s objective is to maximize its expected cumulative rewards and considering future gains, rather than just focusing on the immediate rewards. This is captured by the equation:

Gt=k=0γkrt+k+1subscript𝐺𝑡superscriptsubscript𝑘0superscript𝛾𝑘subscript𝑟𝑡𝑘1\vspace{-0.7mm}G_{t}=\sum_{k=0}^{\infty}\gamma^{k}r_{t+k+1}\vspace{-0.4mm}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + italic_k + 1 end_POSTSUBSCRIPT (1)

Here, γ𝛾\gammaitalic_γ is the discount factor, lying between 0 and 1. It assigns the importance to future rewards, with values close to 1 implying a consideration for long-term rewards and values near 0 stressing on immediate rewards.

The Value Function V(s)𝑉𝑠V(s)italic_V ( italic_s ) for a policy π𝜋\piitalic_π predicts the return from state s𝑠sitalic_s:

Vπ(s)=𝔼π[Gt|St=s]superscript𝑉𝜋𝑠subscript𝔼𝜋delimited-[]conditionalsubscript𝐺𝑡subscript𝑆𝑡𝑠V^{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ] (2)

The Q-Function Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) for a state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) with policy π𝜋\piitalic_π is:

Qπ(s,a)=𝔼π[Gt|St=s,At=a],superscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]formulae-sequenceconditionalsubscript𝐺𝑡subscript𝑆𝑡𝑠subscript𝐴𝑡𝑎Q^{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s,A_{t}=a],italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] , (3)

where 𝔼πsubscript𝔼𝜋\mathbb{E}_{\pi}blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT represents the expectation under the policy π𝜋\piitalic_π. They relate via the Bellman Equation:

Vπ(s)=aπ(a|s)s,rp(s,r|s,a)[r+γVπ(s)].superscript𝑉𝜋𝑠subscript𝑎𝜋conditional𝑎𝑠subscriptsuperscript𝑠𝑟𝑝superscript𝑠conditional𝑟𝑠𝑎delimited-[]𝑟𝛾superscript𝑉𝜋superscript𝑠V^{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s^{\prime},r}p(s^{\prime},r|s,a)[r+\gamma V^{% \pi}(s^{\prime})].italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) [ italic_r + italic_γ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] . (4)

Our proposed RLKD method is based the Proximal Policy Optimization (PPO) [28] framework. PPO, stemming from the policy gradient technique, tackles RL’s issues of stability and efficiency. The actor in PPO optimizes the policy based on the feedback from the critic. The standard policy gradient techniques might make significant policy updates leading to erratic results, in contrast, PPO ensures the updates are restrained by using a clipped function.

The PPO objective is defined as:

LCLIP(θ)=𝔼^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]superscript𝐿𝐶𝐿𝐼𝑃𝜃subscript^𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡clipsubscript𝑟𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑡L^{CLIP}(\theta)=\hat{\mathbb{E}}_{t}[\min(r_{t}(\theta)\hat{A}_{t},\text{clip% }(r_{t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{t})]italic_L start_POSTSUPERSCRIPT italic_C italic_L italic_I italic_P end_POSTSUPERSCRIPT ( italic_θ ) = over^ start_ARG blackboard_E end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_min ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (5)

where rt(θ)subscript𝑟𝑡𝜃r_{t}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is the probability ratio of current to old policy action. A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT estimates the advantage function at time t𝑡titalic_t where the critic provides a feedback. The clip function keeps the ratio within a limited set by ϵitalic-ϵ\epsilonitalic_ϵ.

PPO’s moderate updates ensure consistent learning, making it is suitable for our work, especially in the case that consistent learning is needed across varied settings.

4 Method

The previous KD methods [23, 20] attempt to adjust the temperature to improve the student network’s knowledge acquisition, but they overlook the nature of KD is continuous. When adjusting the temperature, they only consider the benefits in the current stage, neglecting the potential rewards of temperature adjustments in future learning stages. To address this issue, we treat the adjustment of instance temperature during the KD process as a sequential decision-making task , where the temperature adjustment for each instance is considered as the action within the task. Based on this insight, we propose the RLKD method (see Fig. 2) based on RL with a novel state representation (described in Sec. 4.1), allowing us to take into account the future rewards of temperature adjustment on training instances at the current stage. In our RLKD method, the reward is designed to measure the improvement in the student network’s performance; thus, we calculate the reward during the parameter update of the student network. According to the KD setup, the student network updates its parameters after training on each batch of data (typically comprising 32 training instances), which means that we can only compute the reward once after every 32 actions. This leads to a significant delayed reward issue. To solve this problem, we design an instance reward calibration scheme (described in Sec. 4.2). Furthermore, we formulate a strategy for efficient exploration, enabling the agent to rapidly learn effective temperature adjustment policy (described in Sec. 4.3).

4.1 Instance Temperature Adjustment as a Sequential Decision-making Task

In this work, we aim at learning a policy that directly maximizes the performance of the student network driven by the maximization of our designed reward. To achieve this goal, we formulate the instance temperature adjustment in the KD process as a sequential decision-making task: (st,at,rt+1,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡1subscript𝑠𝑡1\left(s_{t},a_{t},r_{t+1},s_{t+1}\right)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). Specifically, the process includes the following steps: 1) Estimate the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the performance of the teacher and student networks on the current training instance. 2) Given the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and informed by the prior experiences, the agent evaluates each state-action pair and execute the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of temperature adjustment for each training instance. 3) After the agent performs the optimized action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the environment transfers to a subsequent state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and provides a reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT to the agent. 4) The agent updates its policy based on the received reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and the newly observed state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.

We utilize the PPO framework [28] to model this process. Subsequently, we provide a detailed introduction to the definitions of state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

State. The state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT serves as input for the agent, providing critical support for agent making the instance temperature decision. The design of the state should align with the needs of the instance temperature decision-making policy. Intuitively, when the policy makes a temperature decision for a training instance x𝑥xitalic_x, it needs to consider the performance of both the teacher and the student networks on this instance. Moreover, due to the varying difficulty of the knowledge embodied in each training instance, the student network’s mastery over each instance differs [34, 17]. The state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT should also include a measure of the student network’s grasp on that particular instance.

Based on these intuitions, given an instance x𝑥xitalic_x, we collect cues from three aspects to form the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: the performance of the teacher network, the performance of the student network, and the extent to which the student network has mastered the instance. Particularly, the teacher network outputs its prediction ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at instance x𝑥xitalic_x is expressed as:

pt=argmaxi[k]fteacher(x)i,subscript𝑝𝑡subscriptargmax𝑖delimited-[]𝑘subscript𝑓𝑡𝑒𝑎𝑐𝑒𝑟subscript𝑥𝑖p_{t}=\operatorname{argmax}_{i\in[k]}f_{{teacher}}(x)_{i},italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (6)

where, k𝑘kitalic_k represents the total number of categories, and fteachersubscript𝑓𝑡𝑒𝑎𝑐𝑒𝑟f_{{teacher}}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT denotes the teacher network. We use the probability fteacher(x)ptsubscript𝑓𝑡𝑒𝑎𝑐𝑒𝑟subscript𝑥subscript𝑝𝑡f_{{teacher}}(x)_{p_{t}}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT associated with the teacher network’s prediction ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the instance x𝑥xitalic_x to measure the performance of the teacher network. Similarly, to measure the performance of the student network, we use the probability fstudent(x)pssubscript𝑓𝑠𝑡𝑢𝑑𝑒𝑛𝑡subscript𝑥subscript𝑝𝑠f_{{student}}(x)_{p_{s}}italic_f start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT associated with the student network’s prediction pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for that instance x𝑥xitalic_x. To assess the mastery level of a student network over the instance x𝑥xitalic_x, we draw inspiration from uncertainty-based sampling in active learning [27, 4], and determine the mastery level by measuring the uncertainty score υstudent(x)subscript𝜐𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑥\upsilon_{{student}}(x)italic_υ start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) in the student network’s prediction distribution for the instance. The uncertainty score for the student network with respect to instance x𝑥xitalic_x is calculated according to:

υstudent(x)=1(fstudent(x)psmaxi[k]\psfstudent(x)i).subscript𝜐𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑥1subscript𝑓𝑠𝑡𝑢𝑑𝑒𝑛𝑡subscript𝑥subscript𝑝𝑠subscript𝑖\delimited-[]𝑘subscript𝑝𝑠subscript𝑓𝑠𝑡𝑢𝑑𝑒𝑛𝑡subscript𝑥𝑖\upsilon_{{student}}(x)=1-(f_{{{student}}}(x)_{p_{s}}-\max_{i\in[k]\backslash p% _{s}}f_{{{student}}}(x)_{i}).italic_υ start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) = 1 - ( italic_f start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] \ italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)

Our uncertainty score υstudent(x)subscript𝜐𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑥\upsilon_{{student}}(x)italic_υ start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) is positively correlated with the degree of uncertainty exhibited by the student network towards instance x𝑥xitalic_x, which is because if the student network has a good grasp of the instance, the network is very confident in its prediction, resulting in a prediction distribution with a single high-probability predicted value. Conversely, if the mastery is poor, the student network exhibits uncertainty in its prediction, leading to multiple high-probability predicted values that are close to each other.

In summary, for a given instance x𝑥xitalic_x, we define our state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as (fteacher(x)pt,fstudent(x)ps,υstudent(x))subscript𝑓𝑡𝑒𝑎𝑐𝑒𝑟subscript𝑥subscript𝑝𝑡subscript𝑓𝑠𝑡𝑢𝑑𝑒𝑛𝑡subscript𝑥subscript𝑝𝑠subscript𝜐𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑥(f_{{teacher}}(x)_{p_{t}},f_{{student}}(x)_{p_{s}},\upsilon_{{student}}(x))( italic_f start_POSTSUBSCRIPT italic_t italic_e italic_a italic_c italic_h italic_e italic_r end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT italic_s italic_t italic_u italic_d italic_e italic_n italic_t end_POSTSUBSCRIPT ( italic_x ) ), encompassing the predicted probabilities from the teacher network, the predicted probabilities from the student network, and the uncertainty score of the student network.

Action. Our action is the decision-making regarding instance temperature 𝒯𝒯\mathcal{T}caligraphic_T. To overcome the limitations of exploration in a discrete action space, we opt to explore instance temperature 𝒯𝒯\mathcal{T}caligraphic_T in a continuous action space. Below, we elaborate how to obtain the instance temperature 𝒯𝒯\mathcal{T}caligraphic_T.

Upon receiving the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the instance x𝑥xitalic_x, we use stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as input to the actor network within the PPO framework. To better explore various actions of the actor network and to smooth its learning process, we design our actions to follow a Gaussian distribution 𝒩(μ,σ2)𝒩𝜇superscript𝜎2\mathcal{N}\left(\mu,\sigma^{2}\right)caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Thus our actor outputs the mean μ𝜇\muitalic_μ and variance σ𝜎\sigmaitalic_σ of a Gaussian distribution. To boost the flexibility and randomness of action exploration, we randomly sample a value from the Gaussian distribution to serve as our instance temperature 𝒯𝒯\mathcal{T}caligraphic_T. Finally, based on our experience that almost all the instance temperature varies within the range of 0 to 10, we restrict the temperature to a range by following formula:

𝒯=10sigmoid(𝒯),𝒯10sigmoid𝒯\mathcal{T}=10\cdot\operatorname{sigmoid}(\mathcal{T}),caligraphic_T = 10 ⋅ roman_sigmoid ( caligraphic_T ) , (8)

where sigmoidsigmoid\operatorname{sigmoid}roman_sigmoid refers to the sigmoid activation function.

Reward. The reward function is a critical component of our framework, providing feedback regarding the quality of the agent’s action, thereby assisting the agent in refining its action policy. The action of our agent is to select an appropriate instance temperature 𝒯𝒯\mathcal{T}caligraphic_T based on the stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the instance x𝑥xitalic_x, which can facilitate knowledge acquisition by the student network, aiming to maximize the performance of the student network as much as possible. To achieve this objective, we integrate the settings of KD and consider the improvement of the student network’s performance between two consecutive batches as the reward. Moreover, a common characteristic in deep learning is that the student network’s performance shows significant improvement during the initial stages of training, this may bring disproportionately large reward values. However, these large values do not necessarily reflect the agent’s astute action choices. To mitigate the impact of this phenomenon, we progressively increase the reward size during the early training stages. The formula for the reward is defined as:

reward=sigmoid(/n)reward.𝑟𝑒𝑤𝑎𝑟𝑑sigmoid𝑛𝑟𝑒𝑤𝑎𝑟𝑑reward=\operatorname{sigmoid}(\mathcal{E}/n)\cdot reward.italic_r italic_e italic_w italic_a italic_r italic_d = roman_sigmoid ( caligraphic_E / italic_n ) ⋅ italic_r italic_e italic_w italic_a italic_r italic_d . (9)

Herein, \mathcal{E}caligraphic_E represents the current epoch number, n𝑛nitalic_n is a hyperparameter denotes the first n𝑛nitalic_n epochs during which the reward incrementally grows.

4.2 Instance Reward Calibration

In our RLKD method, the action is to adjust the temperature for each training instance. To evaluate the quality of a particular action, we should calculate the corresponding reward for that action. However, we are unable to directly obtain the instance reward for each action. This is because our reward is based on the performance improvement of the student network. The student model is trained on batches of instance data and updates its parameters accordingly. The reward can only be computed after the student model updates its parameters. Typically, in KD, the batch size is set to 32, meaning that we have to go through 32 actions before we can receive a reward. This delayed reward characteristic (known as the credit assignment problem [16, 13]) makes it is difficult to assess and improve the policy network.

To address this issue, we design a reward corrector 𝒞𝒞\mathcal{C}caligraphic_C based on the refinement of the reward decomposition [3]. The reward corrector, which redistributes the reward rbsuperscript𝑟𝑏r^{b}italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT for the current batch based on the state sbsuperscript𝑠𝑏s^{b}italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT of the current batch and the action absuperscript𝑎𝑏a^{b}italic_a start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT taken by the agent for each instance, to obtain the corrected reward rbsuperscript𝑟superscript𝑏r^{b^{\prime}}italic_r start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT that corresponds to the action for each instance. The corrected reward rbsuperscript𝑟superscript𝑏r^{b^{\prime}}italic_r start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is calculated as:

rb=𝒞(sb,ab,rb),superscript𝑟superscript𝑏𝒞superscript𝑠𝑏superscript𝑎𝑏superscript𝑟𝑏r^{b^{\prime}}=\mathcal{C}(s^{b},a^{b},r^{b}),italic_r start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = caligraphic_C ( italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) , (10)

To account for the contribution of each instance’s action to the reward of the entire batch, we introduce an auxiliary task that allows the reward corrector to predict the sequence-wide return G𝐺Gitalic_G at each time step. The loss function for our reward corrector 𝒞𝒞\mathcal{C}caligraphic_C is defined as:

𝒞=α(rnbrb)2+βni=1n(ribGi)2.subscript𝒞𝛼superscriptsubscriptsuperscript𝑟superscript𝑏𝑛superscript𝑟𝑏2𝛽𝑛superscriptsubscript𝑖1𝑛superscriptsubscriptsuperscript𝑟superscript𝑏𝑖subscript𝐺𝑖2\mathcal{L}_{\mathcal{C}}=\alpha\cdot(r^{b^{\prime}}_{n}-r^{b})^{2}+\frac{% \beta}{n}\cdot\sum_{i=1}^{n}(r^{b^{\prime}}_{i}-G_{i})^{2}.caligraphic_L start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT = italic_α ⋅ ( italic_r start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_β end_ARG start_ARG italic_n end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

Here, rnbsubscriptsuperscript𝑟superscript𝑏𝑛r^{b^{\prime}}_{n}italic_r start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the n𝑛nitalic_n-th corrected reward, and n𝑛nitalic_n is the batch size. The variables alpha𝑎𝑙𝑝𝑎alphaitalic_a italic_l italic_p italic_h italic_a and beta𝑏𝑒𝑡𝑎betaitalic_b italic_e italic_t italic_a are weights, which we set to 1 and 0.5, respectively. The variable Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the return at the i𝑖iitalic_i-th time step. Additionally, to ensure that the states sbsuperscript𝑠𝑏s^{b}italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT recorded in the replay buffer match the corrected rewards, we devise a state updater 𝒰𝒰\mathcal{U}caligraphic_U and update the states sbsuperscript𝑠𝑏s^{b}italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT accordingly. The updated state sbsuperscript𝑠superscript𝑏s^{b^{\prime}}italic_s start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is calculated as follows:

sb=𝒰(sb)superscript𝑠superscript𝑏𝒰superscript𝑠𝑏s^{b^{\prime}}=\mathcal{U}(s^{b})italic_s start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = caligraphic_U ( italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) (12)

The loss function for our state updater 𝒰𝒰\mathcal{U}caligraphic_U is defined as:

𝒰=((sb)G)2.subscript𝒰superscriptsuperscript𝑠superscript𝑏𝐺2\mathcal{L}_{\mathcal{U}}=(\mathcal{E}(s^{b^{\prime}})-G)^{2}.caligraphic_L start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT = ( caligraphic_E ( italic_s start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_G ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (13)

Here, \mathcal{E}caligraphic_E refers to an estimator to predict the corresponding return G𝐺Gitalic_G based on the input state.

4.3 Efficient Exploration

Due to the lack of ground truth for instance temperature, the update process in our RL component is conducted online. In this training setup, it is imperative for the agent within the RL framework to quickly learn effective temperature adjustment policy. To enable our agent to adjust the temperature for each instance with higher accuracy, we set the action space as a continuous space. This often implies that, in the initial stages of training, the agent may engage in inefficient exploration across a vast action space [2], which is not conducive to rapidly learning valuable instance temperature adjustment policy. To solve this problem, we propose an efficient exploration strategy. In which, during the early stages of training the RL component, we guide the agent to learn on high-quality training instances, which is helpful to drive the agent towards more effective exploration.

Firstly, we need to define what constitutes high-quality training data in the context of KD. We consider that in KD, high-quality training samples mean those can provide more knowledge to the student network. The prior work [19] reveals the predictive entropy of a student network for a training instance can be used to measure the knowledge value of that instance. The higher the prediction entropy, the greater the knowledge value of the instance. The prediction entropy of a student network for a training instance is:

H(yx)=c=1Cp(y=cx)logp(y=cx).𝐻conditional𝑦𝑥superscriptsubscript𝑐1𝐶𝑝𝑦conditional𝑐𝑥𝑝𝑦conditional𝑐𝑥H(y\mid x)=-\sum_{c=1}^{C}p(y=c\mid x)\log p(y=c\mid x).italic_H ( italic_y ∣ italic_x ) = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_p ( italic_y = italic_c ∣ italic_x ) roman_log italic_p ( italic_y = italic_c ∣ italic_x ) . (14)
Input: agent network Q𝑄Qitalic_Q, student network fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, teacher network fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, dataset D𝐷Ditalic_D, batch size N𝑁Nitalic_N, reward corrector 𝒞𝒞\mathcal{C}caligraphic_C
for batch dt+1subscript𝑑𝑡1d_{t+1}italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in D𝐷Ditalic_D do
       for i=0𝑖0i=0italic_i = 0 to N1𝑁1N-1italic_N - 1 do
             Build state observation sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from fS(dt+1i)subscript𝑓𝑆superscriptsubscript𝑑𝑡1𝑖f_{S}(d_{t+1}^{i})italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), fT(dt+1i)subscript𝑓𝑇superscriptsubscript𝑑𝑡1𝑖f_{T}(d_{t+1}^{i})italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
             Compute the action aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and values ViQ(si)subscript𝑉𝑖𝑄subscript𝑠𝑖V_{i}\leftarrow Q(s_{i})italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_Q ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
             Use aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as instance temperature
             Collect sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a buffer to update Q𝑄Qitalic_Q (Sec. 4.1)
       end for
      Update the student network fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
       Obtain the reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT following Eq. 9
       Calibrate the instance reward following Eq. 10
       Update the reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in the replay buffer
       Update the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following Eq. 12 (Sec. 4.2)
       while not done do
             Update Q𝑄Qitalic_Q
       end while
      
end for
Compute prediction entropy HtfS(dt+1)subscript𝐻𝑡subscript𝑓𝑆subscript𝑑𝑡1H_{t}\leftarrow f_{S}(d_{t+1})italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
Obtain high-quality samples following Eq. 16 (Sec. 4.3)
Implement the efficient exploration strategy
Update Q𝑄Qitalic_Q
Algorithm 1 Training and usage of RLKD

Here, H(yx)𝐻conditional𝑦𝑥H(y\mid x)italic_H ( italic_y ∣ italic_x ) is the predictive entropy for a given instance x𝑥xitalic_x, C𝐶Citalic_C is the number of classes, and p(y=cx)𝑝𝑦conditional𝑐𝑥p(y=c\mid x)italic_p ( italic_y = italic_c ∣ italic_x ) is the predicted probability that instance x𝑥xitalic_x belongs to class c𝑐citalic_c. To identify the high-quality data, we first compute the prediction entropy H𝐻Hitalic_H of the student model for all training instances, and then sort these prediction entropies in descending order to form a sequence Sesubscript𝑆𝑒S_{e}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Sesubscript𝑆𝑒S_{e}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is defined as:

Se={I1,,In}.subscript𝑆𝑒subscript𝐼1subscript𝐼𝑛S_{e}=\{I_{1},...,I_{n}\}.italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } . (15)

Here, n𝑛nitalic_n represents the total number of training examples, and Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the training instance ranked at ranking n𝑛nitalic_n. To mitigate the risk of overfitting, we take the top 10%percent1010\%10 % to 20%percent2020\%20 % training samples in the sequence Se(1020)%superscriptsubscript𝑆𝑒percentsimilar-to1020S_{e}^{(10\sim 20)\%}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 10 ∼ 20 ) % end_POSTSUPERSCRIPT as our high-quality training samples for efficient exploration.

Secondly, since the student network is typically a small model, to prevent overfitting through our high-quality learning and to enhance the robustness of the student model, we utilize mix-up [38] on our high-quality training samples with the training instances ranked from 40%percent4040\%40 % to 50%percent5050\%50 %, denoted as Se(4050)%superscriptsubscript𝑆𝑒percentsimilar-to4050S_{e}^{(40\sim 50)\%}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 40 ∼ 50 ) % end_POSTSUPERSCRIPT, in the sequence Sesubscript𝑆𝑒S_{e}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. The sequence of training instance Setsuperscriptsubscript𝑆𝑒𝑡S_{e}^{t}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT after mix-up is calculated as:

Set=λSe(1020)%+(1λ)Se(4050)%superscriptsubscript𝑆𝑒𝑡𝜆superscriptsubscript𝑆𝑒percentsimilar-to10201𝜆superscriptsubscript𝑆𝑒percentsimilar-to4050\displaystyle S_{e}^{t}=\lambda\cdot S_{e}^{(10\sim 20)\%}+(1-\lambda)\cdot S_% {e}^{(40\sim 50)\%}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_λ ⋅ italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 10 ∼ 20 ) % end_POSTSUPERSCRIPT + ( 1 - italic_λ ) ⋅ italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 40 ∼ 50 ) % end_POSTSUPERSCRIPT (16)

The parameter λ𝜆\lambdaitalic_λ is set to ensure that the knowledge in Setsuperscriptsubscript𝑆𝑒𝑡S_{e}^{t}italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT predominantly comes from higher-ranked training instances (i.e. higher quality training instances).

4.4 Training and Usage of RLKD

Given a dataset D𝐷Ditalic_D, a teacher network fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and a student network fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, our RLKD proceeds as follows. At beginning, we calculate the instance temperature 𝒯𝒯\mathcal{T}caligraphic_T for all training instances in the batch, organized by Sec. 4.1. We record the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and value Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of this batch into the replay buffer, which serves as a reference for the agent’s subsequent decision-making. Next, we calibrate the instance reward and update the state, as described in Sec. 4.2. Finally, as demonstrated in Sec. 4.3, we filter out high-quality training samples based on the performance of each training instance during this training stage and execute our efficient exploration strategy by utilizing these high-quality training instances. The procedure is depicted in Algorithm 1.

5 Experiments

Teacher RN-56 RN-110 RN-110 WRN-40-2 WRN-40-2 VGG-13 WRN-40-2 VGG-13 RN-50 RN-32×\times×4 RN-32×\times×4
Acc 72.34 74.31 74.31 75.61 75.61 74.64 75.61 74.64 79.34 79.42 79.42
Student RN-20 RN-32 RN-20 WRN-16-2 WRN-40-1 VGG-8 SN-V1 MN-V2 MN-V2 SN-V1 SN-V2
Acc 69.06 71.14 69.06 73.26 71.98 70.36 70.50 64.60 64.60 70.50 71.82
Vanilla KD 70.66 73.08 70.66 74.92 73.54 72.98 74.83 67.37 67.35 74.07 74.45
+CTKD 71.11 73.47 71.08 75.40 73.97 73.48 75.70 68.42 68.51 74.52 75.26
+Ours 71.40 73.81 71.44 75.79 74.17 73.75 76.01 68.73 68.75 74.84 75.55
Table 1: Student network top-1 accuracy on CIFAR-100. Testing the performance of Vanilla KD as well as Vanilla KD with the incorporation of instance temperature adjustment using CTKD and our RLKD, respectively.
Teacher Student Vanilla KD +CTKD +Ours PKT +CTKD +Ours RKD +CTKD +Ours SRRL +CTKD +Ours DKD +CTKD +Ours
Top-1 73.96 70.26 70.83 71.28 71.39 70.92 71.31 71.53 70.94 71.13 71.37 71.01 71.25 71.38 71.13 71.47 71.62
Top-5 91.58 89.50 90.31 90.33 90.51 90.25 90.30 90.42 90.33 90.34 90.45 90.41 90.42 90.52 90.31 90.44 90.56
Table 2: Top-1 and Top-5 accuracy on ImageNet with ResNet-34 as teacher and ResNet-18 as student.
Teacher RN-56 RN-110 RN-110 WRN-40-2 WRN-40-2 RN-32×\times×4 RN-32×\times×4
Acc 72.34 74.31 74.31 75.61 75.61 79.42 79.42
Student RN-20 RN-32 RN-20 WRN-16-2 WRN-40-1 SN-V1 SN-V2
Acc 69.06 71.14 69.06 73.26 71.98 70.70 71.82
PKT 70.85 73.36 70.88 74.82 74.01 74.39 75.10
+CTKD 71.13 73.49 71.07 75.34 74.11 74.63 75.52
+Ours 71.41 73.68 71.34 75.62 74.23 74.89 75.78
SP 70.84 73.09 70.74 74.88 73.77 74.97 75.59
+CTKD 71.29 73.42 71.17 75.30 73.97 75.28 75.79
+Ours 71.65 73.70 71.51 75.61 74.22 75.31 76.04
VID 70.62 73.02 70.59 74.89 73.60 74.81 75.24
+CTKD 70.81 73.38 71.11 75.20 73.75 75.23 75.48
+Ours 71.09 73.70 71.39 75.48 74.02 75.58 75.81
CRD 71.69 73.63 71.38 75.53 74.36 75.13 75.90
+CTKD 72.13 74.08 72.02 75.71 74.72 75.41 76.20
+Ours 72.29 74.41 72.28 76.03 74.98 75.68 76.55
SRRL 71.13 73.48 71.09 75.69 74.18 75.36 75.90
+CTKD 71.41 73.81 71.52 75.90 74.38 75.62 75.97
+Ours 71.61 74.02 71.81 76.23 74.64 75.90 76.06
DKD 71.43 73.66 71.28 75.70 74.54 75.44 76.48
+CTKD 71.62 73.91 71.65 75.85 74.57 75.88 76.91
+Ours 71.89 74.27 71.91 76.02 74.90 76.02 77.21
Table 3: Student network Top-1 accuracy on CIFAR-100 dataset.

For a fair comparison, we follow the experimental settings of CTKD [20] to conduct experiments to verify the effectiveness of our RLKD. Experiments are tested on a variety of well-known neural network architectures, such as VGG [29], ResNet (RN) [12], Wide ResNet (WRN) [37], ShuffleNet (SN) [40], and MobileNet (MN) [15]. We also evaluate RLKD as a plug-and-play technique across various distillation frameworks, including Vanilla KD [14], PKT [25], SP [32], VID [1], CRD [31], SRRL [36], and DKD [42]. Furthermore, we perform ablation studies to validate the effectiveness of our designed state representation, instance reward calibration, efficient exploration strategy, and selection of high-quality training examples.

Tasks and datasets. Following [20], we conduct experiments on two tasks: image classification and object detection. For the image classification task, we carry out experiments on CIFAR-100 [18] and ImageNet [8]. For the object detection task, we conduct our experiments on MS-COCO [21]. CIFAR-100 is a prominent dataset for image classification, comprising 32×\times×32 pixel images across 100 different categories, with a training set of 50,000 images and a validation set of 10,000 images. ImageNet, another significant dataset for large-scale image classification, encompasses 1,000 categories with a training set of approximately 1.28 million images and a validation set of 50,000 images. MS-COCO is a famous dataset used for general object detection that includes 80 categories. It has a training set (train 2017) with 118,000 images and a validation set (val 2017) with 5,000 images.

5.1 Main results

CIFAR-100: image classification. As shown in Tab. 1, we conduct image classification on the CIFAR-100 dataset to demonstrate the generalization performance of our RLKD method across 11 teacher-student pairs, including RN-56 & RN-20, etc. Among them, 5 pairs of teacher and student models (VGG-13 & MN-V2, etc.) are characterized by distinguishing architectural frameworks. These experimental designs we employed provide a diverse and comprehensive assessment environment. When the teacher and student networks share the same architecture, the experimental results show that our RLKD method has a strong generalization capacity, also exhibits a superior performance compared to CTKD. Specifically, in the case of RN-110 & RN-20, our method outperforms Vanilla KD by 0.78% (71.44% vs 70.66%) and CTKD by 0.36% (71.44% vs 71.08%). Moreover, in the case where the teacher and student networks have different architectures, the powerful generalization capacity of our RLKD is also validated.

To validate the generalization of our RLKD method across different KD frameworks, we conduct experiments on 6 currently leading KD frameworks (see Tab. 3), including DKD, PKT, etc. When applied to the teacher-student pair RN110 & RN32, our RLKD brings an improvement of 0.61% (74.27% vs 73.66%) in the DKD framework, which surpasses the accuracy of CTKD by 0.36% (74.27% vs 73.91%). Experiments conducted on other 5 KD frameworks (e.g. PKT, etc.) further confirm the strong generalization of our RLKD. Both the accuracy and stability of the proposed RLKD are significantly superior to CTKD, this can be attributed to our RLKD method considers the future rewards of the instance temperature adjustment operations.

mAP AP50 AP75 APl APm APs
T: RN-101 42.04 62.48 45.88 54.60 45.55 25.22
S: RN-18 33.26 53.61 35.26 43.16 35.68 18.96
Vanilla KD 33.97 54.66 36.62 44.14 36.67 18.71
+CTKD 34.51 55.32 36.95 44.76 37.17 19.01
+Ours 34.73 55.61 37.19 45.27 37.30 19.12
T: RN-50 40.22 61.02 43.81 51.98 43.53 24.16
S: MN-V2 29.47 48.87 30.90 38.86 30.77 16.33
Vanilla KD 30.13 50.28 31.35 39.56 31.91 16.69
+CTKD 31.21 52.12 32.01 41.11 33.44 18.09
+Ours 31.49 52.57 33.23 41.71 33.65 18.31
Table 4: Results of our RLKD on the MS-COCO dataset, utilizing Faster-RCNN [26] with FPN [22]. We conduct experiments with the following teacher-student pairings: RN-101 paired with RN-18, and RN-50 paired with MN-V2.

ImageNet: image classification. To validate the scalability of our method and its applicability in complex scenarios involving large datasets, we further conduct image classification on ImageNet. Tab. 2 details the top-1 and top-5 accuracy. Using CTKD and our RLKD as the adaptable plug-in approach, we incorporate them into 5 current leading distillation frameworks (i.e. KD, PKT, RKD, SRRL, and DKD). The experimental results obtained from these 5 KD frameworks unequivocally demonstrate the excellent scalability of our method. Remarkably, our RLKD exhibits robust performance on large dataset like ImageNet. For instance, in the Vanilla KD and SRRL frameworks, our method achieves improvement of 0.2% (90.51% vs 90.31%) and 0.11% (90.52% vs 90.41%) respectively. In contrast, CTKD obtains much fewer improvement on these KD frameworks, with gains of just 0.02% (90.33% vs 90.31%) and 0.01% (90.42% vs 90.41%) respectively, about 10 times lower. We think the superior performance of RLKD can be attributed to its RL-based framework in instance temperature adjustment, which considers the future benefits of these adjustments. Additionally, unlike CTKD, our RLKD also takes into account the student model’s grasp of individual instances during instance temperature adjustment.

MS-COCO: object detection. To verify whether our RLKD method possesses robustness across other visual tasks, we execute object detection on the MS-COCO dataset. As shown in Tab. 4, in the case of RN-50 & MN-V2, regarding the mAP metric, our RLKD outperforms Vanilla KD by 1.36% (31.49% vs 30.13%) and CTKD by 0.28% (31.49% vs 31.21%), respectively. Additionally, for detecting objects with varying sizes – evaluated by the AP metrics for large (APl), medium (APm) and small (APs) objects, our RLKD also shows a significant enhancement, consistently surpasses CTKD across all size categories. Results demonstrate the robustness of our approach, where instance temperature adjustment is treated as a sequential decision-making task, enabling consideration of future benefits.

Teacher RN-56 RN-110 WRN-40-2 VGG-13
Student RN-20 RN-32 WRN-16-2 VGG-8
Ours w/o US 71.16 73.68 75.61 73.57
Ours w US 71.40 73.81 75.79 73.75
Table 5: Ablation study of the uncertainty score (US) feature.
Teacher RN-56 RN-110 WRN-40-2 VGG-13
Student RN-20 RN-32 WRN-16-2 VGG-8
Ours w/o IRA 70.91 73.26 75.39 73.32
Ours w IRA 71.40 73.81 75.79 73.75
Table 6: Ablation on instance reward calibration (IRA) strategy.
Teacher RN-56 RN-110 WRN-40-2 VGG-13
Student RN-20 RN-32 WRN-16-2 VGG-8
Ours w/o EE 71.03 73.52 75.50 73.45
Ours w EE 71.40 73.81 75.79 73.75
Table 7: Ablation study of the efficient exploration (EE) strategy.

5.2 Ablation studies

In the ablation studies, we evaluate the performance of the uncertainty score that is included in our state representation, the instance reward calibration scheme, the efficient exploration strategy, and different high-quality training example selection strategies. All experiments are conducted on the CIFAR-100 dataset with respect to the image classification task, and utilize the Vanilla KD framework.

Uncertainty score. We conduct experiments on 4 sets of teacher-student network pairs to test the effectiveness of the uncertainty score in our state representation. As shown in Tab. 5, when incorporating uncertainty score into state representation, our method shows an improvement of 0.24% (71.40% vs 71.16%) in the RN-56 & RN-20 teacher-student pair. This enhancement verifies the effectiveness of our designed uncertainty score, which enables the agent to make wiser decisions by taking into account the student model’s mastery of the training instances.

Instance reward calibration. As shown in Tab. 6, when incorporating an instance reward calibration strategy into our RLKD method, a promotive effect across 4 different sets of the teacher-student pairs (RN-56 & RN-20, etc.) is achieved. E.g., our instance temperature calibration strategy boosts the performance of RN-110 & RN-32 pair by 0.55% (73.81% vs 73.26%). We believe the effectiveness of the instance reward calibration strategy lies in its ability to enable the agent to more accurately perceive the rewards resulting from each of its instance temperature adjustment actions, thereby enhancing its capacity to update its policy for performing the action.

Efficient exploration. As shown in Tab. 7, we conduct ablation experiments on our efficient exploration strategy across 4 teacher-student pairs. The experimental results demonstrate that our effective exploration strategy

Teacher Student 010%similar-to0percent100\sim 10\%0 ∼ 10 % 1020%similar-to10percent2010\sim 20\%10 ∼ 20 % 1020%similar-to10percent2010\sim 20\%10 ∼ 20 % \mathcal{M}caligraphic_M 3040%similar-to30percent4030\sim 40\%30 ∼ 40 % 1020%similar-to10percent2010\sim 20\%10 ∼ 20 % \mathcal{M}caligraphic_M 4050%similar-to40percent5040\sim 50\%40 ∼ 50 %
72.34 69.06 70.92 71.21 71.27 71.40
75.61 73.26 75.33 75.57 75.61 75.79
Table 8: Comparison of different high-quality training sample selection strategies. The teacher-student pairs corresponding to the second and third rows are respectively RN-56 & RN-20 and WRN-40-2 & WRN-16-2. “\mathcal{M}caligraphic_M” denotes the mix-up operation.

facilitates performance of the student model across 4 teacher-student pairs. In the experiments involving the RN-56 & RN-20 teacher-student pair, our efficient exploration strategy results in a performance improvement of 0.37% (71.40% vs 71.03%). We attribute this success to the strategy enables the agent to learn valuable instance temperature adjustment policy faster, allowing the student model to acquire more useful knowledge during the early stages of KD.

Selection of high-quality training examples. As shown in Tab. 8, we conduct experiments on CIFAR-100 to compare different strategies for selecting the high-quality training examples. Interestingly, we observe that when using the top 10% of high-quality training data, the performance of the student model in the teacher-student pair RN-56 & RN-20 is 70.92%, which is not as good as the performance 71.21% of the student model when using the training data ranked from 10% to 20%. This phenomenon is also observed in the teacher-student pair WRN-40-2 & WRN-16-2. We think this may due to utilizing the top 10% samples caused overfitting in the agent. Furthermore, in the teacher-student pair RN-56 & RN-20, when conducting the mix-up method on the training data ranked from 10% to 20% using the training data ranked 40% to 50%, there is a performance increase of 0.19% (71.40% vs 71.21%). The experimental results verify the validity of our mix-up method that combines instances of varying knowledge values can produce high-quality training data.

6 Conclusion

In current knowledge distillation domain, the methods [20, 23] applied to temperature adjustment neglect the consideration of future benefits associated with the adjustment. To address this issue, we approach the instance temperature adjustment as a sequential decision-making task and propose a novel method RLKD. Specifically, we design a comprehensive state representation to enable the agent in our framework to make informed adjustment to the instance temperature. Besides, we explore an instance reward calibration scheme to provide the agent with more accurate reward signals. In addition, we develop an efficient exploration strategy to boost the agent’s capability to learn valuable temperature adjustment policy fastly. Extensive experiments are conducted on three famous datasets for the tasks of image classification and object detection, demonstrating the effectiveness of our plug-and-play instance temperature adjustment method RLKD.

References

  • Ahn et al. [2019] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9163–9171, 2019.
  • Amin et al. [2021] Susan Amin, Maziar Gomrokchi, Harsh Satija, Herke van Hoof, and Doina Precup. A survey of exploration methods in reinforcement learning. arXiv preprint arXiv:2109.00157, 2021.
  • Arjona-Medina et al. [2019] Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems, 32, 2019.
  • Balcan et al. [2007] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35–50. Springer, 2007.
  • Bengio et al. [2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  • Chandrasegaran et al. [2022] Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Yunqing Zhao, and Ngai-Man Cheung. Revisiting label smoothing and knowledge distillation compatibility: What was missing? In International Conference on Machine Learning, pages 2890–2916. PMLR, 2022.
  • Chen et al. [2017] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Guo et al. [2023] Ziyao Guo, Haonan Yan, Hui Li, and Xiaodong Lin. Class attention transfer based knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11868–11877, 2023.
  • Hahn and Choi [2019] Sangchul Hahn and Heeyoul Choi. Self-knowledge distillation in natural language processing. arXiv preprint arXiv:1908.01851, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hernandez-Leal et al. [2018] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. Is multiagent deep reinforcement learning the answer or the question? a brief survey. learning, 21:22, 2018.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Ke et al. [2018] Nan Rosemary Ke, Anirudh Goyal ALIAS PARTH GOYAL, Olexa Bilaniuk, Jonathan Binas, Michael C Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. Advances in neural information processing systems, 31, 2018.
  • Kim et al. [2021] Kyungyul Kim, ByeongMoon Ji, Doyoung Yoon, and Sangheum Hwang. Self-knowledge distillation with progressive refinement of targets. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6567–6576, 2021.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Li et al. [2022] Chenxin Li, Mingbao Lin, Zhiyuan Ding, Nie Lin, Yihong Zhuang, Yue Huang, Xinghao Ding, and Liujuan Cao. Knowledge condensation distillation. In European Conference on Computer Vision, pages 19–35. Springer, 2022.
  • Li et al. [2023] Zheng Li, Xiang Li, Lingfeng Yang, Borui Zhao, Renjie Song, Lei Luo, Jun Li, and Jian Yang. Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1504–1512, 2023.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • Liu et al. [2022] Jihao Liu, Boxiao Liu, Hongsheng Li, and Yu Liu. Meta knowledge distillation. arXiv preprint arXiv:2202.07940, 2022.
  • Liu et al. [2019] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2604–2613, 2019.
  • Passalis and Tefas [2018] Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 268–284, 2018.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  • Roth and Small [2006] Dan Roth and Kevin Small. Margin-based active learning for structured output spaces. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pages 413–424. Springer, 2006.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Tang et al. [2019] Raphael Tang, Yao Lu, and Jimmy Lin. Natural language generation for effective knowledge distillation. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 202–208, 2019.
  • Tian et al. [2019] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  • Tung and Mori [2019] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019.
  • Vilalta and Drissi [2002] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial intelligence review, 18:77–95, 2002.
  • Xu et al. [2023] Guodong Xu, Ziwei Liu, and Chen Change Loy. Computation-efficient knowledge distillation via uncertainty-aware mixup. Pattern Recognition, 138:109338, 2023.
  • Yang et al. [2022] Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12319–12328, 2022.
  • Yang et al. [2020] Jing Yang, Brais Martinez, Adrian Bulat, and Georgios Tzimiropoulos. Knowledge distillation via softmax regression representation learning. In International Conference on Learning Representations, 2020.
  • Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhang and Ma [2020] Linfeng Zhang and Kaisheng Ma. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations, 2020.
  • Zhang et al. [2018] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
  • Zhang et al. [2022] Zhengbo Zhang, Chunluan Zhou, and Zhigang Tu. Distilling inter-class distance for semantic segmentation. arXiv preprint arXiv:2205.03650, 2022.
  • Zhao et al. [2022] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11953–11962, 2022.