research-article

Open access

GrOD: Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training

Authors:

Zhanxing Zhu, and

Jun HuanAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 6

Article No.: 117, Pages 1 - 25

https://doi.org/10.1145/3530836

Published: 08 September 2022 Publication History

All formats PDF

Abstract

Regularization that incorporates the linear combination of empirical loss and explicit regularization terms as the loss function has been frequently used for many machine learning tasks. The explicit regularization term is designed in different types, depending on its applications. While regularized learning often boost the performance with higher accuracy and faster convergence, the regularization would sometimes hurt the empirical loss minimization and lead to poor performance. To deal with such issues in this work, we propose a novel strategy, namely Gradients Orthogonal Decomposition (GrOD), that improves the training procedure of regularized deep learning. Instead of linearly combining gradients of the two terms, GrOD re-estimates a new direction for iteration that does not hurt the empirical loss minimization while preserving the regularization affects, through orthogonal decomposition. We have performed extensive experiments to use GrOD improving the commonly used algorithms of transfer learning [2], knowledge distillation [3], and adversarial learning [4]. The experiment results based on large datasets, including Caltech 256 [5], MIT indoor 67 [6], CIFAR-10 [7], and ImageNet [8], show significant improvement made by GrOD for all three algorithms in all cases.

1 Introduction

Deep learning has been widely used as a major workhorse for a variety of pattern recognition applications, such as image classification [7, 8], face recognition [9, 10], human parsing [11, 12], biomarker identification [13, 14], and spatiotemporal pattern mining [15, 16]. In many real-world practices of deep learning [17], regularization has been frequently used to improve the performance of deep neural networks training through incorporating explicit regularization terms beyond empirical loss minimization (see also in Chapter 7 of [17]). To regularize the training of deep neural networks, a simple yet effective approach is to build a regularization term to augment the empirical loss through weighted linear combination. Such weights are typically used to make tradeoff between empirical loss and model complexity.

Compared to the regularized statistical learning [18] that was originally proposed to avoid over-fitting, regularization nowadays is redesigned in deep learning to enable a wide range of novel learning applications, such as knowledge transfer from pre-trained neural networks [2, 19, 20, 21], knowledge distillation via Teacher–Student training [3, 22], and adversarial learning for robustness [4, 23]. While the use of regularization yielding deep learning with better performance and new functionalities, regularized deep learning might sometimes hurt the performance deep neural network and achieve even worse performance than empirical loss minimization (ERM) [18], especially when regularizer weight is inappropriately large.

1.1 Our Observations

While one can fix the over-regularization issue through lowering the regularizer weight for statistical learning, such problem is still tough for regularized deep learning [17]. For example, using the starting point as the reference (SPAR) (i.e., incorporating an L $ ^2 $ -norm regularizer that constrains the distance between the parameter and the starting point of optimization [2]) is frequently used to fine-tune deep neural networks for deep transfer learning. Using an inappropriate pre-trained model for SPAR leads to even worse performance than the one from scratch [24, 25], as the L $ ^2 $ -norm regularization with the start point of optimization would affect the local minimum points that the learning procedure finally converges to, while the selection of poor local minimum points may significantly hurt the generalization performance of deep learning.

We specify above observation using an example based on $ L^2 $ -SP [2] shown in Figure 1. The Black Line refers to the empirical loss descent flow of common gradient-based learning algorithms with pre-trained weights as the start point. It shows that with the empirical loss gradients as the descent direction, such method quickly converges to a local minimum in a narrow cone, which is usually considered as an over-fitting solution. In the meanwhile, the Blue Line demonstrates the possible empirical loss descending path of $ L^2 $ -SP algorithm, where a strong regularization blocks the learning algorithm to continue lowering the empirical loss while traversing the area around the point of pre-trained weights. An ideal case has been illustrated as the Red Line, where $ L^2 $ -SP regularizer helps the learning algorithm to avoid the over-fitting solutions. The overall descent direction adapting $ L^2 $ -SP regularizer with respect to empirical loss leads to generalizable solutions. There thus needs a method to make both empirical loss and regularizer continue descending to boost the performance of deep transfer learning. Thus, the new technique to balance the regularization term and the empirical loss in the deep learning procedure is needed.

Fig. 1.

1.2 Our Contributions

Motivated by above observations, simple yet effective training paradigm, namely Gradients Orthogonal Decomposition (GrOD), which provides a new descent direction estimator for the regularized learning of over-parameterized deep neural networks. GrOD follows a simple Empirical Risk Minimization(ERM) Preserved descent direction principle—in every iteration of the learning procedure, the empirical loss of regularized deep learning should descend as fast as the one based on ERM. Specifically, GrOD decomposes the gradients of the regularizer term and removes the part that is opposed to the empirical loss gradients. With remaining parts combined, the regularizer and empirical loss terms are expected to be “minimized simultaneously” while the minimization of empirical loss is more preferred in GrOD.

Inspired by the observation that regularizer may hurt the model’s fitting by preventing empirical loss descending, we proposed a novel regularized deep learning framework GrOD that improves regularized deep learning with a better balance between the gradients of loss function and the regularization term—i.e., with better fitness to the training data without simply degrading the weight of regularization term. During the learning procedure, when the angle between the empirical loss gradient and the regularizer gradient is large (larger than $ 90^\circ $ ), GrOD decomposes the regularizer gradients into two components: hurting part (parallel to empirical loss gradients) and safely regularized part (vertical to empirical loss gradients), discards the hurting part and preserves the safely regularized part. In this way, it will preserve regularization affects without preventing empirical loss descending.

In terms of methodology, the most relevant work to our study is Gradient Episodic Memory (GEM) for continual learning [26], which continuously learn the new task using the well-trained models for past tasks. In terms of objectives, GrOD aims at lowering the effects of regularization from hurting ERM, while GEM prevents the ERM from hurting regularization effects (i.e., the accuracy on old tasks). In terms of algorithms, in every iteration of learning, GEM estimates the descent direction with respect to the gradients of the new task and all past tasks using a time-consuming Quadratic Program (QP), while GrOD re-estimates the descent direction from the gradients of the regularizer term and the empirical loss term with low-complexity orthogonal decomposition. All in all, GEM can be considered as a special case of GrOD using L $ ^2 $ -SP regularizer [2] based on two tasks.

Extensive experiments have been performed using state-of-the-art algorithms for deep transfer learning [2], knowledge distillation [3], and adversarial learning [4] tasks, on top of a wide range of deep learning benchmark datasets including Caltech, MIT indoor 67, CIFAR-10, Fashion MNIST, and ImageNet. The experiments show that GrOD always improves the performance of the three tasks. Specifically, for transfer learning tasks, GrOD has improved the $ L^2 $ -SP algorithm [2] with 0.1%–7% higher accuracy (even transferring from the network pre-trained by inappropriate datasets). Through knowledge distillation [3], the network trained by GrOD outperforms the vanilla one with 0.5%–5% higher accuracy through aligning the generated feature maps. For adversarial learning task, GrOD has been evaluated to enhance the state-of-the-art algorithm [4] with significant Pareto-improvement in both accuracy and robustness. Besides, the gradients direction analysis based on the experiments verified our assumptions about the descent direction’s performance during neural network’s training process.

2 Related Work and Backgrounds

In this section, we first introduced the preliminary setting of regularized deep learning, then introducing the regularization term for deep transfer learning, knowledge distillation, and adversarial learning that would be used in our studies.

2.1 Regularized Deep Learning

Deep convolutional networks usually consist of a great number of parameters that need fit to the dataset. For example, ResNet-110 has more than one million free parameters. The size of free parameters causes the risk of over-fitting. Regularized deep learning is the technique to reduce this risk by constraining the parameters within a limited space with respect to a set of regularization terms. The general learning problem is usually formulated as follow.

Definition 1 (Regularized Deep Learning).

Let’s first denote the dataset for the desired task as $ \mathbf {D}=\lbrace (\mathbf {x}_1,y_1),(\mathbf {x}_2,y_2),(\mathbf {x}_3,y_3)\ldots ,(\mathbf {x}_n,y_n)\rbrace $ , where totally n tuples are offered and each tuple $ (\mathbf {x}_i,y_i) $ refers to the input image and its label in the dataset, $ \mathbf {x} \in \mathbb {R}^D $ , $ y \in \lbrace 1, 2, \ldots , N\rbrace $ for multi-class classification, and D is the dimensionality of the input data. We then denote $ \mathbf {\omega }\in \ \mathbb {R}^{d} $ be the d-dimensional parameter vector containing all d parameters of the training model. Furthermore, given a regularization term $ \Omega (\mathbf {\omega }):\mathbb {R}^{d}\rightarrow \mathbb {R}^{d} $ , one estimates the parameter of target network through the regularized deep learning paradigms. The optimization object with regularized deep learning is to obtain the minimizer of $ \mathcal {L}(\mathbf {\omega }): $

$ \begin{equation} \underset{w}{\mathrm{min}}\ \mathcal {L}(\omega)=\left\lbrace \frac{1}{n}\sum _{i=1}^n L(z(\mathbf {x}_{i}, \mathbf {\omega }), y_{i}) + \lambda \cdot \Omega (\mathbf {\omega })\right\rbrace , \qquad \end{equation} $

(1)

where (i) the first term $ \sum _{i=1}^n L(z(\mathbf {x}_{i}, \mathbf {\omega }), y_{i}) $ refers to the empirical loss of data fitting while (ii) the second term $ \Omega (\mathbf {\omega }) $ characterizes the affects for transfer learning, knowledge distillation, adversarial learning, and so on. z maps $ \mathbb {R}^D \times \mathbb {R}^d $ to $ \lbrace 1, 2, \ldots , N\rbrace $ . The tuning parameter $ \lambda \gt 0\ $ balances the tradeoff between the empirical loss and the regularizer.

2.2 Transfer Learning

When the training dataset size is relatively small, we often need to transfer knowledge learned from large datasets to small tasks [27, 28, 29, 30, 31]. Given the weights of a deep neural network pre-trained by a large dataset (e.g., ImageNet), a recent work [2] proposed to first use pre-trained weights as the starting point of the training procedure, then leverages the squared Euclid distance from the training weights to the pre-trained weights as the regularization term for deep transfer learning. Such approach “helps” the training procedure find a generalizable solution with higher accuracy, even based on a small set.

In terms of regularization, given the weights (denoted as $ \mathbf {\Omega }_s $ ) of a neural network pre-trained from a large dataset, $ L^2 $ -SP [2] algorithm uses the squared-euclidean distance from $ \mathbf {\omega } $ to the pre-trained weights $ \mathbf {\omega _s} $ of source network (listed in Equation (2)) to constrain the learning procedure where

$ \begin{equation} \Omega (\mathbf {\omega }) = \left\Vert \mathbf {\omega } - \mathbf {\omega }_{s}\right\Vert _2^2. \end{equation} $

(2)

In terms of optimization procedure, $ L^2 $ -SP makes the learning procedure start from the pre-trained weights (i.e., using $ \mathbf {\omega _s} $ to initialize the learning procedure).

In addition to above regularization, other methods have been used for deep transfer learning, including [19, 32, 33, 34, 35, 36]. As early as in 2014, authors in [32] reported their observation of significant performance improvement through directly reusing weights of the pre-trained source network to the target task, when training a large CNN with tremendous number of filters and parameters. However, in the meanwhile of reusing all pre-trained weights, the target network might be overloaded by learning tons of inappropriate features (that cannot be used for classification in the target task), while the key features of the target task have been probably ignored. In this way, Yosinki et al. [37] proposed to understand whether a feature can be transferred to the target network, through quantifying the “transferability” of features from each layer considering the performance gain. Furthermore, Huh et al. [19] made empirical study on analyzing features that CNN learned from ImageNet dataset to other computer vision tasks, so as to detail the factors effecting deep transfer learning accuracy. In recent days, this line of research has been further developed with increasing number of algorithms and tools that can improve the performance of deep transfer learning, including subset selection [33, 38], sparse transfer [34], filter distribution constraining [35], parameter transfer [36], and transfer learning over manifolds [39]. Moreover, [29] studies the memorability of images using transfer learning, while authors in [30, 40] work on the knowledge transfer crossing the modalities. The overall survey on transfer learning can be found in [25, 41].

2.3 Knowledge Distillation

To achieve similar goals, instead of adopting the weights in a straightforward approach, authors [3] propose to use so-called knowledge distillation mechanism, where given a pre-trained network as the teacher network it considers the training objective as a student network that learns from the teacher. More specific, the squared Euclid distance between feature maps generated by the convolutional layers of the teacher and student networks are used as regularization [22]. The feature-wise knowledge distillation algorithm proposed in [3] enables effective knowledge transfer through learning the behaviors of the pre-trained network, as a gift of knowledge distillation. Similar mechanism is also used for neural network compression [42], using the original network as the teacher and the compression target model as the student with feature map quantization.

Given the training dataset $ \lbrace (\mathbf {x}_1,y_1),\ldots ,(\mathbf {x}_n,y_n)\rbrace $ and N filters in the teacher/student networks for knowledge distillation, the knowledge distillation algorithm [3] models the regularization as the aggregation of squared-euclidean distances between feature maps outputted by the N filters of the teacher/student networks, such that

$ \begin{equation} \Omega (\mathbf {\omega })=\frac{1}{n} \sum _{j=1}^N\sum _{i=1}^n\left\Vert \mathbf {F}_{j}(\mathbf {\omega }, \mathbf {x}_i) - \mathbf {F}_{j}(\mathbf {\omega _s}, \mathbf {x}_i)\right\Vert _2^2, \end{equation} $

(3)

where $ \mathbf {F}_j(\mathbf {\omega },\mathbf {x}_i)) $ refers to the feature map outputted by the jth filter ( $ 1\le j\le N $ ) of the target network based on weight $ \mathbf {w} $ using input image $ \mathbf {x}_i\ $ ( $ 1\le i\le n $ ).

In terms of methodologies, the knowledge distillation was originally proposed to compress deep neural networks [22, 31, 43] through teacher–student network training, where the teacher and student networks are usually based on the same task [22]. In terms of inductive transfer learning, authors in [3] were first to investigate the possibility of using the distance of intermediate results (e.g., feature maps generated by the same layers) of source and target networks as the regularization term. Furthermore, [44] proposed to use the distance between activation maps as the regularization term for so-called “attention transfer”. Notice in our experiment settings, we mainly focus on the applications of knowledge distillation to knowledge transfer, i.e., the source model is pre-trained using other datasets.

2.4 Adversarial Learning

In addition to the accuracy improvement, there frequently needs to enhance the robustness of deep learning under adversarial attacks [45, 46]. With a strategy to perturb the training data for adversarial samples generation, [46] proposed to incorporate the training loss based on the generated adversarial samples via Fast Gradient Sign Method (FGSM) as the regularization term to augment the loss for deep adversarial learning. Reference [4] indicated that instead of FGSM, using adversarial examples generated by Projected Gradient Descent (PGD, [47]) on the negative loss function will obtain a more robust model.

Given the training dataset $ \lbrace (\mathbf {x}_1,y_1),\ldots ,(\mathbf {x}_n,y_n)\rbrace $ , one state-of-the-art adversarial learning algorithm [4], studied in this article, first synthesizes the adversarial samples set $ \lbrace (\mathbf {x}^{\prime }_1,y_1),\ldots ,(\mathbf {x}^{\prime }_n,y_n)\rbrace $ , through perturbation. Then, the algorithm proposes to use the empirical loss based on adversarial samples as the objective function to minimize where

$ \begin{equation} \mathcal {L}(\mathbf {\omega })=\frac{1}{n}\sum _{i=1}^n L(z(\mathbf {x}^{\prime }_{i}, \mathbf {\omega }), y_{i}). \end{equation} $

(4)

Using first-order taylor expansion, we can approximate (4) as the regularized form

$ \begin{equation} \begin{split} \mathcal {L}(\mathbf {\omega }) &= \frac{1}{n}\sum _{i=1}^n L(z(\mathbf {x}^{\prime }_{i},\mathbf {\omega }), y_{i})\\ &\approx \ \frac{1}{n}\sum _{i=1}^n \left[L(z(\mathbf {x}_{i},\mathbf {\omega }), y_{i}) + (\mathbf {x}^{\prime }_{i} -\mathbf {x}_{i})\frac{\partial }{\partial x}L(z(\mathbf {x}_{i},\mathbf {\omega }), y_{i})\right]\\ &=\ \frac{1}{n}\sum _{i=1}^n L(z(\mathbf {x}_{i},\mathbf {\omega }), y_{i}) \\ &\quad \ +\ \left\lbrace \frac{1}{n}\sum _{i=1}^n (\mathbf {x}^{\prime }_{i} -\mathbf {x}_{i})\frac{\partial }{\partial x}L(z(\mathbf {x}_{i},\mathbf {\omega }), y_{i})\right\rbrace \\ &=\ \frac{1}{n}\sum _{i=1}^n L(z(\mathbf {x}_{i}, \mathbf {\omega }), y_{i}) + \lambda \cdot \Omega (\mathbf {\omega }). \end{split} \end{equation} $

(5)

Thus, the regularization part of adversarial training is $ \Omega (\omega) = \tfrac{1}{n}\sum _{i=1}^n (\mathbf {x}^{\prime }_{i} -\mathbf {x}_{i})\tfrac{\partial }{\partial x}L(z(\mathbf {x}_{i},\mathbf {\omega }), y_{i}) $ with $ \lambda =1 $ . Specifically, a pre-trained model using the original dataset is frequently required as the target network for defense, where the gradients and/or Hessian matrices of the loss function are used to perturb the input space of the training data with optional noise to the labels. One can also generate the perturbations for adversarial learning under black–box/derivative–free settings [48, 49]. In addition to the empirical loss over the perturbed set, knowledge distillation over feature maps can also be adopted for defense [50, 51]. More definitions and details could be found in a comprehensive survey [52].

2.5 Discussion on the Connection to Our Work

Compared to above work and other transfer learning studies, our work aims at providing a generic descent direction estimation strategy that improves the performance of regularization-based deep transfer learning. The intuition of GrOD is, per iteration during the learning procedure, re-estimating a new direction of loss descending that addresses the affect of regularizers while making the empirical loss reduction/minimization not hurt. In our work, we demonstrated the capacity of GrOD working with two most recent deep transfer learning regularizers— $ L^2 $ -SP [2] and Knowledge distillation [3], which are based on two typical deep learning philosophies (i.e., constraining weights and feature maps, respectively), using a wide range of transfer learning tasks. The consistent performance boosts with GrOD in all cases of experiments suggests that GrOD can improve above regularization-based deep transfer learning with higher accuracy.

Other techniques, including continual learning [20, 21], attention mechanism for CNN models [44, 53, 54, 55], and so on, can also improve the performance of knowledge transfer between tasks. We believe our work made complementary contributions in this area. All in all, we appreciate the contributions made by these studies. Furthermore, compared to the earlier version of this manuscript [1], we have made significant contributions to extend the previous work that primarily focuses on deep transfer learning, to improve the regularized deep learning in general cases. New regularized deep learning applications, such as knowledge distillation and adversarial learning, have been studied here. This manuscript includes our most recent efforts on improving deep transfer learning, adversarial learning, and network distilling with GrOD, from both theoretical and empirical aspects. Additional experiments with new results have been provided to demonstrate our new findings.

3 GrOD: Gradient Orthogonal Decomposition

In this section, we formalize the technical details of our research, then present the design of our solution GrOD.

3.1 Definitions, Intuitions, and Assumptions

Prior to presenting of the algorithms, this section introduces the settings of the problem.

Definition 2 (Descent Directions).

Gradient-based learning algorithms are frequently used for regularized deep learning to minimize the loss function listed in Equation (1). In each iteration of learning procedure, the algorithms estimate a descent direction $ \mathbf {d}(\mathbf {\omega }) $ , such as stochastic gradient, based on the optimization objective $ \mathbf {\omega } $ that approximates the gradient, such that

$ \begin{equation} \begin{aligned}\mathbf {d}(\omega)&\approx \nabla \mathcal {L}(\mathbf {\omega })\\ & = \sum _{i=1}^n \nabla L(z(\mathbf {x}_{i}, \mathbf {\omega }), y_{i}) + \lambda \nabla \Omega (\mathbf {\omega })\\ & = \nabla J(\mathbf {\omega }) +\lambda \cdot \nabla \Omega (\mathbf {\omega }), \end{aligned} \end{equation} $

(6)

where $ \nabla J(\mathbf {\omega })= \sum _{i=1}^n \nabla L(z(\mathbf {x}_{i}, \mathbf {\omega }), y_{i}) $ refers to the gradient of empirical loss based on training set and $ \nabla \Omega (\mathbf {\omega }) $ is the gradient of regularization term all based on optimization objective $ \mathbf {\omega } $ .

Following the above definition, we reduce our research problem as finding a new descent direction based on the gradients of the empirical loss and the regularizer term. The new descent direction is expected to preserve the effects of regularization, while avoiding to hurt the empirical loss minimization. Due to the affect of regularization $ \Omega (\mathbf {\omega },\mathbf {\omega }_s) $ , the angle between the actual descent direction $ \mathbf {d}(\mathbf {\omega }) $ and the gradient of empirical loss $ \nabla J(\mathbf {\omega }) $ , i.e., $ \measuredangle (\mathbf {d}(\mathbf {\omega }),\nabla J(\mathbf {\omega })) $ , would be large. It is intuitive to state that when $ \measuredangle (\mathbf {d}(\mathbf {\omega }),\nabla J(\mathbf {\omega }) $ is large, the descent direction cannot effectively lower the empirical loss and causes the potential performance bottleneck of deep transfer learning. We thus formulate the technical problem with following assumptions specified.

Assumption 1 (Effective Empirical Loss Minimization).

It is reasonable to assume that the actual descent direction $ \mathbf {\widehat{d}}(\mathbf {\omega }) $ having a smaller angle with the gradient of empirical loss, i.e., a smaller $ \measuredangle (\mathbf {\widehat{d}}(\mathbf {\omega }),\nabla J(\mathbf {\omega })) $ , can lower the empirical loss more efficiently.

Assumption 2 (Regularization Effect Preservation).

It is also reasonable to assume the actual descent direction $ \mathbf {\widehat{d}}(\mathbf {\omega }) $ having a smaller angle with the gradient of regualrizer’s term, i.e., a smaller $ \measuredangle (\mathbf {\widehat{d}}(\mathbf {\omega }),\nabla \Omega (\mathbf {\omega })) $ , could strengthen the effects of regularization for deep learning.

3.2 Problem Formulation

Based on above definitions and assumptions, in our research, we propose a new direction descent algorithm—every iteration of the algorithm re-estimates a new descent direction $ \mathbf {\widehat{d}} $ to effectively lower the training loss based on the optimization object $ \mathbf {\omega } $ while preserving the effect of regularizer $ \Omega (\mathbf {\omega }) $ (Assumption 2). Note that, to avoid the use of any threshold for bounding the two angles between $ \mathbf {\widehat{d}} $ and $ \nabla J(\omega) $ and between $ \mathbf {\widehat{d}} $ and $ \nabla Omega(\omega) $ , we formulate the research problem as follows.

3.2.1 ERM-Effective Descent Direction.

We formulate the research problem as finding an ERM-Effective Descent Direction as follow.

Definition 3 (ERM-effective Descent Direction).

We define the ERM-Effective descent direction $ \mathbf {d}(\omega) $ as a direction derived from the overall loss gradient $ \nabla \mathcal {L}(\omega) $ and could descend the empirical loss $ J(\omega) $ as fast as the one using the gradient of empirical loss $ \nabla J(\omega) $ . Such that

$ \begin{equation} \begin{aligned}\mathbf {d}(\omega) = & \arg \min _{\mathbf {d}}\ \left\Vert \mathbf {d} - \nabla \mathcal {L}(\omega)\right\Vert ^2_2 & s.t. \quad J(\omega - \epsilon \mathbf {d}) \le J(\omega - \epsilon \nabla J(\omega)), \ \end{aligned} \end{equation} $

(7)

where $ \epsilon $ denotes the learning rate.

Such ERM-effective descent direction $ \mathbf {d}(\omega) $ can be estimated by solving the constrained optimization problem (7). Intuitively, optimization (7) aims at finding the descent direction which is close to the overall loss gradients and lowers the empirical loss no less than using the empirical loss gradient as the descent direction in the iteration with $ \omega $ .

3.2.2 Low-Complexity ERM-Effective Descent Direction via Relaxed Constraint Programming.

While the proposed descent direction straightforwardly meets our assumptions, the computation complexity to solve the constrained programming is high. Thus, we relax the constrained programming problem through the first-order approximation.

Assumption 3 (Relaxation with First-order Taylor Approximation).

For simplicity, we assume the loss function would enjoy a tight first-order approximation based on Taylor expansion, such that with $ \Vert \Delta \Vert _2^2 $ close to zero, $ J(\omega +\Delta)\approx J(\omega)\ + \langle \nabla J(\omega),\Delta \rangle + o(\Vert \Delta \Vert _2^2), $ where $ \langle \cdot ,\cdot \rangle $ denotes the inner product. Thus, with a varnishing learning rate $ \epsilon $ and any descent direction $ \mathbf {d} $ , there should have $ J(\omega -\epsilon \mathbf {d})\approx J(\omega) - \langle \mathbf {d},\nabla J(\omega)\rangle $ .

Definition 4 (ERM-Effective Descent Direction via Relaxed Constraint Programming).

Based on above assumption, we can rewrite the problem in (7) into the relaxed constraint programming problem as follow.

$ \begin{equation} \begin{aligned}\mathbf {d}(\omega) = & \arg \min _{\mathbf {d}}\ \left\Vert \mathbf {d} - \nabla \mathcal {L}(\omega)\right\Vert ^2_2 & s.t. \quad \langle \mathbf {d}, \nabla J(\omega) \rangle \ge \left\Vert \nabla J(\omega)\right\Vert _2^2. \end{aligned} \end{equation} $

(8)

In this work, we intend to solve the problem in (8) by orthogonal decomposition of regularization gradients $ \nabla \Omega (\omega) $ .

3.3 GrOD: Descent Direction Estimation via Orthogonal Decompositions

In this section, we presented the design of GrOD as a descent direction estimator that solves the relaxed constraint programming problem addressed in Section 3.1.2. Given the empirical loss function $ J(\omega) $ , the regularization term $ \Omega (\mathbf {\omega }) $ , the set of training data $ \mathbf {D}=\lbrace (\mathbf {x}_1,y_1),(\mathbf {x}_2,y_2),\ldots ,(\mathbf {x}_n,y_n)\rbrace $ , the mini-batch size b and the regularization coefficient $ \lambda $ , we propose to use Algorithm 1 to estimate the descent direction at the point $ \omega _t $ for the tth iteration of regularized deep learning.

With such descent direction estimator, the learning algorithm is capable of replacing the original stochastic gradient estimators used in stochastic gradient descent (SGD), Momentum and/or Adam for deep learning. Specifically, for each (e.g., the tth) iteration of learning procedure, GrOD estimates the gradients of empirical loss and regularization term (i.e., $ \nabla \widehat{J}_t $ and $ \nabla \widehat{\Omega }_t $ ) separately as follows.

—

Acute Angle: When the angle between gradients of empirical loss and regularization term is acute, i.e., $ \measuredangle (\nabla \widehat{J}_t,\nabla \widehat{\Omega }_t)\le 90^\circ $ , GrOD uses the original stochastic gradient as the descent direction (such as Line 8 in Algorithm 1). In such case, we believed the effect of regularization might not over-penalize the empirical loss minimization procedure.

—

Obtuse Angle: On the other hand, when the angle is obtuse, GrOD decomposes the gradient of regularization term $ \nabla \widehat{\Omega }_t $ to two orthogonal directions, where the first direction is orthogonal with the gradient of empirical loss while the second direction parallelizing with the empirical loss gradient (i.e., $ \tfrac{\langle \nabla \hat{J}_t, \nabla \hat{\Omega }_t \rangle }{||\nabla \hat{J}_t ||_2^2} \cdot \nabla \hat{J}_t $ ). GrOD truncates the direction against the gradient of empirical loss (i.e., $ \nabla \hat{\Omega }_t - \tfrac{\langle \nabla \hat{J}_t, \nabla \hat{\Omega }_t \rangle }{||\nabla \hat{J}_t ||_2^2} \cdot \nabla \hat{J}_t $ ), and further compose the orthogonal direction with gradient of empirical loss as the actual descent direction (as shown in Line 10 of Algorithm 1).

Note that the complexity of GrOD for descent direction estimation is low. Given the two gradient vectors $ \nabla \widehat{J}_t $ and $ \nabla \widehat{\Omega }_t $ , GrOD uses Line 10 of Algorithm 1 to estimate the descent direction, where the inner-product of two vectors, scalar-vector-product, and some vector addition/subtraction operations are used. Thus the computational complexity of Line 10 of GrOD for descent direction estimation is $ O(d) $ , where d refers to the number of dimensions.

To understand the theoretical properties of GrOD, please refer to Section 3.4 for our analysis. For empirical validation, please refer to Section 4.5.2, where the experiment results validate the effect of the gradient orthogonal decomposition to the regularized deep learning.

3.4 Understanding the Effects of GrOD for Regularization

Based on the algorithm introduced in Algorithm 1, we can make lemmas as follow.

Lemma 1 (Acute Angle with Gain).

In the $ t{th} $ iteration of GrOD, given (1) an positive regularizer’s weight $ \lambda \gt 0 $ , (2) the empirical loss gradient $ \nabla \hat{J}_t $ , (3) the regularizer’s gradient $ \nabla \widehat{\Omega }_t $ , and (4) the actual descent direction $ \mathbf {\widehat{d}_t} $ computed by GrOD, the angle between the actual descent direction $ \mathbf {\widehat{d}}_t $ and the empirical loss gradient $ \nabla \hat{J}_t $ is acute, and the inner product of $ \mathbf {\widehat{d}}_t $ and $ \nabla \hat{J}_t $ is larger than the squared norm of $ \nabla \hat{J}_t $ , such that

$ \begin{equation} \langle \mathbf {\widehat{d}_t},\nabla \hat{J}_t\rangle \ge \left\Vert \nabla \hat{J}_t\right\Vert _2^2. \end{equation} $

(9)

Above lemma could be obtained using the proof as follow.

Proof.

We prove above two lemmas in two cases

—

When $ \measuredangle (\nabla \widehat{\Omega }_t,\nabla \hat{J}_t)\le 90^\circ $ (i.e., $ \langle \nabla \widehat{\Omega }_t,\nabla \hat{J}_t\rangle {{}} \ge 0 $ ), then $ \mathbf {\widehat{d}_t}=\nabla \widehat{\Omega }_t+\lambda \cdot \nabla \hat{J}_t $ and $ \langle \mathbf {\widehat{d}_t},\nabla \hat{J}_t\rangle {{}} = \langle \nabla \hat{J}_t,\nabla \hat{J}_t\rangle +\ \lambda \cdot \langle \nabla \hat{J}_t,\nabla \widehat{\Omega }_t\rangle {{}} \ge \Vert \nabla \hat{J}_t\Vert _2^2\ge 0 $ .

—

Else when $ \measuredangle (\nabla \widehat{\Omega }_t,\nabla \hat{J}_t)\gt 90^\circ $ ((i.e., $ \langle \nabla \widehat{\Omega }_t,\nabla \hat{J}_t\rangle {{}} \lt 0 $ )), then $ \mathbf {\widehat{d}_t}= \nabla \hat{J}_t + \lambda \cdot [\nabla \hat{\Omega }_t - \tfrac{\left\langle \nabla \hat{J}_t, \nabla \hat{\Omega }_t \right\rangle }{\Vert \nabla \hat{J}_t \Vert _2^2} \cdot \nabla \hat{J}_t] $ and $ \langle \mathbf {\widehat{d}_t},\nabla \hat{J}_t\rangle {{}} = \langle \nabla \hat{J}_t,\nabla \hat{J}_t\rangle +\ \lambda \cdot [\langle \nabla \hat{J}_t,\nabla \widehat{\Omega }_t\rangle -\ \tfrac{\langle \nabla \hat{J}_t, \nabla \hat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t \Vert _2^2} \cdot \langle \nabla \hat{J}_t,\nabla \hat{J}_t\rangle ]= \Vert \nabla \hat{J}_t\Vert _2^2\ge 0 $ .

In above two cases, there has $ \langle \mathbf {\widehat{d}_t},\nabla \hat{J}_t\rangle {{}} \ge \Vert \nabla \hat{J}_t\Vert _2^2\ge 0 $ and thus $ \measuredangle (\mathbf {\widehat{d}_t},\nabla \hat{J}_t)\le 90^\circ $ (acute angle).□

Lemma 2 (Strengthened Descent Direction).

There has $ \Vert \mathbf {\widehat{d}}_t\Vert _2^2\ge \Vert \nabla \hat{J}+\lambda \cdot \nabla \widehat{\Omega }_t\Vert _2^2 $ — i.e., the norm of GrOD descent direction is longer than the original loss gradient.

Proof.

—

When $ \measuredangle (\nabla \widehat{\Omega }_t,\nabla \hat{J}_t)\le 90^\circ $ (i.e., $ \langle \nabla \widehat{\Omega }_t,\nabla \hat{J}_t\rangle {{}} \ge 0 $ ), then $ \mathbf {\widehat{d}_t}=\nabla \widehat{\Omega }_t+\lambda \cdot \nabla \hat{J}_t $ . Thus $ \Vert \mathbf {\widehat{d}}_t\Vert _2^2=\Vert \nabla \hat{J}+\lambda \cdot \nabla \widehat{\Omega }_t\Vert _2^2 $ .

—

Else when $ \measuredangle (\nabla \widehat{\Omega }_t,\nabla \hat{J}_t)\gt 90^\circ $ ((i.e., $ \langle \nabla \widehat{\Omega }_t,\nabla \hat{J}_t\rangle {{}} \lt 0 $ )), then $ \mathbf {\widehat{d}_t}= \nabla \hat{J}_t + \lambda \cdot [\nabla \hat{\Omega }_t - \tfrac{\langle \nabla \hat{J}_t, \nabla \hat{\Omega }_t \rangle }{\Vert \nabla \hat{J}_t \Vert _2^2} \cdot \nabla \hat{J}_t] $ . Let decompose $ \widehat{\Omega }_t $ into two orthogonal vectors $ \widehat{\Omega }_x=\tfrac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\nabla \hat{J}_t $ and $ \widehat{\Omega }_y=\widehat{\Omega }_t-\tfrac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\nabla \hat{J}_t $ subject to the direction and the orthogonal direction of $ \nabla \hat{J}_t $ . Then, we have

$ \begin{equation*} \begin{aligned}& \Vert \nabla \hat{J}+\lambda \cdot \nabla \widehat{\Omega }_t\Vert _2^2\\ = & \Vert \nabla \hat{J}+\lambda \cdot \nabla \widehat{\Omega }_x+\lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2,\ \text{Consider the orthogonal directions}\\ = & \Vert \nabla \hat{J}+\lambda \cdot \nabla \widehat{\Omega }_x\Vert _2^2+\Vert \lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2 \end{aligned} \end{equation*} $

$ \begin{equation} \begin{aligned}= & \left\Vert \left(1+\lambda \frac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\right)\cdot \nabla \hat{J}_t\right\Vert _2^2+\Vert \lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2,\ \text{as}\ \langle \nabla \widehat{\Omega }_t,\nabla \hat{J}_t\rangle \lt 0\\ \le & \Vert \nabla \hat{J}_t\Vert _2^2+\Vert \lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2,\ \text{Consider the orthogonal directions}\\ =&\Vert \mathbf {\widehat{d}}_t\Vert _2^2. \end{aligned} \end{equation} $

□

Finally, we could obtain our theoretical result as follow.

Proposition 1 (The.

GrOD Descent Direction is an ERM-Effective Descent Direction via Relaxed Constraint Programming) We argue that in every $ t{th} $ iteration, given a positive regualrizer’s weight $ \lambda $ , suppose the empirical loss gradient $ \nabla \hat{J}_t $ and the regularizer’s gradient $ \nabla \widehat{\Omega }_t $ , the actual descent direction $ \mathbf {\widehat{d}_t} $ computed by GrOD should be a solution of problem (8) in Definition 4.

Proof.

Lemma 1 proves that the GrOD descent direction $ \mathbf {\widehat{d}_t} $ satisfies $ \measuredangle (\mathbf {\widehat{d}_t},\nabla \hat{J}_t)\le 90^\circ $ — the constraint of problem (8) in Definition 4. Thus, here, we reduce our proof to test whether $ \mathbf {\widehat{d}_t} $ is a minimizer of $ \Vert \mathbf {d}-(\nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_t)\Vert _2^2 $ among all possible vectors satisfying the constraint. We test this proposition in following two cases.

—

When $ \measuredangle (\nabla \widehat{\Omega }_t,\nabla \hat{J}_t)\le 90^\circ $ , then $ \mathbf {\widehat{d}}_t=\nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_t $ (as Line 8 in Algorithm 1). Thus, $ \Vert \mathbf {\widehat{d}}_t-(\nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_t)\Vert _2^2=0 $ (minimal) in this case.

—

Else when $ \measuredangle (\nabla \widehat{\Omega }_t,\nabla \hat{J}_t)\gt 90^\circ $ , then $ \mathbf {\widehat{d}_t}= \nabla \hat{J}_t + \lambda \cdot [\nabla \hat{\Omega }_t - \tfrac{\langle \nabla \hat{J}_t, \nabla \hat{\Omega }_t \rangle }{\Vert \nabla \hat{J}_t \Vert _2^2} \cdot \nabla \hat{J}_t] $ (as Line 10 in Algorithm 1). Let decompose $ \widehat{\Omega }_t $ into two orthogonal vectors $ \widehat{\Omega }_x=\tfrac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\nabla \hat{J}_t $ and $ \widehat{\Omega }_y=\widehat{\Omega }_t-\tfrac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\nabla \hat{J}_t $ subject to the direction and the orthogonal direction of $ \nabla \hat{J}_t $ , and thus,

$ \begin{equation} \left\Vert \mathbf {\widehat{d}_t}-(\nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_t)\right\Vert _2^2=\left\Vert \lambda \cdot \frac{\left\lt \nabla \hat{J}_t, \nabla \hat{\Omega }_t \right\gt }{\left\Vert \nabla \hat{J}_t \right\Vert _2^2} \cdot \nabla \hat{J}_t\right\Vert _2^2 =\lambda ^2\Vert \widehat{\Omega }_x\Vert _2^2. \end{equation} $

In the meanwhile, we can obtain an inequality as follow.

\( \begin{equation} \begin{aligned}&\ \Vert \mathbf {\widehat{d}_t}-(\nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_t)\Vert _2^2,\ \text{Consider Lemma~2 and triangle}\\ & \ge \ \Vert \mathbf {\widehat{d}}_t\Vert _2^2 -\Vert \nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_t\Vert _2^2\\ & =\ \Vert \mathbf {\widehat{d}}_t\Vert _2^2 -\Vert \nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_x+\lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2,\ \text{Consider $(\nabla \hat{J}_t+\lambda \cdot \nabla \Omega _x)\perp \nabla \widehat{\Omega }_y$}\\ & =\ \Vert \mathbf {\widehat{d}}_t\Vert _2^2 -(\Vert \nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_x\Vert _2^2+\Vert \lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2),\ \text{Consider $\frac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\lt 0$ in this case}\\ & =\ \Vert \mathbf {\widehat{d}}_t\Vert _2^2 -(\Vert \nabla \hat{J}_t\Vert _2^2-\Vert \lambda \cdot \nabla \widehat{\Omega }_x\Vert _2^2+\Vert \lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2),\ \text{Consider $\nabla \hat{J}_t\perp \nabla \widehat{\Omega }_y$}\\ & =\ \Vert \mathbf {\widehat{d}}_t\Vert _2^2 -(\Vert \nabla \hat{J}_t+\lambda \cdot \nabla \widehat{\Omega }_y\Vert _2^2-\Vert \lambda \cdot \nabla \widehat{\Omega }_x\Vert _2^2)\\ & =\ \lambda ^2\Vert \nabla \widehat{\Omega }_x\Vert _2^2 \end{aligned} \end{equation} \)

Consider (11) and (12), we can conclude that $ \mathbf {\widehat{d}}_t $ is the solution of problem (8) while $ \lambda ^2\Vert \nabla \widehat{\Omega }_x\Vert _2^2=\lbrace \min _{\mathbf {d}}\ \Vert \mathbf {d} - \nabla \mathcal {L}(\omega)\Vert ^2_2 \ s.t.\ \langle \mathbf {d}, \nabla J(\omega) \rangle {{}} \ge \Vert \nabla J(\omega)\Vert _2^2\rbrace $ .□

In this way, we could conclude GrOD is the solution that we desire in problem (8). Furthermore, from the perspectives of descent directions, we also find that the behavior of GrOD is not achievable through tuning the weight of the regularizer alone.

Proposition 2 (.

GrOD is Not Achievable by Tuning the Weight of the Regularizer) In the $ t{th} $ iteration of GrOD, given (1) any two positive weights of the regularizer $ \forall \lambda _1,\ \lambda _2\gt 0 $ for GrOD and the vanilla loss for regularized deep learning respectively, (2) the empirical loss gradient $ \nabla \hat{J}_t $ with $ \Vert \nabla \hat{J}_t\Vert _2^2\gt 0 $ , and (3) the regularizer’s gradient $ \nabla \widehat{\Omega }_t $ with with $ \Vert \nabla \widehat{\Omega }_t\Vert _2^2\gt 0 $ , we denote the GrOD descent direction based on $ \lambda _1 $ as $ \mathbf {\widehat{d}}_t(\lambda _1) $ and the vanilla regularized loss as $ \nabla \hat{J}_t+\lambda _2\cdot \nabla \widehat{\Omega }_t $ . We argue that, when $ \measuredangle (\nabla \hat{J}_t,\nabla \widehat{\Omega }_t)\gt 90^\circ $ , for any positive weights $ \lambda _1 $ and $ \lambda _2 $ , there has

$ \begin{equation} \mathbf {\widehat{d}}_t(\lambda _1)\ne \nabla \hat{J}_t+\lambda _2\cdot \nabla \widehat{\Omega }_t. \end{equation} $

(13)

In this way, we say the descent direction of GrOD is not achievable by vanilla gradient of loss for regularized deep learning for any pair of weights for regularizers.

Proof.

Let decompose $ \widehat{\Omega }_t $ into two orthogonal vectors $ \widehat{\Omega }_x=\tfrac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\nabla \hat{J}_t $ and $ \widehat{\Omega }_y=\widehat{\Omega }_t-\tfrac{\langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle }{\Vert \nabla \hat{J}_t\Vert _2^2}\nabla \hat{J}_t $ subject to the direction and the orthogonal direction of $ \nabla \hat{J}_t $ . Since $ \measuredangle (\nabla \hat{J}_t,\nabla \widehat{\Omega }_t)\gt 90^\circ $ and $ \Vert \hat{J}_t\Vert _2^2\gt 0 $ and $ \Vert \widehat{\Omega }_t\Vert _2^2\gt 0 $ , there thus has $ \langle \nabla \hat{J}_t,\widehat{\Omega }_t\rangle {{}} \lt 0 $ and $ \Vert \widehat{\Omega }_x\Vert _2^2\gt 0 $ . Given any two positive weights $ \forall \lambda _1,\ \lambda _2\gt 0 $ we can obtain the inequality as follow.

\( \begin{equation} \begin{aligned}&\left\Vert \mathbf {\widehat{d}}_t(\lambda _1)-(\nabla \hat{J}_t+\lambda _2\cdot \nabla \widehat{\Omega }_t)\right\Vert _2^2,\ \text{Consider $\measuredangle (\nabla \hat{J}_t,\nabla \widehat{\Omega }_t)\gt 90^\circ $}\\ =& \left\Vert \lambda _1\cdot (\nabla \hat{\Omega }_t - \frac{\left\lt \nabla \hat{J}_t, \nabla \hat{\Omega }_t \right\gt }{\Vert \nabla \hat{J}_t \Vert _2^2} \cdot \nabla \hat{J}_t)-\lambda _2\cdot \nabla \widehat{\Omega }_t\right\Vert _2^2\\ = & \left\Vert (\lambda _1-\lambda _2)\cdot (\nabla \hat{\Omega }_x+\nabla \hat{\Omega }_y) - \lambda _1\cdot \widehat{\Omega }_x\right\Vert _2^2\\ =&\left\Vert (\lambda _1-\lambda _2)\cdot \nabla \widehat{\Omega }_y-\lambda _2\cdot \nabla \widehat{\Omega }_x\right\Vert _2^2,\ \text{Consider $\nabla \widehat{\Omega }_x\perp \nabla \widehat{\Omega }_y$}\\ =&\left\Vert (\lambda _1-\lambda _2)\cdot \nabla \widehat{\Omega }_y\right\Vert _2^2+\Vert \lambda _2\cdot \nabla \widehat{\Omega }_x\Vert _2^2\\ \gt & 0. \end{aligned} \end{equation} \)

(14)

In this way, we can conclude that $ \mathbf {\widehat{d}}_t(\lambda _1)\ne \nabla \hat{J}_t+\lambda _2\cdot \nabla \widehat{\Omega }_t $ , for $ \forall \lambda _1,\ \lambda \gt 0 $ , when $ \measuredangle (\nabla \hat{J}_t,\nabla \widehat{\Omega }_t)\gt 90^\circ $ , $ \Vert \hat{J}_t\Vert _2^2\gt 0 $ and $ \Vert \widehat{\Omega }_t\Vert _2^2\gt 0 $ .□

To interpret the theoretical results, we use an example to visualize our intuition. Figure 2 illustrates an example of GrOD descent direction estimation, when the angles between gradients of empirical loss and regularization term is obtuse ( $ \gt \!\!90^\circ $ ). As shown in Figure 2(a), the effect of regularization term forms a direction that might slow down the empirical loss descending. As shown in Figure 2(b), GrOD decomposes the gradient of regularization term and truncates the conflicted direction for the actual descent direction estimation. On the other hand, the angle between the actual descent direction and regularization gradient and the angle between the actual descent direction and empirical loss gradient are both acute ( $ \le \!\!90^\circ $ ), so as to secure the regularization effect while ensuring empirical loss descending. In this way, we can understand GrOD as the solution to the relaxed constraint programming problem for ERM-preserved descent direction estimation addressed in Section 3.2. Furthermore, in this example, due to the truncated direction of the regularizer’s gradient, the GrOD descent direction cannot be achieved by any linear combinations of ERM loss gradient and the regularizer’s gradient.

Fig. 2.

4 Experiment

In this section, we report our experiment results for GrOD with three types of regularized deep learning paradigms, i.e., L $ ^2 $ -SP for transfer learning [2], knowledge distillation [3], and adversarial training [46].

4.1 Dataset and Experiment Setups

In transfer learning and knowledge distillation experiments, we used the ResNet-18 [56] as our base model with three widely used source datasets including ImageNet [8], Places 365 [57], and Stanford Dogs 120 [58] for weights pre-training. To evaluate the performance of transfer learning, we selected four datasets as the target datasets. These are Caltech 256 [5], MIT Indoors 67 [6], Flowers 102 [59], and CIFAR-10 [7]. Note that, we follow the same settings used in [2] for Caltech 256 setup, where 30 or 60 samples randomly drawn from each category for training with 20 remaining samples for testing. In adversarial training experiments, we used a small model consisting of two CNN layers and two fully connected layers with Fashion MNIST [60], while the ResNet-18 with CIFAR-10. Table 1 presents the statistics on some basic facts of all the datasets used in experiments.

Table 1.

Datasets	Domains	# Train/Test
ImageNet	Visual objects	1,419K+/100K
Place 365	Indoor scenes	10,000K+
Stanford Dog 120	Dogs	12K/8.5K
CIFAR-10	Visual objects	50K/10K
Caltech 256	Visual objects	30K+
MIT Indoors 67	Indoor scenes	5K+/1K+
Flowers 102	Flowers	1K+/6K+
Fashion-MNIST	Clothes	50K/60K

Table 1. Statistics on Datasets

4.1.1 Source/Target Tasks Pairing.

Above configuration leads to 15 source/target task pairs, where regularization would hurt the performance of transfer learning in some of these cases. For example, the image contents of ImageNet and CIFAR-10 are quite similar, in this way, the knowledge transfer from ImageNet to CIFAR-10 could improve the performance. On the other hand, the images in Stanford Dog 120 and MIT Indoor 67 are quite different, e.g., dogs v.s. indoor scenes; then the regularization based on pre-trained weights of Stanford Dog 120 task would hurt the learning of MIT Indoor 67 task.

4.1.2 Pre-Trained Models and Weights.

Furthermore, to obtain the pre-trained weights of all source tasks, we adopt the pre-trained models of ImageNet,¹ Place 365,² and Stanford Dog 120³ released online. We found an interesting fact that the pre-trained models of Place 365 and Stanford Dog 120 were trained from the pre-trained model of ImageNet. In this way, the pre-trained models for Place 365 and Stanford Dog 120 have been already enhanced by the ImageNet.

4.1.3 Image Classification Tasks Setups.

In transfer learning and knowledge distillation task, all images are re-sized to $ 256 \times 256 $ and re-scaled to $ [-2, 2] $ for each channel, following with data augmentation operations of random mirror and random crop to $ 224 \times 224 $ . We use a batch size of 64, SGD with the momentum of 0.9 is used for optimizing all models [61]. The learning rate for base model starts with 0.01 and is divided by 10 after 6,000 iterations. The Training is terminated with 8,000 iterations for Caltech 256, MIT Indoor 67 and Flowers 102, terminates with 20,000 iterations for CIFAR-10 (i.e., 18 epochs). The pre-trained weights obtained from the source task were not only used as the initialization of the model, i.e., starting point of optimization. Under the best configuration, each experiment is repeated five times. We report the average accuracy with standard deviations. In adversarial training task, all images are resized to $ 224 \times 224 $ and re-scaled to $ [-2, 2] $ , with random horizontal flips.

4.1.4 Hyper-Parameter Tuning for Regularizer Weights.

The regularizer weights for all experiments have been tuned best using cross validation or follow the default settings from the officially release codes of models. Our latter experiments would show that, with the same hyper-parameter settings, GrOD does not always outperform the overall loss gradients descent. To compare with varying hyper-parameters, our experiment results addressed in Section 4.4 will demonstrate the effectiveness of GrOD that outperforms common regularized deep learning dominantly for adversarial learning with varying regularizer weights.

4.2 Performance of GrOD on Transfer Learning with L²-SP [2]

In this section, we report the results of overall performance comparison based on the above tasks using $ L^2 $ -SP [2] and its variant based on GrOD for knowledge transfer from pre-trained models. We primarily focus on evaluating the performance improvement contributed by GrOD on top of $ L^2 $ -SP, comparing to the vanilla implementations. Both source and target tasks are trained on a typical ResNet-18 architecture. The knowledge transfer from ImageNet to all target tasks seems all good, as ImageNet contains more than 1,000 classes of images with more categories covered and rich features offered. However, the performance of knowledge transfer from Stanford Dog 120 to MIT Indoor 67 might be quite limited or even negatively affected the learning procedure, as these two datasets contain quite different images—dogs v.s. indoor scenes. Further discussion on the negative transfer effects would be addressed in Section 4.2.2.

4.2.1 Overall Comparison.

We present accuracy of all source/target pairs in Table 2. GrOD improves the performance of deep transfer learning in all of the above cases. For example, for the CIFAR-10 (target task) with ImageNet (source task), $ L^2 $ -SP algorithm achieved 93.30% accuracy, while GrOD ( $ L^2 $ -SP) has improved the accuracy to 96.41% (with more than 3.1% accuracy improvement). To the best of our knowledge, it has the best known result [62] for CIFAR-10 training on ResNet-18 from ImageNet sources with only 18 epochs. Even, using Stanford Dog 120 as the source task can perform similar as the ones sourcing from ImageNet, since the pre-trained model of Standford Dog 120 was pre–pre–trained from ImageNet. Overall GrOD significantly improves the performance of $ L^2 $ -SP in all transfer learning settings that we evaluated.

Table 2.

	Caltech 256	MIT Indoors 67	Flowers 102	CIFAR-10
Source Dataset ImageNet [8]
Fine-Tune [37]	$ 82.68 \pm 0.20 $	$ 76.73 \pm 0.77 $	$ 90.24 \pm 0.31 $	$ 96.40 \pm 0.4 $
$ L^2 $ -SP [2]	$ 83.69 \pm 0.09 $	$ 75.11 \pm 0.43 $	$ 88.96 \pm 0.21 $	$ 93.30 \pm 0.16 $
GrOD + $ L^2 $ -SP [2]	$ \mathbf {84.14 \pm 0.08} $	$ \mathbf {77.46 \pm 0.29} $	$ \mathbf {90.68 \pm 0.31} $	$ \mathbf {96.41 \pm 0.11} $
Source Dataset Places 365 [57]
Fine-Tune [37]	$ 73.13 \pm 0.20 $	$ 82.64 \pm 0.16 $	$ 83.77 \pm 0.68 $	$ 89.35 \pm 0.59 $
$ L^2 $ -SP [2]	$ 66.99 \pm 0.20 $	$ 84.09 \pm 0.09 $	$ 77.66 \pm 0.13 $	$ 89.78 \pm 0.05 $
GrOD + $ L^2 $ -SP [2]	$ \mathbf {73.32 \pm 0.1} $	$ 84.09 \pm 0.09 $	$ \mathbf {84.11 \pm 0.06} $	$ \mathbf {90.85 \pm 0.11} $
Source Dataset Stanford Dogs 120 [58]
Fine-Tune [37]	$ 82.29 \pm 0.04 $	$ 75.69 \pm 0.21 $	$ 90.20 \pm 0.39 $	$ 96.34 \pm 0.13 $
$ L^2 $ -SP [2]	$ 83.44 \pm 0.23 $	$ 74.64 \pm 0.07 $	$ 88.14 \pm 0.06 $	$ 94.16 \pm 0.10 $
GrOD + $ L^2 $ -SP [2]	$ \mathbf {83.84 \pm 0.08} $	$ \mathbf {76.46 \pm 0.22} $	$ \mathbf {89.98 \pm 0.04} $	$ \mathbf {96.39 \pm 0.08} $

Table 2. Accuracy Comparison on Knowledge Transfer from Different Source Datasets

An interesting facts observed in the experiments is that, on top of the both algorithms and 15 source/task pairs, using Stanford Dog 120 as the source task can perform similar as the ones sourcing from ImageNet. We consider it is due to the reason that the public release of Stanford Dog 120 pre-trained model is pre-trained from ImageNet, while the size of Stanford Dog 120 dataset is relatively small (i.e., it cannot “wash out” the knowledge obtained from ImageNet while preserving the knowledge from the both ImageNet/Stanford Dog 120 datasets). In this way, knowledge transferring from Stanford Dog 120 can be as good as those based on ImageNet. In the meanwhile, GrOD can still improve the performance of $ L^2 $ -SP, gaining 0.12% $ \sim $ 2.2% higher accuracy with low variance, even given the well-trained Stanford Dog 120 model.

4.2.2 Performance with Negative Transfer Effect.

According to the results presented in Table 2, we find negative transfer may happen in the cross-domain cases “Visual Objects/Dogs $ \Leftrightarrow $ Indoor Scenes” (please refer to the domain definitions in Table 1), while GrOD can improve the performance of $ L^2 $ -SP to relieve such negative effects. Two detailed cases are addressed as follow.

—

Cases of Negative Transfer. For both $ L^2 $ -SP algorithms, when using ImageNet and Stanford Dogs 120 as the source task while transferring to MIT Indoors 67 as the target task, we can observe significant performance degradation comparing to knowledge transfer from Place 365 to MIT Indoor 67. For example (Case I), the accuracy of MIT Indoor 67 using $ L^2 $ -SP is 84.09% based on pre-trained weights of Place 365, while the accuracy would be degraded to 75.11% and 74.64% under the same settings with ImageNet and Stanford Dog 120 as the pre-trained models, respectively. Furthermore, we also observe the similar negative transfer effects, when using Place 365 as source while transfer to the target tasks based on Caltech 256, Flower 102, and CIFAR-10. For example (Case II), the accuracy on Flowers 102 is 77.66% using Place 365 as source, while sourcing from ImageNet and Stanford Dog can achieve as high as 88.96% and 88.14%, respectively, all based on $ L^2 $ -SP.

—

Relieving Negative Transfer Effects. We believe performance degradation appeared in Cases I and II is due to the negative transfer, as the domains of these datasets are quite different. GrOD can however relieve such negative transfer cases. GrOD+ $ L^2 $ -SP [2] can achieve 84.11% on Flowers 102 dataset even when sourcing from Place 365—i.e., achieving 7% accuracy improvement, comparing to vanilla $ L^2 $ -SP under the same settings. For the rest negative transfer cases, GrOD can still improve the performance, with around 2% higher accuracy, comparing to the vanilla implementations of $ L^2 $ -SP. In this way, we conclude that GrOD can improve the performance of $ L^2 $ -SP in negative transfer cases with higher accuracy.

Note that we don’t intend to claim that GrOD could eliminate the negative transfer effects in parts. It, however, improves the performance of regularization-based deep transfer learning, even with inappropriate source/target pairs. Such accuracy improvement can marginally solve the problem of negative transfer effects.

4.3 Performance of GrOD on Feature-Wise Knowledge Distillation with [3]

We report the results of overall performance comparison based on the aforementioned tasks using Feature-wise Knowledge Distillation [3, 22] and its variant based on GrOD for Teacher–Student training of deep neural networks. We also focus on evaluating the performance improvement contributed by GrOD on top of Knowledge Distillation (denoted as “KnowDist” in this article), comparing to [3]. We use a ResNet-18 pre-trained on ImageNet as the Teacher Network.

We present the overall accuracy comparisons in Table 3. GrOD improves the performance of Teacher–Student training in all Student networks (also ResNet-18) training. For example, to train a Student network using CIFAR-10 dataset, common Knowledge distillation achieves 96.43% accuracy while GrOD further improves the accuracy to 96.57%. These two numbers are quite closed to the state-of-the-art performance of ResNet-18 on CIFAR-10 datasets (without using additional training or data augmentation methods) [62]. As it has been shown in 3, GrOD brings significant improvement in all tasks. We test the performance of GrOD based on other Teacher networks (based on different datasets). GrOD achieves performance improvement in all cases, on top of [3].

Table 3.

	Caltech 256	MIT Indoors 67	Flowers 102	CIFAR-10
Distilling a Teacher Network Pre-trained by ImageNet [8]
KnowDist [3]	$ 82.93 \pm 0.08 $	$ 78.05 \pm 0.32 $	$ 90.43 \pm 0.4 $	$ 96.43 \pm 0.08 $
GrOD +KnowDist [3]	$ \mathbf {83.27 \pm 0.4} $	$ \mathbf {78.77 \pm 0.31} $	$ \mathbf {90.91 \pm 0.4} $	$ \mathbf {96.57 \pm 0.2} $
Distilling a Teacher Network Pre-trained by Places 365 [57]
KnowDist [3]	$ 72.8 \pm 0.22 $	$ 83.29 \pm 0.42 $	$ 83.50 \pm 0.26 $	$ 94.96 \pm 0.05 $
GrOD + KnowDist [3]	$ \mathbf {73.18 \pm 0.24} $	$ \mathbf {84.40 \pm 0.41} $	$ \mathbf {84.12 \pm 0.56} $	$ \mathbf {95.02 \pm 0.13} $
Distilling a Teacher Network Pre-trained by Stanford Dogs 120 [58]
KnowDist [3]	$ 82.73 \pm 0.26 $	$ 76.36 \pm 0.19 $	$ 89.86 \pm 0.07 $	$ 96.11 \pm 0.53 $
GrOD + KnowDist [3]	$ \mathbf {82.85 \pm 0.27} $	$ \mathbf {76.74 \pm 0.26} $	$ \mathbf {90.29 \pm 0.34} $	$ \mathbf {96.41 \pm 0.18} $

Table 3. Classification Accuracy Comparison for Knowledge Distillation from Teacher Networks Pre-Trained by Various Datasets

4.4 Performance of GrOD on Adversarial Learning with Advt [4]

In this section, we report the results of performance comparison based on Fashion MNIST and CIFAR-10 under, adversarial learning settings, using advt [4] and its variant based on GrOD. We also focus on evaluating the performance improvement contributed by GrOD on top of advt. We use a simple two-layer CNN⁴ and a ResNet-18 for this experiment.

4.4.1 Adversarial Learning Setups with GrOD.

The experiment setups for adversarial learning with GrOD are a bit different from previous settings. Reference [46] found training with an adversarial objective function with regularization in Equation (4). To generate the perturbation more efficiently, [4] provides the state-of-the-art of adversarial learning that uses PGD [47] to generate adversarial examples, where two key factors $ \epsilon $ and $ \lambda $ control the level of noise in adversarial attacks and strength of regularization affects to the adversarial learning.

In this way, one can make tradeoff between accuracy and robustness of the model through tuning the regularization coefficient $ \lambda $ and attacking strength $ \epsilon $ . We can further adopt GrOD to get better accuracy while reserve robustness. Note that we model the gradient of regularization term as the $ \nabla \Omega (\omega) =\tfrac{1}{n}\sum _{i=1}^n \nabla _\omega L(z(\mathbf {x}_i,\omega),y_i) -\tfrac{1}{n}\sum _{i=1}^n \nabla _\omega L(z(\mathbf {x}_i,\omega),y_i) $ to remove the major component in parallel with the original loss.

4.4.2 Experimental Results.

We tested the adversarial training with GrODon Fashion-MNIST [60] and CIFAR-10, respectively. All the images’ are re-scaled to $ [-2, 2] $ . All adversarial examples are generated via [4] with seven steps. Figure 3 presents the results of experiment results for adversarial training with MNIST and CIFAR-10 datasets.

Fig. 3.

For Fashion MNIST dataset, we set step size for noise $ \epsilon =0.05 $ with varying regulariztion coefficient $ \lambda $ , so as to see whether GrOD can improve the performance of advt [4] with enhanced robustness (i.e., the accuracy based on adversarial samples) and accuracy (i.e., the accuracy based on original testing samples). The results show that GrOD can achieve “Pareto–improvement” on top of advt. When, both algorithms achieve the same accuracy, GrOD leads to higher robustness. When they behave with same robustness, advt(GrOD) outperforms advt with higher accuracy. Note that, without adversarial attack, the evaluated CNN can obtain 0.92 accuracy on testing sets.

For the experiments based on the CIFAR-10 dataset, we fix the regularization coefficient while varying the step size $ \epsilon $ for adversarial attack from 0 to 0.1. The experiment shows that advt(GrOD) can always outperform advt under the same level of noise (perturbation) for adversarial attacks. Generally, experiments based on both datasets demonstrated significant improvement of GrOD in adversarial learning tasks on top of state-of-the-art [4].

4.5 Case Studies

We report the results of the following two case studies that provide further evidences supporting that GrOD works in the way that we assumed.

4.5.1 Empirical Loss Minimization.

As was elaborated in the Introduction section, we suspect that using regularizer might restrict the learning procedure from lowering the empirical loss. Such restriction helps the regularized deep learning to avoid over-fitting, but in the meanwhile, hurts the learning procedure. Following the insight we hope to study trends of empirical loss part minimization with and without GrOD in $ L^2 $ -SP [2] case. Note that the empirical loss here is NOT the training loss, it refers to the data fitting error part of the training loss.

Figure 4 illustrates the trends of both empirical loss and testing loss, with increasing number of iterations, based on both $ L^2 $ -SP and GrOD( $ L^2 $ -SP), for Places 365 $ \Longrightarrow $ MIT Indoors 67 case. As was expected, the empirical loss of both vanilla $ L^2 $ -SP and GrOD( $ L^2 $ -SP) reduces with the number of iterations, while the empirical loss of $ L^2 $ -SP is always higher than that of GrOD( $ L^2 $ -SP). In the meanwhile, GrOD( $ L^2 $ -SP) always enjoys a lower testing loss than vanilla $ L^2 $ -SP. The phenomena indicate that, comparing vanilla $ L^2 $ -SP to GrOD( $ L^2 $ -SP), the $ L^2 $ -SP regularization term would restrict the procedure of empirical loss minimization and finally hurt the learning procedure with lower testing accuracy. Furthermore, we also observed that GrOD could be further improved through early stopping.

Fig. 4.

4.5.2 Angles Between Descent Directions.

The intuition of GrOD design is based on the two assumptions made in Section 3.2—it is possible to find a new descent direction that is very closed to the direction of empirical loss gradient (Assumption 1), while always shares an angle with the gradient of regularization term as small as the original descent direction (Assumption 2).

Figure 5 plots the comparison of the two types of angles with the $ L^2 $ -SP and advt algorithms with and without GrOD. The results showed that with GrOD both algorithms always enjoy a smaller angle between the actual descent direction and the (stochastic) gradient of empirical loss, i.e., $ \measuredangle (\mathbf {\widehat{d}}(\mathbf {\omega }),\nabla J(\mathbf {\omega })) $ of both algorithms with GrOD is smaller than the vanilla ones. We thus confirm the validity of Assumption 1. To demonstrate the validity of Assumption 2, we measure $ \measuredangle (\mathbf {\widehat{d}}(\mathbf {\omega }),\nabla \Omega (\mathbf {\omega })) $ for the two cases using $ L^2 $ -SP and advt algorithms on CIFAR-10. It shows that no matter whether GrOD is used, the trends of angles over the number of iterations are quite similar for the same algorithm under the same settings. Please note that values of angles highly depend on the choice of hyper-parameters (e.g., $ \lambda $ for $ L^2 $ -SP). However, we still can verify that, by design, the angles between the GrOD’s actual descent direction and the empirical loss’s gradient are always acute.

Fig. 5.

5 Discussion

In this article, we proposed GrOD, which can improve regularized deep learning through orthogonal decomposition of loss gradients. We have included extensive experiments using regularized learning paradigms [2, 3, 4] for knowledge transfer, knowledge distillation, and adversarial training. In this section, we discuss several open issues in this work.

5.1 Performance Improvement of GrOD and Analysis

All in all, the performance improvement made by GrOD on top of L $ ^2 $ -SP [2], KnowDist [3], and Adversarial Learning [4] is quite marginal. However, we hope to point out that the performance improvements consistently exist in all cases, especially relieving the “negative transfer” where the use of regularizer hurts the transfer learning. Please note that, in our study, we include the experiments based on inappropriate pairs of source and target datasets/pre-trained models for knowledge transfer and knowledge distillation (e.g., in Tables 2 and 3) while most of existing works uses ImageNet as the source datasets or pre-trained models only.

With inappropriate pairs of source and target datasets/pre-trained models, the regularized learning with L $ ^2 $ -SP or Distll might hurts the performance compared to directly fine-tuning from pre-trained weights. In all negative transfer cases, GrOD improves the performance of regularized knowledge transfer or knowledge distillation while always achieving performance better than vanilla fine-tuning. Furthermore, for the adversarial training with Advt regularization [4], GrOD achieves better tradeoff between accuracy and robustness, with Pareto dominance in these two factors, under varying strength of perturbation. Again our theoretical analysis in Section 3.4 clearly states how GrOD could ensure the effectiveness of ERM learning procedure while preserving the regularization effects (Proposition 1), which solicits the performance improvement. Note that our theoretical analysis relies on two assumptions made in Section 3.1, we conducted case studies with experiments to validate these two assumptions empirically.

5.2 Stability of GrOD Performance

Though GrOD enjoys higher accuracy on average, in some cases, it also incorporates higher variances. For example, in Table 2, with weights pre-trained by ImageNet, GrOD( $ L^2 $ -SP) achieves 90.86% with 0.31% STD for the target task based on Flower 102, while $ L^2 $ -SP achieves 88.96% with 0.21% STD in the same settings. It is obvious that GrOD incorporates with higher variances, however the lower bound of confidence intervals of GrOD is still higher than the upper bound of confidence interval of the original algorithm. Furthermore, we also tried to hack the weight decay using GrOD, the results showed that GrOD cannot improve weight decay. (Note that the weight decay, i.e., the L $ ^2 $ -regularization, has been frequently considered as a stabilizer [63, 64] of the training procedure in a regularization of Ridge-style.) We believe both of these observations are due to the use of orthogonal decomposition on stochastic gradient. In practical deep learning, stochastic gradients—the noisy evaluation of loss functions’ derivatives, have been used to accelerate the learning procedure with mini-batch sampling. However, the gradient noise [65] after orthogonal decomposition might perturbate the training procedure and leads to instability.

5.3 Hyper-Parameters Tuning and Fair Comparisons

In our experiments, to enable fair comparisons, we use hyper-parameters, including learning rates and the weights of regularizers, according to the default settings released from the open-source implementation of the algorithms [2, 3, 4] (most of which were tuned best through cross validations in their research). In the same set of experiments, both GrOD and the original algorithms used the same set of hyper-parameters, especially the weights of regularizers for fair comparisons. Note that the performance of GrOD could be further improved through tuning the hyper-parameters (rather than the use of default ones).

Furthermore, our theoretical analysis also shows that the descent direction of GrOD is not achievable through tuning the weight of regularizers’ term (Proposition 2). That means, no matter how one sets the hyper-parameters for vanilla regularized deep learning, the algorithm based on the vanilla loss gradient of regularized deep learning could not behave as same as GrOD.

5.4 Connections to Optimization Algorithms

Note that GrOD strategy is derived from the common stochastic gradient estimation used in stochastic gradient based learning algorithms, such as SGD, Momentum, conditioned SGD, Adam, and so on. It can be consider as an alternative approach for descent direction estimation on top of vanilla stochastic gradient estimation, where you can still use natural gradient-alike method to condition the descent direction or adopt Momentum-alike acceleration methods to replace the weight updating mechanism. We are not intending to compare GrOD with any gradient-based learning algorithms, as the contributions are complementary. One can freely use GrOD to improve any gradient-based optimization algorithms with the descent direction corrected.

Furthermore, according to the ERM-Effective descent direction assumption, GrOD can further lower the empirical loss while preserving regularization effects simultaneously, as the finally descent direction will be close to the both empirical loss gradient and regularizer gradient. Our later on experiment based on adversarial learning will show that no matter how regularizer weights are fine-tuned, GrOD can still outperform the traditional regularized deep learning algorithms that linearly combine the gradients of the two terms as the descent direction. In future work, we intend to study the asymptotic properties and convergence performance of GrOD, using Neural Tangent Kernel as the proxy [66] to lower the complexity in non-convex optimization analysis.

5.5 Improving Advanced Regularization Methods and Other Applications

Please be advised that the regularized deep learning algorithms for transfer learning ( $ L^2 $ -SP) [2], knowledge distillation [3], and adversarial training [4] are not the state-of-the-art algorithms in the fields. In future work, we are interested in incorporating with more advanced methods, such as DELTA and its variants [24, 67], BSS [68], Co-Tuning [69], learning without forgetting [21], and deep ensemble learning [70], where more advanced and complicated regularizers have been proposed incorporating constrained features, singular value decomposition, category relationship, and so on. In future work, we would study the use of GrOD and its variants to improve regularized learning, especially focusing on applications other than image classification, such as segmentation & parsing [10, 12], regularized graphical learning [15, 16, 71], and network interpretability [72, 73].

6 Conclusions

In this article, we studied a descent direction estimation strategy GrOD that improves the common regularized deep learning techniques with applications to transfer learning [2], knowledge distillation [3], and adversarial learning [4]. Significant improvements have been observed compared to the existing methods that simply aggregates empirical loss for data fitting and regularization terms through linear combination, such as [2, 3, 4].

Specifically, we designed a new method to re-estimate a new direction for loss descending based on the (stochastic) gradient estimation of empirical loss and regularizers, where orthogonal decomposition has been made on the gradient of regularization terms, so as to eliminate the conflicted direction against the empirical loss descending. The design of the algorithm is based on an intuitive assumption made by us, namely ERM-preserved descent direction, where in the every iteration of the learning procedure, the empirical loss of regularized deep learning is expected to descend as fast as the one based on empirical loss minimization. We have conducted extensive experiments to evaluate GrOD using several real-world datasets based on classical convolutional neural networks. The experiment results and comparisons show that GrOD significantly improves the state-of-the-art algorithms for the three applications with higher accuracy and robustness.

Footnotes

https://github.com/pytorch/vision/tree/master/torchvision/models.

https://github.com/CSAILVision/places365.

https://github.com/stormy-ua/dog-breeds-classification.

⁴

https://github.com/ashmeet13/FashionMNIST-CNN.

References

[1]

Ruosi Wan, Haoyi Xiong, Xingjian Li, Zhanxing Zhu, and Jun Huan. 2019. Towards making deep transfer learning never hurt. In Proceedings of the 2019 IEEE International Conference on Data Mining. IEEE.

Abstract

1 Introduction

1.1 Our Observations

1.2 Our Contributions

2 Related Work and Backgrounds

2.1 Regularized Deep Learning

2.2 Transfer Learning

2.3 Knowledge Distillation

2.4 Adversarial Learning

2.5 Discussion on the Connection to Our Work

3 GrOD: Gradient Orthogonal Decomposition

3.1 Definitions, Intuitions, and Assumptions

3.2 Problem Formulation

3.2.1 ERM-Effective Descent Direction.

3.2.2 Low-Complexity ERM-Effective Descent Direction via Relaxed Constraint Programming.

3.3 GrOD: Descent Direction Estimation via Orthogonal Decompositions

3.4 Understanding the Effects of GrOD for Regularization

4 Experiment

4.1 Dataset and Experiment Setups

4.1.1 Source/Target Tasks Pairing.

4.1.2 Pre-Trained Models and Weights.

4.1.3 Image Classification Tasks Setups.

4.1.4 Hyper-Parameter Tuning for Regularizer Weights.

4.2 Performance of GrOD on Transfer Learning with L2-SP [2]

4.2.1 Overall Comparison.

4.2.2 Performance with Negative Transfer Effect.

4.3 Performance of GrOD on Feature-Wise Knowledge Distillation with [3]

4.4 Performance of GrOD on Adversarial Learning with Advt [4]

4.4.1 Adversarial Learning Setups with GrOD.

4.4.2 Experimental Results.

4.5 Case Studies

4.5.1 Empirical Loss Minimization.

4.5.2 Angles Between Descent Directions.

5 Discussion

5.1 Performance Improvement of GrOD and Analysis

5.2 Stability of GrOD Performance

5.3 Hyper-Parameters Tuning and Fair Comparisons

5.4 Connections to Optimization Algorithms

5.5 Improving Advanced Regularization Methods and Other Applications

6 Conclusions

Footnotes

References

Cited By

Index Terms

Recommendations

A hybrid adversarial training for deep learning model and denoising network resistant to adversarial examples

Improving adversarial robustness of deep neural networks via adaptive margin evolution

Transformed ℓ 1 regularization for learning sparse deep neural networks

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

4.2 Performance of GrOD on Transfer Learning with L²-SP [2]