3.2 Composite Policy LRs
There are two types of composite LR schemes, according to whether the LRs are composed using the same LR function or using two or more different LR functions. The former is coined as the homogeneous multi-policy LRs and the latter is called the heterogeneous multi-policy LRs.
Decaying LRs improve the limitation of the fixed LRs by using decreasing LR values during training. Similar to simulated annealing, training with a decaying LR starts with a relatively large LR value, which is reduced gradually throughout the training, aiming to accelerate the learning process while ensure that the training converges with good accuracy or meeting the target accuracy. A decaying LR policy is defined by a decay function \(g(t)\) and a constant coefficient \(k\), denoted by \(\eta (t) = k g(t)\). \(g(t)\) gradually decreases from the upper bound of 1 as the number of iterations (\(t\)) increases, and the constant \(k\) value serves as the starting LR.
Table
1 lists the five most popular decaying LRs, supported in LRBench. The STEP function defines the LR policy at iteration
\(t\) with two parameters: a fixed step size
\(l\) (
\(l\gt 1\)) and an exponential factor
\(\gamma\). The LR value is initialized with
\(k\) and decays every
\(l\) iteration by
\(\gamma\). The NSTEP enriches STEP by introducing
\(n\) variable step sizes, denoted by
\(l_0, l_1, \ldots , l_{n-1}\), instead of the one fixed step size
\(l\). NSTEP is initialized by
\(k\) (
\(g(t)=1\) when
\(i=0, t\lt l_0\)) and computed by
\(\gamma ^i\) (when
\(i\gt 0\) and
\(l_{i-1} \le t \lt l_i\)). EXP is an LR function defined by an exponential function (
\(\gamma ^ t\)). Although EXP, STEP, and NSTEP all use an exponential function to define
\(g(t)\), their choice of concrete
\(\gamma\) is different. To avoid the LR decaying too fast due to the exponential explosion, EXP uses a
\(\gamma\) that is close to 1, e.g., 0.99994 and reduces the LR value every iteration. In contrast, STEP and NSTEP employ a small
\(\gamma\), e.g., 0.1, and decay the LR value using one fixed step size
\(l\) or using
\(n\) variable step sizes
\(l_i\). The total number of steps is determined, for STEP, by the step size and the pre-defined training #Iterations (or #Epochs), and for NSTEP,
\(n\) is typically small, e.g., 2–5 steps. Other decaying LRs are based on
the inverse time function (
INV) and the
polynomial function (
POLY) with parameter
\(p\) as shown in Table
1. A good selection of the value settings for these parameters is challenging and yet critical for achieving effective training performance when using decaying LR policies.
Several studies [
5,
17,
25] show that with a good selection of the decaying LR policies, they are more effective than the fixed LRs for improving training performance on accuracy and training time. However, [
30,
36,
37,
41] recently show that decaying LRs tend to miss the opportunity to accelerate the training progress on the plateau in the middle of training when initialized with too small LR values and result in slow convergence and/or low accuracy. While larger initial values for decaying LRs can lead to the same problem as those for fixed LRs.
STEP is an example of homogeneous multi-policy LRs created using a single LR function FIX by employing multiple fixed LRs defined by the LR update schedule based on training iteration \(t\) and step size \(l\). NSTEP is another example of homogeneous multi-policy LRs by employing \(n\) different FIX policies, each changing the LR value based on \(t\) and the \(n\) different step size: \(l_i\) (\(i=0,\dots , n-1\)).
Cyclic LRs (CLRs) are proposed recently by [
30,
36,
37], to address the above issue of decaying LRs. CLRs by design change the LR cyclically within a pre-defined value range, instead of using a fixed value or reducing the LR by following a decaying policy, and some target accuracy thresholds can be achieved earlier with CLRs in shorter training time and smaller #Epochs or #Iterations. In general, cyclic LRs can be defined by
\(\eta (t) = |k_0 - k_1| g(t) + min(k_0, k_1)\), where
\(k_0\) and
\(k_1\) specify the upper and lower value boundaries,
\(g(t)\) represents the cyclic function whose domain ranges from 0 to 1, and
\(min(k_0, k_1) \le \eta (t) \le max(k_0, k_1)\). For each CLR policy, three important parameters should be specified:
\(k_0\) and
\(k_1\), which specify the initial cyclic boundary, and the half-cycle length
\(l\), defined by the half of the cycle interval, similar to the step size
\(l\) used in decaying LRs. The good selection of these LR parameter settings is challenging and yet critical for the DNN training effectiveness under a cyclic LR policy (see Section
5 for details).
We support three types of CLRs currently in LRBench: triangle-LRs, sine-LRs, and cosine-LRs as Table
2 shows. TRI is formulated with a triangle wave function
\(TRI(t)\) bounded by
\(k_0\) and
\(k_1\). TRI2 and TRIEXP are two variants of TRI by multiplying
\(TRI(t)\) with a decaying function,
\(\frac{1}{2^{floor(\frac{t}{2l})}}\) for TRI2 and
\(\gamma ^{t}\) for TRIEXP. TRI2 reduces the LR boundary (
\(|k_0 - k_1|\)) every
\(2 l\) iterations while TRIEXP decreases the LR boundary exponentially. [
30] proposed a cosine function with warm restart and can be seen as another type of CLRs, and we denote it as COS. We implement COS2 and COSEXP as the two variants of COS corresponding to TRI2 and TRIEXP in LRBench and also implement SIN, SIN2, and SINEXP as the sine-CLRs in LRBench.
We visualize the optimization process of three LR functions, FIX, NSTEP, and TRIEXP, from the above three categories in Figure
1. The corresponding LR policies are FIX (
\(k=0.025\)) in black, NSTEP (
\(k=0.05, \gamma =0.5, l=[120, 130]\)) in red, and TRIEXP (
\(k_0=0.05, k_1=0.25, \gamma =0.9, l=25\)) in yellow. All three LRs start from the same initial point, and the optimization process lasts for 150 iterations. The color on the grid marks the value of the cost function to be minimized, where the global optimum corresponds to red. In general, we observe that different LRs lead to different optimization paths. Although three LRs exhibit little difference in terms of the optimization path up to the 70th iteration, the FIX lands at a different local optimum at the 149th iteration rather than the global optimum (red). It shows that the accumulated impacts of LR values could result in sub-optimal training results. Also, at the beginning of the optimization process, TRIEXP achieved the fastest progress according to Figure
1(a) and (b). This observation indicates that starting with a high LR value may help accelerate the training process. However, the relatively high LR values of TRIEXP in the late stage of training also introduce high “kinetic energy” into this optimization and therefore, the cost values are bouncing around chaotically as the TRIEXP optimization path (yellow) shows in Figure
1(b)–(d), where the model may not converge. This example also illustrates that simply using a single LR function may not achieve the optimal optimization progress due to the lack of flexibility to adapt to different training states.
Studies by [
30,
36,
37,
41] confirm independently that increasing the LR cyclically during the training is more effective than simply decaying LRs continuously as training progresses. On the one hand, with relatively large LR values, cyclic LRs can accelerate the training progress when the model is trapped on the plateau and escape the local optimum by updating the model parameters more aggressively. On the other hand, cyclic LRs, especially decaying cyclic LRs (i.e., TRI2, TRIEXP) can further reduce the boundaries of LR values, as training progresses through different stages, such as the training initialization phase, the middle stage of training in which the model being trained is trapped on a plateau or a local optimum, and the final convergence phase. We argue that the LR should be actively changed to accommodate different training phases. However, existing LRs, such as the fixed LRs, decaying LRs, and cyclic LRs are all defined by one function with one set of parameters. Such design limits its adaptability to different training phases and often results in slow training and sub-optimal training results.
According to our characterization of single-policy and multi-policy LRs, we can view TRI2, COS2, and SIN2 as examples of homogeneous multi-policy LRs, because they are created by combining multiple LR policies through a simple and straightforward integration of decaying LR policy to refine the cyclic LR policy using decaying (\(k_0, k_1\)) by one half every \(2 l\) iterations. We have analyzed how and why these multi-policy LRs provide more flexibility and adaptability in selecting and composing a multi-policy LR mechanism for more effective training of DNNs in this section. Our empirical results also confirm consistently that multi-policy LRs hold the potential to further improve the test accuracy of the trained DNN models.
Advanced Composite LRs. There are different combinations of multi-policy LRs that can be beneficial for further boosting the DNN model training performance. Consider a new multi-policy LR mechanism; it is created by composing three triangle LRs in three different training stages with decaying cyclic upper and lower bounds and varying decaying step sizes over the total training iterations. Table
3 shows the effectiveness of this multi-policy LR policy through a comparative experiment on ResNet-32 with the CIFAR-10 dataset using the Caffe DNN framework. The experiment compares four scenarios. (1) The single-policy LR with the fixed value of 0.1 achieves an accuracy of 86.08% in 61,000 iterations out of the default total training iterations of 64,000. (2) The decaying LR policy of the NSTEP function with specific
\(k, \gamma , l\) achieves the test accuracy of 92.38% for the trained model with only 53,000 iterations out of 64,000 total iterations. (3) The cyclic LR policy of the SINEXP function with specific LR parameters (
\(k_0, k_1, \gamma\)) achieves the accuracy of 92.81% at the end of 64,000 total training iterations. (4) The advanced composite multi-policy LR mechanism is created by LRBench for CIFAR-10 (ResNet-32), which achieves the highest test accuracy of 92.91% at the total training rounds of 64,000, compared to other three LR mechanisms. Figure
2 further illustrates the empirical comparison results. It shows the LR (green curves) and Top-1/Top-5 accuracy (red/purple curves) for the top three performing LR policies: NSTEP, SINEXP, and the advanced composite LR.
We highlight three interesting observations.
First, this multi-policy LR achieved the highest accuracy of 92.91%, followed by the cyclic LR (SINEXP, 92.81%) and decaying LR (NSTEP, 92.38%) while the fixed LR achieved the lowest 86.08% accuracy.
Second, from the LR value
\(y\)-axis (left) and the accuracy
\(y\)-axis (right) in Figure
2(a) and (b), we observe that the accuracy increases as the overall LRs decrease over the iterative training process (
\(t\) on the
\(x\)-axis). However, without dynamically adjusting the LR values in a cyclic manner, NSTEP missed the opportunity to achieve higher accuracy by simply decaying its initial LR value over the iterative training process.
Third, Figure
2(b) shows that even with a cyclic LR policy such as SINEXP, it fails to compete with the new multi-policy LR that we have designed because it fails to decay faster and at the end of the training iterations, it still uses much higher LRs. In comparison, Figure
2(c) shows the training efficiency using our new multi-policy LR for CIFAR-10 on ResNet-32. It uses three different CLR functions in three different stages of the overall 64,000 training iterations, each stage reducing the cyclic range by one-tenth decaying of
\(k_0\) and
\(k_1\) and reducing the step size
\(l\) by 500 iterations.
Recent research has shown that variable LRs are advantageous by empowering training with adaptability to dynamically change the LR value throughout the entire training process to adapt to the need of different learning modes (learning speed, step size, and direction for changing learning speed) at different training phases. A key challenge for defining a good multi-policy LR, be it homogeneous or heterogeneous, is related to the problem of how to determine the dynamic settings of its parameters, such as \(l\) and \(k_0, k_1\). Existing hyperparameter search tools are typically designed for tuning the parameters at the initialization of the DNN training, which will not change during training, such as the number of DNN layers, the batch size, the number of iterations/epochs, and the optimizer (SGD, Adam, etc.). Thus, an open problem of selecting and composing LR policies with evolving LR parameters is to develop a systematic approach to selecting and composing LRs. In the next section, we present our framework, called LRBench, to create a good LR policy by dynamically tuning, composing, and selecting a good LR policy for a learning task and a given dataset on a chosen DNN model.