Learning Accurate Performance Predictors for Ultrafast Automated Model Compression

Wang, Ziwei; Lu, Jiwen; Xiao, Han; Liu, Shengyu; Zhou, Jie

doi:10.1007/s11263-023-01783-0

Learning Accurate Performance Predictors for Ultrafast Automated Model Compression

Published: 13 April 2023

Volume 131, pages 1761–1783, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Ziwei Wang¹,
Jiwen Lu¹,
Han Xiao¹,
Shengyu Liu¹ &
…
Jie Zhou¹

453 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose an ultrafast automated model compression framework called SeerNet for flexible network deployment. Conventional non-differen-tiable methods discretely search the desirable compression policy based on the accuracy from exhaustively trained lightweight models, and existing differentiable methods optimize an extremely large supernet to obtain the required compressed model for deployment. They both cause heavy computational cost due to the complex compression policy search and evaluation process. On the contrary, we obtain the optimal efficient networks by directly optimizing the compression policy with an accurate performance predictor, where the ultrafast automated model compression for various computational cost constraint is achieved without complex compression policy search and evaluation. Specifically, we first train the performance predictor based on the accuracy from uncertain compression policies actively selected by efficient evolutionary search, so that informative supervision is provided to learn the accurate performance predictor with acceptable cost. Then we leverage the gradient that maximizes the predicted performance under the barrier complexity constraint for ultrafast acquisition of the desirable compression policy, where adaptive update stepsizes with momentum are employed to enhance optimality of the acquired pruning and quantization strategy. Compared with the state-of-the-art automated model compression methods, experimental results on image classification and object detection show that our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost. Code is available at https://github.com/ZiweiWangTHU/SeerNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Search-and-Train: Two-Stage Model Compression and Acceleration

Network Pruning via Explicit Information Migration

Neural Network Compression by Joint Sparsity Promotion and Redundancy Reduction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Abbasnejad, E.,, Teney, D., Parvaneh, A., Shi, J., & van den Hengel, A. (2020). Counterfactual vision and language learning. In CVPR, pp. 10044–10054.
Balcan, M.-F., Broder, A., & Zhang, T. (2007). Margin based active learning. In: COLT, pp. 35–50.
Bell, S., Lawrence, Z. C., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In: CVPR, pp. 2874–2883.
Beluch, W. H., Genewein, T., Nürnberger, A., & Köhler, J. M. (2018). The power of ensembles for active learning in image classification. In: CVPR, pp. 9368–9377.
Bethge, J., Bartz, C., Yang, H., Chen, Y., & Meinel, C. (2020). Meliusnet: Can binary neural networks achieve mobilenet-level accuracy? arXiv preprint arXiv:2001.05936.
Bulat, A., & Tzimiropoulos, G. (2021). Bit-mixer: Mixed-precision networks with runtime bit-width selection. In: ICCV, pp. 5188–5197.
Cai, H., Gan, C., Wang, T., Zhang, Z., & Han, S. (2019). Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791.
Cai, Z., & Vasconcelos, N. (2020). Rethinking differentiable search for mixed-precision neural networks. In: CVPR, pp. 2349–2358.
Chen, G., Choi, W., Yu, X., Han, T., & Chandraker, M. (2017). Learning efficient object detection models with knowledge distillation. In: NIPS, pp. 742–751.
Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., & Gopalakrishnan, K. (2018). Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085.
Dai, X., Zhang, P., Wu, B., Yin, H., Sun, F., Wang, Y., Dukhan, M., Hu, Y., Wu, Y., Jia, Y., et al. (2019). Chamnet: Towards efficient network design through platform-aware model adaptation. In: CVPR, pp. 11398–11407.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR, pp. 248–255.
Denil, M., Shakibi, B., Dinh, L., De Freitas, N., et al. (2013). Predicting parameters in deep learning. In: NIPS, pp. 2148–2156.
Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., & Li, J. (2018). Boosting adversarial attacks with momentum. In: CVPR, pp. 9185–9193.
Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., & Keutzer, K. (2019). Hawq: Hessian aware quantization of neural networks with mixed-precision. In: ICCV, pp. 293–302.
Duch, W., & Korczak, J. (1998). Optimization and global minimization methods suitable for neural networks. Neural Computing Surveys, 2, 163–212.
Google Scholar
Erin Liong, V., Lu, J., Wang, G., Moulin, P., & Zhou, J. (2015). Deep hashing for compact binary codes learning. In: CVPR, pp. 2475–2483.
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R., & Modha, D. S. (2019). Learned step size quantization. arXiv preprint arXiv:1902.08153.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. IJCV, 88(2), 303–338.
Article Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In: ICCV, pp. 6202–6211.
Finlay, C., Pooladian, A.-A., & Oberman, A. (2019). The logbarrier adversarial attack: making effective use of decision boundary information. In: ICCV, pp. 4862–4870.
Gal, Y. Islam, R., & Ghahramani, Z. (2017). Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910.
Gong, R., Liu, X., Jiang, S., Li, T., Hu, P., Lin, J., Yu, F., & Yan, J. (2019). Differentiable soft quantization: Bridging full-precision and low-bit neural networks. arXiv preprint arXiv:1908.05033.
Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., & Lee, S. (2019). Counterfactual visual explanations. arXiv preprint arXiv:1904.07451.
Habi, H. V., Jennings, R. H., & Netzer, A. (2020). Hmq: Hardware friendly mixed precision quantization block for cnns. arXiv preprint arXiv:2007.09952.
Han, S., Mao, H., & Dally, W. J. (2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
Han, S., Pool, J., Tran, J., & Dally, W. (2015b). Learning both weights and connections for efficient neural network. In: NIPS, pp. 1135–1143.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR, pp. 770–778.
He, Y., Kang, G., Dong, X., Fu, Y., & Yang, Y. (2018a). Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866.
He, Y., Zhang, X., & Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In: ICCV, pp. 1389–1397.
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., & Han, S. (2018b). Amc: Automl for model compression and acceleration on mobile devices. In: ECCV, pp. 784–800.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks. In: NIPS, pp. 4107–4115.
Jin, Q., Yang, L., & Liao, Z. (2020). Adabits: Neural network quantization with adaptive bit-widths. In: CVPR, pp. 2146–2156.
Joshi, A. J., Porikli, F., & Papanikolopoulos, N. (2009). Multi-class active learning for image classification. In: CVPR, pp. 2372–2379.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
Li, J., Qi, Q., Wang, J., Ge, C., Li, Y., Yue, Z., & Sun, H. (2019a). Oicsr: Out-in-channel sparsity regularization for compact deep neural networks. In: CVPR, pp. 7046–7055.
Li, R., Wang, Y., Liang, F., Qin, H., Yan, J., & Fan, R. (2019b). Fully quantized network for object detection. In: CVPR, pp. 2810–2819.
Li, X., & Guo, Y. (2014). Multi-level adaptive active learning for scene classification. In: ECCV, pp. 234–249.
Li, Y., Gu, S., Mayer, C., Van Gool, L., & Timofte, R. (2020a). Group sparsity: The hinge between filter pruning and decomposition for network compression. In: CVPR, pp. 8018–8027.
Li, Y., Dong, X., & Wei, W. (2020). Additive powers-of-two quantization: A non-uniform discretization for neural networks. ICLR.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Lawrence, Z. C. (2014). Microsoft coco: Common objects in context. In: ECCV, pp. 740–755.
Liu, B., Wang, M., Foroosh, H., Tappen, M., & Pensky, M. (2015). Sparse convolutional neural networks. In: CVPR, pp. 806–814.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In: ECCV, pp. 21–37.
Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., & Cheng, K.-T. (2018a). Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In: ECCV, pp. 722–737.
Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K.-T., & Sun, J. (2019). Metapruning: Meta learning for automatic neural network channel pruning. In: ICCV, pp. 3296–3305.
Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2018b). Rethinking the value of network pruning. In: ICLR.
Lou, Q., Guo, F., Kim, M., & Liu, L., & Lei, J. (2019). Autoq: Automated kernel-wise neural network quantization. In: ICLR.
Louizos, C., Welling, M., & Kingma, D. P. (2017). Learning sparse neural networks through $ l_0 $ regularization. arXiv preprint arXiv:1712.01312.
Louizos, C., Reisser, M., Blankevoort, T., Gavves, E., & Welling, M. (2018). Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875.
Luo, W., Schwing, A., & Urtasun, R. (2013). Latent structured active learning. NIPS, 26, 728–736.
Google Scholar
Melville, P., & Mooney, R. J. (2004). Diverse ensembles for active learning. In: ICML.
Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440.
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., & Kautz, J. (2019). Importance estimation for neural network pruning. In: CVPR, pp. 11264–11272.
Peng, H., Wu, J., Chen, S., & Huang, J. (2019). Collaborative channel pruning for deep networks. In: ICML, pp. 5113–5122.
Phan, H., Huynh, D., He, Y., Savvides, M., & Shen, Z. (2019). Mobinet: A mobile binary network for image classification. arXiv preprint arXiv:1907.12629.
Qu, Z., Zhou, Z., Cheng, Y., & Thiele, L. (2020). Adaptive loss-aware quantization for multi-bit networks. In: CVPR, pp. 7988–7997.
Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In: ECCV, pp. 525–542.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: CVPR, pp. 4510–4520.
Settles, B., & Craven, M. (2008). An analysis of active learning strategies for sequence labeling tasks. In: EMNLP, pp. 1070–1079.
Siddiqui, Y., Valentin, J., & Nießner, M. (2020). Viewal: Active learning with viewpoint entropy for semantic segmentation. In: CVPR, pp. 9433–9443.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In: ICML, pp. 1139–1147.
Uhlich, S., Mauch, L., Yoshiyama, K., Cardinaux, F., Garcia, J. A., Tiedemann, S., Kemp, T., & Nakamura, A. (2019). Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452.
Vasisht, D., Damianou, A., Varma, M., & Kapoor, A. (2014). Active learning for sparse bayesian multilabel classification. In: KDD, pp. 472–481.
Vijayanarasimhan, S., & Grauman, K. (2014). Large-scale live active learning: Training object detectors with crawled data and crowds. IJCV, 108(1–2), 97–114.
Wang, K., Liu, Z., Lin, Y., Lin, J., & Han, S. (2019a). Haq: Hardware-aware automated quantization with mixed precision. In: CVPR, pp. 8612–8620.
Wang, T., Wang, K., Cai, H., Lin, J., Liu, Z., Wang, H., Lin, Y., & Han, S. (2020a). Apq: Joint search for network architecture, pruning and quantization policy. In: CVPR, pp. 2078–2087.
Wang, W., Song, H., Zhao, S., Shen, J., Zhao, S., Hoi, S. C. H., & Ling, H. (2019b). Learning unsupervised video object segmentation through visual attention. In: CVPR, pp. 3064–3074.
Wang, Y., Lu, Y., & Blankevoort, T. (2020b). Differentiable joint pruning and quantization for hardware efficiency. In: ECCV, pp. 259–277.
Wang, Z., Zheng, Q., Lu, J., & Zhou, J. (2020c). Deep hashing with active pairwise supervision. In: ECCV, pp. 522–538.
Wang, Z., Jiwen, L., & Zhou, J. (2021). Learning channel-wise interactions for binary convolutional neural networks. TPAMI, 43(10), 3432–3445.
Article Google Scholar
Wang, Z., Xiao, H., Lu, J., & Zhou, J. (2021b). Generalizable mixed-precision quantization via attribution rank preservation. In: ICCV, pp. 5291–5300.
Wang, Z., Jiwen, L., Ziyi, W., & Zhou, J. (2022). Learning efficient binarized object detectors with information compression. TPAMI, 44(6), 3082–3095.
Article Google Scholar
Wang, Z., Wang, C., Xu, X., Zhou, J., & Lu, J. (2020b). Quantformer: Learning extremely low-precision vision transformers. TPAMI, pp. 1–14. https://doi.org/10.1109/TPAMI.2022.3229313.
Wen, W., Liu, H., Chen, Y., Li, H., Bender, G., & Kindermans, P.-J. (2020). Neural predictor for neural architecture search. In: ECCV, pp. 660–676.
Wu, Z., Wang, Z., Wei, Z., Wei, Y., & Yan, H. (2020). Smart explorer: Recognizing objects in dense clutter via interactive exploration. In: IROS, pp. 6600–6607.
Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sandler, M., Sze, V., & Adam, H. (2018). Netadapt: Platform-aware neural network adaptation for mobile applications. In: ECCV, pp. 285–300.
Yu, H., Han, Q., Li, J., Shi, J., Cheng, G., & Fan, B. (2020). Search what you want: Barrier panelty nas for mixed precision quantization. arXiv preprint arXiv:2007.10026.
Zhang, D., Yang, J., Ye, D., & Hua, G. (2018). Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In: ECCV, pp. 365–382.

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 62125603, and in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI).

Author information

Authors and Affiliations

Department of Automation, State Key Lab of Intelligent Technologies and Systems, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, 100084, China
Ziwei Wang, Jiwen Lu, Han Xiao, Shengyu Liu & Jie Zhou

Authors

Ziwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiwen Lu
View author publications
You can also search for this author in PubMed Google Scholar
Han Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Shengyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiwen Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals.

Additional information

Communicated by Kong Hui.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Appendix A: Mathematical formulation from (11) to (12) of the manuscript.

Since we leverage the deterministic neural networks to predict the accuracy of various lightweight models, we present the alternative objective function to tractably calculate the objective (11) in the manuscript based on importance sampling. We first rewrite the importance weight in (11) of the manuscript via the analytical form of Dirac-delta function:

$$\begin{aligned} \qquad \qquad \qquad \frac{p(a|\varvec{s})}{p(a|\varvec{\hat{s}})}=\frac{\lim \limits _{\epsilon \rightarrow 0}\frac{1}{\pi }\frac{\epsilon }{\epsilon ^2+(a-a_{10})^2}}{\lim \limits _{\epsilon \rightarrow 0}\frac{1}{\pi }\frac{\epsilon }{\epsilon ^2+(a-a_{20})^2}} \end{aligned}$$

(16)

where $a_{10}$ and $a_{20}$ are distribution parameters of original and perturbed policies parameterized by the performance predictor. As $\epsilon $ is higher-order infinitesimal of $a-a_{10}$ and $a-a_{20}$ according to the definition of Dirac-delta function, the importance weight can be rewritten as follows:

$$\begin{aligned}&\frac{p(a|\varvec{s})}{p(a|\varvec{\hat{s}})}=\frac{(a-a_{20})^2}{(a-a_{10})^2}\nonumber \\&=\frac{(a-a_{10})^2+(a_{10}-a_{20})^2+2(a-a_{10})(a_{10}-a_{20})}{(a-a_{10})^2} \end{aligned}$$

Since a means the predicted accuracy parameterized by $a_{10}$, the difference between a and $a_{10}$ can be assumed as a small constant $\gamma _{10}$ in the deterministic settings. Therefore, the difference between a and $a_{10}$ is far less than that between $a_{10}$ and $a_{20}$. We obtain the following relationship between $\frac{p(a|\varvec{s})}{p(a|\varvec{\hat{s}})}$ and $a_{10}-a_{20}$:

$$\begin{aligned} \qquad ~~ \frac{p(a|\varvec{s})}{p(a|\varvec{\hat{s}})}=(a_{10}-a_{20})^2/\gamma _{10} \propto (a_{10}-a_{20})^2 \end{aligned}$$

(17)

$a_{10}$ and $a_{20}$ are represented by $f(\varvec{s})$ and $f(\hat{\varvec{s}})$, and the loss function $l(f(\varvec{s}),a)$ of accuracy prediction is assigned with the $L_2$ norm of the difference between the predicted and actual accuracy $(f(\varvec{s}^i)-a^i)^2$. Therefore, we optimize the alternative objective (12) in the manuscript to provide informative supervision for performance predictor learning.

Appendix B: The Actual and Predicted Accuracies for Sampled Lightweight Networks

We employed the architectures of VGG-small (Zhang et al., 2018) and ResNet20 (He et al., 2016) for automated model compression on CIFAR-10, and compressed MobileNet-V2 (Sandler et al., 2018), ResNet18 and ResNet50 architectures on ImageNet. We trained the performance predictor by the accuracy of 800 randomly sampled and 800 actively sampled compressed models respectively, and regressed the accuracy for 50 randomly sampled lightweight models via the well-trained performance predictor. We show the actual and predicted accuracy of random sampling and our active sampling in Fig. 6, where the MSE between the actual and predicted accuracy is also demonstrated. The predicted accuracy is generally closed to the actual one across the datasets, which shows the effectiveness of the performance predictor for automated model compression. Meanwhile, our active sampling strategy chooses the uncertain compression policies that provide informative supervision for performance predictor learning, so that the predicted accuracy is more precise compared with the random sampling strategy. Since the performance variance for different quantization and pruning strategies on largescale datasets is larger, and our active sampling strategy offers more benefits for the performance predictor learning on ImageNet. Although deeper architectures with large search space such as MobileNet-V2 and ResNet50 obtain higher MSE for their performance predictors, the active sampling policy is still capable of providing informative supervision for accurate performance predictor learning as the MSE is less than $5\times 10^{-4}$.

Table 7 The accuracy(%) and BOPs(G) variance for Table 1 in the manuscript by running experiments for 5 times

Full size table

Table 8 The accuracy(%) and BOPs(K) variance for Table 2 in the manuscript acquired by running experiments for 5 times

Full size table

Appendix C: Visualization of the Optimal Compression Policy

We show the bitwidth of weights and activations and pruning ratio across different layers for image classification in Fig. 7, where the computational cost constraint is 2.4G and 0.2G BOPs for compressing VGG-small and ResNet20 on CIFAR-10 and is 8G, 33G and 62G BOPs for MobileNet-V2, ResNet18 and ResNet50 compression on ImageNet respectively.

Because sparse networks require precise weights and activations to maintain the representational capacity and increasing the bitwidth of layers with high pruning ratio only brings slight computational cost, the layers with high pruning ratio are usually assigned with large bitwidth in the optimal compression policy. The optimal compression policy searched via the state-of-the-art method DJPQ (Wang et al., 2020b) only prunes the bottom layers since they sequentially prune the networks from bottom layers to top layers, while our SeerNet simultaneously optimizes the pruning strategy for all layers and preserves the informative channels with redundant channel removal.

For VGG-small and ResNet20 architectures trained on CIFAR-10, the bitwidth varies slightly across different layers. Meanwhile, the pruning ratio is high and the bitwidth is low for all layers, which means significant over-parameterization for both network architectures on CIFAR-10. The large bitwidth and low pruning ratio in the optimal compression policy for MobileNet-V2 indicates that the compact architecture is hard to be further compressed without sizable accuracy drop. On the contrary, ResNet50 is compressed with extremely low bitwidth, which demonstrates the significant redundancy. Moreover, the bitwidth of activations is usually larger than that of weights, which depicts that the model accuracy is more sensitive to activation quantization than weight quantization.

Appendix D: Implementation Details in Section 4.2.2

To validate the effectiveness of the presented performance predictor, we utilized the reinforcement learning and evolutionary algorithms to search the optimal compression policy via our performance predictor. For reinforcement learning, we leveraged the deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015). Compared with (Wang et al., 2019a), we added an extra state $p_k$ representing the pruning ratio for the $k_{th}$ convolutional layer in the state space and supplemented an extra action $a_k^p$ to sample the pruning ratio of compression strategy in the action space. The accuracy of the compressed model applied in the reward function was obtained via our performance predictor. Other implementation details were the same as those in Wang et al. (2019a). For evolutionary algorithms, we followed the same implementation details in Wang et al. (2020a) to search the optimal compression policy except that we deleted the network architecture components for each candidate. The accuracy of each candidate applied in the fitness function was acquired via our performance predictor. We imposed the resource constraint by limiting the BOPs of the compressed models during the search process.

Table 9 The accuracy(%), MACs(K), BOPs(G) variance for Table 3 in the manuscript acquired by running experiments for 5 times

Full size table

Table 10 The top-1 accuracy(%), MACs(G), BOPs(G) variance for Table 4 in the manuscript acquired by running experiments for 5 times

Full size table

Table 11 The mAP(%), MACs(G), BOPs(G) variance for Table 5 in the manuscript acquired by running experiments for 5 times

Full size table

Table 12 The mAP(%), MACs(G), BOPs(G) variance for Table 6 in the manuscript acquired by running experiments for 5 times

Full size table

Table 13 The MACs(G), BOPs(G), top-1 classification accuracy and search cost on ImageNet with pruning-only methods in MobileNet-V2

Full size table

Appendix E: Performance Variance of SeerNet

In order to show the performance variance of our SeerNet, we run SeerNet for 5 times including compression policy search and backbone training for results in Tables 1, 2, 3, 4, 5, 6. We report the mean and standard deviation of accuracies and computational complexity due to the variation, while the search cost and training cost are almost the same for each time. We leveraged the same complexity budget as that in Tables 1, 2, 3, 4, 5, 6 of the manuscript, and Tables 7, 8, 9, 10, 11, 12 show the experimental results with mean and standard deviation.

Appendix F: Performance of Pruning-only and Quantization-only Strategies

In this section, we evaluate SeerNet in the experimental settings with pruning-only and quantization-only strategies, where the backbone networks are only compressed by pruning and quantization policies in the above settings respectively. We implemented SeerNet following the details introduced in Section 4.1 of the manuscript except for the modifications that the compression policy sampling for performance predictor training only contains pruning or quantization for the two settings. The compared methods include AMC (He et al., 2018b), NetAdapt (Yang et al., 2018), MetaPruning (Liu et al., 2019) for pruning-only methods and contain HAQ (Wang et al., 2019a) and HAWQ (Dong et al., 2019) for quantization-only methods. For pruning-only strategies, since only one shared experimental setting exists in AMC, NetAdapt and MetaPruning which employs MobileNet-V2 with 0.22G MACs for evaluation, we also assign the similar complexity constraint for fair comparison. For quantization-only strategies, we compare SeerNet with HAQ and HAWQ with MobileNet-V2, ResNet18 and ResNet50. Tables 13 and 14 illustrate the results for pruning-only and quantization-only strategies respectively, where our SeerNet still outperforms the baseline methods by a sizable margin with much less marginal search cost.

Table 14 The MACs(G), BOPs(G), top-1 classification accuracy and search cost on ImageNet with quantization-only methods in MobileNet-V2, ResNet18 and ResNet50

Full size table

Table 15 The BOPs(G), top-1 classification accuracy and search cost on ImageNet with random search baseline and our SeerNet in ResNet18

Full size table

Appendix G: Visualization of Policy Optimization

The presented barrier complexity loss in the compression policy optimization is amplified significantly for model complexity approaching the cost budget, which strictly limits the acquired pruning and quantization strategies within the complexity constraint. The complexity of the pruning and quantization policy in the optimization path usually keeps a margin from the computational cost budget. In order to empirically demonstrate the effectiveness of the barrier complexity loss, we implemented compression policy search for 50 times with different initialization, and the compression policies with complexity higher than the budget were never observed during the compression policy optimization. Figure 8 shows several examples w.r.t. the compression policy complexity during the optimization with different policy initializations, where no optimization paths stop early because of exceeding the computational cost budget.

Appendix H: Comparison with The Random Selection Baseline Method

To show the effectiveness of our search method, we conducted experiments to compare our SeerNet with random selection baseline method (RS) with ResNet18 on ImageNet. The pipeline of RS is demonstrated as follows: (a) randomly sampling k compression policies that satisfy the BOPs constraint, (b) exhaustively evaluating the acquired lightweight architectures, (c) selecting the one with the highest accuracy. With the same complexity constraint in Table 4 of the manuscript, we randomly sample 5 compression strategies that satisfy the BOPs budget for each random selection. Table 15 demonstrates the results, where the BOPs of RS is far from the budget and underperforms our SeerNet by a large margin regarding the accuracy.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Z., Lu, J., Xiao, H. et al. Learning Accurate Performance Predictors for Ultrafast Automated Model Compression. Int J Comput Vis 131, 1761–1783 (2023). https://doi.org/10.1007/s11263-023-01783-0

Download citation

Received: 09 October 2021
Accepted: 14 March 2023
Published: 13 April 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11263-023-01783-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Accurate Performance Predictors for Ultrafast Automated Model Compression

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Search-and-Train: Two-Stage Model Compression and Acceleration

Network Pruning via Explicit Information Migration

Neural Network Compression by Joint Sparsity Promotion and Redundancy Reduction

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix

Appendix A: Mathematical formulation from (11) to (12) of the manuscript.

Appendix B: The Actual and Predicted Accuracies for Sampled Lightweight Networks

Appendix C: Visualization of the Optimal Compression Policy

Appendix D: Implementation Details in Section 4.2.2

Appendix E: Performance Variance of SeerNet

Appendix F: Performance of Pruning-only and Quantization-only Strategies

Appendix G: Visualization of Policy Optimization

Appendix H: Comparison with The Random Selection Baseline Method

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now