Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hierarchical Multi-Attention Transfer for Knowledge Distillation

Published: 27 September 2023 Publication History

Abstract

Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in regard to its great flexibility for managing different teacher-student architectures. However, existing attention-based methods usually transfer similar attention knowledge from the intermediate layers of deep neural networks, leaving the hierarchical structure of deep representation learning poorly investigated for knowledge distillation. In this paper, we propose a hierarchical multi-attention transfer framework (HMAT), where different types of attention are utilized to transfer the knowledge at different levels of deep representation learning for knowledge distillation. Specifically, position-based and channel-based attention knowledge characterize the knowledge from low-level and high-level feature representations, respectively, and activation-based attention knowledge characterize the knowledge from both mid-level and high-level feature representations. Extensive experiments on three popular visual recognition tasks, image classification, image retrieval, and object detection, demonstrate that the proposed hierarchical multi-attention transfer or HMAT significantly outperforms recent state-of-the-art KD methods.

References

[1]
Romero Adriana, Ballas Nicolas, K. Samira Ebrahimi, Chassang Antoine, Gatta Carlo, and B. Yoshua. 2015. FitNets: Hints for thin deep nets. International Conference on Learning Representations (2015), 1–13.
[2]
Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D. Lawrence, and Zhenwen Dai. 2019. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9163–9171.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[4]
Zhao Borui, Cui Quan, Song Renjie, Qiu Yiyu, and Liang Jiajun. 2022. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[5]
Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. 2021. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7028–7036.
[6]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.
[7]
Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv preprint arXiv:1412.1602 (2014).
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 248–255.
[9]
Mark Everingham and John Winn. 2009. The PASCAL visual object classes challenge 2007 (VOC2007) development kit. (2009).
[10]
Mark Everingham and John Winn. 2011. The PASCAL visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep 8 (2011), 5.
[11]
Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, and Jia Li. 2020. Efficient low-resolution face recognition via bridge distillation. IEEE Transactions on Image Processing 29 (2020), 6898–6908.
[12]
Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision 129, 6 (2021), 1789–1819.
[13]
Yushuo Guan, Pengyu Zha o, Bingxuan Wang, Yuanxing Zhang, Cong Yao, Kaigui Bian, and Jian Tang. 2020. Differentiable feature aggregation search for knowledge distillation. In European Conference on Computer Vision. 469–484.
[14]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems 28 (2015).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.
[16]
Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. 2019. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1921–1930.
[17]
Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2, 7 (2015).
[18]
Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713–13722.
[19]
Zehao Huang and Naiyan Wang. 2017. Like what you like: Knowledge distill via neuron selectivity transfer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017).
[20]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. Advances in Neural Information Processing Systems 29 (2016).
[21]
Mingi Ji, Byeongho Heo, and Sungrae Park. 2021. Show, attend and distill: Knowledge distillation via attention-based feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7945–7952.
[22]
Jangho Kim, SeongUk Park, and Nojun Kwak. 2018. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems 31 (2018).
[23]
Nikos Komodakis and Sergey Zagoruyko. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations.
[24]
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 554–561.
[25]
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
[26]
Souvik Kundu and Sairam Sundaresan. 2021. AttentionLite: Towards efficient self-attention models for vision. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2225–2229.
[27]
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient ConvNets. arXiv preprint arXiv:1608.08710 (2016).
[28]
Jia Li, Kui Fu, Shengwei Zhao, and Shiming Ge. 2019. Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency. IEEE Transactions on Image Processing 29 (2019), 1902–1914.
[29]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision. 116–131.
[30]
Aravindh Mahendran and Andrea Vedaldi. 2016. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision 120, 3 (2016), 233–255.
[31]
Kakeru Mitsuno, Yuichiro Nomura, and Takio Kurita. 2021. Channel planting for deep neural networks using knowledge distillation. In International Conference on Learning Representations. 7573–7579.
[32]
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4004–4012.
[33]
Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. 2019. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5007–5016.
[34]
Yan Qu, Weihong Deng, and Jiani Hu. 2020. H-AT: Hybrid attention transfer for knowledge distillation. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV’20). 249–260.
[35]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In European Conference on Computer Vision. 525–542.
[36]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).
[37]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4510–4520.
[38]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
[39]
Changyong Shu, Yifan Liu, Jianfei Gao, Lin Xu, and Chunhua Shen. 2020. Channel-wise distillation for semantic segmentation. arXiv e-prints (2020), 3243–3249.
[40]
An Shumin, Liao Qingmin, Lu Zongqing, and Xue Jing-Hao. 2022. Efficient semantic segmentation via self-attention and self-distillation. IEEE Transactions on Intelligent Transportation Systems (2022), 1–11.
[41]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019).
[42]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The Caltech-UCSD Birds-200-2011 Dataset. (2011).
[43]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
[44]
Guodong Xu, Ziwei Liu, Xiaoxiao Li, and Chen Change Loy. 2020. Knowledge distillation meets self-supervision. In European Conference on Computer Vision. 588–604.
[45]
Anbang Yao and Dawei Sun. 2020. Knowledge transfer via dense cross-layer mutual-distillation. In European Conference on Computer Vision. 294–311.
[46]
Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4133–4141.
[47]
Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1285–1294.
[48]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
[49]
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. 818–833.
[50]
Kangkai Zhang, Chunhui Zhanga, Shikun Li, Dan Zeng, and Shiming Ge. 2021. Student network learning via evolutionary knowledge distillation. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[51]
Linfeng Zhang and Kaisheng Ma. 2020. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations.
[52]
Haoran Zhao, Xin Sun, Junyu Dong, Changrui Chen, and Zihe Dong. 2020. Highlight every step: Knowledge distillation via collaborative teaching. IEEE Transactions on Cybernetics (2020).
[53]
Yang Zhendong, Li Zhe, Jiang Xiaohu, Gong Yuan, Yuan Zehuan, Zhao Danpei, and Chun Yuan. 2022. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[54]
Huang Zhenhua, Yang Shunzhi, Zhou MengChu, Li Zhetao, Gong Zheng, and Chen Yunwen. 2022. Feature map distillation of thin nets for low-resolution object recognition. IEEE Transactions on Image Processing. 31 (2022), 1364–1379.
[55]
Huang Zhenhua, Lin Zuorui, Gong Zheng, Chen Yunwen, and Tang Yong. 2022. A two-phase knowledge distillation model for graph convolutional network-based recommendation. International Journal of Intelligent Systems. 37 (2022), 5902–5923.

Cited By

View all
  • (2025)Explainability-based knowledge distillationPattern Recognition10.1016/j.patcog.2024.111095159(111095)Online publication date: Mar-2025
  • (2025)Applications of knowledge distillation in remote sensing: A surveyInformation Fusion10.1016/j.inffus.2024.102742115(102742)Online publication date: Mar-2025
  • (2024)Recognition of sports and daily activities through deep learning and convolutional block attentionPeerJ Computer Science10.7717/peerj-cs.210010(e2100)Online publication date: 31-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
February 2024
548 pages
EISSN:1551-6865
DOI:10.1145/3613570
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2023
Online AM: 20 October 2022
Accepted: 02 October 2022
Revised: 16 August 2022
Received: 28 March 2022
Published in TOMM Volume 20, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Model compression
  2. knowledge distillation
  3. hierarchical attention transfer

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Qing Lan Project of Colleges and Universities of Jiangsu Province in 2020
  • Australian Research Council

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,174
  • Downloads (Last 6 weeks)110
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Explainability-based knowledge distillationPattern Recognition10.1016/j.patcog.2024.111095159(111095)Online publication date: Mar-2025
  • (2025)Applications of knowledge distillation in remote sensing: A surveyInformation Fusion10.1016/j.inffus.2024.102742115(102742)Online publication date: Mar-2025
  • (2024)Recognition of sports and daily activities through deep learning and convolutional block attentionPeerJ Computer Science10.7717/peerj-cs.210010(e2100)Online publication date: 31-May-2024
  • (2024)A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge DistillationRemote Sensing10.3390/rs1614259316:14(2593)Online publication date: 16-Jul-2024
  • (2024)Denoising Multiscale Back-Projection Feature Fusion for Underwater Image EnhancementApplied Sciences10.3390/app1411439514:11(4395)Online publication date: 22-May-2024
  • (2024)Air Traffic Flow Prediction with Spatiotemporal Knowledge Distillation NetworkJournal of Advanced Transportation10.1155/2024/43494022024(1-17)Online publication date: 15-May-2024
  • (2024)EC-YOLOX: A Deep-Learning Algorithm for Floating Objects Detection in Ground Images of Complex Water EnvironmentsIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.336771317(7359-7370)Online publication date: 2024
  • (2024)MAFIKD: A Real-Time Pest Detection Method Based on Knowledge DistillationIEEE Sensors Journal10.1109/JSEN.2024.344962824:20(33715-33735)Online publication date: 15-Oct-2024
  • (2024)Federated Learning With Selective Knowledge Distillation Over Bandwidth-constrained Wireless NetworksICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622906(3476-3481)Online publication date: 9-Jun-2024
  • (2024)Fine-Tuning Optimization of Small Language Models: A Novel Graph-Theoretical Approach for Efficient Prompt Engineering2024 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)10.1109/BMSB62888.2024.10608341(1-7)Online publication date: 19-Jun-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media