research-article

Hierarchical Multi-Attention Transfer for Knowledge Distillation

Authors:

Dacheng TaoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 2

Article No.: 51, Pages 1 - 20

https://doi.org/10.1145/3568679

Published: 27 September 2023 Publication History

Abstract

Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in regard to its great flexibility for managing different teacher-student architectures. However, existing attention-based methods usually transfer similar attention knowledge from the intermediate layers of deep neural networks, leaving the hierarchical structure of deep representation learning poorly investigated for knowledge distillation. In this paper, we propose a hierarchical multi-attention transfer framework (HMAT), where different types of attention are utilized to transfer the knowledge at different levels of deep representation learning for knowledge distillation. Specifically, position-based and channel-based attention knowledge characterize the knowledge from low-level and high-level feature representations, respectively, and activation-based attention knowledge characterize the knowledge from both mid-level and high-level feature representations. Extensive experiments on three popular visual recognition tasks, image classification, image retrieval, and object detection, demonstrate that the proposed hierarchical multi-attention transfer or HMAT significantly outperforms recent state-of-the-art KD methods.

References

[1]

Romero Adriana, Ballas Nicolas, K. Samira Ebrahimi, Chassang Antoine, Gatta Carlo, and B. Yoshua. 2015. FitNets: Hints for thin deep nets. International Conference on Learning Representations (2015), 1–13.

[2]

Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D. Lawrence, and Zhenwen Dai. 2019. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9163–9171.

[3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[4]

Zhao Borui, Cui Quan, Song Renjie, Qiu Yiyu, and Liang Jiajun. 2022. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]

Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. 2021. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7028–7036.

[6]

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.

[7]

Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv preprint arXiv:1412.1602 (2014).

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 248–255.

[9]

Mark Everingham and John Winn. 2009. The PASCAL visual object classes challenge 2007 (VOC2007) development kit. (2009).

[10]

Mark Everingham and John Winn. 2011. The PASCAL visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep 8 (2011), 5.

[11]

Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, and Jia Li. 2020. Efficient low-resolution face recognition via bridge distillation. IEEE Transactions on Image Processing 29 (2020), 6898–6908.

[12]

Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision 129, 6 (2021), 1789–1819.

Digital Library

[13]

Yushuo Guan, Pengyu Zha o, Bingxuan Wang, Yuanxing Zhang, Cong Yao, Kaigui Bian, and Jian Tang. 2020. Differentiable feature aggregation search for knowledge distillation. In European Conference on Computer Vision. 469–484.

Digital Library

[14]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems 28 (2015).

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.

[16]

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. 2019. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1921–1930.

[17]

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2, 7 (2015).

[18]

Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713–13722.

[19]

Zehao Huang and Naiyan Wang. 2017. Like what you like: Knowledge distill via neuron selectivity transfer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017).

[20]

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. Advances in Neural Information Processing Systems 29 (2016).

[21]

Mingi Ji, Byeongho Heo, and Sungrae Park. 2021. Show, attend and distill: Knowledge distillation via attention-based feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7945–7952.

[22]

Jangho Kim, SeongUk Park, and Nojun Kwak. 2018. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems 31 (2018).

[23]

Nikos Komodakis and Sergey Zagoruyko. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations.

[24]

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 554–561.

Digital Library

[25]

Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).

[26]

Souvik Kundu and Sairam Sundaresan. 2021. AttentionLite: Towards efficient self-attention models for vision. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2225–2229.

[27]

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient ConvNets. arXiv preprint arXiv:1608.08710 (2016).

[28]

Jia Li, Kui Fu, Shengwei Zhao, and Shiming Ge. 2019. Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency. IEEE Transactions on Image Processing 29 (2019), 1902–1914.

Digital Library

[29]

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision. 116–131.

Digital Library

[30]

Aravindh Mahendran and Andrea Vedaldi. 2016. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision 120, 3 (2016), 233–255.

Digital Library

[31]

Kakeru Mitsuno, Yuichiro Nomura, and Takio Kurita. 2021. Channel planting for deep neural networks using knowledge distillation. In International Conference on Learning Representations. 7573–7579.

[32]

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4004–4012.

[33]

Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. 2019. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5007–5016.

[34]

Yan Qu, Weihong Deng, and Jiani Hu. 2020. H-AT: Hybrid attention transfer for knowledge distillation. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV’20). 249–260.

Digital Library

[35]

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In European Conference on Computer Vision. 525–542.

[36]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).

[37]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4510–4520.

[38]

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.

[39]

Changyong Shu, Yifan Liu, Jianfei Gao, Lin Xu, and Chunhua Shen. 2020. Channel-wise distillation for semantic segmentation. arXiv e-prints (2020), 3243–3249.

[40]

An Shumin, Liao Qingmin, Lu Zongqing, and Xue Jing-Hao. 2022. Efficient semantic segmentation via self-attention and self-distillation. IEEE Transactions on Intelligent Transportation Systems (2022), 1–11.

[41]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019).

[42]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The Caltech-UCSD Birds-200-2011 Dataset. (2011).

[43]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.

[44]

Guodong Xu, Ziwei Liu, Xiaoxiao Li, and Chen Change Loy. 2020. Knowledge distillation meets self-supervision. In European Conference on Computer Vision. 588–604.

Digital Library

[45]

Anbang Yao and Dawei Sun. 2020. Knowledge transfer via dense cross-layer mutual-distillation. In European Conference on Computer Vision. 294–311.

Digital Library

[46]

Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4133–4141.

[47]

Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1285–1294.

Digital Library

[48]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).

[49]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. 818–833.

[50]

Kangkai Zhang, Chunhui Zhanga, Shikun Li, Dan Zeng, and Shiming Ge. 2021. Student network learning via evolutionary knowledge distillation. IEEE Transactions on Circuits and Systems for Video Technology (2021).

[51]

Linfeng Zhang and Kaisheng Ma. 2020. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations.

[52]

Haoran Zhao, Xin Sun, Junyu Dong, Changrui Chen, and Zihe Dong. 2020. Highlight every step: Knowledge distillation via collaborative teaching. IEEE Transactions on Cybernetics (2020).

[53]

Yang Zhendong, Li Zhe, Jiang Xiaohu, Gong Yuan, Yuan Zehuan, Zhao Danpei, and Chun Yuan. 2022. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]

Huang Zhenhua, Yang Shunzhi, Zhou MengChu, Li Zhetao, Gong Zheng, and Chen Yunwen. 2022. Feature map distillation of thin nets for low-resolution object recognition. IEEE Transactions on Image Processing. 31 (2022), 1364–1379.

[55]

Huang Zhenhua, Lin Zuorui, Gong Zheng, Chen Yunwen, and Tang Yong. 2022. A two-phase knowledge distillation model for graph convolutional network-based recommendation. International Journal of Intelligent Systems. 37 (2022), 5902–5923.

Digital Library

Cited By

Sun TChen HHu GZhao C(2025)Explainability-based knowledge distillationPattern Recognition10.1016/j.patcog.2024.111095159(111095)Online publication date: Mar-2025
https://doi.org/10.1016/j.patcog.2024.111095
Himeur YAburaed NElharrouss OVarlamis IAtalla SMansoor WAl-Ahmad H(2025)Applications of knowledge distillation in remote sensing: A surveyInformation Fusion10.1016/j.inffus.2024.102742115(102742)Online publication date: Mar-2025
https://doi.org/10.1016/j.inffus.2024.102742
Mekruksavanich SPhaphan WHnoohom NJitpattanakul A(2024)Recognition of sports and daily activities through deep learning and convolutional block attentionPeerJ Computer Science10.7717/peerj-cs.210010(e2100)Online publication date: 31-May-2024
https://doi.org/10.7717/peerj-cs.2100
Show More Cited By

Index Terms

Hierarchical Multi-Attention Transfer for Knowledge Distillation
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

Knowledge Distillation In Medical Data Mining: A Survey
ICCSE '21: 5th International Conference on Crowd Science and Engineering

In recent years, there have always been many problems in the medical field, such as a shortage of professionals and a shortage of medical resources. With the application of machine learning in the medical field, these problems have been alleviated to a ...
Multi-target Knowledge Distillation via Student Self-reflection
Abstract
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to transfer the knowledge learned by a large teacher model to a small student model. To mimic how the teacher teaches the student, existing knowledge ...
Collaborative knowledge distillation via filter knowledge transfer
Abstract
Knowledge distillation is a promising model compression technique that generally distills the knowledge from a complex teacher model to a lightweight student model. However, the performance gain of a student model is usually limited by the ...
Highlights
- Design filter knowledge transfer.
- Propose a new collaborative knowledge distillation method.
- Design filter knowledge entropy to reflect the importance of filter knowledge.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 2

February 2024

548 pages

EISSN:1551-6865

DOI:10.1145/3613570

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2023

Online AM: 20 October 2022

Accepted: 02 October 2022

Revised: 16 August 2022

Received: 28 March 2022

Published in TOMM Volume 20, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Qing Lan Project of Colleges and Universities of Jiangsu Province in 2020
Australian Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
1,689
Total Downloads

Downloads (Last 12 months)1,174
Downloads (Last 6 weeks)110

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun TChen HHu GZhao C(2025)Explainability-based knowledge distillationPattern Recognition10.1016/j.patcog.2024.111095159(111095)Online publication date: Mar-2025
https://doi.org/10.1016/j.patcog.2024.111095
Himeur YAburaed NElharrouss OVarlamis IAtalla SMansoor WAl-Ahmad H(2025)Applications of knowledge distillation in remote sensing: A surveyInformation Fusion10.1016/j.inffus.2024.102742115(102742)Online publication date: Mar-2025
https://doi.org/10.1016/j.inffus.2024.102742
Mekruksavanich SPhaphan WHnoohom NJitpattanakul A(2024)Recognition of sports and daily activities through deep learning and convolutional block attentionPeerJ Computer Science10.7717/peerj-cs.210010(e2100)Online publication date: 31-May-2024
https://doi.org/10.7717/peerj-cs.2100
Wang JZeng XWang YRen XWang DQu WLiao XPan P(2024)A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge DistillationRemote Sensing10.3390/rs1614259316:14(2593)Online publication date: 16-Jul-2024
https://doi.org/10.3390/rs16142593
Qu WSong YChen J(2024)Denoising Multiscale Back-Projection Feature Fusion for Underwater Image EnhancementApplied Sciences10.3390/app1411439514:11(4395)Online publication date: 22-May-2024
https://doi.org/10.3390/app14114395
Shen ZCai KFang QLuo X(2024)Air Traffic Flow Prediction with Spatiotemporal Knowledge Distillation NetworkJournal of Advanced Transportation10.1155/2024/43494022024(1-17)Online publication date: 15-May-2024
https://doi.org/10.1155/2024/4349402
He JCheng YWang WGu YWang YZhang WShankar ASelvarajan SKumar S(2024)EC-YOLOX: A Deep-Learning Algorithm for Floating Objects Detection in Ground Images of Complex Water EnvironmentsIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.336771317(7359-7370)Online publication date: 2024
https://doi.org/10.1109/JSTARS.2024.3367713
Xu DDong YMa ZZi JXu NXia YLi ZXu FChen F(2024)MAFIKD: A Real-Time Pest Detection Method Based on Knowledge DistillationIEEE Sensors Journal10.1109/JSEN.2024.344962824:20(33715-33735)Online publication date: 15-Oct-2024
https://doi.org/10.1109/JSEN.2024.3449628
Gad GFadlullah ZFouda MIbrahem MKato N(2024)Federated Learning With Selective Knowledge Distillation Over Bandwidth-constrained Wireless NetworksICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622906(3476-3481)Online publication date: 9-Jun-2024
https://doi.org/10.1109/ICC51166.2024.10622906
Gadiraju VTsai HWu HSingha MScott Huang CLiu GChang SWu Y(2024)Fine-Tuning Optimization of Small Language Models: A Novel Graph-Theoretical Approach for Efficient Prompt Engineering2024 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)10.1109/BMSB62888.2024.10608341(1-7)Online publication date: 19-Jun-2024
https://doi.org/10.1109/BMSB62888.2024.10608341
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents