Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hierarchical Multi-Attention Transfer for Knowledge Distillation

Published: 27 September 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in regard to its great flexibility for managing different teacher-student architectures. However, existing attention-based methods usually transfer similar attention knowledge from the intermediate layers of deep neural networks, leaving the hierarchical structure of deep representation learning poorly investigated for knowledge distillation. In this paper, we propose a hierarchical multi-attention transfer framework (HMAT), where different types of attention are utilized to transfer the knowledge at different levels of deep representation learning for knowledge distillation. Specifically, position-based and channel-based attention knowledge characterize the knowledge from low-level and high-level feature representations, respectively, and activation-based attention knowledge characterize the knowledge from both mid-level and high-level feature representations. Extensive experiments on three popular visual recognition tasks, image classification, image retrieval, and object detection, demonstrate that the proposed hierarchical multi-attention transfer or HMAT significantly outperforms recent state-of-the-art KD methods.

    References

    [1]
    Romero Adriana, Ballas Nicolas, K. Samira Ebrahimi, Chassang Antoine, Gatta Carlo, and B. Yoshua. 2015. FitNets: Hints for thin deep nets. International Conference on Learning Representations (2015), 1–13.
    [2]
    Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D. Lawrence, and Zhenwen Dai. 2019. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9163–9171.
    [3]
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
    [4]
    Zhao Borui, Cui Quan, Song Renjie, Qiu Yiyu, and Liang Jiajun. 2022. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    [5]
    Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. 2021. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7028–7036.
    [6]
    Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.
    [7]
    Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv preprint arXiv:1412.1602 (2014).
    [8]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 248–255.
    [9]
    Mark Everingham and John Winn. 2009. The PASCAL visual object classes challenge 2007 (VOC2007) development kit. (2009).
    [10]
    Mark Everingham and John Winn. 2011. The PASCAL visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep 8 (2011), 5.
    [11]
    Shiming Ge, Shengwei Zhao, Chenyu Li, Yu Zhang, and Jia Li. 2020. Efficient low-resolution face recognition via bridge distillation. IEEE Transactions on Image Processing 29 (2020), 6898–6908.
    [12]
    Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision 129, 6 (2021), 1789–1819.
    [13]
    Yushuo Guan, Pengyu Zha o, Bingxuan Wang, Yuanxing Zhang, Cong Yao, Kaigui Bian, and Jian Tang. 2020. Differentiable feature aggregation search for knowledge distillation. In European Conference on Computer Vision. 469–484.
    [14]
    Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems 28 (2015).
    [15]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.
    [16]
    Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. 2019. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1921–1930.
    [17]
    Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2, 7 (2015).
    [18]
    Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713–13722.
    [19]
    Zehao Huang and Naiyan Wang. 2017. Like what you like: Knowledge distill via neuron selectivity transfer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017).
    [20]
    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. Advances in Neural Information Processing Systems 29 (2016).
    [21]
    Mingi Ji, Byeongho Heo, and Sungrae Park. 2021. Show, attend and distill: Knowledge distillation via attention-based feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 7945–7952.
    [22]
    Jangho Kim, SeongUk Park, and Nojun Kwak. 2018. Paraphrasing complex network: Network compression via factor transfer. Advances in Neural Information Processing Systems 31 (2018).
    [23]
    Nikos Komodakis and Sergey Zagoruyko. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations.
    [24]
    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 554–561.
    [25]
    Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
    [26]
    Souvik Kundu and Sairam Sundaresan. 2021. AttentionLite: Towards efficient self-attention models for vision. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2225–2229.
    [27]
    Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient ConvNets. arXiv preprint arXiv:1608.08710 (2016).
    [28]
    Jia Li, Kui Fu, Shengwei Zhao, and Shiming Ge. 2019. Spatiotemporal knowledge distillation for efficient estimation of aerial video saliency. IEEE Transactions on Image Processing 29 (2019), 1902–1914.
    [29]
    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision. 116–131.
    [30]
    Aravindh Mahendran and Andrea Vedaldi. 2016. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision 120, 3 (2016), 233–255.
    [31]
    Kakeru Mitsuno, Yuichiro Nomura, and Takio Kurita. 2021. Channel planting for deep neural networks using knowledge distillation. In International Conference on Learning Representations. 7573–7579.
    [32]
    Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4004–4012.
    [33]
    Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. 2019. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5007–5016.
    [34]
    Yan Qu, Weihong Deng, and Jiani Hu. 2020. H-AT: Hybrid attention transfer for knowledge distillation. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV’20). 249–260.
    [35]
    Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. In European Conference on Computer Vision. 525–542.
    [36]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).
    [37]
    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4510–4520.
    [38]
    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
    [39]
    Changyong Shu, Yifan Liu, Jianfei Gao, Lin Xu, and Chunhua Shen. 2020. Channel-wise distillation for semantic segmentation. arXiv e-prints (2020), 3243–3249.
    [40]
    An Shumin, Liao Qingmin, Lu Zongqing, and Xue Jing-Hao. 2022. Efficient semantic segmentation via self-attention and self-distillation. IEEE Transactions on Intelligent Transportation Systems (2022), 1–11.
    [41]
    Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019).
    [42]
    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The Caltech-UCSD Birds-200-2011 Dataset. (2011).
    [43]
    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
    [44]
    Guodong Xu, Ziwei Liu, Xiaoxiao Li, and Chen Change Loy. 2020. Knowledge distillation meets self-supervision. In European Conference on Computer Vision. 588–604.
    [45]
    Anbang Yao and Dawei Sun. 2020. Knowledge transfer via dense cross-layer mutual-distillation. In European Conference on Computer Vision. 294–311.
    [46]
    Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4133–4141.
    [47]
    Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1285–1294.
    [48]
    Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
    [49]
    Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. 818–833.
    [50]
    Kangkai Zhang, Chunhui Zhanga, Shikun Li, Dan Zeng, and Shiming Ge. 2021. Student network learning via evolutionary knowledge distillation. IEEE Transactions on Circuits and Systems for Video Technology (2021).
    [51]
    Linfeng Zhang and Kaisheng Ma. 2020. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations.
    [52]
    Haoran Zhao, Xin Sun, Junyu Dong, Changrui Chen, and Zihe Dong. 2020. Highlight every step: Knowledge distillation via collaborative teaching. IEEE Transactions on Cybernetics (2020).
    [53]
    Yang Zhendong, Li Zhe, Jiang Xiaohu, Gong Yuan, Yuan Zehuan, Zhao Danpei, and Chun Yuan. 2022. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    [54]
    Huang Zhenhua, Yang Shunzhi, Zhou MengChu, Li Zhetao, Gong Zheng, and Chen Yunwen. 2022. Feature map distillation of thin nets for low-resolution object recognition. IEEE Transactions on Image Processing. 31 (2022), 1364–1379.
    [55]
    Huang Zhenhua, Lin Zuorui, Gong Zheng, Chen Yunwen, and Tang Yong. 2022. A two-phase knowledge distillation model for graph convolutional network-based recommendation. International Journal of Intelligent Systems. 37 (2022), 5902–5923.

    Cited By

    View all
    • (2024)Recognition of sports and daily activities through deep learning and convolutional block attentionPeerJ Computer Science10.7717/peerj-cs.210010(e2100)Online publication date: 31-May-2024
    • (2024)A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge DistillationRemote Sensing10.3390/rs1614259316:14(2593)Online publication date: 16-Jul-2024
    • (2024)Denoising Multiscale Back-Projection Feature Fusion for Underwater Image EnhancementApplied Sciences10.3390/app1411439514:11(4395)Online publication date: 22-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
    February 2024
    548 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3613570
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 September 2023
    Online AM: 20 October 2022
    Accepted: 02 October 2022
    Revised: 16 August 2022
    Received: 28 March 2022
    Published in TOMM Volume 20, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Model compression
    2. knowledge distillation
    3. hierarchical attention transfer

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Qing Lan Project of Colleges and Universities of Jiangsu Province in 2020
    • Australian Research Council

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,137
    • Downloads (Last 6 weeks)104
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Recognition of sports and daily activities through deep learning and convolutional block attentionPeerJ Computer Science10.7717/peerj-cs.210010(e2100)Online publication date: 31-May-2024
    • (2024)A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge DistillationRemote Sensing10.3390/rs1614259316:14(2593)Online publication date: 16-Jul-2024
    • (2024)Denoising Multiscale Back-Projection Feature Fusion for Underwater Image EnhancementApplied Sciences10.3390/app1411439514:11(4395)Online publication date: 22-May-2024
    • (2024)Air Traffic Flow Prediction with Spatiotemporal Knowledge Distillation NetworkJournal of Advanced Transportation10.1155/2024/43494022024(1-17)Online publication date: 15-May-2024
    • (2024)EC-YOLOX: A Deep-Learning Algorithm for Floating Objects Detection in Ground Images of Complex Water EnvironmentsIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.336771317(7359-7370)Online publication date: 2024
    • (2024)An Improved YOLOv8 to Detect Moving ObjectsIEEE Access10.1109/ACCESS.2024.339383512(59782-59806)Online publication date: 2024
    • (2024)Adopting AI teammates in knowledge-intensive crowdsourcing contests: the roles of transparency and explainabilityKybernetes10.1108/K-02-2024-0478Online publication date: 3-Jun-2024
    • (2024)Neighbor self-knowledge distillationInformation Sciences10.1016/j.ins.2023.119859654(119859)Online publication date: Jan-2024
    • (2024)Multi-level knowledge distillation via dynamic decision boundaries exploration and exploitationInformation Fusion10.1016/j.inffus.2024.102586112(102586)Online publication date: Dec-2024
    • (2024)SAKD: Sparse attention knowledge distillationImage and Vision Computing10.1016/j.imavis.2024.105020146(105020)Online publication date: Jun-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media