Knowledge Distillation via Hierarchical Matching for Small Object Detection

Ma, Yong-Chi; Ma, Xiao; Hao, Tian-Ran; Cui, Li-Sha; Jin, Shao-Hui; Lyu, Pei

doi:10.1007/s11390-024-4158-5

Knowledge Distillation via Hierarchical Matching for Small Object Detection

Regular Paper
Special Section of CVM 2024
Published: 20 September 2024

Volume 39, pages 798–810, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yong-Chi Ma (马永驰)¹,
Xiao Ma (马啸)¹,
Tian-Ran Hao (郝天然)¹,
Li-Sha Cui (崔丽莎)¹,
Shao-Hui Jin (靳少辉)¹ &
…
Pei Lyu (吕培)¹

202 Accesses
1 Altmetric
Explore all metrics

Abstract

Knowledge distillation is often used for model compression and has achieved a great breakthrough in image classification, but there still remains scope for improvement in object detection, especially for knowledge extraction of small objects. The main problem is the features of small objects are often polluted by background noise and not prominent due to down-sampling of convolutional neural network (CNN), resulting in the insufficient refinement of small object features during distillation. In this paper, we propose Hierarchical Matching Knowledge Distillation Network (HMKD) that operates on the pyramid level P2 to pyramid level P4 of the feature pyramid network (FPN), aiming to intervene on small object features before affecting. We employ an encoder-decoder network to encapsulate low-resolution, highly semantic information, akin to eliciting insights from profound strata within a teacher network, and then match the encapsulated information with high-resolution feature values of small objects from shallow layers as the key. During this period, we use an attention mechanism to measure the relevance of the inquiry to the feature values. Also in the process of decoding, knowledge is distilled to the student. In addition, we introduce a supplementary distillation module to mitigate the effects of background noise. Experiments show that our method achieves excellent improvements for both one-stage and two-stage object detectors. Specifically, applying the proposed method on Faster R-CNN achieves 41.7% mAP on COCO2017 (ResNet50 as the backbone), which is 3.8% higher than that of the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Cao C, Wang B, Zhang W, Zeng X, Yan X, Feng Z, Liu Y, Wu Z. An improved faster R-CNN for small object detection. IEEE Access, 2019, 7: 106838–106846. DOI: https://doi.org/10.1109/ACCESS.2019.2932731.
Article Google Scholar
Yang C, Huang Z, Wang N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.13658–13667. DOI: https://doi.org/10.1109/CVPR52688.2022.01330.
Google Scholar
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. Binarized neural networks. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.4114–4122.
Google Scholar
Rastegari M, Ordonez V, Redmon J, Farhadi A. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.525–542. DOI: https://doi.org/10.1007/978-3-319-46493-0_32.
Google Scholar
Han S, Pool J, Tran J, Dally W J. Learning both weights and connections for efficient neural network. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.1135–1143.
Google Scholar
He Y, Zhang X, Sun J. Channel pruning for accelerating very deep neural networks. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.1398–1406. DOI: https://doi.org/10.1109/ICCV.2017.155.
Google Scholar
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv: 1503.02531, 2015. https://arxiv.org/abs/1503.02531, Jul. 2024.
Ji M, Heo B, Park S. Show, attend and distill: Knowledge distillation via attention-based feature matching. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.7945–7952. DOI: https://doi.org/10.1609/aaai.v35i9.16969.
Google Scholar
Wang T, Yuan L, Zhang X, Feng J. Distilling object detectors with fine-grained feature imitation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.4928–4937. DOI: https://doi.org/10.1109/CVPR.2019.00507.
Google Scholar
Zhang L, Ma K. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In Proc. the 9th International Conference on Learning Representations, May 2021.
Heo B, Kim J, Yun S, Park H, Kwak N, Choi J Y. A comprehensive overhaul of feature distillation. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.1921–1930. DOI: https://doi.org/10.1109/ICCV.2019.00201.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.91–99.
Google Scholar
Lin T Y, Dollar P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.936–944. DOI: https://doi.org/10.1109/CVPR.2017.106.
Google Scholar
Kang Z, Zhang P, Zhang X, Sun J, Zheng N. Instance-conditional knowledge distillation for object detection. In Proc. the 35th International Conference on Neural Information Processing Systems, Dec. 2021, Article No. 1259.
Google Scholar
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.618–626. DOI: https://doi.org/10.1109/ICCV.2017.74.
Google Scholar
Cao Y, Xu J, Lin S, Wei F, Hu H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision workshop, Oct. 2019, pp.1971–1980. DOI: https://doi.org/10.1109/ICCVW.2019.00246.
Google Scholar
Yang Z, Li Z, Jiang X, Gong Y, Yuan Z, Zhao D, Yuan C. Focal and global knowledge distillation for detectors. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.4633–4642. DOI: https://doi.org/10.1109/CVPR52688.2022.00460.
Google Scholar
Chen G, Choi W, Yu X, Han T, Chandraker M. Learning efficient object detection models with knowledge distillation. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.742–751.
Google Scholar
Guo J, Han K, Wang Y, Wu H, Chen X, Xu C, Xu C. Distilling object detectors via decoupled features. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.2154–2164. DOI: https://doi.org/10.1109/CVPR46437.2021.00219.
Google Scholar
Chang J, Wang S, Xu H M, Chen Z, Yang C, Zhao F. DETRDistill: A universal knowledge distillation framework for DETR-families. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision, Oct. 2023, pp.6875–6885. DOI: https://doi.org/10.1109/ICCV51070.2023.00635.
Google Scholar
Lin T Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 2020, 42(2): 318–327. DOI: https://doi.org/10.1109/TPAMI.2018.2858826.
Article Google Scholar
Tian Z, Shen C, Chen H, He T. FCOS: Fully convolutional one-stage object detection. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.9627–9636. DOI: https://doi.org/10.1109/ICCV.2019.00972.
Ge Z, Liu S, Wang F, Li Z, Sun J. YOLOX: Exceeding YOLO series in 2021. arXiv: 2107.08430, 2021. https://arxiv.org/abs/2107.08430, Jul. 2024.
Huang H, Zhou X, Cao J, He R, Tan T. Vision transformer with super token sampling. arXiv: 2211.11167, 2024. https://arxiv.org/abs/2211.11167, Jul. 2024.
Zhu L, Wang X, Ke Z, Zhang W, Lau R. BiFormer: Vision transformer with bi-level routing attention. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp.10323–10333. DOI: https://doi.org/10.1109/CVPR52729.2023.00995.
Google Scholar
Tian R, Wu Z, Dai Q, Hu H, Qiao Y, Jiang Y G. Res-Former: Scaling ViTs with multi-resolution training. In Proc. the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp.22721–22731. DOI: https://doi.org/10.1109/CVPR52729.2023.02176.
Google Scholar
Kisantal M, Wojna Z, Murawski J, Naruniec J, Cho K. Augmentation for small object detection. arXiv: 1902. 07296, 2019. https://arxiv.org/abs/1902.07296, Jul. 2024.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, Berg A C. SSD: Single shot MultiBox detector. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.21–37. DOI: https://doi.org/10.1007/978-3-319-46448-0_2.
Google Scholar
Cai Z, Fan Q, Feris R S, Vasconcelos N. A unified multi-scale deep convolutional neural network for fast object detection. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.354–370. DOI: https://doi.org/10.1007/978-3-319-46493-0_22.
Google Scholar
Kong T, Yao A, Chen Y, Sun F. HyperNet: Towards accurate region proposal generation and joint object detection. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.845–853. DOI: https://doi.org/10.1109/CVPR.2016.98.
Google Scholar
Li Y, Chen Y, Wang N, Zhang Z X. Scale-aware trident networks for object detection. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27-Nov. 2, 2019, pp.6054–6063. DOI: https://doi.org/10.1109/ICCV.2019.00615.
Singh B, Davis L S. An analysis of scale invariance in object detection—SNIP. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.3578–3587. DOI: https://doi.org/10.1109/CVPR.2018.00377.
Chapter Google Scholar
Singh B, Najibi M, Davis L S. SNIPER: Efficient multi-scale training. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.9333–9343.
Google Scholar
Chen Y, Zhang P, Li Z, Li Y, Zhang X, Qi L, Sun J, Jia J. Dynamic scale training for object detection. arXiv: 2004.12432, 2021. https://arxiv.org/abs/2004.12432, Jul. 2024.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: https://doi.org/10.1007/978-3-030-58452-8_13.
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
Google Scholar
Romero A, Ballas N, Kahou S E, Chassang A, Gatta C, Bengio Y. FitNets: Hints for thin deep nets. In Proc. the 3rd International Conference on Learning Representations, May 2015. DOI: https://doi.org/10.48550/arXiv.1412.6550.
Loshchilov I, Hutter F. Decoupled weight decay regular-ization. In Proc. the 7th International Conference on Learning Representations, May 2017.
Liu H, Liu Q, Liu Y, Liang Y, Zhao G. Exploring effective knowledge distillation for tiny object detection. In Proc. the 2023 IEEE International Conference on Image Processing, Oct. 2023, pp.770–774. DOI: https://doi.org/10.1109/ICIP49359.2023.10222589.
Google Scholar
Ni Z L, Yang F, Wen S, Zhang G. Dual relation knowledge distillation for object detection. In Proc. the 32nd International Joint Conference on Artificial Intelligence, Aug. 2023, pp.1276–1284. DOI: https://doi.org/10.24963/ijcai.2023/142.
Google Scholar
He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2980–2988. DOI: https://doi.org/10.1109/ICCV.2017.322.
Google Scholar
Lee Y, Hwang J W, Lee S, Bae Y, Park J. An energy and GPU-computation efficient backbone network for realtime object detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Jun. 2019, pp.752–760. DOI: https://doi.org/10.1109/CVPRW.2019.00103.
Google Scholar
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proc. the 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.4510–4520. DOI: https://doi.org/10.1109/CVPR.2018.00474.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450000, China
Yong-Chi Ma (马永驰), Xiao Ma (马啸), Tian-Ran Hao (郝天然), Li-Sha Cui (崔丽莎), Shao-Hui Jin (靳少辉) & Pei Lyu (吕培)

Authors

Yong-Chi Ma (马永驰)
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Ma (马啸)
View author publications
You can also search for this author in PubMed Google Scholar
Tian-Ran Hao (郝天然)
View author publications
You can also search for this author in PubMed Google Scholar
Li-Sha Cui (崔丽莎)
View author publications
You can also search for this author in PubMed Google Scholar
Shao-Hui Jin (靳少辉)
View author publications
You can also search for this author in PubMed Google Scholar
Pei Lyu (吕培)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pei Lyu (吕培).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported in part by the Joint Fund of the Ministry of Education for Equipment Pre-Research of China under Grant No. 8091B032257, the National Natural Science Foundation of China under Grant Nos. 62106232 and 62372415, the China Postdoctoral Science Foundation under Grant No. 2021TQ0301, and the Outstanding Youth Science Fund of Henan Province of China under Grant No. 242300421050.

Yong-Chi Ma received his B.S. degree and M.S. degree from the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, in 2020 and 2024, respectively, both in computer science and technology. His research interests are computer vision and knowledge distillation.

Xiao Ma received his B.S. degree in computer science and technology from Henan University, Kaifeng, in 2022. Currently, he is now a Master student in the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou. His current research interests mainly focus on object detection and knowledge distillation.

Tian-Ran Hao received his B.S. degree from the Information Engineering Institute, Zhengzhou University, Zhengzhou, in 2017. He is currently pursuing his Ph.D. degree in the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou. His current research interests include object detection and computer vision.

Li-Sha Cui received her Ph.D. degree in software engineering from Zhengzhou University, Zhengzhou, in 2020. She is currently an associate professor with the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou. Her current research interests include object detection, deep learning, and computer vision.

Shao-Hui Jin received her Ph.D. degree in control science and engineering from Xidian University, Xi’an, in 2016. She now works at the School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou. Her current research interests include image processing, computer vision, and nonline of sight imaging.

Pei Lyu received his Ph.D. degree from State Key Laboratory of CAD&CG, Zhejiang University, Hangzhou, in 2013. He is currently a full professor with the School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou. His research interests include computer vision and computer graphics.

Electronic supplementary material