Abstract
Class activation maps (CAMs) have been widely used on weakly-supervised object localization, which generate attention maps for specific categories in an image. Since CAMs can be obtained using category annotation, which is included in the annotation information of fully-supervised object detection. Therefore, how to adopt attention information in CAMs to improve the performance of fully-supervised object detection is an interesting problem. In this paper, we propose CAM R-CNN to deal with object detection, in which the category-aware attention maps provided by CAMs are integrated into the process of object detection. CAM R-CNN follows the common pipeline of the recent query-based object detectors in an end-to-end fashion, while two key CAM modules are embedded into the process. Specifically, E-CAM module provides embedding-level attention via fusing proposal features and attention information in CAMs with a transformer encoder, and S-CAM module supplies spatial-level attention by multiplying feature maps with the top-activated attention map provided by CAMs. In our experiments, CAM R-CNN demonstrates its superiority compared to several strong baselines on the challenging COCO dataset. Furthermore, we show that S-CAM module can be applied to two-stage detectors such as Faster R-CNN and Cascade R-CNN with consistent gains.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets analysed during the current study are available in the https://cocodataset.org.
References
Zhang X, Wei Y, Feng J, Yang Y, Huang, T (2018) Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1325–1334
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929
Pang Y, Xie J, Khan MH, Anwer RM, Khan FS, Shao L (2019) Mask-guided attention network for occluded pedestrian detection. In: International conference on computer vision, pp 4967–4975
Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6995–7003
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229
Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 14454–14463
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Llion Jones ANG, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision, pp 740–755
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Uijlings JR, Sande KEVD, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision, pp 5561–5569
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg, AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, pp 21–37
Lin TY, Goyal P, Girshick R, He K, Dollár, P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision, pp 734–750
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp 9627–9636
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. In: arXiv Preprint arXiv:1904.07850
Zheng M, Gao P, Wang X, Li H, Dong H (2020) End-to-end object detection with adaptive clustering transformer. In: CoRR, Abs/2011.09315
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: CoRR, Abs/2010.04159
Sun Z, Cao S, Yang Y, Kitani K (2020) Rethinking transformer-based set prediction for object detection. In: CoRR, Abs/2011.10881, pp 3611–3620
Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: CoRR, Abs/2101.07448, pp 3621–3630
Hu J, Cao L, Lu Y, Zhang S, Wang Y, Li K, Huang F, Shao L, Ji R (2021) ISTR: end-to-end instance segmentation with transformers. In: arXiv Preprint arXiv:2105.00637
Hong Q, Liu F, Li D, Liu J, Tian L, Shan Y (2022) Dynamic sparse r-cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4723–4732
Chen S, Sun P, Song Y, Luo P (2022) Diffusiondet: Diffusion model for object detection. arXiv, 2211.09788
Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883
Li R, Mai Z, Trabelsi C, Zhang Z, Jang J, Sanner S (2023) Transcam: transformer attention-based cam refinement for weakly supervised semantic segmentation. J Vis Commun Image Represent 92:103800
Zhang X, Ma J, Liu H, Hu HM, Yang P (2022) Dual attentional siamese network for visual tracking. Displays 74:102205
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of European conference on computer vision, pp 483–499
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6458
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Transactions on pattern analysis and machine intelligence, 2011–2023
Park J, Woo S, Lee JY, Kweon IS (2018) BAM: bottleneck attention module. In: Proceedings of the British machine vision conference, pp 1–14
Woo Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: Proceedings of European conference on computer vision, pp 3–19
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Lin TY, Dollár P, Ross Girshick KH, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Wu Y, Kirillov A, Massa F, Lo WY, Girshick R (2019) Detectron2. In: https://github.com/facebookresearch/detectron2
Lee H, Kim HE, Nam H (2019) SRM: A style-based recalibration module for convolutional neural networks. In: Proceedings of IEEE/CVF international conference on computer vision, pp 1854–1862
Acknowledgements
This work was supported by National Key R &D Program of China (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (Nos. U21B2037, U22B2051, 62176222, 62176223, 62176226, 62072386, 62072387, 62072389, 62002305 and No. 62272401), and the Natural Science Foundation of Fujian Province of China (Nos. 2021J01002, 2022J06001).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, S., Yu, S., Ding, H. et al. CAM R-CNN: End-to-End Object Detection with Class Activation Maps. Neural Process Lett 55, 10483–10499 (2023). https://doi.org/10.1007/s11063-023-11335-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11335-9