CAM R-CNN: End-to-End Object Detection with Class Activation Maps

Zhang, Shengchuan; Yu, Songlin; Ding, Haixin; Hu, Jie; Cao, Liujuan

doi:10.1007/s11063-023-11335-9

CAM R-CNN: End-to-End Object Detection with Class Activation Maps

Published: 05 July 2023

Volume 55, pages 10483–10499, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Shengchuan Zhang¹,
Songlin Yu¹,
Haixin Ding¹,
Jie Hu¹ &
…
Liujuan Cao¹

423 Accesses
Explore all metrics

Abstract

Class activation maps (CAMs) have been widely used on weakly-supervised object localization, which generate attention maps for specific categories in an image. Since CAMs can be obtained using category annotation, which is included in the annotation information of fully-supervised object detection. Therefore, how to adopt attention information in CAMs to improve the performance of fully-supervised object detection is an interesting problem. In this paper, we propose CAM R-CNN to deal with object detection, in which the category-aware attention maps provided by CAMs are integrated into the process of object detection. CAM R-CNN follows the common pipeline of the recent query-based object detectors in an end-to-end fashion, while two key CAM modules are embedded into the process. Specifically, E-CAM module provides embedding-level attention via fusing proposal features and attention information in CAMs with a transformer encoder, and S-CAM module supplies spatial-level attention by multiplying feature maps with the top-activated attention map provided by CAMs. In our experiments, CAM R-CNN demonstrates its superiority compared to several strong baselines on the challenging COCO dataset. Furthermore, we show that S-CAM module can be applied to two-stage detectors such as Faster R-CNN and Cascade R-CNN with consistent gains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised detection with decoupled attention-based deep representation

Article 16 August 2017

Task-decoupled interactive embedding network for object detection

Article 05 January 2024

Object Detection Based on Scene Understanding and Enhanced Proposals

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets analysed during the current study are available in the https://cocodataset.org.

References

Zhang X, Wei Y, Feng J, Yang Y, Huang, T (2018) Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1325–1334
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929
Pang Y, Xie J, Khan MH, Anwer RM, Khan FS, Shao L (2019) Mask-guided attention network for occluded pedestrian detection. In: International conference on computer vision, pp 4967–4975
Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6995–7003
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229
Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 14454–14463
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Llion Jones ANG, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision, pp 740–755
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Uijlings JR, Sande KEVD, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Article Google Scholar
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision, pp 5561–5569
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg, AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, pp 21–37
Lin TY, Goyal P, Girshick R, He K, Dollár, P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision, pp 734–750
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp 9627–9636
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. In: arXiv Preprint arXiv:1904.07850
Zheng M, Gao P, Wang X, Li H, Dong H (2020) End-to-end object detection with adaptive clustering transformer. In: CoRR, Abs/2011.09315
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: CoRR, Abs/2010.04159
Sun Z, Cao S, Yang Y, Kitani K (2020) Rethinking transformer-based set prediction for object detection. In: CoRR, Abs/2011.10881, pp 3611–3620
Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: CoRR, Abs/2101.07448, pp 3621–3630
Hu J, Cao L, Lu Y, Zhang S, Wang Y, Li K, Huang F, Shao L, Ji R (2021) ISTR: end-to-end instance segmentation with transformers. In: arXiv Preprint arXiv:2105.00637
Hong Q, Liu F, Li D, Liu J, Tian L, Shan Y (2022) Dynamic sparse r-cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4723–4732
Chen S, Sun P, Song Y, Luo P (2022) Diffusiondet: Diffusion model for object detection. arXiv, 2211.09788
Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883
Li R, Mai Z, Trabelsi C, Zhang Z, Jang J, Sanner S (2023) Transcam: transformer attention-based cam refinement for weakly supervised semantic segmentation. J Vis Commun Image Represent 92:103800
Article Google Scholar
Zhang X, Ma J, Liu H, Hu HM, Yang P (2022) Dual attentional siamese network for visual tracking. Displays 74:102205
Article Google Scholar
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of European conference on computer vision, pp 483–499
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6458
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Transactions on pattern analysis and machine intelligence, 2011–2023
Park J, Woo S, Lee JY, Kweon IS (2018) BAM: bottleneck attention module. In: Proceedings of the British machine vision conference, pp 1–14
Woo Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: Proceedings of European conference on computer vision, pp 3–19
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Lin TY, Dollár P, Ross Girshick KH, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Wu Y, Kirillov A, Massa F, Lo WY, Girshick R (2019) Detectron2. In: https://github.com/facebookresearch/detectron2
Lee H, Kim HE, Nam H (2019) SRM: A style-based recalibration module for convolutional neural networks. In: Proceedings of IEEE/CVF international conference on computer vision, pp 1854–1862

Download references

Acknowledgements

This work was supported by National Key R &D Program of China (No. 2022ZD0118202), the National Science Fund for Distinguished Young Scholars (No. 62025603), the National Natural Science Foundation of China (Nos. U21B2037, U22B2051, 62176222, 62176223, 62176226, 62072386, 62072387, 62072389, 62002305 and No. 62272401), and the Natural Science Foundation of Fujian Province of China (Nos. 2021J01002, 2022J06001).

Author information

Authors and Affiliations

Department of Artificial lntelligence, School of Informatics, Xiamen University, Siming South, Xiamen, 361005, Fujian, China
Shengchuan Zhang, Songlin Yu, Haixin Ding, Jie Hu & Liujuan Cao

Authors

Shengchuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Songlin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Haixin Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jie Hu
View author publications
You can also search for this author in PubMed Google Scholar
Liujuan Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liujuan Cao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, S., Yu, S., Ding, H. et al. CAM R-CNN: End-to-End Object Detection with Class Activation Maps. Neural Process Lett 55, 10483–10499 (2023). https://doi.org/10.1007/s11063-023-11335-9

Download citation

Accepted: 13 June 2023
Published: 05 July 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11335-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CAM R-CNN: End-to-End Object Detection with Class Activation Maps

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly supervised detection with decoupled attention-based deep representation

Task-decoupled interactive embedding network for object detection

Object Detection Based on Scene Understanding and Enhanced Proposals

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

CAM R-CNN: End-to-End Object Detection with Class Activation Maps

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly supervised detection with decoupled attention-based deep representation

Task-decoupled interactive embedding network for object detection

Object Detection Based on Scene Understanding and Enhanced Proposals

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation