Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475485acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Improving Weakly Supervised Object Localization via Causal Intervention

Published: 17 October 2021 Publication History

Abstract

The recently emerged weakly-supervised object localization (WSOL) methods can learn to localize an object in the image only using image-level labels. Previous works endeavor to perceive the interval objects from the small and sparse discriminative attention map, yet ignoring the co-occurrence confounder (e.g., duck and water), which makes the model inspection (e.g., CAM) hard to distinguish between the object and context. In this paper, we make an early attempt to tackle this challenge via causal intervention (CI). Our proposed method, dubbed CI-CAM, explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps thus improving the accuracy of object localization. Extensive experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning the clear object boundary from confounding contexts. Particularly, on the CUB-200-2011 which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms the traditional CAM-based baseline (58.39% vs 52.4% in Top-1 localization accuracy). While in more general scenarios such as ILSVRC 2016, CI-CAM can also perform on par with the state of the arts.

References

[1]
Wonho Bae, Junhyug Noh, and Gunhee Kim. 2020. Rethinking class activation mapping for weakly supervised object localization. In European Conference on Computer Vision. Springer, 618--634.
[2]
Elias Bareinboim and Judea Pearl. 2012. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. PMLR, 100--108.
[3]
Michel Besserve, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2018. Counterfactuals uncover the modular structure of deep generative models. arXiv preprint arXiv:1812.03253 (2018).
[4]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10800--10809.
[5]
Junsuk Choe and Hyunjung Shim. 2019. Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2219--2228.
[6]
Sandipan Choudhuri, Nibaran Das, Ritesh Sarkhel, and Mita Nasipuri. 2018. Object localization on natural scenes: A survey. International Journal of Pattern Recognition and Artificial Intelligence, Vol. 32, 02 (2018), 1855001.
[7]
Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsiavash, and Luc Van Gool. 2017. Weakly supervised cascaded convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 914--922.
[8]
Vanessa Didelez and Iris Pigeot. 2001. Judea pearl: Causality: Models, reasoning, and inference. Politische Vierteljahresschrift, Vol. 42, 2 (2001), 313--315.
[9]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV (2010).
[10]
Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2009. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, Vol. 32, 9 (2009), 1627--1645.
[11]
Yan Gao, Boxiao Liu, Nan Guo, Xiaochun Ye, Fang Wan, Haihang You, and Dongrui Fan. 2019. C-MIDN: Coupled Multiple Instance Detection Network With Segmentation Guidance for Weakly Supervised Object Detection. In ICCV.
[12]
Ross Girshick, Forrest Iandola, Trevor Darrell, and Jitendra Malik. 2015. Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 437--446.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[14]
Luke Keele. 2015. The statistics of causal inference: A view from political methodology. Political Analysis (2015), 313--335.
[15]
Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. 2017. Two-phase learning for weakly supervised object localization. In Proceedings of the IEEE International Conference on Computer Vision. 3534--3543.
[16]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Yawei Luo, Ping Liu, Tao Guan, Junqing Yu, and Yi Yang. 2020. Adversarial Style Mining for One-Shot Unsupervised Domain Adaptation. In Advances in Neural Information Processing Systems. 20612--20623.
[18]
Yawei Luo, Ping Liu, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2021. Category-Level Adversarial Adaptation for Semantic Segmentation using Purified Features. IEEE Transactions on Pattern Analysis & Machine Intelligence (TPAMI) (2021).
[19]
Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2019. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2507--2516.
[20]
Yawei Luo, Zhedong Zheng, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2018. Macro-micro adversarial network for human parsing. In Proceedings of the European conference on computer vision (ECCV). 418--434.
[21]
David P MacKinnon, Amanda J Fairchild, and Matthew S Fritz. 2007. Mediation analysis. Annu. Rev. Psychol., Vol. 58 (2007), 593--614.
[22]
Jinjie Mai, Meng Yang, and Wenfeng Luo. 2020. Erasing Integrated Learning: A Simple Yet Effective Approach for Weakly Supervised Object Localization. In CVPR.
[23]
Leland Gerson Neuberg. 2003. Causality: Models, Reasoning, and Inference.
[24]
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2020. Counterfactual vqa: A cause-effect look at language bias. arXiv preprint arXiv:2006.04315 (2020).
[25]
Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. 2018. Learning independent causal mechanisms. In International Conference on Machine Learning. PMLR, 4036--4044.
[26]
Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological methods, Vol. 19, 4 (2014), 459.
[27]
Judea Pearl et al. 2009. Causal inference in statistics: An overview. Statistics surveys, Vol. 3 (2009), 96--146.
[28]
Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer .John Wiley & Sons.
[29]
Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two causal principles for improving visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10860--10869.
[30]
Lorenzo Richiardi, Rino Bellocco, and Daniela Zugna. 2013. Mediation analysis in epidemiology: methods, interpretation and bias. International journal of epidemiology, Vol. 42, 5 (2013), 1511--1519.
[31]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. IJCV (2015).
[32]
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV.
[33]
Feifei Shao, Long Chen, Jian Shao, Wei Ji, Shaoning Xiao, Lu Ye, Yueting Zhuang, and Jun Xiao. 2021. Deep Learning for Weakly-Supervised Object Detection and Object Localization: A Survey. arXiv preprint arXiv:2105.12694 (2021).
[34]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In arXiv.
[35]
Michael E Sobel. 1996. An introduction to causal inference. Sociological Methods & Research, Vol. 24, 3 (1996), 353--379.
[36]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.
[37]
Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020 a. Long-tailed classification by keeping the good and removing the bad momentum causal effect. arXiv preprint arXiv:2009.12991 (2020).
[38]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020 b. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716--3725.
[39]
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 648--656.
[40]
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).
[41]
Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual commonsense r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10760--10770.
[42]
Yunchao Wei, Zhiqiang Shen, Bowen Cheng, Honghui Shi, Jinjun Xiong, Jiashi Feng, and Thomas Huang. 2018. Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 434--450.
[43]
Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010).
[44]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048--2057.
[45]
Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. 2019. DANet: Divergent Activation for Weakly Supervised Object Localization. In ICCV.
[46]
Seunghan Yang, Yoonhyung Kim, Youngeun Kim, and Changick Kim. 2020 a. Combinational Class Activation Maps for Weakly Supervised Object Localization. In WACV.
[47]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2020 b. Deconfounded image captioning: A causal retrospect. arXiv preprint arXiv:2003.03923 (2020).
[48]
Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2020. A survey on causal inference. arXiv preprint arXiv:2002.02770 (2020).
[49]
Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2020. Interventional few-shot learning. arXiv preprint arXiv:2009.13000 (2020).
[50]
Dong Zhang, Hanwang Zhang, Jinhui Tang, Xiansheng Hua, and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. arXiv preprint arXiv:2009.12547 (2020).
[51]
Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S Huang. 2018. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1325--1334.
[52]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In CVPR.

Cited By

View all
  • (2024)A Ship Detection Method in Infrared Remote Sensing Images Based on Image Generation and Causal InferenceElectronics10.3390/electronics1307129313:7(1293)Online publication date: 30-Mar-2024
  • (2024)Efficient Dual-Confounding Eliminating for Weakly-supervised Temporal Action LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681571(8179-8188)Online publication date: 28-Oct-2024
  • (2024)NICEST: Noisy Label Correction and Training for Robust Scene Graph GenerationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338734946:10(6873-6888)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. causal intervention
  2. object localization
  3. weakly-supervised learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)3
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Ship Detection Method in Infrared Remote Sensing Images Based on Image Generation and Causal InferenceElectronics10.3390/electronics1307129313:7(1293)Online publication date: 30-Mar-2024
  • (2024)Efficient Dual-Confounding Eliminating for Weakly-supervised Temporal Action LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681571(8179-8188)Online publication date: 28-Oct-2024
  • (2024)NICEST: Noisy Label Correction and Training for Robust Scene Graph GenerationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338734946:10(6873-6888)Online publication date: Oct-2024
  • (2024)Logit Variated Product Quantization Based on Parts Interaction and Metric Learning With Knowledge Distillation for Fine-Grained Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766126(10406-10419)Online publication date: 2024
  • (2024)Knowledge-Guided Causal Intervention for Weakly-Supervised Object LocalizationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338966836:11(6477-6489)Online publication date: Nov-2024
  • (2024)Clustering-inspired channel selection method for weakly supervised object localizationPattern Recognition Letters10.1016/j.patrec.2024.04.005182(46-52)Online publication date: Jun-2024
  • (2024)Semantic-Constraint Matching for transformer-based weakly supervised object localizationPattern Recognition10.1016/j.patcog.2024.110971(110971)Online publication date: Sep-2024
  • (2024)An explainable deep reinforcement learning algorithm for the parameter configuration and adjustment in the consortium blockchainEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107606129:COnline publication date: 16-May-2024
  • (2024)Weakly Supervised Object Localization with Background Suppression Erasing for Art Authentication and Copyright ProtectionMachine Intelligence Research10.1007/s11633-023-1455-321:1(89-103)Online publication date: 15-Jan-2024
  • (2023)Deconfounded Multimodal Learning for Spatio-temporal Video GroundingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613822(7521-7529)Online publication date: 26-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media