INTRA: Interaction Relationship-Aware Weakly Supervised Affordance Grounding

Jang, Ji Ha; Seo, Hoigi; Chun, Se Young

doi:10.1007/978-3-031-73039-9_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15122))

Included in the following conference series:

European Conference on Computer Vision

91 Accesses

Abstract

Affordance denotes the potential interactions inherent in objects. The perception of affordance can enable intelligent agents to navigate and interact with new environments efficiently. Weakly supervised affordance grounding teaches agents the concept of affordance without costly pixel-level annotations, but with exocentric images. Although recent advances in weakly supervised affordance grounding yielded promising results, there remain challenges including the requirement for paired exocentric and egocentric image dataset, and the complexity in grounding diverse affordances for a single object. To address them, we propose INTeraction Relationship-aware weakly supervised Affordance grounding (INTRA). Unlike prior arts, INTRA recasts this problem as representation learning to identify unique features of interactions through contrastive learning with exocentric images only, eliminating the need for paired datasets. Moreover, we leverage vision-language model embeddings for performing affordance grounding flexibly with any text, designing text-conditioned affordance map generation to reflect interaction relationship for contrastive learning and enhancing robustness with our text synonym augmentation. Our method outperformed prior arts on diverse datasets such as AGD20K, IIT-AFF, CAD and UMD. Additionally, experimental results demonstrate that our method has remarkable domain scalability for synthesized images/illustrations and is capable of performing affordance grounding for novel interactions and objects. Project page: https://jeeit17.github.io/INTRA.

J. H. Jang—H. Seo—Authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Grounded Affordance from Exocentric View

Article 26 December 2023

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Ahn, M., et al.: Do as I can and not as I say: grounding language in robotic affordances. arXiv:2204.01691 (2022)
Ahn, M., et al.: Do as I can, not as I say: grounding language in robotic affordances. arXiv:2204.01691 (2022)
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep ViT features as dense visual descriptors. arXiv:2112.05814 (2021)
Ardón, P., Pairet, È., Lohan, K.S., Ramamoorthy, S., Petrick, R.: Affordances in robotic tasks–a survey. arXiv:2004.07400 (2020)
Ardón, P., Pairet, E., Petrick, R.P., Ramamoorthy, S., Lohan, K.S.: Learning grasp affordance reasoning through semantic relations. RA-L, 4571–4578 (2019)
Google Scholar
Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from human videos as a versatile representation for robotics. In: CVPR, pp. 13778–13790 (2023)
Google Scholar
Burke, C.J., Tobler, P.N., Baddeley, M., Schultz, W.: Neural mechanisms of observational learning. PNAS, 14431–14436 (2010)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9650–9660 (2021)
Google Scholar
Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demonstration video to target image. In: CVPR, pp. 6799–6808 (2023)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)
Google Scholar
Chen, X., He, K.: Exploring simple Siamese representation learning. In: CVPR, pp. 15750–15758 (2021)
Google Scholar
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR, pp. 3488–3493 (2016)
Google Scholar
Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2Vec: reasoning object affordances from online videos. In: CVPR, pp. 2139–2147 (2018)
Google Scholar
Gao, W., et al.: TS-CAM: token semantic coupled attention map for weakly supervised object localization. In: ICCV, pp. 2886–2895 (2021)
Google Scholar
Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: RLAfford: end-to-end affordance learning for robotic manipulation. In: ICRA, pp. 5880–5886 (2023)
Google Scholar
Gibson, J.: The Ecological Approach to Visual Perception. Resources for ecological psychology, Lawrence Erlbaum Associates (1986)
Google Scholar
Hadjivelichkov, D., Zwane, S., Agapito, L., Deisenroth, M.P., Kanoulas, D.: One-shot transfer of affordance regions? AffCorrs! In: CoRL, pp. 550–560 (2023)
Google Scholar
Hou, Z., Yu, B., Qiao, Y., Peng, X., Tao, D.: Affordance transfer learning for human-object interaction detection. In: CVPR, pp. 495–504 (2021)
Google Scholar
Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 789–804. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_46
Chapter Google Scholar
Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, vol. 33, pp. 18661–18673 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)
Kümmerer, M., Wallis, T.S., Bethge, M.: DeepGaze II: reading fixations from deep features trained on object recognition. arXiv:1610.01563 (2016)
Li, F., et al.: Mask DINO: towards a unified transformer-based framework for object detection and segmentation. In: CVPR, pp. 3041–3050 (2023)
Google Scholar
Li, G., Jampani, V., Sun, D., Sevilla-Lara, L.: LOCATE: localize and transfer object parts for weakly supervised affordance grounding. In: CVPR, pp. 10922–10931 (2023)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900 (2022)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS, pp. 9694–9705 (2021)
Google Scholar
Liang, J., et al.: Code as policies: language model programs for embodied control. In: ICRA, pp. 9493–9500 (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Luo, H., Zhai, W., Zhang, J., Cao, Y., Tao, D.: Grounded affordance from exocentric view. arXiv:2208.13196 (2022)
Luo, H., Zhai, W., Zhang, J., Cao, Y., Tao, D.: Learning affordance grounding from exocentric images. In: CVPR, pp. 2252–2261 (2022)
Google Scholar
Luo, H., Zhai, W., Zhang, J., Cao, Y., Tao, D.: Learning visual affordance grounding from demonstration videos. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (2008)
Google Scholar
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, Oakland, CA, USA (1967)
Google Scholar
Mai, J., Yang, M., Luo, W.: Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: CVPR, pp. 8766–8775 (2020)
Google Scholar
Mees, O., Borja-Diaz, J., Burgard, W.: Grounding language with visual affordances over unstructured data. In: ICRA, pp. 11576–11582 (2023)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM, 39–41 (1995)
Google Scholar
Myers, A., Teo, C.L., Fermüller, C., Aloimonos, Y.: Affordance detection of tool parts from geometric features. In: ICRA, pp. 1374–1381 (2015)
Google Scholar
Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: ICCV, pp. 8688–8697 (2019)
Google Scholar
Nguyen, A., Kanoulas, D., Caldwell, D.G., Tsagarakis, N.G.: Detecting object affordances with convolutional neural networks. In: IROS, pp. 2765–2770 (2016)
Google Scholar
Nguyen, A., Kanoulas, D., Caldwell, D.G., Tsagarakis, N.G.: Object-based affordances detection with convolutional neural networks and dense conditional random fields. In: IROS, pp. 5908–5915 (2017)
Google Scholar
Nguyen, T., et al.: Open-vocabulary affordance detection in 3D point clouds. In: IROS, pp. 5692–5698 (2023)
Google Scholar
Ning, S., Qiu, L., Liu, Y., He, X.: HOICLIP: efficient knowledge transfer for HOI detection with vision-language models. In: CVPR, pp. 23507–23517 (2023)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv:2304.07193 (2023)
Pan, J., et al.: SalGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017)
Pan, X., et al.: Unveiling the potential of structure preserving for weakly supervised object localization. In: CVPR, pp. 11642–11651 (2021)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I., Suenderhauf, N.: SayPlan: grounding large language models using 3D scene graphs for scalable task planning. In: CoRL (2023)
Google Scholar
Rashid, A., et al.: Language embedded radiance fields for zero-shot task-oriented grasping. In: CoRL (2023)
Google Scholar
Sawatzky, J., Srikantha, A., Gall, J.: Weakly supervised affordance detection. In: CVPR, pp. 2795–2804 (2017)
Google Scholar
Singh, I., et al.: ProgPrompt: generating situated robot task plans using large language models. In: ICRA, pp. 11523–11530 (2023)
Google Scholar
Tang, J., Zheng, G., Yu, J., Yang, S.: CoTDet: affordance knowledge prompting for task driven object detection. In: ICCV, pp. 3068–3078 (2023)
Google Scholar
Wan, B., Tuytelaars, T.: Exploiting CLIP for zero-shot HOI detection requires knowledge distillation at multiple levels. In: WACV, pp. 1805–1815 (2024)
Google Scholar
Warren, W.: Perceiving affordances: visual guidance of stair climbing. J. Exp. Psychol. Hum. Percept. Perform., 683–703 (1984)
Google Scholar
Xu, R., Chu, F.J., Tang, C., Liu, W., Vela, P.A.: An affordance keypoint detection network for robot manipulation. IEEE RA-L, 2870–2877 (2021)
Google Scholar
Xue, Y., Gan, E., Ni, J., Joshi, S., Mirzasoleiman, B.: Investigating the benefits of projection head for representation learning. In: ICLR (2024)
Google Scholar
Yu, S., Seo, P.H., Son, J.: Zero-shot referring image segmentation with global-local context features. In: CVPR, pp. 19456–19465 (2023)
Google Scholar
Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. arXiv:2305.15347 (2023)
Zhang, X., et al.: Affordance-driven next-best-view planning for robotic grasping. In: CoRL (2023)
Google Scholar
Zhao, X., Li, M., Weber, C., Hafez, M.B., Wermter, S.: Chat with the environment: interactive multimodal perception using large language models. arXiv:2303.08268 (2023)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR, pp. 2921–2929 (2016)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv:2304.10592 (2023)

Download references

Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO. RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2022M3C1A309202211) and AI-Bio Research Grant through Seoul National University. Also, the authors acknowledged the financial support from the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
Ji Ha Jang, Hoigi Seo & Se Young Chun
INMC & IPAI, Seoul National University, Seoul, Republic of Korea
Se Young Chun

Authors

Ji Ha Jang
View author publications
You can also search for this author in PubMed Google Scholar
Hoigi Seo
View author publications
You can also search for this author in PubMed Google Scholar
Se Young Chun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Se Young Chun .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9243 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jang, J.H., Seo, H., Chun, S.Y. (2025). INTRA: Interaction Relationship-Aware Weakly Supervised Affordance Grounding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-73039-9_2
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73038-2
Online ISBN: 978-3-031-73039-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

INTRA: Interaction Relationship-Aware Weakly Supervised Affordance Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Grounded Affordance from Exocentric View

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9243 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

INTRA: Interaction Relationship-Aware Weakly Supervised Affordance Grounding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Grounded Affordance from Exocentric View

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9243 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation