NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

Zhang, Zhongqun; Wang, Hengfei; Yu, Ziwei; Cheng, Yihua; Yao, Angela; Chang, Hyung Jin

doi:10.1007/978-3-031-73390-1_17

Zhongqun Zhang¹³,
Hengfei Wang¹³,
Ziwei Yu¹⁴,
Yihua Cheng¹³,
Angela Yao¹⁴ &
…
Hyung Jin Chang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15086))

Included in the following conference series:

European Conference on Computer Vision

72 Accesses

Abstract

Modeling the physical contacts between the hand and object is standard for refining inaccurate hand poses and generating novel human grasp in 3D hand-object reconstruction. However, existing methods rely on geometric constraints that cannot be specified or controlled. This paper introduces a novel task of controllable 3D hand-object contact modeling with natural language descriptions. Challenges include i) the complexity of cross-modal modeling from language to contact, and ii) a lack of descriptive text for contact patterns. To address these issues, we propose NL2Contact, a model that generates controllable contacts by leveraging staged diffusion models. Given a language description of the hand and contact, NL2Contact generates realistic and faithful 3D hand-object contacts. To train the model, we build ContactDescribe, the first dataset with hand-centered contact descriptions. It contains multi-level and diverse descriptions generated by large language models based on carefully designed prompts (e.g., grasp action, grasp type, contact location, free finger status). We show applications of our model to grasp pose optimization and novel human grasp generation, both based on a textual contact description.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SemGrasp : Semantic Grasp Generation via Language Aligned Discretization

ContactPose: A Dataset of Grasps with Object Contact and Hand Pose

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Notes

1.
We slightly abuse the notion $\theta $ to represent learning of model parameters.

References

https://openai.com/blog/chatgpt/
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA (2018)
Google Scholar
Brahmbhatt, S., Ham, C., Kemp, C.C., Hays, J.: ContactDB: analyzing and predicting grasp contact via thermal imaging. In: CVPR (2019)
Google Scholar
Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: a dataset of grasps with object contact and hand pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 361–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_22
Chapter Google Scholar
Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: towards common benchmarks for manipulation research. In: ICAR (2015)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Cheang, C., Lin, H., Fu, Y., Xue, X.: Learning 6-dof object poses to grasp category-level objects by language instructions. In: ICRA (2022)
Google Scholar
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F., Rogez, G.: Ganhand: Predicting human grasp affordances in multi-object scenes. In: CVPR (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Grady, P., Tang, C., Twigg, C.D., Vo, M., Brahmbhatt, S., Kemp, C.C.: ContactOpt: Optimizing contact to improve grasps. In: CVPR (2021)
Google Scholar
Guo, C., et al.: Generating diverse and natural 3d human motions from text. In: CVPR (2022)
Google Scholar
Ha, H., Florence, P., Song, S.: Scaling up and distilling down: Language-guided robot skill acquisition. CoRL (2023)
Google Scholar
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)
Google Scholar
Hasson, Y., Varol, G., Laptev, I., Schmid, C.: Towards unconstrained joint hand-object reconstruction from RGB videos. In: 3DV (2021)
Google Scholar
Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
Google Scholar
Jian, J., Liu, X., Li, M., Hu, R., Liu, J.: Affordpose: a large-scale dataset of hand-object interactions with affordance-driven hand pose. In: ICCV (2023)
Google Scholar
Jiang, H., Liu, S., Wang, J., Wang, X.: Hand-object contact consistency reasoning for human grasps generation. In: ICCV (2021)
Google Scholar
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: ICCV (2023)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Google Scholar
Kong, H., Gong, K., Lian, D., Mi, M.B., Wang, X.: Priority-centric human motion generation in discrete latent space. In: ICCV (2023)
Google Scholar
Lakshmipathy, A.S., Feng, N., Lee, Y.X., Mahler, M., Pollard, N.: Contact edit: Artist tools for intuitive modeling of hand-object interactions. ACM Trans. Graph. (TOG) (2023)
Google Scholar
Li, H., Lin, X., Zhou, Y., Li, X., Huo, Y., Chen, J., Ye, Q.: Contact2grasp: 3d grasp synthesis via hand-object contact constraint. IJCAI (2022)
Google Scholar
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: ECCV (2022)
Google Scholar
Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3D hand-object poses estimation with interactions in time. In: CVPR (2021)
Google Scholar
Liu, S., Zhou, Y., Yang, J., Gupta, S., Wang, S.: Contactgen: Generative contact modeling for grasp generation. In: CVPR (2023)
Google Scholar
Liu, Y., et al.: Hoi4d: a 4d egocentric dataset for category-level human-object interaction. In: CVPR (2022)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Google Scholar
Qin, Y., et al.: Dexmv: Imitation learning for dexterous manipulation from human videos. In: ECCV, pp. 570–587 (2022). https://doi.org/10.1007/978-3-031-19842-7_33
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (ToG) (2017)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Google Scholar
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR (2022)
Google Scholar
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: ECCV (2020)
Google Scholar
Tang, C., Huang, D., Ge, W., Liu, W., Zhang, H.: Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping. IEEE Robotics and Automation Letters (2023)
Google Scholar
Tendulkar, P., Surís, D., Vondrick, C.: Flex: full-body grasping without full-body grasps. In: CVPR (2023)
Google Scholar
Tse, T.H.E., Kim, K.I., Leonardis, A., Chang, H.J.: Collaborative learning for hand and object reconstruction with attention-guided graph convolution. In: CVPR (2022)
Google Scholar
Tse, T.H.E., et al.: Spectral graphormer: spectral graph-based transformer for egocentric two-hand reconstruction using multi-view color images. In: ICCV (2023)
Google Scholar
Tse, T.H.E., Zhang, Z., Kim, K.I., Leonardis, A., Zheng, F., Chang, H.J.: S2Contact: graph-based network for 3d hand-object contact estimation with semi-supervised learning. In: ECCV (2022)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. NeurIPS (2017)
Google Scholar
Wang, H., Zhang, Z., Cheng, Y., Chang, H.J.: High-fidelity eye animatable neural radiance fields for human face. BMVC (2023)
Google Scholar
Wang, H., Zhang, Z., Cheng, Y., Chang, H.J.: Textgaze: gaze-controllable face generation with natural language. MM (2024)
Google Scholar
Wu, Y., Wang, J., Zhang, Y., Zhang, S., Hilliges, O., Yu, F., Tang, S.: Saga: Stochastic whole-body grasping with contact. In: ECCV (2022)
Google Scholar
Xie, W., Zhao, Z., Li, S., Zuo, B., Wang, Y.: Nonrigid object contact estimation with regional unwrapping transformer. In: ICCV (2023)
Google Scholar
Yang, L., et al.: Oakink: a large-scale knowledge repository for understanding hand-object interaction. In: CVPR (2022)
Google Scholar
Yang, L., Zhan, X., Li, K., Xu, W., Li, J., Lu, C.: CPF: Learning a contact potential field to model the hand-object interaction. In: ICCV (2021)
Google Scholar
Ye, Y., Hebbar, P., Gupta, A., Tulsiani, S.: Diffusion-guided reconstruction of everyday hand-object interaction clips. In: ICCV (2023)
Google Scholar
Yu, Z., Yang, L., Xie, Y., Chen, P., Yao, A.: Uv-based 3d hand-object reconstruction with grasp optimization. BMVC (2022)
Google Scholar
Zhang, H., Ye, Y., Shiratori, T., Komura, T.: Manipnet: neural manipulation synthesis with a hand-object spatial representation. ACM Trans. Graph. (ToG) (2021)
Google Scholar
Zhou, K., Bhatnagar, B.L., Lenssen, J.E., Pons-Moll, G.: TOCH: Spatio-temporal object-to-hand correspondence for motion refinement. In: ECCV (2022)
Google Scholar
Zhu, Z., Wang, J., Qin, Y., Sun, D., Jampani, V., Wang, X.: Contactart: Learning 3d interaction priors for category-level articulated object and hand poses estimation. 3DV (2024)
Google Scholar

Download references

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-2020-0-01789) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (Award Number: AISG2-RP-2020-016), China Scholarship Council (CSC) Grant No. 202208060266 and No. 202006210057.

Author information

Authors and Affiliations

University of Birmingham, Birmingham, UK
Zhongqun Zhang, Hengfei Wang, Yihua Cheng & Hyung Jin Chang
National University of Singapore, Singapore, Singapore
Ziwei Yu & Angela Yao

Authors

Zhongqun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hengfei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yihua Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Angela Yao
View author publications
You can also search for this author in PubMed Google Scholar
Hyung Jin Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yihua Cheng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1077 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Wang, H., Yu, Z., Cheng, Y., Yao, A., Chang, H.J. (2025). NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15086. Springer, Cham. https://doi.org/10.1007/978-3-031-73390-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-73390-1_17
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73389-5
Online ISBN: 978-3-031-73390-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SemGrasp : Semantic Grasp Generation via Language Aligned Discretization

ContactPose: A Dataset of Grasps with Object Contact and Hand Pose

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1077 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SemGrasp : Semantic Grasp Generation via Language Aligned Discretization

ContactPose: A Dataset of Grasps with Object Contact and Hand Pose

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1077 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation