Abstract
Modeling the physical contacts between the hand and object is standard for refining inaccurate hand poses and generating novel human grasp in 3D hand-object reconstruction. However, existing methods rely on geometric constraints that cannot be specified or controlled. This paper introduces a novel task of controllable 3D hand-object contact modeling with natural language descriptions. Challenges include i) the complexity of cross-modal modeling from language to contact, and ii) a lack of descriptive text for contact patterns. To address these issues, we propose NL2Contact, a model that generates controllable contacts by leveraging staged diffusion models. Given a language description of the hand and contact, NL2Contact generates realistic and faithful 3D hand-object contacts. To train the model, we build ContactDescribe, the first dataset with hand-centered contact descriptions. It contains multi-level and diverse descriptions generated by large language models based on carefully designed prompts (e.g., grasp action, grasp type, contact location, free finger status). We show applications of our model to grasp pose optimization and novel human grasp generation, both based on a textual contact description.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We slightly abuse the notion \(\theta \) to represent learning of model parameters.
References
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA (2018)
Brahmbhatt, S., Ham, C., Kemp, C.C., Hays, J.: ContactDB: analyzing and predicting grasp contact via thermal imaging. In: CVPR (2019)
Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: a dataset of grasps with object contact and hand pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 361–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_22
Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: towards common benchmarks for manipulation research. In: ICAR (2015)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Cheang, C., Lin, H., Fu, Y., Xue, X.: Learning 6-dof object poses to grasp category-level objects by language instructions. In: ICRA (2022)
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F., Rogez, G.: Ganhand: Predicting human grasp affordances in multi-object scenes. In: CVPR (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Grady, P., Tang, C., Twigg, C.D., Vo, M., Brahmbhatt, S., Kemp, C.C.: ContactOpt: Optimizing contact to improve grasps. In: CVPR (2021)
Guo, C., et al.: Generating diverse and natural 3d human motions from text. In: CVPR (2022)
Ha, H., Florence, P., Song, S.: Scaling up and distilling down: Language-guided robot skill acquisition. CoRL (2023)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)
Hasson, Y., Varol, G., Laptev, I., Schmid, C.: Towards unconstrained joint hand-object reconstruction from RGB videos. In: 3DV (2021)
Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
Jian, J., Liu, X., Li, M., Hu, R., Liu, J.: Affordpose: a large-scale dataset of hand-object interactions with affordance-driven hand pose. In: ICCV (2023)
Jiang, H., Liu, S., Wang, J., Wang, X.: Hand-object contact consistency reasoning for human grasps generation. In: ICCV (2021)
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: ICCV (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Kong, H., Gong, K., Lian, D., Mi, M.B., Wang, X.: Priority-centric human motion generation in discrete latent space. In: ICCV (2023)
Lakshmipathy, A.S., Feng, N., Lee, Y.X., Mahler, M., Pollard, N.: Contact edit: Artist tools for intuitive modeling of hand-object interactions. ACM Trans. Graph. (TOG) (2023)
Li, H., Lin, X., Zhou, Y., Li, X., Huo, Y., Chen, J., Ye, Q.: Contact2grasp: 3d grasp synthesis via hand-object contact constraint. IJCAI (2022)
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: ECCV (2022)
Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3D hand-object poses estimation with interactions in time. In: CVPR (2021)
Liu, S., Zhou, Y., Yang, J., Gupta, S., Wang, S.: Contactgen: Generative contact modeling for grasp generation. In: CVPR (2023)
Liu, Y., et al.: Hoi4d: a 4d egocentric dataset for category-level human-object interaction. In: CVPR (2022)
Pavlakos, G., et al.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Qin, Y., et al.: Dexmv: Imitation learning for dexterous manipulation from human videos. In: ECCV, pp. 570–587 (2022). https://doi.org/10.1007/978-3-031-19842-7_33
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (ToG) (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR (2022)
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: ECCV (2020)
Tang, C., Huang, D., Ge, W., Liu, W., Zhang, H.: Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping. IEEE Robotics and Automation Letters (2023)
Tendulkar, P., Surís, D., Vondrick, C.: Flex: full-body grasping without full-body grasps. In: CVPR (2023)
Tse, T.H.E., Kim, K.I., Leonardis, A., Chang, H.J.: Collaborative learning for hand and object reconstruction with attention-guided graph convolution. In: CVPR (2022)
Tse, T.H.E., et al.: Spectral graphormer: spectral graph-based transformer for egocentric two-hand reconstruction using multi-view color images. In: ICCV (2023)
Tse, T.H.E., Zhang, Z., Kim, K.I., Leonardis, A., Zheng, F., Chang, H.J.: S2Contact: graph-based network for 3d hand-object contact estimation with semi-supervised learning. In: ECCV (2022)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. NeurIPS (2017)
Wang, H., Zhang, Z., Cheng, Y., Chang, H.J.: High-fidelity eye animatable neural radiance fields for human face. BMVC (2023)
Wang, H., Zhang, Z., Cheng, Y., Chang, H.J.: Textgaze: gaze-controllable face generation with natural language. MM (2024)
Wu, Y., Wang, J., Zhang, Y., Zhang, S., Hilliges, O., Yu, F., Tang, S.: Saga: Stochastic whole-body grasping with contact. In: ECCV (2022)
Xie, W., Zhao, Z., Li, S., Zuo, B., Wang, Y.: Nonrigid object contact estimation with regional unwrapping transformer. In: ICCV (2023)
Yang, L., et al.: Oakink: a large-scale knowledge repository for understanding hand-object interaction. In: CVPR (2022)
Yang, L., Zhan, X., Li, K., Xu, W., Li, J., Lu, C.: CPF: Learning a contact potential field to model the hand-object interaction. In: ICCV (2021)
Ye, Y., Hebbar, P., Gupta, A., Tulsiani, S.: Diffusion-guided reconstruction of everyday hand-object interaction clips. In: ICCV (2023)
Yu, Z., Yang, L., Xie, Y., Chen, P., Yao, A.: Uv-based 3d hand-object reconstruction with grasp optimization. BMVC (2022)
Zhang, H., Ye, Y., Shiratori, T., Komura, T.: Manipnet: neural manipulation synthesis with a hand-object spatial representation. ACM Trans. Graph. (ToG) (2021)
Zhou, K., Bhatnagar, B.L., Lenssen, J.E., Pons-Moll, G.: TOCH: Spatio-temporal object-to-hand correspondence for motion refinement. In: ECCV (2022)
Zhu, Z., Wang, J., Qin, Y., Sun, D., Jampani, V., Wang, X.: Contactart: Learning 3d interaction priors for category-level articulated object and hand poses estimation. 3DV (2024)
Acknowledgements
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-2020-0-01789) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (Award Number: AISG2-RP-2020-016), China Scholarship Council (CSC) Grant No. 202208060266 and No. 202006210057.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Wang, H., Yu, Z., Cheng, Y., Yao, A., Chang, H.J. (2025). NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15086. Springer, Cham. https://doi.org/10.1007/978-3-031-73390-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-73390-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73389-5
Online ISBN: 978-3-031-73390-1
eBook Packages: Computer ScienceComputer Science (R0)