Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Published: 08 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g., self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On the one hand, we leverage temporal context to complement insufficient information provided by the single frame and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.

    References

    [1]
    Luca Ballan, Aparna Taneja, Jürgen Gall, Luc Van Gool, and Marc Pollefeys. 2012. Motion capture of hands in action using discriminative salient points. In Proceedings of the European Conference on Computer Vision. 640–653.
    [2]
    Adnane Boukhayma, Rodrigo de Bem, and Philip H. S. Torr. 2019. 3D hand shape and pose from images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10843–10852.
    [3]
    Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. 2018. Weakly-supervised 3D hand pose estimation from monocular RGB images. In Proceedings of the European Conference on Computer Vision. 666–682.
    [4]
    Yujin Chen, Zhigang Tu, Di Kang, Linchao Bao, Ying Zhang, Xuefei Zhe, Ruizhi Chen, and Junsong Yuan. 2021. Model-based 3D hand reconstruction via self-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10451–10460.
    [5]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
    [6]
    James Diffenderfer, Brian Bartoldson, Shreya Chaganti, Jize Zhang, and Bhavya Kailkhura. 2021. A winning hand: Compressing deep networks can improve out-of-distribution robustness. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34.
    [7]
    Hehe Fan, Tao Zhuo, Xin Yu, Yi Yang, and Mohan Kankanhalli. 2021. Understanding atomic hand-object interaction with human intention. IEEE Trans. Circ. Syst. Vid. Technol. 32, 1 (2021), 275–285.
    [8]
    Zicong Fan, Adrian Spurr, Muhammed Kocabas, Siyu Tang, Michael J. Black, and Otmar Hilliges. 2021. Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation. In Proceedings of the International Conference on 3D Vision. 1–10.
    [9]
    Patrick Grady, Chengcheng Tang, Christopher D. Twigg, Minh Vo, Samarth Brahmbhatt, and Charles C. Kemp. 2021. Contactopt: Optimizing contact to improve grasps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1471–1481.
    [10]
    Hengkai Guo, Guijin Wang, Xinghao Chen, Cairong Zhang, Fei Qiao, and Huazhong Yang. 2017. Region ensemble network: Improving convolutional network for hand pose estimation. In Proceedings of the IEEE International Conference on Image Processing. 4512–4516.
    [11]
    Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D. Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, et al. 2020. MEgATrack: Monochrome egocentric articulated hand-tracking for virtual reality. ACM Trans. Graph. 39, 4 (2020), 87.
    [12]
    Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. 2020. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 571–580.
    [13]
    Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. 2019. Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11807–11816.
    [14]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [15]
    Hezhen Hu, Weichao Zhao, Wengang Zhou, and Houqiang Li. 2023. SignBERT+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Trans. Pattern Anal. Mach. Intell. 45, 9 (2023), 11221–11239.
    [16]
    Hezhen Hu, Wengang Zhou, and Houqiang Li. 2021. Hand-model-aware sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 1558–1566.
    [17]
    Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2017. Egocentric hand detection via dynamic region growing. ACM Trans. Multimedia Comput. Commun. Appl. 14, 1 (2017), 1–17.
    [18]
    Weiting Huang, Pengfei Ren, Jingyu Wang, Qi Qi, and Haifeng Sun. 2020. AWR: Adaptive weighting regression for 3D hand pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence. 11061–11068.
    [19]
    Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. 2018. Hand pose estimation via latent 2.5D heatmap regression. In Proceedings of the European Conference on Computer Vision. 118–134.
    [20]
    Yingying Jiao, Haipeng Chen, Runyang Feng, Haoming Chen, Sifan Wu, Yifang Yin, and Zhenguang Liu. 2022. GLPose: Global-local representation learning for human pose estimation. ACM Trans. Multimedia Comput. Commun. Appl. 18, 25 (2022), 1–16.
    [21]
    Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3D human dynamics from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5614–5623.
    [22]
    Dong Uk Kim, Kwang In Kim, and Seungryul Baek. 2021. End-to-end detection and pose estimation of two interacting hands. In Proceedings of the IEEE International Conference on Computer Vision. 11189–11198.
    [23]
    Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980
    [24]
    Mengcheng Li, Liang An, Hongwen Zhang, Lianpeng Wu, Feng Chen, Tao Yu, and Yebin Liu. 2022. Interacting attention graph for single image two-hand reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2761–2770.
    [25]
    Moran Li, Yuan Gao, and Nong Sang. 2021. Exploiting learnable joint groups for hand pose estimation. Proceedings of the AAAI Conference on Artificial Intelligence 35, 3 (2021), 1921–1929.
    [26]
    Hui Liang, Junsong Yuan, and Daniel Thalmann. 2014. Resolving ambiguous hand pose predictions by exploiting part correlations. IEEE Trans. Circ. Syst. Vid. Technol. 25, 7 (2014), 1125–1139.
    [27]
    Fanqing Lin, Connor Wilhelm, and Tony Martinez. 2021. Two-hand global 3D pose estimation using monocular RGB. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2373–2381.
    [28]
    Jianbo Liu, Ying Wang, Shiming Xiang, and Chunhong Pan. 2021. HAN: An efficient hierarchical self-attention network for skeleton-based gesture recognition. arXiv:2106.13391. Retrieved from https://arxiv.org/abs/2106.13391
    [29]
    Hao Meng, Sheng Jin, Wentao Liu, Chen Qian, Mengxiang Lin, Wanli Ouyang, and Ping Luo. 2022. 3D interacting hand pose estimation by hand de-occlusion and removal. In Proceedings of the European Conference on Computer Vision. 380–397.
    [30]
    Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. 2018. V2v-posenet: Voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5079–5088.
    [31]
    Gyeongsik Moon, Shoou-I. Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. 2020. InterHand2. 6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In Proceedings of the European Conference on Computer Vision. 548–564.
    [32]
    Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr Sotnychenko, Mickeal Verschoor, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. 2019. Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Trans. Graph. 38, 4 (2019), 1–13.
    [33]
    Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. 2020. Detecting hands and recognizing physical contact in the wild. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33. 7841–7851.
    [34]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32. 8026–8037.
    [35]
    Chen Qian, Xiao Sun, Yichen Wei, Xiaoou Tang, and Jian Sun. 2014. Realtime and robust hand tracking from depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1106–1113.
    [36]
    Javier Romero, Hedvig Kjellström, and Danica Kragic. 2010. Hands in action: Real-time 3D reconstruction of hands in interaction with objects. In Proceedings of IEEE International Conference on Robotics and Automation. 458–463.
    [37]
    Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied hands: Modeling and capturing hands and bodies together. ACM Trans. Graph. 36, 6 (2017), 1–17.
    [38]
    Jordi Sanchez-Riera, Kathiravan Srinivasan, Kai-Lung Hua, Wen-Huang Cheng, M. Anwar Hossain, and Mohammed F. Alhamid. 2017. Robust RGB-D hand tracking using deep learning priors. IEEE Trans. Circ. Syst. Vid. Technol. 28, 9 (2017), 2289–2301.
    [39]
    Dandan Shan, Richard Higgins, and David Fouhey. 2021. COHESIV: Contrastive object and hand embedding segmentation in video. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34. 5898–5909.
    [40]
    Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan Taylor, Jamie Shotton, David Kim, Christoph Rhemann, Ido Leichter, Alon Vinnikov, Yichen Wei, et al. 2015. Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 3633–3642.
    [41]
    Breannan Smith, Chenglei Wu, He Wen, Patrick Peluse, Yaser Sheikh, Jessica K. Hodgins, and Takaaki Shiratori. 2020. Constraining dense hand surface tracking with elasticity. ACM Trans. Graph. 39, 6 (2020), 1–14.
    [42]
    Andrea Tagliasacchi, Matthias Schröder, Anastasia Tkach, Sofien Bouaziz, Mario Botsch, and Mark Pauly. 2015. Robust articulated-icp for real-time hand tracking. In Proceedings of the Computer Graphics Forum. 101–114.
    [43]
    Danhang Tang, Jonathan Taylor, Pushmeet Kohli, Cem Keskin, Tae-Kyun Kim, and Jamie Shotton. 2015. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE International Conference on Computer Vision. 3325–3333.
    [44]
    Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin. 2014. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33, 5 (2014), 1–10.
    [45]
    Dimitrios Tzionas and Juergen Gall. 2015. 3D object reconstruction from hand-object interactions. In Proceedings of the IEEE International Conference on Computer Vision. 729–737.
    [46]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.
    [47]
    Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. 2019. Self-supervised 3D hand pose estimation through training by fitting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10853–10862.
    [48]
    Jiayi Wang, Franziska Mueller, Florian Bernard, Suzanne Sorli, Oleksandr Sotnychenko, Neng Qian, Miguel A. Otaduy, Dan Casas, and Christian Theobalt. 2020. RGB2Hands: Real-time tracking of 3D hand interactions from monocular RGB video. ACM Trans. Graph. 39, 6 (2020), 1–16.
    [49]
    Yangang Wang, Jianyuan Min, Jianjie Zhang, Yebin Liu, Feng Xu, Qionghai Dai, and Jinxiang Chai. 2013. Video-based hand manipulation capture through composite motion control. ACM Trans. Graph. 32, 4 (2013), 1–14.
    [50]
    Yangang Wang, Cong Peng, and Yebin Liu. 2018. Mask-pose cascaded cnn for 2D hand pose estimation from single color image. IEEE Trans. Circ. Syst. Vid. Technol. 29, 11 (2018), 3258–3268.
    [51]
    Yi Xiao, Tong Liu, Yu Han, Yue Liu, and Yongtian Wang. 2022. Realtime recognition of dynamic hand gestures in practical applications. ACM Trans. Multimedia Comput. Commun. Appl. 18, 25 (2022), 1–16.
    [52]
    Lu Xu, Chen Hu, Jianru Xue, Kuizhi Mei, et al. 2020. Improve regression network on depth hand pose estimation with auxiliary variable. IEEE Trans. Circ. Syst. Vid. Technol. 31, 3 (2020), 890–904.
    [53]
    John Yang, Hyung Jin Chang, Seungeui Lee, and Nojun Kwak. 2020. Seqhand: RGB-sequence-based 3D hand pose and shape estimation. In Proceedings of the European Conference on Computer Vision. 122–139.
    [54]
    Lixin Yang, Xinyu Zhan, Kailin Li, Wenqiang Xu, Jiefeng Li, and Cewu Lu. 2021. CPF: Learning a contact potential field to model the hand-object interaction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11097–11106.
    [55]
    Baowen Zhang, Yangang Wang, Xiaoming Deng, Yinda Zhang, Ping Tan, Cuixia Ma, and Hongan Wang. 2021. Interacting two-hand 3D pose and shape reconstruction from single color image. In Proceedings of the IEEE International Conference on Computer Vision. 11354–11363.
    [56]
    Weichao Zhao, Hezhen Hu, Wengang Zhou, Jiaxin Shi, and Houqiang Li. 2023. BEST: BERT pre-training for sign language recognition with coupling tokenization. Proceedings of the AAAI Conference on Artificial Intelligence, 3597–3605.
    [57]
    Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu. 2020. Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5346–5355.
    [58]
    Christian Zimmermann and Thomas Brox. 2017. Learning to estimate 3D hand pose from single RGB images. In Proceedings of the IEEE International Conference on Computer Vision. 4903–4911.

    Cited By

    View all
    • (2024)Fair and Robust Federated Learning via Decentralized and Adaptive Aggregation based on BlockchainACM Transactions on Sensor Networks10.1145/3673656Online publication date: 17-Jun-2024

    Index Terms

    1. Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
      June 2024
      715 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613638
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 March 2024
      Online AM: 24 January 2024
      Accepted: 15 December 2023
      Revised: 12 December 2023
      Received: 28 March 2023
      Published in TOMM Volume 20, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Interacting hand
      2. model-based 3D hand reconstrcution
      3. temporal context

      Qualifiers

      • Research-article

      Funding Sources

      • GPU cluster built by MCC Lab of Information Science and Technology Institution and the Supercomputing Center of the USTC

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)107
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Fair and Robust Federated Learning via Decentralized and Adaptive Aggregation based on BlockchainACM Transactions on Sensor Networks10.1145/3673656Online publication date: 17-Jun-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media