Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Bidirectional Transformer GAN for Long-term Human Motion Prediction

Published: 15 April 2023 Publication History
  • Get Citation Alerts
  • Abstract

    The mainstream motion prediction methods usually focus on short-term prediction, and their predicted long-term motions often fall into an average pose, i.e., the freezing forecasting problem [27]. To mitigate this problem, we propose a novel Bidirectional Transformer-based Generative Adversarial Network (BiTGAN) for long-term human motion prediction. The bidirectional setup leads to consistent and smooth generation in both forward and backward directions. Besides, to make full use of the history motions, we split them into two parts. The first part is fed to the Transformer encoder in our BiTGAN while the second part is used as the decoder input. This strategy can alleviate the exposure problem [37]. Additionally, to better maintain both the local (i.e., frame-level pose) and global (i.e., video-level semantic) similarities between the predicted motion sequence and the real one, the soft dynamic time warping (Soft-DTW) loss is introduced into the generator. Finally, we utilize a dual-discriminator to distinguish the predicted sequence at both frame and sequence levels. Extensive experiments on the public Human3.6M dataset demonstrate that our proposed BiTGAN achieves state-of-the-art performance on long-term (4s) human motion prediction, and reduces the average error of all actions by 4%.

    References

    [1]
    Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. In Proceeding of the 5th International Conference on Learning Representations.
    [2]
    Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3722–3731.
    [3]
    Judith Butepage, Michael J. Black, Danica Kragic, and Hedvig Kjellstrom. 2017. Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6158–6166.
    [4]
    Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Xu Yang, Yiheng Zhu, Xiaohui Shen, Ding Liu, Jing Liu, and Nadia Magnenat-Thalmann. 2020. Learning progressive joint propagation for human motion prediction. In Proceedings of the European Conference on Computer Vision. Springer, 226–242.
    [5]
    Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, and Jitendra Malik. 2020. Long-term human motion prediction with scene context. In Proceedings of the European Conference on Computer Vision. Springer, 387–404.
    [6]
    Marco Cuturi and Mathieu Blondel. 2017. Soft-dtw: A differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning. PMLR, 894–903.
    [7]
    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR abs/1901.02860 (2019).
    [8]
    Lingwei Dang, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11467–11476.
    [9]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.
    [10]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceeding of the 9th International Conference on Learning Representations.
    [11]
    Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. 2021. Single-Shot Motion Completion with Transformer. CoRR abs/2103.00776 (2021).
    [12]
    Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 4346–4354.
    [13]
    Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 4833–4837.
    [14]
    Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G. Ororbia. 2019. A neural temporal model for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12116–12125.
    [15]
    Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José M. F. Moura. 2018. Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision. 786–803.
    [16]
    James N. Ingram, Konrad P. Körding, Ian S. Howard, and Daniel M. Wolpert. 2008. The statistics of natural hand movements. Experimental Brain Research 188, 2 (2008), 223–236.
    [17]
    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325–1339.
    [18]
    Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.
    [19]
    Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) 54, 10s (2021), 200:1–200:41.
    [20]
    Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.
    [21]
    Hema S. Koppula and Ashutosh Saxena. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2015), 14–29.
    [22]
    Jogendra Nath Kundu, Maharshi Gor, and R. Venkatesh Babu. 2019. Bihmp-gan: Bidirectional 3D human motion prediction gan. In Proceedings of the AAAI Conference on Artificial Intelligence. 8553–8560.
    [23]
    Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5226–5234.
    [24]
    Jiachen Li, Fan Yang, Hengbo Ma, Srikanth Malla, Masayoshi Tomizuka, and Chiho Choi. 2021. Rain: Reinforced hybrid attention inference network for motion forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16096–16106.
    [25]
    Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, and Hao Li. 2020. Learning to generate diverse dance motions with transformer. CoRR abs/2008.08171 (2020).
    [26]
    Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 214–223.
    [27]
    Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021. Learn to dance with AIST++: Music conditioned 3D dance generation. arXiv:2101.08779. Retrieved from https://arxiv.org/abs/2101.08779.
    [28]
    Hongyi Liu and Lihui Wang. 2017. Human motion prediction for human-robot collaboration. Journal of Manufacturing Systems 44 (2017), 287–294. https://www.sciencedirect.com/science/article/pii/S0278612517300481.
    [29]
    Yicheng Liu, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, and Bolei Zhou. 2021. Multimodal motion prediction with stacked transformers. CoRR abs/2103.11624 (2021).
    [30]
    Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, and Li Cheng. 2019. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10004–10012.
    [31]
    Kedi Lyu, Zhenguang Liu, Shuang Wu, Haipeng Chen, Xuhong Zhang, and Yuyu Yin. 2021. Learning human motion prediction via stochastic differential equations. In Proceedings of the 29th ACM International Conference on Multimedia. 4976–4984.
    [32]
    Xin Man, Deqiang Ouyang, Xiangpeng Li, Jingkuan Song, and Jie Shao. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–17.
    [33]
    Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2020. History repeats itself: Human motion prediction via motion attention. In Proceedings of the European Conference on Computer Vision. Springer, 474–489.
    [34]
    Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9489–9497.
    [35]
    Julieta Martinez, Michael J. Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2891–2900.
    [36]
    Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1–23.
    [37]
    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In Proceeding of the 4th International Conference on Learning Representations.
    [38]
    Xuanchi Ren, Haoran Li, Zijian Huang, and Qifeng Chen. 2020. Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM International Conference on Multimedia. 46–54.
    [39]
    Theodoros Sofianos, Alessio Sampieri, Luca Franco, and Fabio Galasso. 2021. Space-time-separable graph convolutional network for pose forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11209–11218.
    [40]
    Pengxiang Su, Zhenguang Liu, Shuang Wu, Lei Zhu, Yifang Yin, and Xuanjing Shen. 2021. Motion prediction via joint dependency modeling in phase space. In Proceedings of the 29th ACM International Conference on Multimedia. 713–721.
    [41]
    Hao Tang, Song Bai, Li Zhang, Philip H. S. Torr, and Nicu Sebe. 2020. Xinggan for person image generation. In Proceedings of the European Conference on Computer Vision. Springer, 717–734.
    [42]
    Hao Tang, Dan Xu, Wei Wang, Yan Yan, and Nicu Sebe. 2018. Dual generator generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Asian Conference on Computer Vision. Springer, 3–21.
    [43]
    Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. 2018. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 935–941.
    [44]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceeding of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 5998–6008.
    [45]
    Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision.601–617.
    [46]
    Borui Wang, Ehsan Adeli, Hsu-kuang Chiu, De-An Huang, and Juan Carlos Niebles. 2019. Imitation learning for human pose prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7124–7133.
    [47]
    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019).
    [48]
    Bo Zhang, Rui Zhang, Niccolo Bisagno, Nicola Conci, Francesco G. B. De Natale, and Hongbo Liu. 2021. Where are they going? Predicting human behaviors in crowded scenes. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1–19.
    [49]
    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

    Cited By

    View all
    • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
    • (2024)Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion PredictionIEEE Transactions on Image Processing10.1109/TIP.2024.341493533(3907-3920)Online publication date: 2024
    • (2024)Human Motion Prediction: Assessing Direct and Geometry-Aware Approaches in 3D SpaceIEEE Access10.1109/ACCESS.2024.343469512(104643-104662)Online publication date: 2024
    • Show More Cited By

    Index Terms

    1. Bidirectional Transformer GAN for Long-term Human Motion Prediction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5
      September 2023
      262 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3585398
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 April 2023
      Online AM: 10 January 2023
      Accepted: 24 December 2022
      Revised: 20 October 2022
      Received: 08 April 2022
      Published in TOMM Volume 19, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Long-term human motion prediction
      2. bidirectional generation
      3. Transformer
      4. GAN
      5. DTW

      Qualifiers

      • Research-article

      Funding Sources

      • National Key Research & Development Program of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)609
      • Downloads (Last 6 weeks)43
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
      • (2024)Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion PredictionIEEE Transactions on Image Processing10.1109/TIP.2024.341493533(3907-3920)Online publication date: 2024
      • (2024)Human Motion Prediction: Assessing Direct and Geometry-Aware Approaches in 3D SpaceIEEE Access10.1109/ACCESS.2024.343469512(104643-104662)Online publication date: 2024
      • (2024)AMHGCN: Adaptive multi-level hypergraph convolution network for human motion predictionNeural Networks10.1016/j.neunet.2024.106153172(106153)Online publication date: Apr-2024
      • (2024)Simplified neural architecture for efficient human motion prediction in human-robot interactionNeurocomputing10.1016/j.neucom.2024.127683588(127683)Online publication date: Jul-2024
      • (2024)Recent advances in deterministic human motion predictionImage and Vision Computing10.1016/j.imavis.2024.104926143:COnline publication date: 2-Jul-2024
      • (2023)Interaction Transformer for Human Reaction GenerationIEEE Transactions on Multimedia10.1109/TMM.2023.324215225(8842-8854)Online publication date: 1-Jan-2023
      • (2023)A human-like action learning processKnowledge-Based Systems10.1016/j.knosys.2023.110948280:COnline publication date: 25-Nov-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media