research-article

Bidirectional Transformer GAN for Long-term Human Motion Prediction

Authors:

Wei WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 5

Article No.: 163, Pages 1 - 19

https://doi.org/10.1145/3579359

Published: 15 April 2023 Publication History

Abstract

The mainstream motion prediction methods usually focus on short-term prediction, and their predicted long-term motions often fall into an average pose, i.e., the freezing forecasting problem [27]. To mitigate this problem, we propose a novel Bidirectional Transformer-based Generative Adversarial Network (BiTGAN) for long-term human motion prediction. The bidirectional setup leads to consistent and smooth generation in both forward and backward directions. Besides, to make full use of the history motions, we split them into two parts. The first part is fed to the Transformer encoder in our BiTGAN while the second part is used as the decoder input. This strategy can alleviate the exposure problem [37]. Additionally, to better maintain both the local (i.e., frame-level pose) and global (i.e., video-level semantic) similarities between the predicted motion sequence and the real one, the soft dynamic time warping (Soft-DTW) loss is introduced into the generator. Finally, we utilize a dual-discriminator to distinguish the predicted sequence at both frame and sequence levels. Extensive experiments on the public Human3.6M dataset demonstrate that our proposed BiTGAN achieves state-of-the-art performance on long-term (4s) human motion prediction, and reduces the average error of all actions by 4%.

References

[1]

Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. In Proceeding of the 5th International Conference on Learning Representations.

[2]

Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3722–3731.

[3]

Judith Butepage, Michael J. Black, Danica Kragic, and Hedvig Kjellstrom. 2017. Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6158–6166.

[4]

Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Xu Yang, Yiheng Zhu, Xiaohui Shen, Ding Liu, Jing Liu, and Nadia Magnenat-Thalmann. 2020. Learning progressive joint propagation for human motion prediction. In Proceedings of the European Conference on Computer Vision. Springer, 226–242.

Digital Library

[5]

Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, and Jitendra Malik. 2020. Long-term human motion prediction with scene context. In Proceedings of the European Conference on Computer Vision. Springer, 387–404.

Digital Library

[6]

Marco Cuturi and Mathieu Blondel. 2017. Soft-dtw: A differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning. PMLR, 894–903.

[7]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR abs/1901.02860 (2019).

[8]

Lingwei Dang, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11467–11476.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.

[10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceeding of the 9th International Conference on Learning Representations.

[11]

Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. 2021. Single-Shot Motion Completion with Transformer. CoRR abs/2103.00776 (2021).

[12]

Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 4346–4354.

Digital Library

[13]

Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 4833–4837.

Digital Library

[14]

Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G. Ororbia. 2019. A neural temporal model for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12116–12125.

[15]

Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José M. F. Moura. 2018. Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision. 786–803.

Digital Library

[16]

James N. Ingram, Konrad P. Körding, Ian S. Howard, and Daniel M. Wolpert. 2008. The statistics of natural hand movements. Experimental Brain Research 188, 2 (2008), 223–236.

[17]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325–1339.

Digital Library

[18]

Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.

[19]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) 54, 10s (2021), 200:1–200:41.

[20]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.

[21]

Hema S. Koppula and Ashutosh Saxena. 2015. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2015), 14–29.

Digital Library

[22]

Jogendra Nath Kundu, Maharshi Gor, and R. Venkatesh Babu. 2019. Bihmp-gan: Bidirectional 3D human motion prediction gan. In Proceedings of the AAAI Conference on Artificial Intelligence. 8553–8560.

Digital Library

[23]

Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5226–5234.

[24]

Jiachen Li, Fan Yang, Hengbo Ma, Srikanth Malla, Masayoshi Tomizuka, and Chiho Choi. 2021. Rain: Reinforced hybrid attention inference network for motion forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16096–16106.

[25]

Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, and Hao Li. 2020. Learning to generate diverse dance motions with transformer. CoRR abs/2008.08171 (2020).

[26]

Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 214–223.

[27]

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021. Learn to dance with AIST++: Music conditioned 3D dance generation. arXiv:2101.08779. Retrieved from https://arxiv.org/abs/2101.08779.

[28]

Hongyi Liu and Lihui Wang. 2017. Human motion prediction for human-robot collaboration. Journal of Manufacturing Systems 44 (2017), 287–294. https://www.sciencedirect.com/science/article/pii/S0278612517300481.

[29]

Yicheng Liu, Jinghuai Zhang, Liangji Fang, Qinhong Jiang, and Bolei Zhou. 2021. Multimodal motion prediction with stacked transformers. CoRR abs/2103.11624 (2021).

[30]

Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, and Li Cheng. 2019. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10004–10012.

[31]

Kedi Lyu, Zhenguang Liu, Shuang Wu, Haipeng Chen, Xuhong Zhang, and Yuyu Yin. 2021. Learning human motion prediction via stochastic differential equations. In Proceedings of the 29th ACM International Conference on Multimedia. 4976–4984.

Digital Library

[32]

Xin Man, Deqiang Ouyang, Xiangpeng Li, Jingkuan Song, and Jie Shao. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–17.

Digital Library

[33]

Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2020. History repeats itself: Human motion prediction via motion attention. In Proceedings of the European Conference on Computer Vision. Springer, 474–489.

Digital Library

[34]

Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9489–9497.

[35]

Julieta Martinez, Michael J. Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2891–2900.

[36]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1–23.

Digital Library

[37]

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In Proceeding of the 4th International Conference on Learning Representations.

[38]

Xuanchi Ren, Haoran Li, Zijian Huang, and Qifeng Chen. 2020. Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM International Conference on Multimedia. 46–54.

Digital Library

[39]

Theodoros Sofianos, Alessio Sampieri, Luca Franco, and Fabio Galasso. 2021. Space-time-separable graph convolutional network for pose forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11209–11218.

[40]

Pengxiang Su, Zhenguang Liu, Shuang Wu, Lei Zhu, Yifang Yin, and Xuanjing Shen. 2021. Motion prediction via joint dependency modeling in phase space. In Proceedings of the 29th ACM International Conference on Multimedia. 713–721.

Digital Library

[41]

Hao Tang, Song Bai, Li Zhang, Philip H. S. Torr, and Nicu Sebe. 2020. Xinggan for person image generation. In Proceedings of the European Conference on Computer Vision. Springer, 717–734.

Digital Library

[42]

Hao Tang, Dan Xu, Wei Wang, Yan Yan, and Nicu Sebe. 2018. Dual generator generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Asian Conference on Computer Vision. Springer, 3–21.

[43]

Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. 2018. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 935–941.

[44]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceeding of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 5998–6008.

[45]

Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision.601–617.

Digital Library

[46]

Borui Wang, Ehsan Adeli, Hsu-kuang Chiu, De-An Huang, and Juan Carlos Niebles. 2019. Imitation learning for human pose prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7124–7133.

[47]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019).

[48]

Bo Zhang, Rui Zhang, Niccolo Bisagno, Nicola Conci, Francesco G. B. De Natale, and Hongbo Liu. 2021. Where are they going? Predicting human behaviors in crowded scenes. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1–19.

Digital Library

[49]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

Cited By

Feng ZXu JMa LZhang S(2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633781
Gao XYang YWu YDu SQi G(2024)Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion PredictionIEEE Transactions on Image Processing10.1109/TIP.2024.341493533(3907-3920)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3414935
Idrees SKim JChoi JSohn S(2024)Human Motion Prediction: Assessing Direct and Geometry-Aware Approaches in 3D SpaceIEEE Access10.1109/ACCESS.2024.343469512(104643-104662)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3434695
Show More Cited By

Index Terms

Bidirectional Transformer GAN for Long-term Human Motion Prediction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Long-term human motion prediction by modeling motion context and enhancing motion dynamic
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Human motion prediction aims at generating future frames of human motion based on an observed sequence of skeletons. Recent methods employ the latest hidden states of a recurrent neural network (RNN) to encode the historical skeletons, which can only ...
Spatial–temporal modeling for prediction of stylized human motion
Highlights
- Auto-regressive network structure for stylized motion prediction.
- Style feature ...
Abstract
Human motion prediction refers to forecasting human motion in the future given a past motion sequence, which has significant applications in human tracking, automatic motion generation, autonomous driving, human-robotics interaction, ...
Motion Estimation Using Long Term Motion Vector Prediction
DCC '99: Proceedings of the Conference on Data Compression

This paper presents a motion estimation technique for the coding of video sequences that is based on long term temporal prediction. The motion vector of a moving object is tracked from one frame to another using a projection method. The traced motion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 5

September 2023

262 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3585398

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2023

Online AM: 10 January 2023

Accepted: 24 December 2022

Revised: 20 October 2022

Received: 08 April 2022

Published in TOMM Volume 19, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research & Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,076
Total Downloads

Downloads (Last 12 months)609
Downloads (Last 6 weeks)43

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feng ZXu JMa LZhang S(2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633781
Gao XYang YWu YDu SQi G(2024)Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion PredictionIEEE Transactions on Image Processing10.1109/TIP.2024.341493533(3907-3920)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3414935
Idrees SKim JChoi JSohn S(2024)Human Motion Prediction: Assessing Direct and Geometry-Aware Approaches in 3D SpaceIEEE Access10.1109/ACCESS.2024.343469512(104643-104662)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3434695
Li JWang JWu LWang XLuo XXu Y(2024)AMHGCN: Adaptive multi-level hypergraph convolution network for human motion predictionNeural Networks10.1016/j.neunet.2024.106153172(106153)Online publication date: Apr-2024
https://doi.org/10.1016/j.neunet.2024.106153
Zou J(2024)Simplified neural architecture for efficient human motion prediction in human-robot interactionNeurocomputing10.1016/j.neucom.2024.127683588(127683)Online publication date: Jul-2024
https://doi.org/10.1016/j.neucom.2024.127683
Deng TSun Y(2024)Recent advances in deterministic human motion predictionImage and Vision Computing10.1016/j.imavis.2024.104926143:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.imavis.2024.104926
Chopin BTang HOtberdout NDaoudi MSebe N(2023)Interaction Transformer for Human Reaction GenerationIEEE Transactions on Multimedia10.1109/TMM.2023.324215225(8842-8854)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3242152
Li JWang JKuang CWu LWang XXu Y(2023)A human-like action learning processKnowledge-Based Systems10.1016/j.knosys.2023.110948280:COnline publication date: 25-Nov-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110948

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents