MotionLCM: Real-Time Controllable Motion Generation via Latent Consistency Model

Dai, Wenxun; Chen, Ling-Hao; Wang, Jingbo; Liu, Jinpeng; Dai, Bo; Tang, Yansong

doi:10.1007/978-3-031-72640-8_22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15074))

Included in the following conference series:

European Conference on Computer Vision

344 Accesses

Abstract

This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial-temporal control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model [9]. By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (e.g., initial poses) in the vanilla motion space to control the generation process directly, similar to controlling other latent-free diffusion models [29, 73] for motion generation. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.

L.-H. Chen—Project lead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

Notes

1.
EMA operation: $\mathbf{\Theta }^{-} \leftarrow \texttt {sg}(\mu \mathbf{\Theta }^{-} + (1 - \mu ) \mathbf{\Theta })$, where $\texttt {sg}(\cdot )$ denotes the stopgrad operation and $\mu $ satisfies $0 \le \mu < 1$.

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)
Google Scholar
Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)
Google Scholar
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: Teach: temporal action composition for 3D humans. In: 3DV, pp. 414–423 (2022)
Google Scholar
Barquero, G., Escalera, S., Palmero, C.: Seamless human motion composition with blended positional encodings. In: CVPR, pp. 457–469 (2024)
Google Scholar
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: VR, pp. 1–10 (2021)
Google Scholar
Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) ECCV 2022. LNCS, vol. 13677, pp. 356–372. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_22
Chen, L.H., et al.: Motionllm: understanding human behaviors from human motions and videos. arXiv preprint arXiv:2405.20340 (2024)
Chen, L.H., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: HumanMAC: masked motion completion for human motion prediction. In: ICCV, pp. 9544–9555 (2023)
Google Scholar
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR, pp. 18000–18010 (2023)
Google Scholar
Cong, P., et al.: Laserhuman: language-guided scene-aware human motion generation in free environment. arXiv preprint arXiv:2403.13307 (2024)
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: a framework for denoising-diffusion-based motion synthesis. In: CVPR, pp. 9760–9770 (2023)
Google Scholar
Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C· ase: learning conditional adversarial skill embeddings for physics-based characters. In: SIGGRAPH Asia, pp. 1–11 (2023)
Google Scholar
Fan, K., et al.: Freemotion: a unified framework for number-free text-to-motion synthesis. arXiv preprint arXiv:2405.15763 (2024)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV, pp. 1396–1406 (2021)
Google Scholar
Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MOMask: generative masked modeling of 3D human motions. In: CVPR, pp. 1900–1910 (2024)
Google Scholar
Guo, C., et al.: Generative human motion stylization in latent space. arXiv preprint arXiv:2401.13505 (2024)
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR, pp. 5152–5161 (2022)
Google Scholar
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Chapter Google Scholar
Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: ACMMM, pp. 2021–2029 (2020)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, pp. 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. TOG 36(4), 1–13 (2017)
Article Google Scholar
Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. TOG 35(4), 1–11 (2016)
Article Google Scholar
Huang, Y., et al.: Stablemofusion: towards robust and efficient diffusion-based motion generation framework. arXiv preprint arXiv:2405.05691 (2024)
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Article MathSciNet Google Scholar
Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.S.: a large-scale RGB-D database for arbitrary-view human action recognition. In: ACMMM, pp. 1510–1518 (2018)
Google Scholar
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: human motion as a foreign language. In: NeurIPS (2024)
Google Scholar
Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. In: NeurIPS, pp. 26565–26577 (2022)
Google Scholar
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: CVPR, pp. 2151–2162 (2023)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lee, T., Moon, G., Lee, K.M.: Multiact: long-term 3D human motion generation from multiple action labels. In: AAAI, pp. 1231–1239 (2023)
Google Scholar
Li, B., Zhao, Y., Zhelun, S., Sheng, L.: Danceformer: music conditioned 3D dance generation with parametric motion transformer. In: AAAI, pp. 1272–1279 (2022)
Google Scholar
Li, R., et al.: Lodge: a coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. In: CVPR, pp. 1524–1534 (2024)
Google Scholar
Li, R., et al.: Finedance: a fine-grained choreography dataset for 3D full body dance generation. In: ICCV, pp. 10234–10243 (2023)
Google Scholar
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with aist++. In: ICCV, pp. 13401–13412 (2021)
Google Scholar
Li, T., Qiao, C., Ren, G., Yin, K., Ha, S.: AAMDM: accelerated auto-regressive motion diffusion model. In: CVPR, pp. 1813–1823 (2024)
Google Scholar
Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Learning 1(2018), 1 (2018)
Google Scholar
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)
Liu, J., Dai, W., Wang, C., Cheng, Y., Tang, Y., Tong, X.: Plan, posture and go: towards open-world text-to-motion generation. arXiv preprint arXiv:2312.14828 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: NeurIPS, pp. 5775–5787 (2022)
Google Scholar
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)
Lu, S., et al.: Humantomato: text-aligned whole-body motion generation. arXiv preprint arXiv:2310.12978 (2023)
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)
Google Scholar
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML, pp. 8162–8171 (2021)
Google Scholar
Paszke, A, et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV, pp. 10985–10995 (2021)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: Temos: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Petrovich, M., Black, M.J., Varol, G.: TMR: text-to-motion retrieval using contrastive 3D human motion synthesis. In: ICCV, pp. 9488–9497 (2023)
Google Scholar
Petrovich, M., et al.: Multi-track timeline control for text-driven 3D human motion generation. In: CVPRW, pp. 1911–1921 (2024)
Google Scholar
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Article Google Scholar
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: Babel: bodies, action and behavior with English labels. In: CVPR, pp. 722–731 (2021)
Google Scholar
Raab, S., Leibovitch, I., Li, P., Aberman, K., Sorkine-Hornung, O., Cohen-Or, D.: Modi: unconditional motion synthesis from diverse data. In: CVPR, pp. 13873–13883 (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. In: ICLR (2024)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: CVPR, pp. 1010–1019 (2016)
Google Scholar
Shi, Y., Wang, J., Jiang, X., Dai, B.: Controllable motion diffusion model. arXiv preprint arXiv:2306.00416 (2023)
Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR, pp. 11050–11059 (2022)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: ICML (2023)
Google Scholar
Tang, Y., et al.: Flag3D: a 3D fitness activity dataset with language instruction. In: CVPR, pp. 22106–22117 (2023)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2022)
Google Scholar
Tseng, J., Castellon, R., Liu, K.: Edge: editable dance generation from music. In: CVPR, pp. 448–458 (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135 (2023)
Wang, Z., et al.: Move as you say interact as you can: language-guided human motion generation with scene affordance. In: CVPR, pp. 433–444 (2024)
Google Scholar
Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: language-conditioned human motion generation in 3D scenes. In: NeurIPS, pp. 14959–14971 (2022)
Google Scholar
Wang, Z., Wang, J., Lin, D., Dai, B.: Intercontrol: generate human motion interactions by controlling every joint. arXiv preprint arXiv:2311.15864 (2023)
Xiao, Z., et al.: Unified human-scene interaction via prompted chain-of-contacts. In: ICLR (2024)
Google Scholar
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: OmniControl: control any joint at any time for human motion generation. In: ICLR (2024)
Google Scholar
Xu, L., et al.: Inter-x: towards versatile human-human interaction analysis. In: CVPR, pp. 22260–22271 (2024)
Google Scholar
Xu, L., et al.: Actformer: a GAN-based transformer towards general action-conditioned 3D human motion generation. In: ICCV, pp. 2228–2238 (2023)
Google Scholar
Xu, L., et al.: RegenNet: towards human action-reaction synthesis. In: CVPR, pp. 1759–1769 (2024)
Google Scholar
Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: ICCV, pp. 4394–4402 (2019)
Google Scholar
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: PhysDiff: physics-guided human motion diffusion model. In: ICCV, pp. 16010–16021 (2023)
Google Scholar
Zhang, B., et al.: RodinHD: high-fidelity 3D avatar generation with diffusion models. arXiv preprint arXiv:2407.06938 (2024)
Zhang, B., et al.: Gaussiancube: structuring gaussian splatting using optimal transport for 3D generative modeling. arXiv preprint arXiv:2403.19655 (2024)
Zhang, J., et al.: Generating human motion from textual descriptions with discrete representations. In: CVPR, pp. 14730–14740 (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Google Scholar
Zhang, M., et al.: Motiondiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: CVPR, pp. 6225–6234 (2020)
Google Scholar
Zhong, L., Xie, Y., Jampani, V., Sun, D., Jiang, H.: Smoodi: stylized motion diffusion model. arXiv preprint arXiv:2407.12783 (2024)
Zhou, W., et al.: EMDM: efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)

Download references

Acknowledgements

The research is supported by Shenzhen Ubiquitous Data Enabling Key Lab under grant ZDSYS20220527171406015 and CCF-Tencent Rhino-Bird Open Research Fund. This project is also supported by Shanghai Artificial Intelligence Laboratory. The author team would like to acknowledge Yiming Xie, Zhiyang Dou, and Shunlin Lu for their helpful technical discussions and suggestions.

Author information

Authors and Affiliations

Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Shenzhen, China
Wenxun Dai, Jinpeng Liu & Yansong Tang
Tsinghua University, Beijing, China
Wenxun Dai, Ling-Hao Chen, Jinpeng Liu & Yansong Tang
Shanghai AI Laboratory, Shanghai, China
Jingbo Wang & Bo Dai

Authors

Wenxun Dai
View author publications
You can also search for this author in PubMed Google Scholar
Ling-Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jingbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinpeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yansong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jingbo Wang or Bo Dai .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 55274 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dai, W., Chen, LH., Wang, J., Liu, J., Dai, B., Tang, Y. (2025). MotionLCM: Real-Time Controllable Motion Generation via Latent Consistency Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15074. Springer, Cham. https://doi.org/10.1007/978-3-031-72640-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-72640-8_22
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72639-2
Online ISBN: 978-3-031-72640-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MotionLCM: Real-Time Controllable Motion Generation via Latent Consistency Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 55274 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MotionLCM: Real-Time Controllable Motion Generation via Latent Consistency Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 55274 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation