Abstract
This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial-temporal control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model [9]. By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (e.g., initial poses) in the vanilla motion space to control the generation process directly, similar to controlling other latent-free diffusion models [29, 73] for motion generation. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.
L.-H. Chen—Project lead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
EMA operation: \(\mathbf{\Theta }^{-} \leftarrow \texttt {sg}(\mu \mathbf{\Theta }^{-} + (1 - \mu ) \mathbf{\Theta })\), where \(\texttt {sg}(\cdot )\) denotes the stopgrad operation and \(\mu \) satisfies \(0 \le \mu < 1\).
References
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)
Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: Teach: temporal action composition for 3D humans. In: 3DV, pp. 414–423 (2022)
Barquero, G., Escalera, S., Palmero, C.: Seamless human motion composition with blended positional encodings. In: CVPR, pp. 457–469 (2024)
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: VR, pp. 1–10 (2021)
Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) ECCV 2022. LNCS, vol. 13677, pp. 356–372. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_22
Chen, L.H., et al.: Motionllm: understanding human behaviors from human motions and videos. arXiv preprint arXiv:2405.20340 (2024)
Chen, L.H., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: HumanMAC: masked motion completion for human motion prediction. In: ICCV, pp. 9544–9555 (2023)
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR, pp. 18000–18010 (2023)
Cong, P., et al.: Laserhuman: language-guided scene-aware human motion generation in free environment. arXiv preprint arXiv:2403.13307 (2024)
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: a framework for denoising-diffusion-based motion synthesis. In: CVPR, pp. 9760–9770 (2023)
Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C· ase: learning conditional adversarial skill embeddings for physics-based characters. In: SIGGRAPH Asia, pp. 1–11 (2023)
Fan, K., et al.: Freemotion: a unified framework for number-free text-to-motion synthesis. arXiv preprint arXiv:2405.15763 (2024)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV, pp. 1396–1406 (2021)
Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MOMask: generative masked modeling of 3D human motions. In: CVPR, pp. 1900–1910 (2024)
Guo, C., et al.: Generative human motion stylization in latent space. arXiv preprint arXiv:2401.13505 (2024)
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR, pp. 5152–5161 (2022)
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: ACMMM, pp. 2021–2029 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS, pp. 6840–6851 (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. TOG 36(4), 1–13 (2017)
Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. TOG 35(4), 1–11 (2016)
Huang, Y., et al.: Stablemofusion: towards robust and efficient diffusion-based motion generation framework. arXiv preprint arXiv:2405.05691 (2024)
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.S.: a large-scale RGB-D database for arbitrary-view human action recognition. In: ACMMM, pp. 1510–1518 (2018)
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: human motion as a foreign language. In: NeurIPS (2024)
Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. In: NeurIPS, pp. 26565–26577 (2022)
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: CVPR, pp. 2151–2162 (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lee, T., Moon, G., Lee, K.M.: Multiact: long-term 3D human motion generation from multiple action labels. In: AAAI, pp. 1231–1239 (2023)
Li, B., Zhao, Y., Zhelun, S., Sheng, L.: Danceformer: music conditioned 3D dance generation with parametric motion transformer. In: AAAI, pp. 1272–1279 (2022)
Li, R., et al.: Lodge: a coarse to fine diffusion network for long dance generation guided by the characteristic dance primitives. In: CVPR, pp. 1524–1534 (2024)
Li, R., et al.: Finedance: a fine-grained choreography dataset for 3D full body dance generation. In: ICCV, pp. 10234–10243 (2023)
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with aist++. In: ICCV, pp. 13401–13412 (2021)
Li, T., Qiao, C., Ren, G., Yin, K., Ha, S.: AAMDM: accelerated auto-regressive motion diffusion model. In: CVPR, pp. 1813–1823 (2024)
Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Learning 1(2018), 1 (2018)
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)
Liu, J., Dai, W., Wang, C., Cheng, Y., Tang, Y., Tong, X.: Plan, posture and go: towards open-world text-to-motion generation. arXiv preprint arXiv:2312.14828 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: NeurIPS, pp. 5775–5787 (2022)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)
Lu, S., et al.: Humantomato: text-aligned whole-body motion generation. arXiv preprint arXiv:2310.12978 (2023)
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: ICML, pp. 8162–8171 (2021)
Paszke, A, et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV, pp. 10985–10995 (2021)
Petrovich, M., Black, M.J., Varol, G.: Temos: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Petrovich, M., Black, M.J., Varol, G.: TMR: text-to-motion retrieval using contrastive 3D human motion synthesis. In: ICCV, pp. 9488–9497 (2023)
Petrovich, M., et al.: Multi-track timeline control for text-driven 3D human motion generation. In: CVPRW, pp. 1911–1921 (2024)
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: Babel: bodies, action and behavior with English labels. In: CVPR, pp. 722–731 (2021)
Raab, S., Leibovitch, I., Li, P., Aberman, K., Sorkine-Hornung, O., Cohen-Or, D.: Modi: unconditional motion synthesis from diverse data. In: CVPR, pp. 13873–13883 (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. In: ICLR (2024)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: CVPR, pp. 1010–1019 (2016)
Shi, Y., Wang, J., Jiang, X., Dai, B.: Controllable motion diffusion model. arXiv preprint arXiv:2306.00416 (2023)
Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR, pp. 11050–11059 (2022)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: ICML (2023)
Tang, Y., et al.: Flag3D: a 3D fitness activity dataset with language instruction. In: CVPR, pp. 22106–22117 (2023)
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2022)
Tseng, J., Castellon, R., Liu, K.: Edge: editable dance generation from music. In: CVPR, pp. 448–458 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135 (2023)
Wang, Z., et al.: Move as you say interact as you can: language-guided human motion generation with scene affordance. In: CVPR, pp. 433–444 (2024)
Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: Humanise: language-conditioned human motion generation in 3D scenes. In: NeurIPS, pp. 14959–14971 (2022)
Wang, Z., Wang, J., Lin, D., Dai, B.: Intercontrol: generate human motion interactions by controlling every joint. arXiv preprint arXiv:2311.15864 (2023)
Xiao, Z., et al.: Unified human-scene interaction via prompted chain-of-contacts. In: ICLR (2024)
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: OmniControl: control any joint at any time for human motion generation. In: ICLR (2024)
Xu, L., et al.: Inter-x: towards versatile human-human interaction analysis. In: CVPR, pp. 22260–22271 (2024)
Xu, L., et al.: Actformer: a GAN-based transformer towards general action-conditioned 3D human motion generation. In: ICCV, pp. 2228–2238 (2023)
Xu, L., et al.: RegenNet: towards human action-reaction synthesis. In: CVPR, pp. 1759–1769 (2024)
Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: ICCV, pp. 4394–4402 (2019)
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: PhysDiff: physics-guided human motion diffusion model. In: ICCV, pp. 16010–16021 (2023)
Zhang, B., et al.: RodinHD: high-fidelity 3D avatar generation with diffusion models. arXiv preprint arXiv:2407.06938 (2024)
Zhang, B., et al.: Gaussiancube: structuring gaussian splatting using optimal transport for 3D generative modeling. arXiv preprint arXiv:2403.19655 (2024)
Zhang, J., et al.: Generating human motion from textual descriptions with discrete representations. In: CVPR, pp. 14730–14740 (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV, pp. 3836–3847 (2023)
Zhang, M., et al.: Motiondiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: CVPR, pp. 6225–6234 (2020)
Zhong, L., Xie, Y., Jampani, V., Sun, D., Jiang, H.: Smoodi: stylized motion diffusion model. arXiv preprint arXiv:2407.12783 (2024)
Zhou, W., et al.: EMDM: efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)
Acknowledgements
The research is supported by Shenzhen Ubiquitous Data Enabling Key Lab under grant ZDSYS20220527171406015 and CCF-Tencent Rhino-Bird Open Research Fund. This project is also supported by Shanghai Artificial Intelligence Laboratory. The author team would like to acknowledge Yiming Xie, Zhiyang Dou, and Shunlin Lu for their helpful technical discussions and suggestions.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dai, W., Chen, LH., Wang, J., Liu, J., Dai, B., Tang, Y. (2025). MotionLCM: Real-Time Controllable Motion Generation via Latent Consistency Model. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15074. Springer, Cham. https://doi.org/10.1007/978-3-031-72640-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-72640-8_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72639-2
Online ISBN: 978-3-031-72640-8
eBook Packages: Computer ScienceComputer Science (R0)