research-article

Open access

DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis

Authors:

Bo ZhengAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3764 - 3773

https://doi.org/10.1145/3503161.3548400

Published: 10 October 2022 Publication History

Abstract

Current co-speech gestures synthesis methods struggle with generating diverse motions and typically collapse to single or few frequent motion sequences, which are trained on original data distribution with customized models and strategies. We tackle this problem by temporally clustering motion sequences into content and rhythm segments and then training on content-balanced data distribution. In particular, by clustering motion sequences, we have observed for each rhythm pattern, some motions appear frequently, while others appear less. This imbalance results in the difficulty of generating low frequent occurrence motions and it cannot be easily solved by resampling, due to the inherent many-to-many mapping between content and rhythm. Therefore, we present DisCo, which disentangles motion into implicit content and rhythm features by contrastive loss for adopting different data balance strategies. Besides, to model the inherent mapping between content and rhythm features, we design a diversity-and-inclusion network (DIN), which firstly generates content features candidates and then selects one candidate by learned voting. Experiments on two public datasets, Trinity and S2G-Ellen, justify that DisCo generates more realistic and diverse motions than state-of-the-art methods. Code and data are available at https://pantomatrix.github.io/DisCo/

Supplementary Material

MP4 File (MM22_fp3051.mp4)

Presentation video

Download
34.09 MB

References

[1]

Kfir Aberman, YijiaWeng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2020. Unpaired motion style transfer from video to animation. ACM Transactions on Graphics (TOG) 39, 4 (2020), 64--1.

Digital Library

[2]

Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Learning character-agnostic motion for motion retargeting in 2D. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1--14.

Digital Library

[3]

Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional mixture approach. In European Conference on Computer Vision. Springer, 248--265.

Digital Library

[4]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487--496.

[5]

Andreas Aristidou, Daniel Cohen-Or, Jessica K Hodgins, Yiorgos Chrysanthou, and Ariel Shamir. 2018. Deep motifs and motion signatures. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1--13.

Digital Library

[6]

Andreas Aristidou, Daniel Cohen-Or, Jessica K Hodgins, and Ariel Shamir. 2018. Self-similarity analysis for motion capture cleaning. In Computer graphics forum, Vol. 37. Wiley Online Library, 297--309.

[7]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018).

[8]

Kirsten Bergmann, Volkan Aksu, and Stefan Kopp. 2011. The relation of speech and gestures: Temporal synchrony follows semantic synchrony. In Proceedings of the 2nd Workshop on Gesture and Speech in Interaction (GeSpIn 2011).

[9]

Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, and Dinesh Manocha. 2021. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia. 2027--2036.

Digital Library

[10]

Elif Bozkurt, Shahriar Asta, Serkan Özkul, Yücel Yemez, and Engin Erzin. 2013. Multimodal analysis of speech prosody and upper body gestures using hidden semi-markov models. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3652--3656.

[11]

Elif Bozkurt, Yücel Yemez, and Engin Erzin. 2016. Multimodal analysis of speech and arm motion for prosody-driven synthesis of beat gestures. Speech Communication 85 (2016), 29--42.

Digital Library

[12]

Elif Bozkurt, Yücel Yemez, and Engin Erzin. 2020. Affective synthesis and animation of arm gestures from speech prosody. Speech Communication 119 (2020), 1--11.

[13]

Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106 (2018), 249--259.

Digital Library

[14]

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence 43, 1 (2019), 172--186.

Digital Library

[15]

Justine Cassell. 2000. Embodied conversational interface agents. Commun. ACM 43, 4 (2000), 70--78.

Digital Library

[16]

Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. 413--420.

Digital Library

[17]

Agisilaos Chartsias, Giorgos Papanastasiou, Chengjia Wang, Scott Semple, David E Newby, Rohan Dharmakumar, and Sotirios A Tsaftaris. 2020. Disentangle, align and fuse for multimodal and semi-supervised image segmentation. IEEE transactions on medical imaging 40, 3 (2020), 781--792.

[18]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.

[19]

Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents. Springer, 127--140.

[20]

Chung-Cheng Chiu and Stacy Marsella. 2014. Gesture generation with lowdimensional embeddings. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 781--788.

Digital Library

[21]

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9268--9277.

[22]

Emily L Denton et al. 2017. Unsupervised learning of disentangled representations from video. Advances in neural information processing systems 30 (2017).

[23]

Chris Drummond, Robert C Holte, et al. 2003. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11. Citeseer, 1--8.

[24]

Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 93--98.

Digital Library

[25]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2020. Adversarial gesture generation with realistic gesture phasing. Computers & Graphics 89 (2020), 117--130.

[26]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2021. Express Gesture: Expressive gesture generation from speech through database matching. Computer Animation and Virtual Worlds (2021), e2016.

[27]

Yuan Gao, Xingyuan Bu, Yang Hu, Hui Shen, Ti Bai, Xubin Li, and ShileiWen. 2018. Solution for large-scale hierarchical object detection datasets with incomplete annotation and data imbalance. arXiv preprint arXiv:1810.06208 (2018).

[28]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497--3506.

[29]

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning Speech-driven 3D Conversational Gestures from Video. arXiv preprint arXiv:2102.06837 (2021).

[30]

Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing. Springer, 878--887.

Digital Library

[31]

Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79--86.

Digital Library

[32]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[33]

Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1--14.

Digital Library

[34]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[35]

Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. 2016. Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5375--5384.

[36]

Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. 2019. Decoupling Representation and Classifier for Long-Tailed Recognition. In International Conference on Learning Representations.

[37]

Kenan Kasarcı, Elif Bozkurt, Yücel Yemez, and Engin Erzin. 2016. Real-time speech driven gesture animation. In 2016 24th Signal Processing and Communication Application Conference (SIU). IEEE, 1917--1920.

[38]

Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto Togneri. 2017. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE transactions on neural networks and learning systems 29, 8 (2017), 3573--3587.

[39]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[40]

Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association.

[41]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).

[42]

Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2021. Moving fast and slow: Analysis of representations and postprocessing in speech-driven automatic gesture generation. International Journal of Human--Computer Interaction (2021), 1--17.

[43]

Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2018. Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV). 35--51.

Digital Library

[44]

Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang. 2020. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128, 10 (2020), 2402--2417.

Digital Library

[45]

Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture controllers. In ACM SIGGRAPH 2010 papers. 1--11.

Digital Library

[46]

Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time prosodydriven synthesis of body language. In ACM SIGGRAPH Asia 2009 papers. 1--10.

[47]

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293--11302.

[48]

Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13401--13412.

[49]

Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. 2019. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 510--519.

[50]

Jongin Lim, Hyung Jin Chang, and Jin Young Choi. 2019. PMnet: Learning of Disentangled Pose and Movement for Unsupervised Motion Retargeting. In BMVC. 136.

[51]

Haiyang Liu and Jihan Zhang. 2021. Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism. arXiv preprint arXiv:2106.11769 (2021).

[52]

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv preprint arXiv:2203.05297 (2022).

[53]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2537--2546.

[54]

JinHong Lu, TianHang Liu, ShuZhuang Xu, and Hiroshi Shimodaira. 2021. Double-DCCCAE: Estimation of Body Gestures From Speech Waveform. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 900--904.

[55]

Soroosh Mariooryad and Carlos Busso. 2012. Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech, and Language Processing 20, 8 (2012), 2329--2340.

Digital Library

[56]

Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Citeseer, 18--25.

[57]

S Ozkul, Elif Bozkurt, Shahriar Asta, Yücel Yemez, and Engin Erzin. 2012. Multimodal analysis of upper-body gestures, facial expressions and speech. In Procceedings of the 4th International Workshop on Corpora for Research on Emotion Sentiment and Social Signals.

[58]

Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7753--7762.

[59]

Shenhan Qian, Zhi Tu, Yihao Zhi, Wen Liu, and Shenghua Gao. 2021. Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11077--11086.

[60]

Gregory Rogez and Cordelia Schmid. 2016. MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild. In Advances in Neural Information Processing Systems (NIPS). Curran Associates, 3108--3116.

[61]

Li Shen, Zhouchen Lin, and Qingming Huang. 2016. Relay back propagation for effective learning of deep convolutional neural networks. In European conference on computer vision. Springer, 467--482.

[62]

Alon Shoshan, Nadav Bhonker, Igor Kviatkovsky, and Gerard Medioni. 2021. Gancontrol: Explicitly controllable gans. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14083--14093.

[63]

Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[64]

Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In International Conference on Human-Computer Interaction. Springer, 198--202.

[65]

Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1415--1424.

[66]

Bowen Wu, Carlos Ishi, Hiroshi Ishiguro, et al. 2021. Probabilistic human-like gesture synthesis from speech using GRU-based WGAN. In GENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Workshop 2021.

Digital Library

[67]

Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2021. Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics 10, 3 (2021), 228.

[68]

Fanglu Xie, Go Irie, and Tatsushi Matsubayashi. 2021. Disentangling Subject-Dependent/-Independent Representations for 2D Motion Retargeting. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4200--4204.

[69]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1--16.

Digital Library

[70]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.

[71]

Bo Zheng, Yibiao Zhao, Joey C Yu, Katsushi Ikeuchi, and Song-Chun Zhu. 2013. Beyond point clouds: Scene understanding by reasoning geometry and physics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3127--3134.

Digital Library

[72]

Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. 2020. Bbn: Bilateralbranch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9719--9728.

Cited By

Tao DRuizhen HLibin LLi YHao Z(2024)Research progress in human-like indoor scene interactionJournal of Image and Graphics10.11834/jig.24000429:6(1575-1606)Online publication date: 2024
https://doi.org/10.11834/jig.240004
Zhang ZAo TZhang YGao QLin CChen BLiu L(2024)Semantic Gesticulator: Semantics-Aware Co-Speech Gesture SynthesisACM Transactions on Graphics10.1145/365813443:4(1-17)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658134
Zhang CWang CZhao YCheng SLuo LGuo X(2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00609
Show More Cited By

Index Terms

DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis
1. Computing methodologies

Recommendations

Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders
Extended Reality
Abstract
Multi-label Zero-Shot Learning (ZSL) is an extension of traditional single-label ZSL, where the objective is to accurately classify images containing multiple unseen classes that are not available during training. Current techniques depends on ...
Generalized Zero-Shot Learning using Identifiable Variational Autoencoders
Highlights
- Identifiable VAE is a generative model to address conventional and generalized ZSL.
Abstract
Deep learning tasks rely heavily on a large amount of training data, but collecting and annotating data daily is not practical. Therefore, Zero-shot learning (ZSL) has become important for the applications, where no labeled data is ...
Semi-supervised Learning by Disentangling and Self-ensembling over Stochastic Latent Space
Medical Image Computing and Computer Assisted Intervention – MICCAI 2019
Abstract
The success of deep learning in medical imaging is mostly achieved at the cost of a large labeled data set. Semi-supervised learning (SSL) provides a promising solution by leveraging the structure of unlabeled data to improve learning from a small ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
891
Total Downloads

Downloads (Last 12 months)418
Downloads (Last 6 weeks)40

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tao DRuizhen HLibin LLi YHao Z(2024)Research progress in human-like indoor scene interactionJournal of Image and Graphics10.11834/jig.24000429:6(1575-1606)Online publication date: 2024
https://doi.org/10.11834/jig.240004
Zhang ZAo TZhang YGao QLin CChen BLiu L(2024)Semantic Gesticulator: Semantics-Aware Co-Speech Gesture SynthesisACM Transactions on Graphics10.1145/365813443:4(1-17)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658134
Zhang CWang CZhao YCheng SLuo LGuo X(2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00609
Zhang JYoshie O(2024)Learning hierarchical discrete prior for co-speech gesture generationNeurocomputing10.1016/j.neucom.2024.127831595(127831)Online publication date: Aug-2024
https://doi.org/10.1016/j.neucom.2024.127831
Wang ZLiu YCheng XIkenaga T(2024)Key points trajectory and multi-level depth distinction based refinement for video mirror and glass segmentationMultimedia Tools and Applications10.1007/s11042-024-19627-5Online publication date: 20-Jun-2024
https://doi.org/10.1007/s11042-024-19627-5
Xu ZLi YLi YDu SIkenaga T(2023)Hierarchical Spatio-Temporal Neural Network with Displacement Based Refinement for Monocular Head Pose Prediction2023 18th International Conference on Machine Vision and Applications (MVA)10.23919/MVA57639.2023.10216167(1-5)Online publication date: 23-Jul-2023
https://doi.org/10.23919/MVA57639.2023.10216167
Zhuang ZLi YDu SIkenaga T(2023)Intra-frame Skeleton Constraints Modeling and Grouping Strategy Based Multi-Scale Graph Convolution Network for 3D Human Motion Prediction2023 18th International Conference on Machine Vision and Applications (MVA)10.23919/MVA57639.2023.10216076(1-5)Online publication date: 23-Jul-2023
https://doi.org/10.23919/MVA57639.2023.10216076
Teshima HWake NThomas DNakashima YKawasaki HIkeuchi K(2023)ACT2GProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/36069406:3(1-17)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.1145/3606940
Pang KQin DFan YHabekost JShiratori TYamagishi JKomura T(2023)Bodyformer: Semantics-guided 3D Body Gesture Synthesis with TransformerACM Transactions on Graphics10.1145/359245642:4(1-12)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592456
Ao TZhang ZLiu L(2023)GestureDiffuCLIP: Gesture Diffusion Model with CLIP LatentsACM Transactions on Graphics10.1145/359209742:4(1-18)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592097
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents