Article

Trainable videorealistic speech animation

Authors:

Tomaso PoggioAuthors Info & Claims

SIGGRAPH '02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques

Pages 388 - 398

https://doi.org/10.1145/566570.566594

Published: 01 July 2002 Publication History

Abstract

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.

References

[1]

BARRON, J. L., FLEET, D. J., AND BEAUCHEMIN, S. S. 1994. Performance of optical flow techniques. International Journal of Computer Vision 12, 1, 43-77.

Digital Library

[2]

BEIER, T., AND NEELY, S. 1992. Feature-based image metamorphosis. In Computer Graphics (Proceedings of ACM SIGGRAPH 92), vol. 26(2), ACM, 35-42.

Digital Library

[3]

BERGEN, J., ANANDAN, P., HANNA, K., AND HINGORANI, R. 1992. Hierarchical model-based motion estimation. In Proceedings of the European Conference on Computer Vision, 237-252.

Digital Library

[4]

BEYMER, D., AND POGGIO, T. 1996. Image representations for visual learning. Science 272, 1905-1909.

[5]

BEYMER, D., SHASHUA, A., AND POGGIO, T. 1993. Example based image analysis and synthesis. Tech. Rep. 1431, MIT AI Lab.

Digital Library

[6]

BISHOP, C. M. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

Digital Library

[7]

BLACK, A., AND TAYLOR, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.

[8]

BLACK, M., FLEET, D., AND YACOOB, Y. 2000. Robustly estimating changes in image appearance. Computer Vision and Image Understanding, Special Issue on Robust Statistical Techniques in Image Understanding, 8-31.

Digital Library

[9]

BLANZ, V., AND VETTER, T. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of SIGGRAPH 2001, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 187-194.

Digital Library

[10]

BRAND, M., AND HERTZMANN, A. 2000. Style machines. In Proceedings of SIGGRAPH 2000, ACM Press / ACM SIGGRAPH, K. Akeley, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 183-192.

Digital Library

[11]

BRAND, M. 1999. Voice puppetry. In Proceedings of SIGGRAPH 1999, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 21-28.

Digital Library

[12]

BREGLER, C., COVELL, M., AND SLANEY, M. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of SIGGRAPH 1997, ACM Press / ACM SIGGRAPH, Los Angeles, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 353-360.

Digital Library

[13]

BROOKE, N., AND SCOTT, S. 1994. Computer graphics animations of talking faces based on stochastic models. In Intl. Symposium on Speech, Image Processing, and Neural Networks.

[14]

BURT, P. J., AND ADELSON, E. H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications COM-31, 4 (Apr.), 532-540.

[15]

CHEN, S. E., AND WILLIAMS, L. 1993. View interpolation for image synthesis. In Proceedings of SIGGRAPH 1993, ACM Press / ACM SIGGRAPH, Anaheim, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 279-288.

Digital Library

[16]

COHEN, M. M., AND MASSARO, D. W. 1993. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation, N. M. Thalmann and D. Thalmann, Eds. Springer-Verlag, Tokyo, 139-156.

[17]

COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision.

Digital Library

[18]

CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L. 1989. Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company.

Digital Library

[19]

COSATTO, E., AND GRAF, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation '98, 103-110.

Digital Library

[20]

EZZAT, T., AND POGGIO, T. 2000. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 45-57.

Digital Library

[21]

GIROSI, F., JONES, M., AND POGGIO, T. 1993. Priors, stabilizers, and basis functions: From regularization to radial, tensor, and additive splines. Tech. Rep. 1430, MIT AI Lab, June.

Digital Library

[22]

GUENTER, B., GRIMM, C., WOOD, D., MALVAR, H., AND PIGHIN, F. 1998. Making faces. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-66.

Digital Library

[23]

HORN, B. K. P., AND SCHUNCK, B. G. 1981. Determining optical flow. Artificial Intelligence 17, 185-203.

Digital Library

[24]

HUANG, X., ALLEVA, F., HON, H.-W., HWANG, M.-Y., LEE, K.-F., AND ROSENFELD, R. 1993. The SPHINX-II speech recognition system: an overview (http://sourceforge.net/projects/cmusphinx/). Computer Speech and Language 7, 2, 137-148.

[25]

JONES, M., AND POGGIO, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision.

Digital Library

[26]

LEE, S. Y., CHWA, K. Y., SHIN, S. Y., AND WOLBERG, G. 1995. Image metemorphosis using snakes and free-form deformations. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, vol. 29 of Computer Graphics Proceedings, Annual Conference Series, ACM, 439-448.

Digital Library

[27]

LEE, Y., TERZOPOULOS, D., AND WATERS, K. 1995. Realistic modeling for facial animation. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, Los Angeles, California, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-62.

Digital Library

[28]

LEE, S. Y., WOLBERG, G., AND SHIN, S. Y. 1998. Polymorph: An algorithm for morphing among multiple images. IEEE Computer Graphics Applications 18, 58-71.

Digital Library

[29]

LEGOFF, B., AND BENOIT, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).

[30]

MASUKO, T., KOBAYASHI, T., TAMURA, M., MASUBUCHI, J., AND TOKUDA, K. 1998. Text-to-visual speech synthesis based on parameter generation from hmm. In ICASSP.

[31]

MOULINES, E., AND CHARPENTIER, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453-467.

Digital Library

[32]

PARKE, F. I. 1974. A parametric model of human faces. PhD thesis, University of Utah.

Digital Library

[33]

PEARCE, A., WYVILL, B., WYVILL, G., AND HILL, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface.

Digital Library

[34]

PIGHIN, F., HECKER, J., LISCHINSKI, D., SZELISKI, R., AND SALESIN, D. 1998. Synthesizing realistic facial expressions from photographs. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 75-84.

Digital Library

[35]

POGGIO, T., AND VETTER, T. 1992. Recognition and structure from one 2D model view: observations on prototypes, object classes and symmetries. Tech. Rep. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.

Digital Library

[36]

ROWEIS, S. 1998. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems, The MIT Press, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds., vol. 10.

Digital Library

[37]

SCOTT, K., KAGELS, D., WATSON, S., ROM, H., WRIGHT, J., LEE, M., AND HUSSEY, K. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, vol. 2, 620-625.

[38]

SJLANDER, K., AND BESKOW, J. 2000. Wavesurfer - an open source speech tool. In Proc of ICSLP, vol. 4, 464-467.

[39]

TENENBAUM, J. B., DE SILVA, V., AND LANGFORD, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (Dec), 2319-2323.

[40]

TIPPING, M. E., AND BISHOP, C. M. 1999. Mixtures of probabilistic principal component analyzers. Neural Computation 11, 2, 443-482.

Digital Library

[41]

WAHBA, G. 1900. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia.

[42]

WATERS, K. 1987. A muscle model for animating three-dimensional facial expressions. In Computer Graphics (Proceedings of ACM SIGGRAPH 87), vol. 21(4), ACM, 17-24.

Digital Library

[43]

WATSON, S., WRIGHT, J., SCOTT, K., KAGELS, D., FREDA, D., AND HUSSEY, K. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.

[44]

WOLBERG, G. 1990. Digital Image Warping. IEEE Computer Society Press, Los Alamitos, CA.

Digital Library

Cited By

Yang SQiao KShi SYang JMa DHu GYan BChen J(2023)SATFace: Subject Agnostic Talking Face Generation with Natural Head MovementNeural Processing Letters10.1007/s11063-023-11272-755:6(7529-7542)Online publication date: 11-Apr-2023
https://doi.org/10.1007/s11063-023-11272-7
Ma LMa ZMeng WXu SZhang X(2023)Audio-Driven Lips and Expression on 3D Human FaceAdvances in Computer Graphics10.1007/978-3-031-50072-5_2(15-26)Online publication date: 29-Dec-2023
https://doi.org/10.1007/978-3-031-50072-5_2
Websdale DTaylor SMilner B(2022)Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic DataIEEE Transactions on Multimedia10.1109/TMM.2021.308702024(2539-2552)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3087020
Show More Cited By

Index Terms

Trainable videorealistic speech animation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
      2. Image and video acquisition
        Motion capture
  2. Computer graphics
    1. Animation
      1. Motion capture
      2. Motion processing

Recommendations

Trainable videorealistic speech animation

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual ...
Visual Speech Synthesis by Morphing Visemes
special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of Technology

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired ...
Trainable videorealistic speech animation
FGR' 04: Proceedings of the Sixth IEEE international conference on Automatic face and gesture recognition

We describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGGRAPH '02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques

July 2002

574 pages

ISBN:1581135211

DOI:10.1145/566570

Conference Chair:
Tom Appolloni
Harris Corporation

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGGRAPH02

Sponsor:

SIGGRAPH

SIGGRAPH02: The 29th International Conference on Computer Graphics and Interactive Techniques

July 23 - 26, 2002

Texas, San Antonio

Acceptance Rates

SIGGRAPH '02 Paper Acceptance Rate 67 of 358 submissions, 19%;

Overall Acceptance Rate 1,822 of 8,601 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

125
Total Citations
View Citations
1,762
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang SQiao KShi SYang JMa DHu GYan BChen J(2023)SATFace: Subject Agnostic Talking Face Generation with Natural Head MovementNeural Processing Letters10.1007/s11063-023-11272-755:6(7529-7542)Online publication date: 11-Apr-2023
https://doi.org/10.1007/s11063-023-11272-7
Ma LMa ZMeng WXu SZhang X(2023)Audio-Driven Lips and Expression on 3D Human FaceAdvances in Computer Graphics10.1007/978-3-031-50072-5_2(15-26)Online publication date: 29-Dec-2023
https://doi.org/10.1007/978-3-031-50072-5_2
Websdale DTaylor SMilner B(2022)Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic DataIEEE Transactions on Multimedia10.1109/TMM.2021.308702024(2539-2552)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3087020
Pajorová EHluchý L(2021)Virtual Learning Tools for Students with Delimited AbilityCooperative Design, Visualization, and Engineering10.1007/978-3-030-88207-5_34(342-347)Online publication date: 1-Oct-2021
https://doi.org/10.1007/978-3-030-88207-5_34
Cudeiro DBolkart TLaidlaw CRanjan ABlack M(2019)Capture, Learning, and Synthesis of 3D Speaking Styles2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2019.01034(10093-10103)Online publication date: Jun-2019
https://doi.org/10.1109/CVPR.2019.01034
Pajorová EHluchý L(2019)Augmented Reality as a Higher Education Form for Students with Delimited AbilitySmart Education and e-Learning 201910.1007/978-981-13-8260-4_41(461-469)Online publication date: 1-Jun-2019
https://doi.org/10.1007/978-981-13-8260-4_41
Pham HWang YPavlovic VD'Mello SGeorgiou PScherer SProvost ESoleymani MWorsley M(2018)End-to-end Learning for 3D Facial Animation from SpeechProceedings of the 20th ACM International Conference on Multimodal Interaction10.1145/3242969.3243017(361-365)Online publication date: 2-Oct-2018
https://dl.acm.org/doi/10.1145/3242969.3243017
Zhou YXu ZLandreth CKalogerakis EMaji SSingh K(2018)VisemenetACM Transactions on Graphics10.1145/3197517.320129237:4(1-10)Online publication date: 30-Jul-2018
https://dl.acm.org/doi/10.1145/3197517.3201292
Khodabakhsh ABusch CRamachandra R(2018)A Taxonomy of Audiovisual Fake Multimedia Content Creation Technology2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR.2018.00082(372-377)Online publication date: Apr-2018
https://doi.org/10.1109/MIPR.2018.00082
Xie LWang LYang S(2018)Visual Speech AnimationHandbook of Human Motion10.1007/978-3-319-14418-4_1(2115-2144)Online publication date: 5-Apr-2018
https://doi.org/10.1007/978-3-319-14418-4_1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents