Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/566570.566594acmconferencesArticle/Chapter ViewAbstractPublication PagessiggraphConference Proceedingsconference-collections
Article

Trainable videorealistic speech animation

Published: 01 July 2002 Publication History

Abstract

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.

References

[1]
BARRON, J. L., FLEET, D. J., AND BEAUCHEMIN, S. S. 1994. Performance of optical flow techniques. International Journal of Computer Vision 12, 1, 43-77.
[2]
BEIER, T., AND NEELY, S. 1992. Feature-based image metamorphosis. In Computer Graphics (Proceedings of ACM SIGGRAPH 92), vol. 26(2), ACM, 35-42.
[3]
BERGEN, J., ANANDAN, P., HANNA, K., AND HINGORANI, R. 1992. Hierarchical model-based motion estimation. In Proceedings of the European Conference on Computer Vision, 237-252.
[4]
BEYMER, D., AND POGGIO, T. 1996. Image representations for visual learning. Science 272, 1905-1909.
[5]
BEYMER, D., SHASHUA, A., AND POGGIO, T. 1993. Example based image analysis and synthesis. Tech. Rep. 1431, MIT AI Lab.
[6]
BISHOP, C. M. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford.
[7]
BLACK, A., AND TAYLOR, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.
[8]
BLACK, M., FLEET, D., AND YACOOB, Y. 2000. Robustly estimating changes in image appearance. Computer Vision and Image Understanding, Special Issue on Robust Statistical Techniques in Image Understanding, 8-31.
[9]
BLANZ, V., AND VETTER, T. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of SIGGRAPH 2001, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 187-194.
[10]
BRAND, M., AND HERTZMANN, A. 2000. Style machines. In Proceedings of SIGGRAPH 2000, ACM Press / ACM SIGGRAPH, K. Akeley, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 183-192.
[11]
BRAND, M. 1999. Voice puppetry. In Proceedings of SIGGRAPH 1999, ACM Press / ACM SIGGRAPH, Los Angeles, A. Rockwood, Ed., Computer Graphics Proceedings, Annual Conference Series, ACM, 21-28.
[12]
BREGLER, C., COVELL, M., AND SLANEY, M. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of SIGGRAPH 1997, ACM Press / ACM SIGGRAPH, Los Angeles, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 353-360.
[13]
BROOKE, N., AND SCOTT, S. 1994. Computer graphics animations of talking faces based on stochastic models. In Intl. Symposium on Speech, Image Processing, and Neural Networks.
[14]
BURT, P. J., AND ADELSON, E. H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications COM-31, 4 (Apr.), 532-540.
[15]
CHEN, S. E., AND WILLIAMS, L. 1993. View interpolation for image synthesis. In Proceedings of SIGGRAPH 1993, ACM Press / ACM SIGGRAPH, Anaheim, CA, Computer Graphics Proceedings, Annual Conference Series, ACM, 279-288.
[16]
COHEN, M. M., AND MASSARO, D. W. 1993. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation, N. M. Thalmann and D. Thalmann, Eds. Springer-Verlag, Tokyo, 139-156.
[17]
COOTES, T. F., EDWARDS, G. J., AND TAYLOR, C. J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision.
[18]
CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R. L. 1989. Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company.
[19]
COSATTO, E., AND GRAF, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation '98, 103-110.
[20]
EZZAT, T., AND POGGIO, T. 2000. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 45-57.
[21]
GIROSI, F., JONES, M., AND POGGIO, T. 1993. Priors, stabilizers, and basis functions: From regularization to radial, tensor, and additive splines. Tech. Rep. 1430, MIT AI Lab, June.
[22]
GUENTER, B., GRIMM, C., WOOD, D., MALVAR, H., AND PIGHIN, F. 1998. Making faces. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-66.
[23]
HORN, B. K. P., AND SCHUNCK, B. G. 1981. Determining optical flow. Artificial Intelligence 17, 185-203.
[24]
HUANG, X., ALLEVA, F., HON, H.-W., HWANG, M.-Y., LEE, K.-F., AND ROSENFELD, R. 1993. The SPHINX-II speech recognition system: an overview (http://sourceforge.net/projects/cmusphinx/). Computer Speech and Language 7, 2, 137-148.
[25]
JONES, M., AND POGGIO, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision.
[26]
LEE, S. Y., CHWA, K. Y., SHIN, S. Y., AND WOLBERG, G. 1995. Image metemorphosis using snakes and free-form deformations. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, vol. 29 of Computer Graphics Proceedings, Annual Conference Series, ACM, 439-448.
[27]
LEE, Y., TERZOPOULOS, D., AND WATERS, K. 1995. Realistic modeling for facial animation. In Proceedings of SIGGRAPH 1995, ACM Press / ACM SIGGRAPH, Los Angeles, California, Computer Graphics Proceedings, Annual Conference Series, ACM, 55-62.
[28]
LEE, S. Y., WOLBERG, G., AND SHIN, S. Y. 1998. Polymorph: An algorithm for morphing among multiple images. IEEE Computer Graphics Applications 18, 58-71.
[29]
LEGOFF, B., AND BENOIT, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).
[30]
MASUKO, T., KOBAYASHI, T., TAMURA, M., MASUBUCHI, J., AND TOKUDA, K. 1998. Text-to-visual speech synthesis based on parameter generation from hmm. In ICASSP.
[31]
MOULINES, E., AND CHARPENTIER, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453-467.
[32]
PARKE, F. I. 1974. A parametric model of human faces. PhD thesis, University of Utah.
[33]
PEARCE, A., WYVILL, B., WYVILL, G., AND HILL, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface.
[34]
PIGHIN, F., HECKER, J., LISCHINSKI, D., SZELISKI, R., AND SALESIN, D. 1998. Synthesizing realistic facial expressions from photographs. In Proceedings of SIGGRAPH 1998, ACM Press / ACM SIGGRAPH, Orlando, FL, Computer Graphics Proceedings, Annual Conference Series, ACM, 75-84.
[35]
POGGIO, T., AND VETTER, T. 1992. Recognition and structure from one 2D model view: observations on prototypes, object classes and symmetries. Tech. Rep. 1347, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
[36]
ROWEIS, S. 1998. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems, The MIT Press, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds., vol. 10.
[37]
SCOTT, K., KAGELS, D., WATSON, S., ROM, H., WRIGHT, J., LEE, M., AND HUSSEY, K. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, vol. 2, 620-625.
[38]
SJLANDER, K., AND BESKOW, J. 2000. Wavesurfer - an open source speech tool. In Proc of ICSLP, vol. 4, 464-467.
[39]
TENENBAUM, J. B., DE SILVA, V., AND LANGFORD, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (Dec), 2319-2323.
[40]
TIPPING, M. E., AND BISHOP, C. M. 1999. Mixtures of probabilistic principal component analyzers. Neural Computation 11, 2, 443-482.
[41]
WAHBA, G. 1900. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia.
[42]
WATERS, K. 1987. A muscle model for animating three-dimensional facial expressions. In Computer Graphics (Proceedings of ACM SIGGRAPH 87), vol. 21(4), ACM, 17-24.
[43]
WATSON, S., WRIGHT, J., SCOTT, K., KAGELS, D., FREDA, D., AND HUSSEY, K. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.
[44]
WOLBERG, G. 1990. Digital Image Warping. IEEE Computer Society Press, Los Alamitos, CA.

Cited By

View all
  • (2023)SATFace: Subject Agnostic Talking Face Generation with Natural Head MovementNeural Processing Letters10.1007/s11063-023-11272-755:6(7529-7542)Online publication date: 11-Apr-2023
  • (2023)Audio-Driven Lips and Expression on 3D Human FaceAdvances in Computer Graphics10.1007/978-3-031-50072-5_2(15-26)Online publication date: 29-Dec-2023
  • (2022)Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic DataIEEE Transactions on Multimedia10.1109/TMM.2021.308702024(2539-2552)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGGRAPH '02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques
July 2002
574 pages
ISBN:1581135211
DOI:10.1145/566570
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. facial animation
  2. facial modeling
  3. lip synchronization
  4. morphing
  5. optical flow
  6. speech synthesis

Qualifiers

  • Article

Conference

SIGGRAPH02
Sponsor:

Acceptance Rates

SIGGRAPH '02 Paper Acceptance Rate 67 of 358 submissions, 19%;
Overall Acceptance Rate 1,822 of 8,601 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)SATFace: Subject Agnostic Talking Face Generation with Natural Head MovementNeural Processing Letters10.1007/s11063-023-11272-755:6(7529-7542)Online publication date: 11-Apr-2023
  • (2023)Audio-Driven Lips and Expression on 3D Human FaceAdvances in Computer Graphics10.1007/978-3-031-50072-5_2(15-26)Online publication date: 29-Dec-2023
  • (2022)Speaker-Independent Speech Animation Using Perceptual Loss Functions and Synthetic DataIEEE Transactions on Multimedia10.1109/TMM.2021.308702024(2539-2552)Online publication date: 2022
  • (2021)Virtual Learning Tools for Students with Delimited AbilityCooperative Design, Visualization, and Engineering10.1007/978-3-030-88207-5_34(342-347)Online publication date: 1-Oct-2021
  • (2019)Capture, Learning, and Synthesis of 3D Speaking Styles2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2019.01034(10093-10103)Online publication date: Jun-2019
  • (2019)Augmented Reality as a Higher Education Form for Students with Delimited AbilitySmart Education and e-Learning 201910.1007/978-981-13-8260-4_41(461-469)Online publication date: 1-Jun-2019
  • (2018)End-to-end Learning for 3D Facial Animation from SpeechProceedings of the 20th ACM International Conference on Multimodal Interaction10.1145/3242969.3243017(361-365)Online publication date: 2-Oct-2018
  • (2018)VisemenetACM Transactions on Graphics10.1145/3197517.320129237:4(1-10)Online publication date: 30-Jul-2018
  • (2018)A Taxonomy of Audiovisual Fake Multimedia Content Creation Technology2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR.2018.00082(372-377)Online publication date: Apr-2018
  • (2018)Visual Speech AnimationHandbook of Human Motion10.1007/978-3-319-14418-4_1(2115-2144)Online publication date: 5-Apr-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media