article

Visual Speech Synthesis by Morphing Visemes

Authors:

Tomaso PoggioAuthors Info & Claims

International Journal of Computer Vision, Volume 38, Issue 1

Pages 45 - 57

https://doi.org/10.1023/A:1008166717597

Published: 30 June 2000 Publication History

Abstract

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.

References

[1]

Avidan, S., Evgeniou, T., Shashua, A., and Poggio, T. 1997. Image-based view synthesis by combining trilinear tensors and learning techniques. In VRST '97 Proceedings, Lausanne, Switzerland, pp. 103-109.

Digital Library

[2]

Barron, J.L., Fleet, D.J., and Beauchemin, S.S. 1994. Performance of optical flow techniques. International Journal of Computer Vision, 12(1):43-77.

Digital Library

[3]

Beier, T. and Neely, S. 1992. Feature-based image metamorphosis. In SIGGRAPH '92 Proceedings, Chicago, IL, pp. 35-42.

Digital Library

[4]

Bergen, J.R. and Hingorani, R. 1990. Hierarchical motion-based frame rate conversion. Technical Report, David Sarnoff Research Center, Princeton, New Jersey.

[5]

Beymer, D., Shashua, A., and Poggio, T. 1993. Example based image analysis and synthesis. Technical Report 1431, MIT AI Lab.

Digital Library

[6]

Black, A. and Taylor, P. 1997. The Festival Speech Synthesis System. University of Edinburgh.

[7]

Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: Driving visual speech with audio. In SIGGRAPH '97 Proceedings, Los Angeles, CA.

Digital Library

[8]

Burt, P.J. and Adelson, E.H. 1983. The laplacian pyramid as a compact image code. IEEE Trans. on Communications, COM- 31(4):532-540.

[9]

Chen, S.E. and Williams, L. 1993. View interpolation for image synthesis. In SIGGRAPH '93 Proceedings, Anaheim, CA, pp. 279- 288.

Digital Library

[10]

Cohen, M.M. and Massaro, D.W. 1993. Modeling coarticulation in synthetic visual speech. In N.M. Thalmann and D. Thalmann, (Eds.), Models and Techniques in Computer Animation, Springer-Verlag: Tokyo, pp. 139-156.

[11]

Cootes, T.F., Edwards, G.J., and Taylor, C.J. 1998. Active appearance models. In Proceedings of the European Conference on Computer Vision, Freiburg, Germany.

Digital Library

[12]

Cosatto, E. and Graf, H. 1998. Sample-based synthesis of photorealistic talking heads. In Proceedings of Computer Animation '98, Philadelphia, Pennsylvania, pp. 103-110.

Digital Library

[13]

Ezzat, T. and Poggio, T. A morphable model for the human mouth. Technical Report, MIT AI Lab, forthcoming.

[14]

Fisher, C.G. 1968. Confusions among visually perceived consonants. Jour. Speech and Hearing Research, 11:796-804.

[15]

Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. 1998. Making faces. In SIGGRAPH '98 Proceedings, Orlando, FL, pp. 55-66.

Digital Library

[16]

Horn, B.K.P. and Schunck, B.G. 1981. Determining optical flow. Artificial Intelligence, 17:185-203.

Digital Library

[17]

Jones, M. and Poggio, T. 1998. Multidimensional morphable models: A framework for representing and maching object classes. In Proceedings of the International Conference on Computer Vision, Bombay, India.

Digital Library

[18]

Lee, S.Y., Chwa, K.Y., Shin, S.Y., and Wolberg, G. 1992. Image metemorphosis using snakes and free-form deformations. In SIGGRAPH '92 Proceedings, pp. 439-448.

Digital Library

[19]

Lee, Y., Terzopoulos, D., and Waters, K. 1995. Realistic modeling for facial animation. In SIGGRAPH '95 Proceedings, Los Angeles, California, pp. 55-62.

Digital Library

[20]

LeGoff, B. and Benoit, C. 1996. A text-to-audiovisual-speech synthesizer for french. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Philadelphia, USA.

[21]

Lim, J. 1990. Two-Dimensional Signal and Image Processing. Prentice Hall: Englewood Cliffs, New Jersey.

Digital Library

[22]

Montgomery, A. and Jackson, P. 1983. Physical characteristics of the lips underlying vowel lipreading performance. Jour. Acoust. Soc. Am., 73(6):2134-2144.

[23]

Moulines, E. and Charpentier, F. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453-467.

Digital Library

[24]

Olive, J., Greenwood, A., and Coleman, J. 1993. Acoustics of American English Speech: A Dynamic Approach. Springer-Verlag: New York, USA.

[25]

Owens, E. and Blazek, B. 1985. Visemes observed by hearing-impaired and normal-hearing adult viewers. Jour. Speech and Hearing Research, 28:381-393.

[26]

Parke, F.I. 1974. A parametric model of human faces. Ph.D. Thesis, University of Utah.

Digital Library

[27]

Pearce, A., Wyvill, B., Wyvill, G., and Hill, D. 1986. Speech and expression: A computer solution to face animation. In Graphics Interface, Vancouver, pp. 136-140.

Digital Library

[28]

Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D. 1998. Synthesizing realistic facial expressions from photographs. In SIGGRAPH '98 Proceedings, Orlando, FL.

Digital Library

[29]

Scott, K.C., Kagels, D.S., Watson, S.H., Rom, H., Wright, J.R., Lee, M., and Hussey, K.J. 1994. Synthesis of speaker facial movement to match selected speech sequences. In Proceedings of the Fifth Australian Conference on Speech Science and Technology, Vol. 2, pp. 620-625.

[30]

Seitz, S. and Dyer, C. 1996. View morphing. In SIGGRAPH '96 Proceedings, pp. 21-30.

Digital Library

[31]

Waters, K. and Levergood, T. 1993. Decface: An automatic lipsynchronization algorithm for synthetic faces. Technical report, Digital Equipment Corporation CRL Report.

[32]

Watson, S.H., Wright, J.R., Scott, K.C., Kagels, D.S., Freda, D., and Hussey, K.J. 1997. An advanced morphing algorithm for interpolating phoneme images to simulate speech. Jet Propulsion Laboratory, California Institute of Technology.

[33]

Wolberg, G. 1990. Digital Image Warping. IEEE Computer Society Press: Los Alamitos, CA.

Digital Library

Cited By

Peng ZLuo YShi YXu HZhu XLiu HHe JFan ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking FacesProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611734(5292-5301)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611734
Su ZFang SRekimoto J(2023)LipLearner: Customizable Silent Speech Interactions on Mobile DevicesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581465(1-21)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3581465
Bu YMa TLi WZhou HJia JChen SXu KShi DWu HYang ZLi KWu ZShi YLu XLiu ZKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective FeedbackProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445490(1-14)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445490
Show More Cited By

Index Terms

Visual Speech Synthesis by Morphing Visemes
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
      2. Computer vision representations
        Shape representations
    2. Natural language processing
      1. Speech recognition

Recommendations

Trainable videorealistic speech animation
SIGGRAPH '02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual ...
Trainable videorealistic speech animation

We describe how to create with machine learning techniques a generative, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual ...
Visual Speech Synthesis by Morphing Visemes

Comments

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision

International Journal of Computer Vision Volume 38, Issue 1

special issue on learning and vision at the center for biological and computational learning, Massachusetts Institute of Technology

June 2000

82 pages

ISSN:0920-5691

Editors:
Takeo Kanade
Carnegie Mellon Univ., Pittsburg, PA
,
Olivier Faugeras
INRIA
,
Tomaso Poggio
Massachusetts Institute of Technology, Cambridge
,
Alessandro Verri
Massachusetts Institute of Technology, Cambridge

Issue’s Table of Contents

Copyright © Copyright © 2000 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 30 June 2000

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Peng ZLuo YShi YXu HZhu XLiu HHe JFan ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking FacesProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611734(5292-5301)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611734
Su ZFang SRekimoto J(2023)LipLearner: Customizable Silent Speech Interactions on Mobile DevicesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581465(1-21)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3581465
Bu YMa TLi WZhou HJia JChen SXu KShi DWu HYang ZLi KWu ZShi YLu XLiu ZKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective FeedbackProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445490(1-14)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445490
Jamaludin AChung JZisserman A(2019)You Said That?: Synthesising Talking Faces from AudioInternational Journal of Computer Vision10.1007/s11263-019-01150-y127:11-12(1767-1779)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1007/s11263-019-01150-y
(2017)A review on data-driven learning of a talking head modelInternational Journal of Intelligent Systems Technologies and Applications10.1504/IJISTA.2017.08423916:2(169-190)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1504/IJISTA.2017.084239
Olszewski KLim JSaito SLi H(2016)High-fidelity facial and speech animation for VR HMDsACM Transactions on Graphics10.1145/2980179.298025235:6(1-14)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.1145/2980179.2980252
Zhao RMartinez A(2016)Labeled Graph Kernel for Behavior AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2015.248140438:8(1640-1650)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.1109/TPAMI.2015.2481404
Yang MJiang JTao JMu KLi H(2016)Emotional head motion predicting from prosodic and linguistic featuresMultimedia Tools and Applications10.1007/s11042-016-3405-375:9(5125-5146)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1007/s11042-016-3405-3
Fan BXie LYang SWang LSoong F(2016)A deep bidirectional LSTM approach for video-realistic talking headMultimedia Tools and Applications10.1007/s11042-015-2944-375:9(5287-5309)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1007/s11042-015-2944-3
Liu YXu FChai JTong XWang LHuo Q(2015)Video-audio driven real-time facial animationACM Transactions on Graphics10.1145/2816795.281812234:6(1-10)Online publication date: 2-Nov-2015
https://dl.acm.org/doi/10.1145/2816795.2818122
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents