Abstract
Humans use both auditory and facial cues to perceive speech, especially when auditory input is degraded, indicating a direct association between visual articulatory and acoustic speech information. This study investigates how well an audio signal of a word can be synthesized based on visual speech cues. Specifically, we synthesized audio waveforms of the vowels in monosyllabic English words from motion trajectories extracted from image sequences in the video recordings of the same words. The articulatory movements were recorded in two different speech styles: plain and clear. We designed a deep network trained on mouth landmark motion trajectories on a spectrogram and formant-based custom loss for different speech styles separately. Human and automatic evaluation show that our framework using visual cues can generate identifiable audio of the target vowels from distinct mouth landmark movements. Our results further demonstrate that intelligible audio can be synthesized from novel unseen talkers that were independent of the training data.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request. The code used during the current study is available in the repository: https://github.com/srbhgarg/VowSynth.git.
Notes
To add talker-related information to the output speech, we could condition the network by feeding the speaker ID to the model in the form of one-hot encoding. At the time of testing, when a particular speaker’s voice is to be generated, we could provide the corresponding speaker’s one-hot encoding vector. This will be similar to what is already done in WaveNet (Oord et al., 2016).
References
Akbari, H., Himani A., Cao, L. & Mesgarani, N. (2018). Lip2Audspec: Speech reconstruction from silent lip movements video. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 2516–2520). https://doi.org/10.1109/icassp.2018.8461856.
Anumanchipalli, G. K., Chartier, J., & Chang, E. F. (2018). Intelligible speech synthesis from neural decoding of spoken sentences. BioRxiv. https://doi.org/10.1101/481267
Assael, Y. M., Shillingford, B., Whiteson, S., & De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.
Bernstein, L. E., Auer, E. T., Jr., & Takayanagi, S. (2004). Auditory speech detection in noise enhanced by lipreading. Speech Communication, 44(1–4), 5–18.
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341–345.
Bond, Z. S., & Moore, T. J. (1994). A note on the acoustic-phonetic characteristics of inadvertently clear speech. Speech Communication, 14, 325–337. https://doi.org/10.1016/0167-6393(94)90026-4
Bradlow, A. R., Torretta, G. M., & Pisoni, D. B. (1996). Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20(3–4), 255–272.
Burris, C., Vorperian, H. K., Fourakis, M., Kent, R. D., & Bolt, D. M. (2014). Quantitative and descriptive comparison of four acoustic analysis systems: Vowel measurements. Journal of Speech, Language, and Hearing Research, 57(1), 26–45.
Chen, L., Su, H., & Ji, Q. (2019). Deep structured prediction for facial landmark detection. Advances in Neural Information Processing Systems, 32, 158.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421–2424.
Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America, 116(6), 3668–3678.
Dubnov, S. (2004). Generalization of spectral flatness measure for non-gaussian linear processes. IEEE Signal Processing Letters, 11(8), 698–701.
Ephrat, A., & Peleg, S. (2017). Vid2speech: speech reconstruction from silent video. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5095–5099). IEEE.
Feng, S., Kudina, O., Halpern, B. M., & Scharenborg, O. (2021). Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122.
Ferguson, S. H. (2012). Talker differences in clear and conversational speech: Vowel intelligibility for older adults with hearing loss. Journal of Speech Language and Hearing Research, 55(3), 779–790.
Ferguson, S. H., & Kewley-Port, D. (2002). Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 112, 259–271.
Ferguson, S. H., & Kewley-Port, D. (2007). Talker differences in clear and conversational speech: Acoustic characteristics of vowels. Journal of Speech Language and Hearing Research, 50, 1241–1255.
Ferguson, S. H., & Quené, H. (2014). Acoustic correlates of vowel intelligibility in clear and conversational speech for young normal-hearing and elderly hearing-impaired listeners. The Journal of the Acoustical Society of America, 135(6), 3570–3584.
Freitas, J., Teixeira, A., Dias, M. S., & Silva, S. (2017). An introduction to silent speech interfaces. Springer.
Gagné, J. P., Rochette, A. J., & Charest, M. (2002). Auditory, visual and audiovisual clear speech. Speech Communication, 37(3–4), 213–230.
Garg, S., Tang, L., Hamarneh, G., Jongman, A., Sereno, J. A., & Wang, Y. (2019). Computer-vision analysis shows different facial movements for the production of different Mandarin tones. The Journal of the Acoustical Society of America, 144(3), 1720–1720.
Gonzalez-Lopez, J. A., Gomez-Alanis, A., Doñas, J. M. M., Pérez-Córdoba, J. L., & Gomez, A. M. (2020). Silent speech interfaces for speech restoration: A review. IEEE Access, 8, 177995–178021.
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.
Harte, C., Sandler, M., & Gasser, M. (2006). Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on audio and music computing multimedia (pp. 21–26).
Heald, S., & Nusbaum, H. C. (2014). Speech perception as an active cognitive process. Frontiers in Systems Neuroscience, 8, 35.
Herff, C., Heger, D., De Pesters, A., Telaar, D., Brunner, P., Schalk, G., & Schultz, T. (2015). Brain-to-text: Decoding spoken phrases from phone representations in the brain. Frontiers in Neuroscience, 9, 217.
Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099–3111.
Hueber, T., Benaroya, E. L., Chollet, G., Denby, B., Dreyfus, G., & Stone, M. (2010). Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, 52(4), 288–300.
Jongman, A., Wang, Y., & Kim, B. H. (2003). Contributions of semantic and facial information to perception of nonsibilant fricatives. Journal of Speech Language and Hearing Research, 46, 1367–1377.
Kawase, T., Hori, Y., Ogawa, T., Sakamoto, S., Suzuki, Y., & Katori, Y. (2015). Importance of Visual Cues in Hearing Restoration by Auditory Prosthesis. In Interface Oral Health Science 2014 (pp. 119–127). Springer
Kim, J., & Davis, C. (2014). Comparing the consistency and distinctiveness of speech produced in quiet and in noise. Computer Speech & Language, 28(2), 598–606.
King, D. E. (2009). Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10, 1755–1758.
Laitinen, M. V., Disch, S., & Pulkki, V. (2013). Sensitivity of human hearing to changes in phase spectrum. Journal of the Audio Engineering Society, 61(11), 860–877.
Lam, Jennifer, Tjaden, Kris, & Wilding, Greg (2012). Acoustics of clear speech: Effect of instruction. Journal of Speech Language and Hearing Research 55(6), 1807–1821. https://doi.org/10.1044/1092-4388(2012/11-0154)
Le Cornu, T., & Milner, B. (2015). Reconstructing intelligible audio speech from visual speech features. In Interspeech (pp. 3355–3359).
Leung, K. K., Redmon, C., Wang, Y., Jongman, A., & Sereno, J. (2016). Cross-linguistic perception of clearly spoken English tense and lax vowels based on auditory, visual, and auditory-visual information. The Journal of the Acoustical Society of America, 140(4), 3335–3335.
Lu, Y., & Cooke, M. (2008). Speech production modifications produced by competing talkers, babble, and stationary noise. The Journal of the Acoustical Society of America, 124(5), 3261–3275.
Maniwa, K., Jongman, A., & Wade, T. (2008). Perception of clear fricatives by normal-hearing and simulated hearing-impaired listeners. The Journal of the Acoustical Society of America, 123, 1114–1125.
Maniwa, K., Jongman, A., & Wade, T. (2009). Acoustic characteristics of clearly spoken English fricatives. The Journal of the Acoustical Society of America, 125(6), 3962–3973.
Mira, R., Vougioukas, K., Ma, P., Petridis, S., Schuller, B. W., & Pantic, M. (2022). End-to-end video-to-speech synthesis using generative adversarial networks. In IEEE transactions on cybernetics. arXiv:2104.13332 [cs.LG]
Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15(2), 133–137.
Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184.
Picheny, M. A., Durlach, N. I., & Braida, L. D. (1986). Speaking clearly for the hard of hearing II. Journal of Speech Language and Hearing Research 29(4), 434–446. https://doi.org/10.1044/jshr.2904.434
Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020). Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13796–13805).
Redmon, C., Leung, K., Wang, Y., McMurray, B., Jongman, A., & Sereno, J. A. (2020). Cross-linguistic perception of clearly spoken English tense and lax vowels based on auditory, visual, and auditory-visual information. Journal of Phonetics, 81, 100980.
Roesler, L. (2013). Acoustic characteristics of tense and lax vowels across sentence position in clear speech. Unpublished Master’s thesis, University of Wisconsin-Milwaukee
Saleem, N., Gao, J., Irfan, M., Verdu, E., & Fuente, J. P. (2022). E2E–V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis. Image and Vision Computing, 119, 104389.
Savitzky, A., & Golay, M. J. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8), 1627–1639.
Schultz, T., & Wand, M. (2010). Modeling coarticulation in EMG-based continuous speech recognition. Speech Communication, 52(4), 341–353.
Smiljanić, R., & Bradlow, A. R. (2009). Speaking and hearing clearly: Talker and listener factors in speaking style changes. Language and Linguistics Compass, 3(1), 236–264.
Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26(2), 212–215.
Tang, L. Y., Hannah, B., Jongman, A., Sereno, J., Wang, Y., & Hamarneh, G. (2015). Examining visible articulatory features in clear and plain speech. Speech Communication, 75, 1–13.
Tasko, S. M., & Greilick, K. (2010). Acoustic and articulatory features of diphthong production: A speech clarity study. Journal of Speech Language and Hearing Research, 53, 84–99.
Traunmüller, H., & Öhrström, N. (2007). Audiovisual perception of openness and lip rounding in front vowels. Journal of Phonetics, 35(2), 244–258.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. In Proceeding of 9th ISCA workshop on speech synthesis workshop (SSW 9), 125
Vougioukas, K., Ma, P., Petridis, S., & Pantic, M. (2019). Video-driven speech reconstruction using generative adversarial networks. arXiv preprint arXiv:1906.06301.
Wang, Disong, Yang, Shan, Su, Dan, Liu, Xunying, Yu, Dong & Meng, Helen. (2022). VCVTS: Multi-speaker video-to-speech synthesis via cross-modal knowledge transfer from voice conversion.
Watson, C. I., & Harrington, J. (1999). Acoustic evidence for dynamic formant trajectories in Australian English vowels. The Journal of the Acoustical Society of America, 106(1), 458–468.
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5934–5938). IEEE.
Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1–2), 23–43.
Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.
Acknowledgements
This research has been supported by the Big Data Next Big Question Fund at Simon Fraser University (SFU) and a grant from the Social Sciences and Humanities Research Council of Canada (SSHRC Insight Grant 435-2019-1065). We thank Shubam Sachdeva, Jetic Gu, Keith Leung, Lisa Tang, and members of the Language and Brain Lab at SFU for their assistance.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Garg, S., Ruan, H., Hamarneh, G. et al. Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation. Int J Speech Technol 26, 459–474 (2023). https://doi.org/10.1007/s10772-023-10030-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-023-10030-3