Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Humans use both auditory and facial cues to perceive speech, especially when auditory input is degraded, indicating a direct association between visual articulatory and acoustic speech information. This study investigates how well an audio signal of a word can be synthesized based on visual speech cues. Specifically, we synthesized audio waveforms of the vowels in monosyllabic English words from motion trajectories extracted from image sequences in the video recordings of the same words. The articulatory movements were recorded in two different speech styles: plain and clear. We designed a deep network trained on mouth landmark motion trajectories on a spectrogram and formant-based custom loss for different speech styles separately. Human and automatic evaluation show that our framework using visual cues can generate identifiable audio of the target vowels from distinct mouth landmark movements. Our results further demonstrate that intelligible audio can be synthesized from novel unseen talkers that were independent of the training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request. The code used during the current study is available in the repository: https://github.com/srbhgarg/VowSynth.git.

Notes

  1. https://github.com/srbhgarg/VowSynth.git.

  2. To add talker-related information to the output speech, we could condition the network by feeding the speaker ID to the model in the form of one-hot encoding. At the time of testing, when a particular speaker’s voice is to be generated, we could provide the corresponding speaker’s one-hot encoding vector. This will be similar to what is already done in WaveNet (Oord et al., 2016).

References

  • Akbari, H., Himani A., Cao, L. & Mesgarani, N. (2018). Lip2Audspec: Speech reconstruction from silent lip movements video. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 2516–2520). https://doi.org/10.1109/icassp.2018.8461856.

  • Anumanchipalli, G. K., Chartier, J., & Chang, E. F. (2018). Intelligible speech synthesis from neural decoding of spoken sentences. BioRxiv. https://doi.org/10.1101/481267

    Article  Google Scholar 

  • Assael, Y. M., Shillingford, B., Whiteson, S., & De Freitas, N. (2016). Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.

  • Bernstein, L. E., Auer, E. T., Jr., & Takayanagi, S. (2004). Auditory speech detection in noise enhanced by lipreading. Speech Communication, 44(1–4), 5–18.

    Google Scholar 

  • Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341–345.

    Google Scholar 

  • Bond, Z. S., & Moore, T. J. (1994). A note on the acoustic-phonetic characteristics of inadvertently clear speech. Speech Communication, 14, 325–337. https://doi.org/10.1016/0167-6393(94)90026-4

    Article  Google Scholar 

  • Bradlow, A. R., Torretta, G. M., & Pisoni, D. B. (1996). Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20(3–4), 255–272.

    Google Scholar 

  • Burris, C., Vorperian, H. K., Fourakis, M., Kent, R. D., & Bolt, D. M. (2014). Quantitative and descriptive comparison of four acoustic analysis systems: Vowel measurements. Journal of Speech, Language, and Hearing Research, 57(1), 26–45.

    Google Scholar 

  • Chen, L., Su, H., & Ji, Q. (2019). Deep structured prediction for facial landmark detection. Advances in Neural Information Processing Systems, 32, 158.

    Google Scholar 

  • Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421–2424.

    Google Scholar 

  • Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America, 116(6), 3668–3678.

    Google Scholar 

  • Dubnov, S. (2004). Generalization of spectral flatness measure for non-gaussian linear processes. IEEE Signal Processing Letters, 11(8), 698–701.

    Google Scholar 

  • Ephrat, A., & Peleg, S. (2017). Vid2speech: speech reconstruction from silent video. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5095–5099). IEEE.

  • Feng, S., Kudina, O., Halpern, B. M., & Scharenborg, O. (2021). Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122.

  • Ferguson, S. H. (2012). Talker differences in clear and conversational speech: Vowel intelligibility for older adults with hearing loss. Journal of Speech Language and Hearing Research, 55(3), 779–790.

    Google Scholar 

  • Ferguson, S. H., & Kewley-Port, D. (2002). Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 112, 259–271.

    Google Scholar 

  • Ferguson, S. H., & Kewley-Port, D. (2007). Talker differences in clear and conversational speech: Acoustic characteristics of vowels. Journal of Speech Language and Hearing Research, 50, 1241–1255.

    Google Scholar 

  • Ferguson, S. H., & Quené, H. (2014). Acoustic correlates of vowel intelligibility in clear and conversational speech for young normal-hearing and elderly hearing-impaired listeners. The Journal of the Acoustical Society of America, 135(6), 3570–3584.

    Google Scholar 

  • Freitas, J., Teixeira, A., Dias, M. S., & Silva, S. (2017). An introduction to silent speech interfaces. Springer.

    Google Scholar 

  • Gagné, J. P., Rochette, A. J., & Charest, M. (2002). Auditory, visual and audiovisual clear speech. Speech Communication, 37(3–4), 213–230.

    MATH  Google Scholar 

  • Garg, S., Tang, L., Hamarneh, G., Jongman, A., Sereno, J. A., & Wang, Y. (2019). Computer-vision analysis shows different facial movements for the production of different Mandarin tones. The Journal of the Acoustical Society of America, 144(3), 1720–1720.

    Google Scholar 

  • Gonzalez-Lopez, J. A., Gomez-Alanis, A., Doñas, J. M. M., Pérez-Córdoba, J. L., & Gomez, A. M. (2020). Silent speech interfaces for speech restoration: A review. IEEE Access, 8, 177995–178021.

    Google Scholar 

  • Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.

    Google Scholar 

  • Harte, C., Sandler, M., & Gasser, M. (2006). Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on audio and music computing multimedia (pp. 21–26).

  • Heald, S., & Nusbaum, H. C. (2014). Speech perception as an active cognitive process. Frontiers in Systems Neuroscience, 8, 35.

    Google Scholar 

  • Herff, C., Heger, D., De Pesters, A., Telaar, D., Brunner, P., Schalk, G., & Schultz, T. (2015). Brain-to-text: Decoding spoken phrases from phone representations in the brain. Frontiers in Neuroscience, 9, 217.

    Google Scholar 

  • Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099–3111.

    Google Scholar 

  • Hueber, T., Benaroya, E. L., Chollet, G., Denby, B., Dreyfus, G., & Stone, M. (2010). Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, 52(4), 288–300.

    Google Scholar 

  • Jongman, A., Wang, Y., & Kim, B. H. (2003). Contributions of semantic and facial information to perception of nonsibilant fricatives. Journal of Speech Language and Hearing Research, 46, 1367–1377.

    Google Scholar 

  • Kawase, T., Hori, Y., Ogawa, T., Sakamoto, S., Suzuki, Y., & Katori, Y. (2015). Importance of Visual Cues in Hearing Restoration by Auditory Prosthesis. In Interface Oral Health Science 2014 (pp. 119–127). Springer

  • Kim, J., & Davis, C. (2014). Comparing the consistency and distinctiveness of speech produced in quiet and in noise. Computer Speech & Language, 28(2), 598–606.

    Google Scholar 

  • King, D. E. (2009). Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10, 1755–1758.

    Google Scholar 

  • Laitinen, M. V., Disch, S., & Pulkki, V. (2013). Sensitivity of human hearing to changes in phase spectrum. Journal of the Audio Engineering Society, 61(11), 860–877.

    Google Scholar 

  • Lam, Jennifer, Tjaden, Kris, & Wilding, Greg (2012). Acoustics of clear speech: Effect of instruction. Journal of Speech Language and Hearing Research 55(6), 1807–1821. https://doi.org/10.1044/1092-4388(2012/11-0154)

  • Le Cornu, T., & Milner, B. (2015). Reconstructing intelligible audio speech from visual speech features. In Interspeech (pp. 3355–3359).

  • Leung, K. K., Redmon, C., Wang, Y., Jongman, A., & Sereno, J. (2016). Cross-linguistic perception of clearly spoken English tense and lax vowels based on auditory, visual, and auditory-visual information. The Journal of the Acoustical Society of America, 140(4), 3335–3335.

    Google Scholar 

  • Lu, Y., & Cooke, M. (2008). Speech production modifications produced by competing talkers, babble, and stationary noise. The Journal of the Acoustical Society of America, 124(5), 3261–3275.

    Google Scholar 

  • Maniwa, K., Jongman, A., & Wade, T. (2008). Perception of clear fricatives by normal-hearing and simulated hearing-impaired listeners. The Journal of the Acoustical Society of America, 123, 1114–1125.

    Google Scholar 

  • Maniwa, K., Jongman, A., & Wade, T. (2009). Acoustic characteristics of clearly spoken English fricatives. The Journal of the Acoustical Society of America, 125(6), 3962–3973.

    Google Scholar 

  • Mira, R., Vougioukas, K., Ma, P., Petridis, S., Schuller, B. W., & Pantic, M. (2022). End-to-end video-to-speech synthesis using generative adversarial networks. In IEEE transactions on cybernetics. arXiv:2104.13332 [cs.LG]

  • Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15(2), 133–137.

    Google Scholar 

  • Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184.

    Google Scholar 

  • Picheny, M. A., Durlach, N. I., & Braida, L. D. (1986). Speaking clearly for the hard of hearing II. Journal of Speech Language and Hearing Research 29(4), 434–446. https://doi.org/10.1044/jshr.2904.434

  • Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020). Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13796–13805).

  • Redmon, C., Leung, K., Wang, Y., McMurray, B., Jongman, A., & Sereno, J. A. (2020). Cross-linguistic perception of clearly spoken English tense and lax vowels based on auditory, visual, and auditory-visual information. Journal of Phonetics, 81, 100980.

    Google Scholar 

  • Roesler, L. (2013). Acoustic characteristics of tense and lax vowels across sentence position in clear speech. Unpublished Master’s thesis, University of Wisconsin-Milwaukee

  • Saleem, N., Gao, J., Irfan, M., Verdu, E., & Fuente, J. P. (2022). E2E–V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis. Image and Vision Computing, 119, 104389.

    Google Scholar 

  • Savitzky, A., & Golay, M. J. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8), 1627–1639.

    Google Scholar 

  • Schultz, T., & Wand, M. (2010). Modeling coarticulation in EMG-based continuous speech recognition. Speech Communication, 52(4), 341–353.

    Google Scholar 

  • Smiljanić, R., & Bradlow, A. R. (2009). Speaking and hearing clearly: Talker and listener factors in speaking style changes. Language and Linguistics Compass, 3(1), 236–264.

    Google Scholar 

  • Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26(2), 212–215.

    Google Scholar 

  • Tang, L. Y., Hannah, B., Jongman, A., Sereno, J., Wang, Y., & Hamarneh, G. (2015). Examining visible articulatory features in clear and plain speech. Speech Communication, 75, 1–13.

    Google Scholar 

  • Tasko, S. M., & Greilick, K. (2010). Acoustic and articulatory features of diphthong production: A speech clarity study. Journal of Speech Language and Hearing Research, 53, 84–99.

    Google Scholar 

  • Traunmüller, H., & Öhrström, N. (2007). Audiovisual perception of openness and lip rounding in front vowels. Journal of Phonetics, 35(2), 244–258.

    Google Scholar 

  • van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. In Proceeding of 9th ISCA workshop on speech synthesis workshop (SSW 9), 125

  • Vougioukas, K., Ma, P., Petridis, S., & Pantic, M. (2019). Video-driven speech reconstruction using generative adversarial networks. arXiv preprint arXiv:1906.06301.

  • Wang, Disong, Yang, Shan, Su, Dan, Liu, Xunying, Yu, Dong & Meng, Helen. (2022). VCVTS: Multi-speaker video-to-speech synthesis via cross-modal knowledge transfer from voice conversion.

  • Watson, C. I., & Harrington, J. (1999). Acoustic evidence for dynamic formant trajectories in Australian English vowels. The Journal of the Acoustical Society of America, 106(1), 458–468.

    Google Scholar 

  • Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5934–5938). IEEE.

  • Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1–2), 23–43.

    Google Scholar 

  • Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.

    Google Scholar 

Download references

Acknowledgements

This research has been supported by the Big Data Next Big Question Fund at Simon Fraser University (SFU) and a grant from the Social Sciences and Humanities Research Council of Canada (SSHRC Insight Grant 435-2019-1065). We thank Shubam Sachdeva, Jetic Gu, Keith Leung, Lisa Tang, and members of the Language and Brain Lab at SFU for their assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saurabh Garg.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garg, S., Ruan, H., Hamarneh, G. et al. Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulation. Int J Speech Technol 26, 459–474 (2023). https://doi.org/10.1007/s10772-023-10030-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10030-3

Keywords