Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

The practice of speech and language processing in China

Published: 25 October 2021 Publication History

Abstract

Several companies are trying push automatic speech recognition and other technologies past their current limitations.

References

[1]
Bi, M., Qian, Y., and Yu, K. Very deep convolutional neural networks for LVCSR. In Proceedings of the 16th Annual Conf. Intern. Speech Communication Assoc., 2015.
[2]
Chen, Z., Wang, S., and Qian, Y. Adversarial domain adaptation for speaker verification using partially shared network. In Proceedings of Interspeech 2020, 3017--3021.
[3]
Chen, Z., Wang, S., Qian, Y., and Yu, K. Channel invariant speaker embedding learning with joint multi-task and adversarial training. In Proceedings of the IEEE 2020 Intern. Conf. Acoustics, Speech and Signal Processing, 6574--6578.
[4]
Chung, J., Senior, A., Vinyals, O., and Zisserman, A. Lip reading sentences in the wild. In Proceedings of the 2017 IEEE Conf. Computer Vision and Pattern Recognition, 3444--3453.
[5]
Du, J., Tu, Y., Dai, L., and Lee, C. A regression approach to single-channel speech separation via high-resolution deep neural networks. IEEE/ACM Trans. Audio, Speech, and Language Processing 24, 8 (2016), 1424--1437.
[6]
Fan, R., Zhou, P., Chen, W., Jia, J., and Liu, G. An online attention-based model for speech recognition. In Proceedings of Interspeech 2019, 4390--4394.
[7]
Gao, Z., Zhang, S., Lei, M., and McLoughlin, I. SAN-M: Memory equipped self-attention for end-to-end speech recognition. In Proceedings of Interspeech 2020, 6--10.
[8]
Higuchi, Y., Watanabe, S., Chen, N., Ogawa, T., and Kobayashi, T. Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. In Proceedings of Interspeech 2020, 3655--3659.
[9]
Hu, Y. et al. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. (2020); arXiv:2008.00264.
[10]
Kothapally, V., Xia, W., Ghorbani, S., Hansen, J., Xue, W., and Huang, J. SkipConvNet: Skip convolutional neural network for speech dereverberation using optimally smoothed spectral mapping. (2020), arXiv:2007.09131.
[11]
Li, J. et al. Densely connected multi-stage model with channel wise sub-band feature for real-time speech enhancement. In Proceedings of 2021 IEEE Intern. Conf. Acoustics, Speech and Signal Processing.
[12]
Meng, F. et al. The Sogou system for Blizzard Challenge. In Proceedings of 2020 Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, 49--53.
[13]
Politis, A., Adavanne, S., and Virtanen, T. Sound event localization and detection task. 2020 DCASE Challenge; http://dcase.community/challenge2020/task-sound-event-localization-anddetection-results
[14]
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. FastSpeech: Fast, robust, and controllable text to speech. NIPS (2019), 3165--3174.
[15]
Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S., and Liberman, M. The third DIHARD Speech Diarization Challenge; https://sat.nist.gov/dihard3#tab_leaderboard
[16]
Shi, K. and Yu, K. Structured word embedding for low memory neural network language model. In Proceedings of Interspeech 2018, 1254--1258.
[17]
Shum, H., He, X., and Li, D. From Eliza to XiaoIce: Challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19, 1 (2018), 10--26.
[18]
Song, W., Xu, G., Zhang, Z., Zhang, C., He, X., and Zhou, B. Efficient WaveGlow: An improved WaveGlow vocoder with enhanced speed. In Proceedings of Interspeech 2020, 225--229.
[19]
Song, W., Yuan, X., Zhang, Z., Zhang, C., Wu, Y., He, X., and Zhou, B. Dian: Duration informed auto-regressive network for voice cloning. In Proceedings of 2021 IEEE Intern. Conf. Acoustics, Speech and Signal Processing.
[20]
Sun, L., Du, J., Gao, T., Fang, Y., Ma, F., and Lee, C. A speaker-dependent approach to separation of far-field multi-talker microphone array speech for front-end processing in the CHiME-5 Challenge. IEEE J. Selected Topics in Signal Processing 13, 4 (2019), 827--840.
[21]
Tan, T., Qian, Y., Hu, H., Zhou, Y., Ding, W., and Yu, K. Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE/ACM Trans. Audio, Speech, and Language Processing 26, 8 (2018), 1393--1405.
[22]
Tong, Y. et al. The JD AI speaker verification system for the FFSVC 2020 Challenge. In Proceedings of Interspeech 2020, 3476--3480.
[23]
Vaswani, A. et al. Attention is all you need. (2017); arXiv:1706.03762.
[24]
Wang, S., Huang, Z., Qian, Y., and Yu, K. Discriminative neural embedding learning for short-duration text-independent speaker verification. IEEE/ACM Trans. Audio, Speech, and Language Processing 27, 11 (2019), 1686--1696.
[25]
Wang, S., Qian, Y., and Yu, K. Focal KL-divergence based dilated convolutional neural networks for co-channel speaker identification. In Proceedings of 2018 IEEE Intern. Conf. Acoustics, Speech and Signal Processing, 5339--5343.
[26]
Wang, S., Yang, Y., Wu, Z., Qian, Y., and Yu, K. Data augmentation using deep generative models for embedding based speaker recognition. IEEE/ACM Trans. Audio, Speech, and Language Processing 28 (2020), 2598--2609.
[27]
Watanabe, S., Mandel, M., Barker, J., and Vincent, E. The 6th CHiME Speech Separation and Recognition Challenge (2020); https://chimechallenge.github.io/chime6/results.html
[28]
Xu, G., Song, W., Zhang, Z., Zhang, C., He, X., and Zhou, B. Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. In Proceedings of 2021 IEEE Intern. Conf. Acoustics, Speech and Signal Processing.
[29]
Xu, Y., Du, J., Dai, L-R., and Lee, C-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. on Audio, Speech, and Language Processing 23, 1 (2014), 7--19.
[30]
Xue, L., Song, W., Xu, G., Xie, L., and Wu, Z. Building a mixed-lingual neural TTS system with only monolingual data. In Proceedings of Interspeech 2019 2060--2064.
[31]
Xue, W., Quan, G. Zhang, C., Ding, G., He, X., and Zhou, B. Neural kalman filtering for speech enhancement. 2020; arXiv:2007.13962.
[32]
Xue, W., Tong, Y., Zhang, C., Ding, G., He, X., and Zhou, B. Sound event localization and detection based on multiple DOA beamforming and multi-task learning. In Proceedings of Interspeech 2020, 5091--5095.
[33]
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., and Xie, L. Multiband MelGAN: Faster waveform generation for high-quality text-to-speech. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop, 492--498.
[34]
Yang, Y., Wang, S., Gong, X., Qian, Y., and Yu, K. Text adaptation for speaker verification with speaker-text factorized embeddings. In Proceedings of the 2020 IEEE Intern. Conf. Acoustics, Speech and Signal Processing, 6454--6458.
[35]
Ye, Z., Wu, H., Jia, J., Bu, Y., Chen, W., Meng, F., and Wang, Y. ChoreoNet: Towards music to dance synthesis with choreographic action unit. In Proceedings of the 28th ACM Intern. Conf. Multimedia (2020), 744--752.
[36]
Yu, K., Ma, R., Shi, K., and Liu, Q. Neural network language model compression with product quantization and soft binarization. IEEE/ACM Trans. Audio, Speech, and Language Processing 28 (2020), 2438--2449.
[37]
Zhao, Z., Liu, Y., Chen, L., Liu, Q., Ma, R., and Yu, K. An investigation on different underlying quantization schemes for pre-trained language models. In Proceedings of 2020 CCF International Conf. Natural Language Processing and Chinese Computing. Springer, 359--371.
[38]
Zhou, L., Gao, J., Li, D., and Shum, H. The design and implementation of Xiaoice, an empathetic social chatbot. Computational Linguistics 46, 1 (2020), 53--93.
[39]
Zhou, P., Fan, R., Chen, W., and Jia, J. Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding. (2019); arXiv:1911.00203.
[40]
Zhou, P., Yang, W., Chen, W., Wang, Y., and Jia, J. Modality attention for end-to-end audio-visual speech recognition. In Proceedings of the 2019 IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, 6565--6569.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 64, Issue 11
November 2021
130 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3494050
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2021
Published in CACM Volume 64, Issue 11

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 3,049
    Total Downloads
  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)10
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media