Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system

Kadyan, Virender; Bala, Shashi; Bawa, Puneet

doi:10.1007/s10772-021-09797-0

Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system

Published: 02 February 2021

Volume 24, pages 473–481, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Virender Kadyan¹,
Shashi Bala² &
Puneet Bawa²

161 Accesses
6 Citations
Explore all metrics

Abstract

Processing of low resource pre and post acoustic signals always faced the challenge of data scarcity in its training module. It’s difficult to obtain high system accuracy with limited corpora in train set which results into extraction of large discriminative feature vector. These vectors information are distorted due to acoustic mismatch occurs because of real environment and inter speaker variations. In this paper, context independent information of an input speech signal is pre-processed using bottleneck features and later in modeling phase Tandem-NN model has been employ to enhance system accuracy. Later to fulfill the requirement of train data issues, in-domain training augmentation is perform using fusion of original clean and artificially created modified train noisy data and to further boost this training data, tempo modification of input speech signal is perform with maintenance of its spectral envelope and pitch in corresponding input audio signal. Experimental result shows that a relative improvement of 13.53% is achieved in clean and 32.43% in noisy conditions with Tandem-NN system in comparison to that of baseline system respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

In domain training data augmentation on noise robust Punjabi Children speech recognition

Article 13 September 2021

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Article 16 October 2020

Bottleneck Feature Extraction in Punjabi Adult Speech Recognition System

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bahari, M. H., Saeidi, R., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7344–7348). IEEE. https://doi.org/10.1109/ICASSP.2013.6639089
Bell, P., Swietojanski, P., & Renals, S. (2013). Multi-level adaptive networks in tandem and hybrid ASR systems. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 6975–6979). IEEE. https://doi.org/10.1109/ICASSP.2013.6639014
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120. https://doi.org/10.1109/TASSP.1979.1163209.
Article Google Scholar
Boll, S., & Pulsipher, D. C. (1980). Suppression of acoustic noise in speech using two microphone adaptive noise cancellation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 752–753. https://doi.org/10.1109/TASSP.1980.1163472.
Article Google Scholar
Boril, H., & Hansen, J. H. (2009). Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1379–1393. https://doi.org/10.1109/TASL.2009.2034770.
Article Google Scholar
Cichocki, A., Unbehauen, R., & Swiniarski, R. W. (1993). Neural networks for optimization and signal processing (Vol. 253). New York: Wiley.
MATH Google Scholar
Ellis, D. P., Singh, R., & Sivadas, S. (2001). Tandem acoustic modeling in large-vocabulary recognition. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 517–520). IEEE. https://doi.org/10.1109/ICASSP.2001.940881
Ghitza, O. (1988). Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment. Journal of Phonetics, 16(1), 109–123. https://doi.org/10.1016/S0095-4470(19)30469-3.
Article Google Scholar
Grézl, F., Karafiát, M., & Burget, L. (2009). Investigation into bottle-neck features for meeting speech recognition. In Tenth annual conference of the international speech communication association.
Hansen, J. H. (1994). Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect. IEEE Transactions on Speech and Audio Processing, 2(4), 598–614. https://doi.org/10.1109/89.326618.
Article Google Scholar
Hansen, J. H., & Bria, O. N. (1990). Lombard effect compensation for robust automatic speech recognition in noise. In First International Conference on Spoken Language Processing.
Hansen, J. H., & Cairns, D. A. (1995). Icarus: Source generator based real-time recognition of speech in noisy stressful and lombard effect environments. Speech Communication, 16(4), 391–422. https://doi.org/10.1016/0167-6393(95)00007-B.
Article Google Scholar
Hermansky, H., Ellis, D. P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635–1638). IEEE. https://doi.org/10.1109/ICASSP.2000.862024
Hirsch, H. G., & Ehrlicher, C. (1995). Noise estimation techniques for robust speech recognition. In 1995 International conference on acoustics, speech, and signal processing (Vol. 1, pp. 153–156). IEEE. https://doi.org/10.1109/ICASSP.1995.479387
Hsu, W. N., Zhang, Y., Weiss, R. J., Chung, Y. A., Wang, Y., Wu, Y., & Glass, J. (2019). Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5901–5905). IEEE. https://doi.org/10.1109/ICASSP.2019.8683561
Huang, J., & Kingsbury, B. (2013). Audio-visual deep learning for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7596–7599). IEEE. https://doi.org/10.1109/ICASSP.2013.6639140
Hush, D. R., & Horne, B. G. (1993). Progress in supervised neural networks. IEEE Signal Processing Magazine, 10(1), 8–39. https://doi.org/10.1109/79.180705.
Article Google Scholar
Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017). A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. International Journal of Speech Technology, 20(4), 761–769. https://doi.org/10.1007/s10772-017-9446-9.
Article Google Scholar
Kadyan, V., Mantri, A., Aggarwal, R. K., & Singh, A. (2019). A comparative study of deep neural network based Punjabi-ASR system. International Journal of Speech Technology, 22(1), 111–119. https://doi.org/10.1007/s10772-018-09577-3.
Article Google Scholar
Kaur, J., Singh, A., & Kadyan, V. (2020). Automatic speech recognition system for tonal languages: state-of-the-art survey. Archives of Computational Methods in Engineering. https://doi.org/10.1007/s11831-020-09414-4.
Article Google Scholar
Lal, P., & King, S. (2013). Cross-lingual automatic speech recognition using tandem features. IEEE Transactions on Audio, Speech, and Language Processing, 21(12), 2506–2515. https://doi.org/10.1109/TASL.2013.2277932.
Article Google Scholar
Kinnunen, T., Juvela, L., Alku, P., & Yamagishi, J. (2017). Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5535–5539). IEEE. https://doi.org/10.1109/ICASSP.2017.7953215
Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association.
Kubat, M. (1999). Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0–02–352781–7. The Knowledge Engineering Review, 13(4), 409–412.
Lippmann, R., Martin, E., & Paul, D. (1987). Multi-style training for robust isolated-word speech recognition. In ICASSP'87. IEEE international conference on acoustics, speech, and signal processing (Vol. 12, pp. 705–708). IEEE. https://doi.org/10.1109/ICASSP.1987.1169544
Lyon, R. (1984). Computational models of neural auditory processing. In ICASSP'84. IEEE international conference on acoustics, speech, and signal processing (Vol. 9, pp. 41–44). IEEE. https://doi.org/10.1109/ICASSP.1984.1172756
Maity, K., Pradhan, G., & Singh, J. P. (2020). A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circuits, Systems, and Signal Processing. https://doi.org/10.1007/s00034-020-01565-w.
Article Google Scholar
McClelland, J. L., & Rumelhart, D. E. (1986). Parallel distributed processing: Explorations in the Microstructure of Cognition (Vol. 2, pp. 216–271). Cambridge: MIT Press.
Google Scholar
Naik, J. M., & Lubensky, D. M. (1994). A hybrid HMM-MLP speaker verification algorithm for telephone speech. In Proceedings of ICASSP'94. IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. I–153). IEEE. https://doi.org/10.1109/ICASSP.1994.389332
Parihar, N., & Picone, J. (2003). Analysis of the Aurora large vocabulary evaluations. In Eighth European conference on speech communication and technology.
Plahl, C., Schlüter, R., & Ney, H. (2010). Hierarchical bottle neck features for LVCSR. In Eleventh annual conference of the international speech communication association.
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., ... & Rose, R. C. (2010). Subspace Gaussian mixture models for speech recognition. In 2010 IEEE international conference on acoustics, speech and signal processing (pp. 4330–4333). IEEE. https://doi.org/10.1109/ICASSP.2010.5495662
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
Ravanelli, M., & Janin, A. (2014). TANDEM-bottleneck feature combination using hierarchical Deep Neural Networks. In The 9th international symposium on chinese spoken language processing (pp. 113–117). IEEE. https://doi.org/10.1109/ISCSLP.2014.6936576
Rosenberg, A., Zhang, Y., Ramabhadran, B., Jia, Y., Moreno, P., Wu, Y., & Wu, Z. (2019). Speech recognition with augmented synthesized speech. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 996–1002). IEEE.
Saon, G., Tüske, Z., Audhkhasi, K., & Kingsbury, B. (2019). Sequence noise injected training for end-to-end speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6261–6265). IEEE. https://doi.org/10.1109/ICASSP.2019.8683706
Seneff, S. (1988). A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16(1), 55–76. https://doi.org/10.1016/S0095-4470(19)30466-8.
Article Google Scholar
Serdyuk, D., Audhkhasi, K., Brakel, P., Ramabhadran, B., Thomas, S., & Bengio, Y. (2016). Invariant representations for noisy speech recognition. arXiv preprint. arXiv:1612.01928
Singh, A., Kadyan, V., Kumar, M., & Bassan, N. (2019). ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages. Artificial Intelligence Review. https://doi.org/10.1007/s10462-019-09775-8.
Article Google Scholar
Tebelskis, J., & Waibel, A. (1990). Large vocabulary recognition using linked predictive neural networks. In International conference on acoustics, speech, and signal processing (pp. 437–440). IEEE. https://doi.org/10.1109/ICASSP.1990.115742
Tomar, V. S., & Rose, R. C. (2013). A family of discriminative manifold learning algorithms and their application to speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1), 161–171. https://doi.org/10.1109/TASLP.2013.2286906.
Article Google Scholar
Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251. https://doi.org/10.1016/0167-6393(93)90095-3.
Article Google Scholar
Zeng, Y. M., Wu, Z. Y., Falk, T., & Chan, W. Y. (2006). Robust GMM based gender classification using pitch and RASTA-PLP parameters of speech. In 2006 International conference on machine learning and cybernetics (pp. 3376–3379). IEEE. https://doi.org/10.1109/ICMLC.2006.258497

Download references

Author information

Authors and Affiliations

Department of Informatics, School of Computer Science, University of Petroleum & Energy Studies (UPES), Bidholi, Dehradun, 248007, India
Virender Kadyan
Centre of Excellence for Speech and Multimodal Laboratory, Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India
Shashi Bala & Puneet Bawa

Authors

Virender Kadyan
View author publications
You can also search for this author in PubMed Google Scholar
Shashi Bala
View author publications
You can also search for this author in PubMed Google Scholar
Puneet Bawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Virender Kadyan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kadyan, V., Bala, S. & Bawa, P. Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system. Int J Speech Technol 24, 473–481 (2021). https://doi.org/10.1007/s10772-021-09797-0

Download citation

Received: 05 February 2020
Accepted: 02 January 2021
Published: 02 February 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10772-021-09797-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

In domain training data augmentation on noise robust Punjabi Children speech recognition

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Bottleneck Feature Extraction in Punjabi Adult Speech Recognition System

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

In domain training data augmentation on noise robust Punjabi Children speech recognition

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Bottleneck Feature Extraction in Punjabi Adult Speech Recognition System

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation