Deep Learning in Audio Classification

Wang, Yaqin; Wei-Kocsis, Jin; Springer, John A.; Matson, Eric T.

doi:10.1007/978-3-031-16302-9_5

Yaqin Wang⁸,
Jin Wei-Kocsis⁸,
John A. Springer⁸ &
…
Eric T. Matson⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1665))

Included in the following conference series:

International Conference on Information and Software Technologies

859 Accesses

Abstract

Audio processing technology is happening everywhere in our life. We ask our car to make a call for us while driving, or we let Alexa turn off the light for us when we don’t want to get out of bed before sleep. In all of these audio-based applications and research, it is AI and ML that makes the computer or the smart phone understand us via our voice [1]. As an important part of artificial intelligence (AI), especially machine learning (ML), which has had great influences in many areas of AI and ML-based research and applications. This paper focuses on deep learning structures and applications for audio classification. We conduct a detailed review of literature in audio-based DL and DRL approaches and applications. We also discuss the limitation and possible future works for audio-based DL approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DCNN-LSTM Based Audio Classification Combining Multiple Feature Engineering and Data Augmentation Techniques

Classification of UrbanSound8k: A Study Using Convolutional Neural Network and Multiple Data Augmentation Techniques

Audio Denoising Using Deep Neural Networks

References

Latif, S., Cuayáhuitl, H., Pervez, F., Shamshad, F., Ali, H.S., Cambria, E.: A survey on deep reinforcement learning for audio-based applications. arXiv preprint arXiv:2101.00240 (2021)
Sharma, G., Umapathy, K., Krishnan, S.: Trends in audio signal feature extraction methods. Appl. Acoust. 158, 107020 (2020)
Google Scholar
Nguyen, G., et al.: Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif. Intell. Rev. 52(1), 77–124 (2019)
Article Google Scholar
Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)
Article Google Scholar
Ying, X.: An overview of overfitting and its solutions. In: Journal of Physics: Conference Series, vol. 1168, no. 2, p. 022022. IOP Publishing (2019)
Google Scholar
Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 4, no. 4. Springer, Cham (2006)
Google Scholar
Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168 (2006)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: Unsupervised learning. In: Hastie, T., Tibshirani, R., Friedman, J. (eds.) The Elements of Statistical Learning. SSS, pp. 485–585. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_14
Chapter MATH Google Scholar
Wiering, M.A., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12(3), 729 (2012)
Google Scholar
Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53(8), 5455–5516 (2020). https://doi.org/10.1007/s10462-020-09825-6
Article Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Google Scholar
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Dong, M.: Convolutional neural network achieves human-level accuracy in music genre classification. arXiv preprint arXiv:1802.09697 (2018)
Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)
Chen, Y., Guo, Q., Liang, X., Wang, J., Qian, Y.: Environmental sound classification with dilated convolutions. Appl. Acoust. 148, 123–132 (2019)
Article Google Scholar
Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)
Latif, S., Qadir, J., Qayyum, A., Usama, M., Younis, S.: Speech technology for healthcare: opportunities, challenges, and state of the art. IEEE Rev. Biomed. Eng. 14, 342–356 (2020)
Article Google Scholar
Sherstinsky, A.: Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D 404, 132306 (2020)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Sainath, T.N., Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks (2016)
Google Scholar
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191. IEEE (2015)
Google Scholar
Ghosal, D., Kolekar, M.H.: Music genre recognition using deep neural networks and transfer learning. In: Interspeech, pp. 2087–2091 (2018)
Google Scholar
Qian, Y., Bi, M., Tan, T., Yu, K.: Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)
Article Google Scholar
Sun, T.-W.: End-to-end speech emotion recognition with gender information. IEEE Access 8, 152 423–152 438 (2020)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. In: International Conference on Machine Learning, pp. 2837–2846. PMLR (2017)
Google Scholar
Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)
Pham, N.-Q., Nguyen, T.-S., Niehues, J., Müller, M., Stüker, S., Waibel, A.: Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377 (2019)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Shannon, M., Zen, H., Byrne, W.: Autoregressive models for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21(3), 587–597 (2012)
Article Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning (1998)
Google Scholar
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. arXiv preprint arXiv:1811.12560 (2018)
Kaiser, L., et al.: Model-based reinforcement learning for Atari. arXiv preprint arXiv:1903.00374 (2019)
Whiteson, S.: TreeQN and ATeeC: differentiable tree planning for deep reinforcement learning (2018)
Google Scholar
Kala, T., Shinozaki, T.: Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5759–5763. IEEE (2018)
Google Scholar
Tjandra, A., Sakti, S., Nakamura, S.: Sequence-to-sequence ASR optimization via reinforcement learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5829–5833. IEEE (2018)
Google Scholar
Chung, H., Jeon, H.-B., Park, J.G.: Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2020)
Google Scholar
Fakoor, R., He, X., Tashev, I., Zarar, S.: Reinforcement learning to adapt speech enhancement to instantaneous input signal quality. arXiv preprint arXiv:1711.10791 (2017)
Alamdari, N., Lobarinas, E., Kehtarnavaz, N.: Personalization of hearing aid compression by human-in-the-loop deep reinforcement learning. IEEE Access 8, 203 503–203 515 (2020)
Google Scholar
Kotecha, N.: Bach2Bach: generating music using a deep reinforcement learning approach. arXiv preprint arXiv:1812.01060 (2018)
Jaques, N., Gu, S., Turner, R.E., Eck, D.: Generating music by fine-tuning recurrent neural networks with reinforcement learning (2016)
Google Scholar
Xie, J., Zhu, M.: Handcrafted features and late fusion with deep learning for bird sound classification. Eco. Inform. 52, 74–81 (2019)
Article Google Scholar
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
Article Google Scholar
Nam, J., Choi, K., Lee, J., Chou, S.-Y., Yang, Y.-H.: Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Process. Mag. 36(1), 41–51 (2018)
Article Google Scholar
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, vol. 12 (1999)
Google Scholar
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Seno, T.: Welcome to deep reinforcement learning part 1: DQN (2017). https://towardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1 (2016)
Google Scholar
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
Abeßer, J.: A review of deep learning based methods for acoustic scene classification. Appl. Sci. 10(6) (2020)
Google Scholar
Seo, H., Park, J., Park, Y.: Acoustic scene classification using various pre-processed features and convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), New York, NY, USA, pp. 25–26 (2019)
Google Scholar
Lostanlen, V., et al.: Per-channel energy normalization: why and how. IEEE Signal Process. Lett. 26(1), 39–43 (2018)
Article Google Scholar
Wu, Y., Lee, T.: Enhancing sound texture in CNN-based acoustic scene classification. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 815–819. IEEE (2019)
Google Scholar
Mariotti, O., Cord, M., Schwander, O.: Exploring deep vision models for acoustic scene classification. In: Proceedings of the DCASE, pp. 103–107 (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Google Scholar
Koutini, K., Eghbal-zadeh, H., Widmer, G.: Receptive-field-regularized CNN variants for acoustic scene classification. arXiv preprint arXiv:1909.02859 (2019)
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Lasseck, M.: Acoustic bird detection with deep convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 143–147 (2018)
Google Scholar
Li, J., Deng, L., Haeb-Umbach, R., Gong, Y.: Robust automatic speech recognition: a bridge to practical applications (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Purdue University, West Lafayette, IN, USA
Yaqin Wang, Jin Wei-Kocsis, John A. Springer & Eric T. Matson

Authors

Yaqin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jin Wei-Kocsis
View author publications
You can also search for this author in PubMed Google Scholar
John A. Springer
View author publications
You can also search for this author in PubMed Google Scholar
Eric T. Matson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaqin Wang .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Audrius Lopata
Kaunas University of Technology, Kaunas, Lithuania
Daina Gudonienė
Kaunas University of Technology, Kaunas, Lithuania
Rita Butkienė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Wei-Kocsis, J., Springer, J.A., Matson, E.T. (2022). Deep Learning in Audio Classification. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2022. Communications in Computer and Information Science, vol 1665. Springer, Cham. https://doi.org/10.1007/978-3-031-16302-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-16302-9_5
Published: 06 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16301-2
Online ISBN: 978-3-031-16302-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning in Audio Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DCNN-LSTM Based Audio Classification Combining Multiple Feature Engineering and Data Augmentation Techniques

Classification of UrbanSound8k: A Study Using Convolutional Neural Network and Multiple Data Augmentation Techniques

Audio Denoising Using Deep Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Deep Learning in Audio Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DCNN-LSTM Based Audio Classification Combining Multiple Feature Engineering and Data Augmentation Techniques

Classification of UrbanSound8k: A Study Using Convolutional Neural Network and Multiple Data Augmentation Techniques

Audio Denoising Using Deep Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation