Abstract
Audio processing technology is happening everywhere in our life. We ask our car to make a call for us while driving, or we let Alexa turn off the light for us when we don’t want to get out of bed before sleep. In all of these audio-based applications and research, it is AI and ML that makes the computer or the smart phone understand us via our voice [1]. As an important part of artificial intelligence (AI), especially machine learning (ML), which has had great influences in many areas of AI and ML-based research and applications. This paper focuses on deep learning structures and applications for audio classification. We conduct a detailed review of literature in audio-based DL and DRL approaches and applications. We also discuss the limitation and possible future works for audio-based DL approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Latif, S., Cuayáhuitl, H., Pervez, F., Shamshad, F., Ali, H.S., Cambria, E.: A survey on deep reinforcement learning for audio-based applications. arXiv preprint arXiv:2101.00240 (2021)
Sharma, G., Umapathy, K., Krishnan, S.: Trends in audio signal feature extraction methods. Appl. Acoust. 158, 107020 (2020)
Nguyen, G., et al.: Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif. Intell. Rev. 52(1), 77–124 (2019)
Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)
Ying, X.: An overview of overfitting and its solutions. In: Journal of Physics: Conference Series, vol. 1168, no. 2, p. 022022. IOP Publishing (2019)
Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 4, no. 4. Springer, Cham (2006)
Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168 (2006)
Hastie, T., Tibshirani, R., Friedman, J.: Unsupervised learning. In: Hastie, T., Tibshirani, R., Friedman, J. (eds.) The Elements of Statistical Learning. SSS, pp. 485–585. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_14
Wiering, M.A., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12(3), 729 (2012)
Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53(8), 5455–5516 (2020). https://doi.org/10.1007/s10462-020-09825-6
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Dong, M.: Convolutional neural network achieves human-level accuracy in music genre classification. arXiv preprint arXiv:1802.09697 (2018)
Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)
Chen, Y., Guo, Q., Liang, X., Wang, J., Qian, Y.: Environmental sound classification with dilated convolutions. Appl. Acoust. 148, 123–132 (2019)
Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)
Latif, S., Qadir, J., Qayyum, A., Usama, M., Younis, S.: Speech technology for healthcare: opportunities, challenges, and state of the art. IEEE Rev. Biomed. Eng. 14, 342–356 (2020)
Sherstinsky, A.: Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D 404, 132306 (2020)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Sainath, T.N., Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks (2016)
Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191. IEEE (2015)
Ghosal, D., Kolekar, M.H.: Music genre recognition using deep neural networks and transfer learning. In: Interspeech, pp. 2087–2091 (2018)
Qian, Y., Bi, M., Tan, T., Yu, K.: Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)
Sun, T.-W.: End-to-end speech emotion recognition with gender information. IEEE Access 8, 152 423–152 438 (2020)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. In: International Conference on Machine Learning, pp. 2837–2846. PMLR (2017)
Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)
Pham, N.-Q., Nguyen, T.-S., Niehues, J., Müller, M., Stüker, S., Waibel, A.: Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377 (2019)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Shannon, M., Zen, H., Byrne, W.: Autoregressive models for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21(3), 587–597 (2012)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning (1998)
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. arXiv preprint arXiv:1811.12560 (2018)
Kaiser, L., et al.: Model-based reinforcement learning for Atari. arXiv preprint arXiv:1903.00374 (2019)
Whiteson, S.: TreeQN and ATeeC: differentiable tree planning for deep reinforcement learning (2018)
Kala, T., Shinozaki, T.: Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5759–5763. IEEE (2018)
Tjandra, A., Sakti, S., Nakamura, S.: Sequence-to-sequence ASR optimization via reinforcement learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5829–5833. IEEE (2018)
Chung, H., Jeon, H.-B., Park, J.G.: Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2020)
Fakoor, R., He, X., Tashev, I., Zarar, S.: Reinforcement learning to adapt speech enhancement to instantaneous input signal quality. arXiv preprint arXiv:1711.10791 (2017)
Alamdari, N., Lobarinas, E., Kehtarnavaz, N.: Personalization of hearing aid compression by human-in-the-loop deep reinforcement learning. IEEE Access 8, 203 503–203 515 (2020)
Kotecha, N.: Bach2Bach: generating music using a deep reinforcement learning approach. arXiv preprint arXiv:1812.01060 (2018)
Jaques, N., Gu, S., Turner, R.E., Eck, D.: Generating music by fine-tuning recurrent neural networks with reinforcement learning (2016)
Xie, J., Zhu, M.: Handcrafted features and late fusion with deep learning for bird sound classification. Eco. Inform. 52, 74–81 (2019)
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
Nam, J., Choi, K., Lee, J., Chou, S.-Y., Yang, Y.-H.: Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Process. Mag. 36(1), 41–51 (2018)
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, vol. 12 (1999)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Seno, T.: Welcome to deep reinforcement learning part 1: DQN (2017). https://towardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1 (2016)
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)
Abeßer, J.: A review of deep learning based methods for acoustic scene classification. Appl. Sci. 10(6) (2020)
Seo, H., Park, J., Park, Y.: Acoustic scene classification using various pre-processed features and convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), New York, NY, USA, pp. 25–26 (2019)
Lostanlen, V., et al.: Per-channel energy normalization: why and how. IEEE Signal Process. Lett. 26(1), 39–43 (2018)
Wu, Y., Lee, T.: Enhancing sound texture in CNN-based acoustic scene classification. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 815–819. IEEE (2019)
Mariotti, O., Cord, M., Schwander, O.: Exploring deep vision models for acoustic scene classification. In: Proceedings of the DCASE, pp. 103–107 (2018)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Koutini, K., Eghbal-zadeh, H., Widmer, G.: Receptive-field-regularized CNN variants for acoustic scene classification. arXiv preprint arXiv:1909.02859 (2019)
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Lasseck, M.: Acoustic bird detection with deep convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 143–147 (2018)
Li, J., Deng, L., Haeb-Umbach, R., Gong, Y.: Robust automatic speech recognition: a bridge to practical applications (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Y., Wei-Kocsis, J., Springer, J.A., Matson, E.T. (2022). Deep Learning in Audio Classification. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2022. Communications in Computer and Information Science, vol 1665. Springer, Cham. https://doi.org/10.1007/978-3-031-16302-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-16302-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16301-2
Online ISBN: 978-3-031-16302-9
eBook Packages: Computer ScienceComputer Science (R0)