Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Deep Learning in Audio Classification

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1665))

Included in the following conference series:

  • 859 Accesses

Abstract

Audio processing technology is happening everywhere in our life. We ask our car to make a call for us while driving, or we let Alexa turn off the light for us when we don’t want to get out of bed before sleep. In all of these audio-based applications and research, it is AI and ML that makes the computer or the smart phone understand us via our voice [1]. As an important part of artificial intelligence (AI), especially machine learning (ML), which has had great influences in many areas of AI and ML-based research and applications. This paper focuses on deep learning structures and applications for audio classification. We conduct a detailed review of literature in audio-based DL and DRL approaches and applications. We also discuss the limitation and possible future works for audio-based DL approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Latif, S., Cuayáhuitl, H., Pervez, F., Shamshad, F., Ali, H.S., Cambria, E.: A survey on deep reinforcement learning for audio-based applications. arXiv preprint arXiv:2101.00240 (2021)

  2. Sharma, G., Umapathy, K., Krishnan, S.: Trends in audio signal feature extraction methods. Appl. Acoust. 158, 107020 (2020)

    Google Scholar 

  3. Nguyen, G., et al.: Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif. Intell. Rev. 52(1), 77–124 (2019)

    Article  Google Scholar 

  4. Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)

    Article  Google Scholar 

  5. Ying, X.: An overview of overfitting and its solutions. In: Journal of Physics: Conference Series, vol. 1168, no. 2, p. 022022. IOP Publishing (2019)

    Google Scholar 

  6. Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 4, no. 4. Springer, Cham (2006)

    Google Scholar 

  7. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 161–168 (2006)

    Google Scholar 

  8. Hastie, T., Tibshirani, R., Friedman, J.: Unsupervised learning. In: Hastie, T., Tibshirani, R., Friedman, J. (eds.) The Elements of Statistical Learning. SSS, pp. 485–585. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_14

    Chapter  MATH  Google Scholar 

  9. Wiering, M.A., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12(3), 729 (2012)

    Google Scholar 

  10. Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 (2016)

  11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

    Google Scholar 

  12. Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53(8), 5455–5516 (2020). https://doi.org/10.1007/s10462-020-09825-6

    Article  Google Scholar 

  13. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)

    Google Scholar 

  14. Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)

    Article  Google Scholar 

  15. Dong, M.: Convolutional neural network achieves human-level accuracy in music genre classification. arXiv preprint arXiv:1802.09697 (2018)

  16. Park, S.R., Lee, J.: A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016)

  17. Chen, Y., Guo, Q., Liang, X., Wang, J., Qian, Y.: Environmental sound classification with dilated convolutions. Appl. Acoust. 148, 123–132 (2019)

    Article  Google Scholar 

  18. Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)

  19. Latif, S., Qadir, J., Qayyum, A., Usama, M., Younis, S.: Speech technology for healthcare: opportunities, challenges, and state of the art. IEEE Rev. Biomed. Eng. 14, 342–356 (2020)

    Article  Google Scholar 

  20. Sherstinsky, A.: Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D 404, 132306 (2020)

    Google Scholar 

  21. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  22. Sainath, T.N., Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks (2016)

    Google Scholar 

  23. Li, J., Mohamed, A., Zweig, G., Gong, Y.: LSTM time and frequency recurrence for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 187–191. IEEE (2015)

    Google Scholar 

  24. Ghosal, D., Kolekar, M.H.: Music genre recognition using deep neural networks and transfer learning. In: Interspeech, pp. 2087–2091 (2018)

    Google Scholar 

  25. Qian, Y., Bi, M., Tan, T., Yu, K.: Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2263–2276 (2016)

    Article  Google Scholar 

  26. Sun, T.-W.: End-to-end speech emotion recognition with gender information. IEEE Access 8, 152 423–152 438 (2020)

    Google Scholar 

  27. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  28. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  29. Raffel, C., Luong, M.-T., Liu, P.J., Weiss, R.J., Eck, D.: Online and linear-time attention by enforcing monotonic alignments. In: International Conference on Machine Learning, pp. 2837–2846. PMLR (2017)

    Google Scholar 

  30. Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012)

  31. Pham, N.-Q., Nguyen, T.-S., Niehues, J., Müller, M., Stüker, S., Waibel, A.: Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377 (2019)

  32. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  33. Shannon, M., Zen, H., Byrne, W.: Autoregressive models for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21(3), 587–597 (2012)

    Article  Google Scholar 

  34. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  35. Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning (1998)

    Google Scholar 

  36. François-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. arXiv preprint arXiv:1811.12560 (2018)

  37. Kaiser, L., et al.: Model-based reinforcement learning for Atari. arXiv preprint arXiv:1903.00374 (2019)

  38. Whiteson, S.: TreeQN and ATeeC: differentiable tree planning for deep reinforcement learning (2018)

    Google Scholar 

  39. Kala, T., Shinozaki, T.: Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5759–5763. IEEE (2018)

    Google Scholar 

  40. Tjandra, A., Sakti, S., Nakamura, S.: Sequence-to-sequence ASR optimization via reinforcement learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5829–5833. IEEE (2018)

    Google Scholar 

  41. Chung, H., Jeon, H.-B., Park, J.G.: Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2020)

    Google Scholar 

  42. Fakoor, R., He, X., Tashev, I., Zarar, S.: Reinforcement learning to adapt speech enhancement to instantaneous input signal quality. arXiv preprint arXiv:1711.10791 (2017)

  43. Alamdari, N., Lobarinas, E., Kehtarnavaz, N.: Personalization of hearing aid compression by human-in-the-loop deep reinforcement learning. IEEE Access 8, 203 503–203 515 (2020)

    Google Scholar 

  44. Kotecha, N.: Bach2Bach: generating music using a deep reinforcement learning approach. arXiv preprint arXiv:1812.01060 (2018)

  45. Jaques, N., Gu, S., Turner, R.E., Eck, D.: Generating music by fine-tuning recurrent neural networks with reinforcement learning (2016)

    Google Scholar 

  46. Xie, J., Zhu, M.: Handcrafted features and late fusion with deep learning for bird sound classification. Eco. Inform. 52, 74–81 (2019)

    Article  Google Scholar 

  47. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)

    Article  Google Scholar 

  48. Nam, J., Choi, K., Lee, J., Chou, S.-Y., Yang, Y.-H.: Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Process. Mag. 36(1), 41–51 (2018)

    Article  Google Scholar 

  49. Konda, V., Tsitsiklis, J.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, vol. 12 (1999)

    Google Scholar 

  50. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)

    Google Scholar 

  51. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  52. Seno, T.: Welcome to deep reinforcement learning part 1: DQN (2017). https://towardsdatascience.com/welcome-to-deep-reinforcement-learning-part-1-dqn-c3cab4d41b6b

  53. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1 (2016)

    Google Scholar 

  54. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015)

  55. Abeßer, J.: A review of deep learning based methods for acoustic scene classification. Appl. Sci. 10(6) (2020)

    Google Scholar 

  56. Seo, H., Park, J., Park, Y.: Acoustic scene classification using various pre-processed features and convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), New York, NY, USA, pp. 25–26 (2019)

    Google Scholar 

  57. Lostanlen, V., et al.: Per-channel energy normalization: why and how. IEEE Signal Process. Lett. 26(1), 39–43 (2018)

    Article  Google Scholar 

  58. Wu, Y., Lee, T.: Enhancing sound texture in CNN-based acoustic scene classification. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 815–819. IEEE (2019)

    Google Scholar 

  59. Mariotti, O., Cord, M., Schwander, O.: Exploring deep vision models for acoustic scene classification. In: Proceedings of the DCASE, pp. 103–107 (2018)

    Google Scholar 

  60. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  61. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)

    Google Scholar 

  62. Koutini, K., Eghbal-zadeh, H., Widmer, G.: Receptive-field-regularized CNN variants for acoustic scene classification. arXiv preprint arXiv:1909.02859 (2019)

  63. Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)

  64. Lasseck, M.: Acoustic bird detection with deep convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), pp. 143–147 (2018)

    Google Scholar 

  65. Li, J., Deng, L., Haeb-Umbach, R., Gong, Y.: Robust automatic speech recognition: a bridge to practical applications (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaqin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Wei-Kocsis, J., Springer, J.A., Matson, E.T. (2022). Deep Learning in Audio Classification. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2022. Communications in Computer and Information Science, vol 1665. Springer, Cham. https://doi.org/10.1007/978-3-031-16302-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16302-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16301-2

  • Online ISBN: 978-3-031-16302-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics