Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Sound event detection in real-life audio using joint spectral and temporal features

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Typical methods for overlapping sound event detection (SED) do not fully consider the joint spectral and temporal transition characteristics of the audio signal. They are generally based on training models using either separate data from each event class or mixed signals containing simultaneous sound events. This paper introduced a new approach for SED in real-life audio using Nonnegative Matrix Factor 2-D Deconvolution and RUSBoost techniques. The idea is to capture the two-dimensional joint spectral and temporal information from the time-frequency representation while possibly separating the sound mixture into several sources. In addition, the RUSBoost technique is utilized to address the class imbalance problem of the training data. The proposed approach is evaluated using the TUT Sound Event 2016 and 2017 datasets. The results showed that the proposed method outperformed the baseline methods. For the TUT Sound Event 2016 dataset, the proposed method reduced the total error rate by 5% while increasing the F1 score by 13.8%. For the TUT Sound Event 2017 dataset, the proposed method reduced the total error rate by 3% while increasing the F1 score by 8.1%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Brown, J.C.: Calculation of a constant q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)

    Article  Google Scholar 

  2. Bucak, S.S., Gunsel, B.: Online video scene clustering by competitive incremental nmf. Signal Image Video Process. 7(4), 723–739 (2013)

    Article  Google Scholar 

  3. Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: International Joint Conference on Neural Networks (IJCNN), 2015 , pp. 1–7. IEEE (2015)

  4. Cotton, CV., Ellis, D.P.W.: Spectral vs. spectro-temporal features for acoustic event detection. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 69–72 (2011). http://dx.doi.org/10.1109/ASPAA.2011.6082331

  5. Dennis, J.W.: Sound Event Recognition in Unstructured Environments Using Spectrogram Image Processing. Nanyang Technological University, Singapore (2014)

    Google Scholar 

  6. El Aziz, M.E., Khidr, W.: Nonnegative matrix factorization based on projected hybrid conjugate gradient algorithm. Signal Image Video Process. 9(8), 1825 (2015)

    Article  Google Scholar 

  7. Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1–13 (2013)

    Article  Google Scholar 

  8. Heittola, T., Mesaros, A., Virtanen, T., Eronen, A.: Sound event detection in multisource environments using source separation. In: Proceedings on CHiME pp. 36–40 (2011)

  9. Innami, S., Kasai, H.: Nmf-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64(5), 1333–1342 (2012)

    Article  Google Scholar 

  10. Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. ISMIR 270, 1–11 (2000)

    Google Scholar 

  11. Mesaros, A., Heittola, T., Virtanen, T., Fagerlund, E., Hiltunen, A., Heittola, T.: Tut acoustic scenes 2016, development dataset (2016). http://dx.doi.org/10.5281/zenodo.45739

  12. Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.: Complex sounds and auditory images. Audit. Physiol. Percept. 83, 429–446 (1992)

    Article  Google Scholar 

  13. Phuong, N.C., Do Dat, T.: Sound classification for event detection: Application into medical telemonitoring. In: International Conference on Computing, Management and Telecommunications (ComManTel), 2013, pp. 330–333. IEEE (2013)

  14. Rabaoui, A., Davy, M., Rossignol, S., Ellouze, N.: Using one-class svms and wavelets for audio surveillance. IEEE Trans. Inf. Forensics Secur. 3(4), 763–775 (2008)

    Article  Google Scholar 

  15. Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: Improving classification performance when training data is skewed. In: 19th International Conference on Pattern Recognition, 2008. ICPR 2008. pp. 1–4. IEEE (2008)

  16. Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Humans 40(1), 185–197 (2010)

    Article  Google Scholar 

  17. Shah, M., Mears, B., Chakrabarti, C., Spanias, A.: Lifelogging: Archival and retrieval of continuously recorded audio using wearable devices. In: 2012 IEEE International Conference on Emerging Signal Processing Applications (ESPA), pp. 99–102. IEEE (2012)

  18. Shokrollahi, M., Krishnan, S.: Non-stationary signal feature characterization using adaptive dictionaries and non-negative matrix factorization. Signal Image Video Process. 10(6), 1025–1032 (2016)

    Article  Google Scholar 

  19. Wichern, G., Xue, J., Thornburg, H., Mechtley, B., Spanias, A.: Segmentation, indexing, and retrieval for environmental and natural sounds. IEEE Trans. Audio Speech Lang. Process. 18(3), 688–707 (2010)

    Article  Google Scholar 

  20. Yin, P., Sun, Y., Xin, J.: A geometric blind source separation method based on facet component analysis. Signal Image Video Process. 10(1), 19–28 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenjun Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, W., Krishnan, S. Sound event detection in real-life audio using joint spectral and temporal features. SIViP 12, 1345–1352 (2018). https://doi.org/10.1007/s11760-018-1288-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-018-1288-7

Keywords