Abstract
Typical methods for overlapping sound event detection (SED) do not fully consider the joint spectral and temporal transition characteristics of the audio signal. They are generally based on training models using either separate data from each event class or mixed signals containing simultaneous sound events. This paper introduced a new approach for SED in real-life audio using Nonnegative Matrix Factor 2-D Deconvolution and RUSBoost techniques. The idea is to capture the two-dimensional joint spectral and temporal information from the time-frequency representation while possibly separating the sound mixture into several sources. In addition, the RUSBoost technique is utilized to address the class imbalance problem of the training data. The proposed approach is evaluated using the TUT Sound Event 2016 and 2017 datasets. The results showed that the proposed method outperformed the baseline methods. For the TUT Sound Event 2016 dataset, the proposed method reduced the total error rate by 5% while increasing the F1 score by 13.8%. For the TUT Sound Event 2017 dataset, the proposed method reduced the total error rate by 3% while increasing the F1 score by 8.1%.
Similar content being viewed by others
References
Brown, J.C.: Calculation of a constant q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)
Bucak, S.S., Gunsel, B.: Online video scene clustering by competitive incremental nmf. Signal Image Video Process. 7(4), 723–739 (2013)
Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: International Joint Conference on Neural Networks (IJCNN), 2015 , pp. 1–7. IEEE (2015)
Cotton, CV., Ellis, D.P.W.: Spectral vs. spectro-temporal features for acoustic event detection. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 69–72 (2011). http://dx.doi.org/10.1109/ASPAA.2011.6082331
Dennis, J.W.: Sound Event Recognition in Unstructured Environments Using Spectrogram Image Processing. Nanyang Technological University, Singapore (2014)
El Aziz, M.E., Khidr, W.: Nonnegative matrix factorization based on projected hybrid conjugate gradient algorithm. Signal Image Video Process. 9(8), 1825 (2015)
Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1–13 (2013)
Heittola, T., Mesaros, A., Virtanen, T., Eronen, A.: Sound event detection in multisource environments using source separation. In: Proceedings on CHiME pp. 36–40 (2011)
Innami, S., Kasai, H.: Nmf-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64(5), 1333–1342 (2012)
Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. ISMIR 270, 1–11 (2000)
Mesaros, A., Heittola, T., Virtanen, T., Fagerlund, E., Hiltunen, A., Heittola, T.: Tut acoustic scenes 2016, development dataset (2016). http://dx.doi.org/10.5281/zenodo.45739
Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.: Complex sounds and auditory images. Audit. Physiol. Percept. 83, 429–446 (1992)
Phuong, N.C., Do Dat, T.: Sound classification for event detection: Application into medical telemonitoring. In: International Conference on Computing, Management and Telecommunications (ComManTel), 2013, pp. 330–333. IEEE (2013)
Rabaoui, A., Davy, M., Rossignol, S., Ellouze, N.: Using one-class svms and wavelets for audio surveillance. IEEE Trans. Inf. Forensics Secur. 3(4), 763–775 (2008)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: Improving classification performance when training data is skewed. In: 19th International Conference on Pattern Recognition, 2008. ICPR 2008. pp. 1–4. IEEE (2008)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Humans 40(1), 185–197 (2010)
Shah, M., Mears, B., Chakrabarti, C., Spanias, A.: Lifelogging: Archival and retrieval of continuously recorded audio using wearable devices. In: 2012 IEEE International Conference on Emerging Signal Processing Applications (ESPA), pp. 99–102. IEEE (2012)
Shokrollahi, M., Krishnan, S.: Non-stationary signal feature characterization using adaptive dictionaries and non-negative matrix factorization. Signal Image Video Process. 10(6), 1025–1032 (2016)
Wichern, G., Xue, J., Thornburg, H., Mechtley, B., Spanias, A.: Segmentation, indexing, and retrieval for environmental and natural sounds. IEEE Trans. Audio Speech Lang. Process. 18(3), 688–707 (2010)
Yin, P., Sun, Y., Xin, J.: A geometric blind source separation method based on facet component analysis. Signal Image Video Process. 10(1), 19–28 (2016)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, W., Krishnan, S. Sound event detection in real-life audio using joint spectral and temporal features. SIViP 12, 1345–1352 (2018). https://doi.org/10.1007/s11760-018-1288-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-018-1288-7