Abstract
This paper proposes a robust scream-sound detection scheme for acoustic surveillance applications. To enhance the discriminability between scream and non-scream sounds, a sound-event partitioning (SEP) method that facilitates the extraction of multiple acoustic vectors from a single sound event is developed. Regularized principal component analysis (PCA) and normalization are applied to the acoustic vectors, which are then classified by support vector machines (SVMs). Experimental results based on 1000 sound events show that the proposed scheme is effective even if there are severe mismatches between the training and testing conditions. The experimental results also show that the proposed scheme can reduce the equal error rate (EER) by up to 60 % when compared to a classical approach that uses mel-frequency cepstral coefficients (MFCC) as features. Extensive analyses on different processing stages of the proposed sound detection scheme also suggest that sound partitioning and feature normalization play important roles in boosting the detection performance.
Similar content being viewed by others
Notes
It is important to note that individual frames do not contain sufficient information for differentiating scream and non-scream sounds. In fact, individual frames of scream and non-scream sound are highly overlapped in the feature space, which will cause problems if they are directly used for training SVM classifiers.
References
addnoise. http://www.mathworks.com/matlabcentral/fileexchange/32136-add-noise/content/addnoise/addnoise.m
Ali S, Smith-Miles KA (2006) Improved support vector machine generalization using normalized input space. In: Proc. of 19th Australian Joint Conference on Artificial Intelligence. pp 362–371
Atrey PK, Maddage NC, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing. pp V813-V816
Chu S, Narayanan S, Kuo CCJ (2009) Environmental sound recognition with time-frequency audio features. IEEE Trans Audio, Speech Lang Process 17(6):1142–1158
Clavel C, Ehrette T, Richard G (2005) Events detection for an audio-based surveillance system. In: Proc.of IEEE International Conference on Multimedia and Expo. pp 1306–1309
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366
Dennis J, Tran HD, Chng E-S (2013) Image feature representation of the subband power distribution for robust sound event classification. IEEE Trans Audio, Speech Lang Process 21(2):367–377
Dennis J, Tran HD, Chng ES (2013) Overlapping sound event recognition using local spectrogram features and the generalised hough transform. Pattern Recogn Lett 34(9):1085–1093
Dennis J, Tran HD, Li H (2011) Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process Lett 18(2):130–133
Ferrer L, Bratt H, Burget L, Cernocky H, Glembek O, Graciarena M, Lawson A, Lei Y, Matejka P, Plchot O (2011) Promoting robustness for speaker modeling in the community: the PRISM evaluation set. In: Proc.of NIST 2011 Workshop
Ghoraani B, Krishnan S (2011) Time-frequency matrix feature extraction and classification of environmental audio signals. IEEE Trans Audio, Speech Lang Process 19(7):2197–2209
Guo G, Li SZ (2003) Content-based audio classification and retrieval by support vector machines. IEEE Trans Neural Netw 14(1):209–215
Hautamaki V, Kinnunen T, Sedlak F, Lee KA, Ma B, Li H (2013) Sparse classifier fusion for speaker verification. IEEE Trans Audio, Speech Lang Process 21(8):1622–1631
Huang W, Chiew T-K, Li H, Kok TS, Biswas J (2010) Scream detection for home applications. In: Proc.of 6th IEEE Conference on Industrial Electronics and Applications. pp 2115–2120
Human Sound Effects. http://www.sound-ideas.com/
Jégou H, Chum O (2012) Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Proc.of European Conference on Computer Vision. pp 774–787
Kim MJ, Kim H (2011) Automatic extraction of pornographic contents using radon transform based audio features. In: Prof. of 9th International Workshop onContent-Based Multimedia Indexing. pp 205–210
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40
Kotus J, Lopatka K, Czyzewski A (2014) Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimedia Tools Appl 68(1):5–21
Lei B, Rahman SA, Song I (2014) Content-based classification of breath sound with enhanced features. Neurocomputing 141:139–147
Liao W-H, Lin Y-K (2009) Classification of non-speech human sounds: Feature selection and snoring sound analysis. In: Proc. of IEEE International Conference on on Systems, Man and Cybernetics. pp 2695–2700
Mak M-W, Kung S-Y (2012) Low-power SVM classifiers for sound event classification on mobile devices. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing pp 1985–1988
Mak M-W, Rao W (2011) Utterance partitioning with acoustic vector resampling for GMM–SVM speaker verification. Speech Comm 53(1):119–130
Mak M-W, Yu H-B (2014) A study of voice activity detection techniques for NIST speaker recognition evaluations. Comput Speech Lang 28(1):295–313
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: Proc.of 5th European Conference on Speech Communication and Technology. pp 1895–1898
Ntalampiras S, Potamitis I, Fakotakis N (2009) On acoustic surveillance of hazardous situations. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing. pp 165–168
Penet C, Demarty C-H, Gravier G, Gros P (2014) Variability modelling for audio events detection in movies. Multimedia Tools and Applications 1–31
PRISM-SET. https://code.google.com/p/prism-set/
Ralf H, Thore G (2002) A PAC-Bayesian margin bound for linear classifiers. IEEE Trans Inf Theory 48(12):3140–3150
Rao W, Mak M-W (2013) Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Trans Audio, Speech Lang Process 21(5):1012–1022
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Simonyan K, Parkhi OM, Vedaldi A, Zisserman A (2013) Fisher Vector Faces in the Wild. In: Proc. of British Machine Vision Conference. pp 8.1-8.12
Tran HD, Li H (2011) Sound event recognition with probabilistic distance SVMs. IEEE Trans Audio, Speech Lang Process 19(6):1556–1568
Valenzise G, Gerosa L, Tagliasacchi M, Antonacci F, Sarti A (2007) Scream and gunshot detection and localization for audio-surveillance systems. In: Proc.of IEEE Conference on Advanced Video and Signal Based Surveillance. pp 21–26
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251
Wang Y, Han K, Wang D (2013) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio, Speech Lang Process 21(2):270–279
Zhao X, Shao Y, Wang D (2012) CASA-based robust speaker identification. IEEE Trans Audio, Speech Lang Process 20(5):1608–1616
Zhao X, Wang D (2013) Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing. pp 7204–7208
Acknowledgments
The work was supported partly by National Natural Science Foundation of China (No. 61402296), Motorola Solutions Foundation (ID: 7186445) and the Hong Kong Polytechnic University Grant No. G-YL78. The authors would like to thank Wing-Lung Leung for developing the sound recording system and part of the Android App.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lei, B., Mak, MW. Robust scream sound detection via sound event partitioning. Multimed Tools Appl 75, 6071–6089 (2016). https://doi.org/10.1007/s11042-015-2555-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2555-z