Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Spoofing detection goes noisy

Published: 01 December 2016 Publication History

Abstract

For the first time, TTS and VC attack detection under additive noise, is studied.Various front-ends for synthetic speech detection under additive noise are systematically analyzed on ASVspoof 2015 database.Relative phase shift (RPS) features perform better than other features considered in clean condition.Mel-frequency cepstral coefficients (MFCCs) and subband spectral centroid magnitude (SCMC) features are the best two techniques among seven different front-ends under noisy conditions.Standard GMM performs better than i-vector PLDA for both clean and noisy conditions. Automatic speaker verification (ASV) technology is recently finding its way to end-user applications for secure access to personal data, smart services or physical facilities. Similar to other biometric technologies, speaker verification is vulnerable to spoofing attacks where an attacker masquerades as a particular target speaker via impersonation, replay, text-to-speech (TTS) or voice conversion (VC) techniques to gain illegitimate access to the system. We focus on TTS and VC that represent the most flexible, high-end spoofing attacks. Most of the prior studies on synthesized or converted speech detection report their findings using high-quality clean recordings. Meanwhile, the performance of spoofing detectors in the presence of additive noise, an important consideration in practical ASV implementations, remains largely unknown. To this end, our study provides a comparative analysis of existing state-of-the-art, off-the-shelf synthetic speech detectors under additive noise contamination with a special focus on front-end processing that has been found critical. Our comparison includes eight acoustic feature sets, five related to spectral magnitude and three to spectral phase information. All the methods contain a number of internal control parameters. Except for feature post-processing steps (deltas and cepstral mean normalization) that we optimized for each method, we fix the internal control parameters to their default values based on literature, and compare all the variants using the exact same dimensionality and back-end system. In addition to the eight feature sets, we consider two alternative classifier back-ends: Gaussian mixture model (GMM) and i-vector, the latter with both cosine scoring and probabilistic linear discriminant analysis (PLDA) scoring. Our extensive analysis on the recent ASVspoof 2015 challenge provides new insights to the robustness of the spoofing detectors. Firstly, unlike in most other speech processing tasks, all the compared spoofing detectors break down even at relatively high signal-to-noise ratios (SNRs) and fail to generalize to noisy conditions even if performing excellently on clean data. This indicates both difficulty of the task, as well as potential to over-fit the methods easily. Secondly, speech enhancement pre-processing is not found helpful. Thirdly, GMM back-end generally outperforms the more involved i-vector back-end. Fourthly, concerning the compared features, the Mel-frequency cepstral coefficient (MFCC) and subband spectral centroid magnitude coefficient (SCMC) features perform the best on average though the winner method depends on SNR and noise type. Finally, a study with two score fusion strategies shows that combining different feature based systems improves recognition accuracy for known and unknown attacks in both clean and noisy conditions. In particular, simple score averaging fusion, as opposed to weighted fusion with logistic loss weight optimization, was found to work better, on average. For clean speech, it provides 88% and 28% relative improvements over the best standalone features for known and unknown spoofing techniques, respectively. If we consider the best score fusion of just two features, then RPS serves as a complementary agent to one of the magnitude features. To sum up, our study reveals a significant gap between the performance of state-of-the-art spoofing detectors between clean and noisy conditions.

References

[1]
F. Alegre, R. Vipperla, N.W.D. Evans, B.G.B. Fauve, On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals, 2012.
[2]
B.S. Atal, Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, J. Acoust. Soc. Am., 55 (1974) 1304-1312.
[3]
, Springer, Berlin, 2008.
[4]
N. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, 1979.
[5]
S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process., 27 (1979) 113-120.
[6]
J. Bonastre, D. Matrouf, C. Fredouille, Artificial impostor voice transformation effects on false acceptance rates, 2007.
[7]
J. Brown, Calculation of a constant Q spectral transform, J. Acoust. Soc. Am., 89 (1991) 425-434.
[8]
N. Brmmer, L. Burget, J. ernock, O. Glembek, F. Grezl, M. Karafiat, D. Van Leeuwen, P. Mat, P. Schwarz, A. Strasheim, Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006, IEEE Trans. Audio Speech Lang. Process., 15 (2007) 2072-2084.
[9]
D. Byrne, H. Dillon, K. Tran, S. Arlinger, K. Wilbraham, R. Cox, B. Hagerman, R. Hetu, J. Kei, C. Lui, J. Kiessling, N.M. Kotby, N.H.A. Nasser, Wafaa, Y. Nakanishi, H. Oyer, R. Powell, D. Stephens, R. Meredith, T. Sirimanna, G. Tavartkiladze, G.I. Frolenkov, S. Westerman, C. Ludvigsen, An international comparison of long-term average speech spectra, J. Acoust. Soc. Am., 96 (1994) 2108-2120.
[10]
S. Chakroborty, A. Roy, G. Saha, Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks, Int. J. Signal Process., 4 (2007) 114-122.
[11]
G. Degottex, J. Kane, T. Drugman, T. Raitio, S. Scherer, COVAREP a collaborative voice analysis repository for speech technologies, 2014.
[12]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process., 19 (2011) 788-798.
[13]
S.K. Ergnay, E. Khoury, A. Lazaridis, S. Marcel, On the vulnerability of speaker verification to realistic voice spoofing, 2015.
[14]
N.W.D. Evans, T. Kinnunen, J. Yamagishi, Z. Wu, F. Alegre, P.L.D. Leon, Speaker recognition anti-spoofing, 2014.
[15]
M. Farrs, M. Wagner, J. Anguita, J. Hernando, How vulnerable are prosodic features to professional imitators?, 2008.
[16]
J. Galka, M. Grzywacz, R. Samborski, Playback attack detection for text-dependent speaker verification over telephone channels, Speech Commun., 67 (2015) 143-153.
[17]
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, 2011.
[18]
C. Grigoras, Statistical tools for multimedia forensics, 2010.
[19]
K. Han, Y. Wang, D. Wang, W.S. Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising, IEEE/ACM Trans. Audio Speech Lang. Process., 23 (2015) 982-992.
[20]
C. Hanili, T. Kinnunen, M. Sahidullah, Classifiers for synthetic speech detection: a comparison, 2015.
[21]
A.O. Hatch, S.S. Kajarekar, A. Stolcke, Within-class covariance normalization for SVM-based speaker recognition, 2006.
[22]
R.G. Hautamki, T. Kinnunen, V. Hautamki, T. Leino, A. Laukkanen, I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry, 2013.
[23]
H. Hermansky, N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process., 2 (1994) 578-589.
[24]
A.K. Jain, A. Ross, S. Pankanti, Biometrics: a tool for information security, IEEE Trans. Inf. Forensics Sec., 1 (2006) 125-143.
[25]
Q. Jin, A.R. Toth, A.W. Black, T. Schultz, Is voice transformation a threat to speaker identification?, 2008.
[26]
H. Kawahara, I. Masuda-Katsuse, A. de Cheveign, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds, Speech Commun., 27 (1999) 187-207.
[27]
P. Kenny, Bayesian speaker verification with heavy-tailed priors, 2010.
[28]
E. Khoury, T. Kinnunen, A. Sizov, Z. Wu, S. Marcel, Introducing i-vectors for joint anti-spoofing and speaker verification, 2014.
[29]
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors, Speech Commun., 52 (2010) 12-40.
[30]
Z. Kons, H. Aronowitz, Voice transformation-based spoofing of text-dependent speaker verification systems, 2013.
[31]
N. Krishnamurthy, J.H.L. Hansen, Babble noise: modeling, analysis, and applications, IEEE Trans Audio Speech Lang. Process., 17 (2009) 1394-1407.
[32]
J.M.K. Kua, T. Thiruvaran, M. Nosratighods, E. Ambikairajah, J. Epps, Investigation of spectral centroid magnitude and frequency for speaker recognition, 2010.
[33]
P.L.D. Leon, V.R. Apsingekar, M. Pucher, J. Yamagishi, Revisiting the security of speaker verification systems against imposture using synthetic speech, 2010.
[34]
P.L.D. Leon, I. Hernez, I. Saratxaga, M. Pucher, J. Yamagishi, Detection of synthetic speech for the problem of imposture, IEEE, 2011.
[35]
P.L.D. Leon, M. Pucher, J. Yamagishi, Evaluation of the vulnerability of speaker verification to synthetic speech, 2010.
[36]
P.L.D. Leon, M. Pucher, J. Yamagishi, I. Hernez, I. Saratxaga, Evaluation of speaker verification security and detection of HMM-based synthetic speech, IEEE Trans. Audio Speech Lang. Process., 20 (2012) 2280-2290.
[37]
L. Li, D. Wang, C. Zhang, T.F. Zheng, Improving short utterance speaker recognition by modeling speech unit classes, IEEE/ACM Trans. Audio Speech Lang. Process., PP (2016).
[38]
J.S. Lim, A.V. Oppenheim, Enhancement and bandwidth compression of noisy speech, Proc. IEEE, 67 (1979) 1586-1604.
[39]
S.E. Linville, J. Rens, Vocal tract resonance analysis of aging voice using long-term average spectra., J. Voice, 15 (2001) 323-330.
[40]
P.C. Loizou, CRC Press, Inc, 2007.
[41]
T. Masuko, T. Hitotsumatsu, K. Tokuda, T. Kobyashi, On the security of HMM-based speaker verification systems against imposture using synthetic speech, 1999.
[42]
D. Matrouf, J. Bonastre, C. Fredouille, Effect of speech transformation on impostor acceptance, 2006.
[43]
V. Mitra, W. Wang, H. Franco, Y. Lei, C. Bartels, M. Graciarena, Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions, 2014.
[44]
H. Murthy, V. Gadde, The modified group delay function and its application to phoneme recognition, 2003.
[45]
S. Nakagawa, L. Wang, S. Ohtsuka, Speaker identification and verification by combining mfcc and phase information, IEEE Trans. Audio Speech Lang. Process., 20 (2012) 1085-1095.
[46]
Novoselov, S., Kozlov, A., Lavrentyeva, G., Simonchik, K., Shchemelinin, V., 2015. STC anti-spoofing systems for the ASVspoof 2015 challenge. http://arxiv.org/ftp/arxiv/papers/1507/1507.08074.pdf.
[47]
T. Patel, H. Patil, Effectiveness of fundamental frequency (F0) and strength of exictation (SOE) for spoofed speech detection, 2016.
[48]
B.L. Pellom, J.H.L. Hansen, An experimental study of speaker verification sensitivity to computer voice-altered imposters, 1999.
[49]
S.J.D. Prince, J.H. Elder, Probabilistic linear discriminant analysis for inferences about identity, 2007.
[50]
L.R. Rabiner, M.J. Cheng, A.E. Rosenberg, C.A. McGonegal, A comparative performance study of several pitch detection algorithms, IEEE Trans. Acoust. Speech Signal Process., 24 (1976) 399-418.
[51]
N.K. Ratha, J.H. Connell, R.M. Bolle, Enhancing security and privacy in biometrics-based authentication systems, IBM Syst. J., 40 (2001) 614-634.
[52]
D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE Trans. Speech Audio Process., 3 (1995) 72-83.
[53]
S.O. Sadjadi, H. Boril, J.H.L. Hansen, A comparison of front-end compensation strategies for robust LVCSR under room reverberation and increased vocal effort, 2012.
[54]
S.O. Sadjadi, J.H. Hansen, Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification, Speech Commun., 72 (2015) 138-148.
[55]
M. Sahidullah, T. Kinnunen, C. Hanili, A comparison of features for synthetic speech detection, 2015.
[56]
J. Sanchez, I. Saratxaga, I. Hernaez, E. Navas, D. Erro, The AHOLAB RPS SSD spoofing challenge 2015 submission, 2015.
[57]
J. Snchez, I. Saratxaga, I. Hernez, E. Navas, D. Erro, T. Raitio, Toward a universal synthetic speech spoofing detection using phase information, IEEE Trans. Inf. Forensics Secur., 10 (2015) 810-820.
[58]
A. Sizov, E. Khoury, T. Kinnunen, Z. Wu, S. Marcel, Joint speaker verification and anti-spoofing in the i-vector space, IEEE Trans. Inf. Forensics Secur. (2015).
[59]
Slaney, M., 1998. Auditory Toolbox (version 2). Interval Research Corporation Technical Report #1998-10.
[60]
SPTK: Speech signal processing toolkit. 2014. Version 3.8, http://sp-tk.sourceforge.net/.
[61]
M. Sun, X. Zhang, H.V. hamme, T.F. Zheng, Unseen noise estimation using separable deep auto encoder for speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., 24 (2016) 93-104.
[62]
D. Sndermann, H. Hge, A. Bonafonte, H. Ney, A.W. Black, S.S. Narayanan, Text-independent voice conversion based on unit selection, 2006.
[63]
Tian, X., Wu, Z., Xiao, X., Chng, E. S., Li, H., 2016. Spoofing detection under noisy conditions: a preliminary investigation and an initial database. http://arxiv.org/pdf/1602.02950v1.pdf.
[64]
J. Toda, Y. Ohtani, K. Shikano, Eigenvoice conversion based on Gaussian mixture model, 2006.
[65]
M. Todisco, H. Delgado, N. Evans, A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients, 2016.
[66]
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition ii: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems, Speech Commun., 12 (1993) 247-251.
[67]
J.A. Villalba, E. Lleida, Speaker verification performance degradation against spoofing and tampering attacks, 2010.
[68]
J.A. Villalba, A. Miguel, A. Ortega, E. Lleida, Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge, 2015.
[69]
L. Wang, Y. Yoshida, Y. Kawakami, S. Nakagawa, Relative phase information for detecting human speech and spoofed speech, 2015.
[70]
M. Wester, Z. Wu, J. Yamagishi, Human vs machine spoofing detection on wideband and narrowband data, 2015.
[71]
Wall Street Journal Corpus. 2015. {Online:} http://www.ldc.upenn.edu.
[72]
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li, Spoofing and countermeasures for speaker verification: a survey, Speech Commun., 66 (2015) 130-153.
[73]
Z. Wu, A. Khodabakhsh, C. Demirolu, J. Yamagishi, D. Saito, T. Toda, S. King, SAS: A speaker verification spoofing database containing diverse attacks, 2015.
[74]
Z. Wu, T. Kinnunen, E.S. Chng, H. Li, A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case, 2012.
[75]
Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanili, M. Sahidullah, A. Sizov, ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge, 2015.
[76]
Z. Wu, H. Li, Voice conversion versus speaker verification: an overview, APSIPA Trans. Audio Signal Inf. Process., 3 (2014).
[77]
Z. Wu, C.E. Siong, H. Li, Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition, 2012.
[78]
Z. Wu, X. Xiao, E. Chng, H. Li, Synthetic speech detection using temporal modulation feature, 2013.
[79]
X. Xiao, X. Tian, S. Du, H. Xu, E.S. Chng, H. Li, Spoofing speech detection using high dimensional magnitude and phase features: The NTU approach for ASVspoof 2015 challenge, 2015.
[80]
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks, IEEE Signal Process. Lett., 21 (2014) 65-68.
[81]
Y. Xu, J. Du, L.R. Dai, C.H. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., 23 (2015) 7-19.
[82]
J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, J. Isogai, Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm, Trans. Audio Speech Lang. Process., 17 (2009) 66-83.
[83]
H. Yin, V. Hohmann, C. Nadeu, Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency, Speech Commun., 53 (2011) 707-715.
[84]
H. Yu, A. Sarkar, D.A.L. Thomsen, Z.H. Tan, Z. Ma, J. Guo, Effect of multi-condition training and speech enhancement methods on spoofing detection, 2016.
[85]
C. Zhang, S. Ranjan, M. Nandwana, Q. Zhang, A. Misra, G. Liu, F. Kelly, J. Hansen, Joint information from nonlinear and linear features for spoofing detection: an i-vector/DNN based approach, 2016.

Cited By

View all
  • (2025)Acoustic Scene Classification Using Various Features and DNN Model: A Monolithic and Hierarchical ApproachCircuits, Systems, and Signal Processing10.1007/s00034-024-02836-644:1(239-280)Online publication date: 1-Jan-2025
  • (2024)Is Audio Spoof Detection Robust to Laundering Attacks?Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security10.1145/3658664.3659656(283-288)Online publication date: 24-Jun-2024
  • (2024)Noise Robust Audio Spoof Detection Using Hybrid Feature Extraction and LCNNSN Computer Science10.1007/s42979-024-02774-95:4Online publication date: 13-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Speech Communication
Speech Communication  Volume 85, Issue C
December 2016
130 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2016

Author Tags

  1. Additive noise
  2. Anti spoofing
  3. Countermeasures
  4. Speaker recognition

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Acoustic Scene Classification Using Various Features and DNN Model: A Monolithic and Hierarchical ApproachCircuits, Systems, and Signal Processing10.1007/s00034-024-02836-644:1(239-280)Online publication date: 1-Jan-2025
  • (2024)Is Audio Spoof Detection Robust to Laundering Attacks?Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security10.1145/3658664.3659656(283-288)Online publication date: 24-Jun-2024
  • (2024)Noise Robust Audio Spoof Detection Using Hybrid Feature Extraction and LCNNSN Computer Science10.1007/s42979-024-02774-95:4Online publication date: 13-Apr-2024
  • (2024)Noise robust automatic speaker verification systems: review and analysisTelecommunications Systems10.1007/s11235-024-01212-887:3(845-886)Online publication date: 1-Nov-2024
  • (2022)Voice spoofing countermeasure for voice replay attacks using deep learningJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-022-00306-511:1Online publication date: 24-Sep-2022
  • (2022)User Authentication Method via Speaker Recognition and Speech Synthesis DetectionSecurity and Communication Networks10.1155/2022/57557852022Online publication date: 1-Jan-2022
  • (2022)The BiLSTM-based synthesized speech recognitionProcedia Computer Science10.1016/j.procs.2022.11.086213:C(415-421)Online publication date: 1-Jan-2022
  • (2022)New replay attack detection using iterative adaptive inverse filtering and high frequency bandExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116597195:COnline publication date: 1-Jun-2022
  • (2022)Replay spoof detection for speaker verification system using magnitude-phase-instantaneous frequency and energy featuresMultimedia Tools and Applications10.1007/s11042-022-12380-781:27(39343-39366)Online publication date: 1-Nov-2022
  • (2021)Voice conversion spoofing detection by exploring artifacts estimatesMultimedia Tools and Applications10.1007/s11042-020-10212-080:15(23561-23580)Online publication date: 1-Jun-2021
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media