1. Introduction
In recent years, the performances of automatic speaker verification (ASV) systems have been significantly improved. It has become one of the most natural and convenient biometric speaker recognition methods. The ASV technologies are now widely deployed in many diverse applications and services, such as call centers, intelligent personalized services, payment security, access authentication, etc. However, many studies in recent years have found that the state-of-the-art ASV systems fail to handle the speech spoofing attacks, especially for three major types of attacks: the replay [
1], speech synthesis [
2,
3], and voice conversion. Therefore, great efforts are required to ensure adequate protection of ASV systems against spoofing. Due to the symmetry distribution characteristic between the genuine speech and spoof speech pair, the spoofing attack detection is challenging. As commonly used acoustic features between genuine and spoof speech share lots of similarities, specifically designed features for spoofing attack detection are in great demand.
Replay attacks refer to using pre-recorded speech samples collected from the genuine target speakers to attack the ASV systems. They are the most simple and accessible attacks, and pose huge threats to ASV, because of the availability of high quality and low-cost recording devices [
1]. Unlike replay attacks, speech synthesis and voice conversion attacks are usually produced from sophisticated speech processing techniques. In these attacks, the state-of-the-art text-to-speech synthesis (TTS) systems are utilized to generate the artificial speech signals, while the voice conversion (VC) systems are utilized to adapt a given natural speech to target speakers. Given sufficient training data, both the speech synthesis and voice conversion techniques can produce high-quality spoofing signals to mimic the ASV systems [
2].
To help research works on anti-spoofing and provide common platforms for the assessment and comparison of spoofing countermeasures, the Automatic Speaker Verification Spoofing and Countermeasures (ASVSpoof) challenges were organized every two years in ASV society from 2015. The most recent ASVSpoof 2019 was the first to consider the replay, speech synthesis, and voice conversion attacks within a single challenge [
2]. In ASVSpoof 2019, these attacks are divided into the logical access (LA, speech synthesis, and voice conversion attacks) and physical access (PA, replay attacks) scenarios according to different use case scenarios (
https://www.asvspoof.org/(accessed on 1 November 2021)).
In this study, we focus on using the ASVSpoof 2019 database to explore automatic spoofing detection methods for both logical and physical access spoofing attacks. We investigate two types of new acoustic features to enhance the discriminative information between bonafide and spoofed speech utterances.
The first proposed features consist of two cepstral coefficients and one log spectrum feature, which are extracted from the linear prediction (LP) residual signals of speech utterances. As we know that the LP residual signal of each utterance implicity contains its excitation source information. Although the latest TTS and VC algorithms can even produce spoofed speech that is perceptually indistinguishable from bonafide speech under some well-controlled conditions, there is still a big difference in excitation source signals between these spoofed speech and the bonafide speech. Moreover, from the analysis of replayed speech, we find that most of the playback device distortions occur in excitation signals, they can not affect much in the speech spectrum trajectories. Therefore, compared with the typical acoustic features that are directly extracted from the bonafide speech, we speculate the cepstral features extracted from the residual signals can capture better discrimination between the bonafide and spoofed speech, either for the synthetic or voice converted attacks or for the replayed attacks.
The second new feature is the harmonic and noise subband ratio feature (HNSR). It is motivated by the conventional Harmonic plus Noise Model (HNM) [
4] for high-quality speech analysis. Based on the HNM, the speech signal is decomposed into a deterministic component and a stochastic (noise or residual) component. According to the decomposition, we assume that the interaction movement difference of the vocal tract and glottal airflow of the bonafide and spoofed speech can be reflected and distinguishable in the HNSR features. The usefulness of the proposed features is examined in both the t-stochastic neighborhood embedding (t-SNE) space and the spoofing detection modeling space.
Besides that two new features, this paper also compared various single features on two major baseline systems, which are Gaussian mixture model (GMM) based system and Light Convolutional Neural Networks (LCNN)-based system. Then, to verify the complementarity between different features, the score-level fusion system is proposed.
4. Harmonic and Noise Interaction Features
From the speech production mechanism and the success of Harmonic plus Noise Model (HNM) [
4] for high-quality speech analysis and synthesis, we know that the generation of speech can be regarded as the interaction movement of the vocal tract and glottal airflow. Our previous work [
32] has proposed a new feature called the spectral subband energy ratios (SSER) to reflect the interaction property, and experimental results have proved its effectiveness to characterize the speaker identity. Therefore, we think that the natural interaction property in the synthesized or replayed spoofing speech may be distorted by the speech synthesis algorithms or the speech recording devices, etc. These distortions may result in very different interaction features between the bonafide and spoofed speech. Therefore, in this section, we try to explore the effectiveness of the interactive features for the ASVSpoof task.
As proposed in our previous work of [
32], here, we also use the spectral subband energy ratios as the interactive features but extract them differently as in [
32]. So, in order to distinguish these features from SSER, we call them “Harmonic and Noise Subband Ratio (HNSR)”. The principle of the HNSR is shown in
Figure 9.
Given a speech segment
, we first estimate the fundamental frequency (F0) and make the unvoiced/voiced decision using the Normalized Cross-Correlation Function (NCCF). Both F0 and NCCF are estimated using the Kaldi toolkit [
33]. The pitch-markers are computed as in [
34,
35]. All the unvoiced frames are discarded, only voiced frames are used to extract the HNSRs. The pitch-synchronous frames are then extracted by using Hamming windows of two pitch period, and centered at each pitch marker. Finally, the harmonic component
is then estimated with the HNM analysis [
4]. Instead of using the high-pass filtering method to get the stochastic noise part as in [
4] for speech synthesis, in this study, we choose to directly subtract the
by the original windowed speech signal
to achieve the
.
Once the
and
are available, we then transform them into the spectrum domain and compute the frequency subband energies as
where
is the short-time Fourier transform,
and
represent the starting and ending frequencies of the spectral subband. For each frequency subband, we set the bandwidth
to the averaged
(225 Hz) value of all the training datasets to have as many spectral subbands as possible. The maximum voiced frequency used in HNM analysis is fixed to 8 kHz. So, we can obtain a 35-dimensional HNSR feature vector for each voiced frame as
As the RCQCC and RLFCC visualization, we also apply the t-SNE to see the distribution of HNSR features as shown in
Figure 10. Compared with
Figure 7 and
Figure 8, the discrimination of HNSRs is much poor. It indicates that the HNSR features can not well distinguish the genuine and spoof speech, but in this study, we can verify whether there is complementarity information between HNSR features and other acoustic features.
6. Conclusions
In this paper, we investigate two types of new acoustic features to improve the detection of synthetic and replay speech spoofing attacks. One is the residual CQCCs, LFCC, and LogSpecs, the other is the HNSR interaction feature. From the experimental results, we find that the acoustic features extracted from residual signals behave much better than those extracted from source speech signals. Although the single system built from HNSR features get bad results, they still show a little bit of complementary information at the score-level system fusion. Furthermore, we investigate different score-level fusion strategies and find that all of the proposed residual features can provide significantly complementary information to official baselines, and the GMM-based and LCNN-based systems have further complementary information (more than 30% relative EER reduction) to improve the final system performances.
Though the experiment results show the efficiency of proposed features, these features may not outperform all other acoustic features for the ASVSpoof task. Future works will focus on evaluating other acoustic features to conduct a comprehensive feature level comparison for the ASVSpoof task, and improving the generalization ability of anti-spoofing countermeasures under different conditions.