Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley

ATTACKING SPEAKER RECOGNITION WITH
DEEP GENERATIVE MODELS
Wilson Cai, Anish Doshi, Rafael Valle
UC Berkeley
ABSTRACT • We evaluate samples produced with SampleRNN and

WaveNet in their ability to fool text-independent speaker
arXiv:1801.02384v1 [cs.SD] 8 Jan 2018
In this paper we investigate the ability of generative adversarial

networks (GANs) to synthesize spoofing attacks on modern recognizers.
speaker recognition systems. We first show that samples gen-
• We propose strategies for untargeted attacks using Gen-
erated with SampleRNN and WaveNet are unable to fool a
erative Adversarial Networks.
CNN-based speaker recognition system. We propose a modi-
fication of the Wasserstein GAN objective function to make • We propose a semi-supervised approach for targeted
use of data that is real but not from the class being learned. attacks by modifying Wasserstein’s GAN loss function.
Our semi-supervised learning method is able to perform both
targeted and untargeted attacks, raising questions related to
security in speaker authentication systems. 2. RELATED WORK
Modern generative models are sophisticated enough to produce

1. INTRODUCTION fake1 speech samples that can be indistinguishable from real
human speech. In this section, we provide a summary of some
Speaker authentication systems are being deployed for security existing neural speech synthesis models and their architectures.
critical applications in industries like banking, forensics, and WaveNet [6] is a generative neural network that is trained
home automation. Like other domains, such industries have end-to-end to model quantized audio waveforms. The model is
benefited from recent advancements in deep learning that lead fully probabilistic and autoregressive, using a stack of causal
to improved accuracy and trainability of the speech authen- convolutional layers to condition the predictive distribution
tication systems. Despite the improvement in the efficiency for each audio sample on all previous ones. It has produced
of these systems, evidence shows that they can be susceptible impressive results for generation of speech audio conditioned
to adversarial attacks[1], thus motivating a current focus on on speaker and text and has become a standard baseline for
understanding adversarial attacks ([2], [3]), finding counter- neural speech generative models.
measures to detect and deflect them and designing systems that SampleRNN [7] is another autoregressive architecture that
are provably correct with respect to mathematically-specified has been successfully used to generate both speech and music
requirements [4]. samples. SampleRNN uses a hierarchical structure of deep
Parallel to advancements in speech authentication, neural RNNs to model dependencies in the sample sequence. Each
speech generation (the process of using deep neural networks deep RNN operates at a different temporal resolution so as to
to generate speech) has also seen huge progress in recent years model both long term and short term dependencies.
[5]. The combination of these advancements begs a natural Recent work on deep learning architectures has also intro-
question that has, to the best of our knowledge, not yet been duced the presence of adversarial examples: small perturba-
answered: tions to the original inputs, normally imperceptible to humans,
Are speech authentication systems robust which nevertheless cause the architecture to generate an in-
to adversarial attacks by speech generative models? correct or deliberately chosen output. In their brilliant papers,
[2] and [3] analyze the origin of adversarial attacks and de-
Generative Adversarial Networks (GANs) are generative
scribe simple and very efficient techniques for creating such
models that recently have been used to produce incredibly
perturbations, such as the fast gradient sign method (FGSM).
authentic samples in a variety of fields. The core idea of
In the vision domain, [8] describe a technique for attack-
GANs, a minimax game played between a generator network
ing facial recognition systems. Their attacks are physically
and a discriminator network, extends naturally to the field of
realizable and inconspicuous, allowing an attacker to imper-
speaker authentication and spoofing.
sonate another individual. In the speech domain, [9] describe
With regards to this question, we offer in this research the
following contributions: 1 We use the term fake to refer to computer generated samples
Speakers Language Duration Context
attacks on speech-recognition systems which use sounds that 2013 Blizzard 1 English 73 h Book narration
are hard to recognize by humans but interpreted as specific CSTR VCTK 109 English 400 Sentences Newspaper narration
2004 NIST 100 Multiple 5 min / speaker Conversational phone speech.
commands by speech-recognition systems. ESC 50 50 N/A 4 min / class Environmental sounds.
To the best of our knowledge, GANs have not been used
for the purpose of speech synthesis2 . [10] uses a conditional
GAN for the purpose of speech enhancement, i.e. taking as Table 1: Description of the datasets used in our experiments.
input a raw speech signal and outputting a denoised waveform.
The model in [11] tackles the reverse problem of using GANs
to learn certain representations given a speech spectrogram. is interested in designing an input that makes the classifica-
tion system predict a target class chosen by the adversary. In
3. ATTACKING SPEAKER RECOGNITION MODELS untargeted attacks, the adversary is interested in a confident
prediction, regardless of the class being predicted as long as it
3.1. Neural speaker recognition system is not the "other" class. Untargeted attacks are essentially de-
signed to fool the classifier into thinking a fake speech sample
The speaker recognition system used in our experiments is is real. Note that a successful targeted attack is by definition a
based on the framework by [12] and is described in Figure 1. successful untargeted attack as well.
The first module at the bottom is a pre-processing step that
extracts the Mel-Spectrogram from the waveform as described
in section 4.2. The second module is a convolutional neural 4. EXPERIMENTAL SETUP
network (CNN) that performs multi-speaker classification us-
ing the Mel-Spectrogram. The CNN is a modified version of 4.1. Datasets
Alexnet [13]. We warn the readers that unlike [12], our clas-
In our experiments we use three speech datasets and one
sifier operates on 64 by 64 Mel-Spectrogram and has slightly
dataset with environmental sounds, as shown in Table 1. The
different number of nodes on each layer.
datasets used are public and provide audio clips of different
labels lengths, quality, language and content. In addition to the
L8: softmax samples listed in Table 1, we used globally conditioned sam-
L7: dense 103 pleRNN and WaveNet fake samples available on the web. The
L6: dropout 50% samples generated with sampleRNN and WaveNet are from
L5: dense 1024 the Blizzard dataset and CSTR VCTK (P280) respectively.
L4: max pooling 2x2
L3: convolution 3x3x64

4.2. Pre-processing
L2: max pooling 2x2
L1: convolution 3x3x32 Data pre-processing is dependent on the model being trained.
For SampleRNN and WaveNet, the raw audio is reduced to
16kHz and quantized using the µ-law companding transforma-
tion as referenced in [7] and [6]. For the model based on the
Wasserstein GAN, we pre-process the data by converting it
64 x 64 Mel-Spectrogram
to 16kHz and removing silences by using the WebRTC Voice
Fig. 1: Architecture for CNN speaker verifier. Activity Detector (VAD) as referenced in [14]. For the CNN
speaker recognition system, the data is pre-processed by re-
We train our speaker classifier using 64 by 64 Mel- sampling to 16kHz when necessary and removing silences by
Spectrograms 3 from 3 speech datasets, including 100 speak- using the aforemetioned VAD.
ers from NIST 2004, speaker p280 from CSTR VCTK and
the single speaker in Blizzard. Our speaker classifier has a 4.3. Feature extraction
rejection path, the “other” class, trained on environmental
sounds using samples from the ESC-50 dataset. Our model SampleRNN and WaveNet operate at the sample level, i.e.
achieves approximately 85% test set accuracy waveform, thus requiring no feature extraction. The features
used for the neural speaker recognition system are based on
3.2. Adversarial attacks Mel-Spectrograms with dynamic range compression. The Mel-
Spectrogram is obtained by projecting a spectrogram onto a
We define adversarial attacks on speaker recognition systems mel scale. We use the python library librosa to project the
as targeted or untargeted. In targeted attacks, an adversary spectrogram onto 64 mel bands, with window size equal to
2 More specifically, Mel-Spectrogram synthesis 1024 samples and hop size equal to 160 samples, i.e. 100ms
3 64 mel bands and 64 frames, 100 ms each long frames. Dynamic range compression is computed as
described in [12], with log(1 + C ∗ M ), where C is a compres- i.e., to produce samples that the system classifies as matching
sion constant scalar set to 1000 and M is a matrix representing a target speaker with reasonable confidence.
the Mel-Spectrogram. Training the GAN is also done with
Mel-Spectrograms of 64 bands and 64 frames image patch. 4.4.4. WGAN-GP with modified objective function
A naive approach for targeted attacks is to train the GAN
4.4. Models
on the data of the single target speaker. A drawback of this
4.4.1. WaveNet approach is that the critic, and by consequence the generator,
does not have access to universal properties of speech5 . To
Due to constraints on computing power and the extreme diffi- circumvent this problem, we rely on semi-supervised learning
culty in training WaveNet 4 , we used samples from WaveNet and propose a modification to the critic’s objective function
models that had been pre-trained for 88 thousand iterations. that allows it to learn to differentiate between not only real
Parameters of the models were kept the same as those in [6]. samples and generated samples, but also between real speech
The ability of WaveNet to perform untargeted attacks amounts samples from a target speaker and real speech samples from
to using a model trained on an entire corpus. Targeted attacks other speakers. We do this by adding a term to the critic’s
are more difficult - we found that a single speaker’s data was loss that encourages its discriminator to classify real speech
not enough to train WaveNet to converge successfully. To samples from untargeted speakers as fake:
construct speaker-dependent samples, we relied on samples
from pre-trained models that were globally conditioned on x∼P
E

D(x)

e +α ∗ E
ẋ∼Pẋ

D(ẋ) −

E
x∼Pr

D(x) + λ E
x̂∼Px̂
2
(k∇x̂ D(x̂)k2 − 1) ,
e g
speaker ID. Based on informal listening experiments, such | {z } | {z } | {z
Real Speaker
} | {z }
Generated Samples Different Speakers Gradient Penalty
samples do sound very similar to the real speech of the speaker (1)
in question.
where Px̂ is the distribution of samples from other speakers
and α is a tunable scaling factor. Note that equation 1 is
4.4.2. sampleRNN
no longer a direct approximation of the Wasserstein distance.
Similarly to WaveNet, we found that the best (least noisy) sam- Rather, it provides a balance of the distance between both the
pleRNN samples came from models which were pretrained fake distribution and real one, and the distance between other
with a high number of iterations. Accordingly, we obtained speakers’ distribution and the target speaker’s one. We refer to
samples from the three-tiered architecture, trained on the this objective function as mixed loss.
Blizzard 2013 dataset [15], which as mentioned in Section 3 Initially, we were able to converge the targeted loss model
is a 300 hour corpus of a single female speaker’s narration. used the same parameters as [20], namely 5 critic iterations per
We also downloaded samples from online repositories, in- generator iteration, a gradient penalty weight of 10, and batch
cluding samples from the original paper’s online repository size of 64. Both the generator and critic were trained using
at https://soundcloud.com/samplernn/sets, the Adam optimizer [21]. However, under these parameters
which we qualitatively found to have less noise than ours. we found that the highest α weight we could successfully use
was 0.1 (we found that not including this scaling factor led to
4.4.3. WGAN serious overfitting and poor convergence of the GAN).
In order to circumvent these problems and train a model
In all of our experiments, we use the Wasserstein GAN with with α set to 1, we made modifications to the setup, including
gradient penalty (WGAN-GP), which we found makes the setting the standard deviation of the DCGAN discriminator’s
model converge better than regular WGAN [16] or GAN [17]. weight initialization to 0.05 and iterations to 20. To accom-
In our experiments, we trained a WGAN-GP to produce mel- modate the critic’s access to additional data in the mixed loss
spectrograms from 1 target speaker against a set of 101 speak- function (4), we increased the generator’s learning rate. Fi-
ers. On each critic iteration, we fed it with a batch of samples nally, we added Gaussian noise to the target speaker data to
from one target speaker, and a batch of data uniformly sampled prevent overfitting.
from the other speakers. We used two popular architectures
for generator/critic pairs: DCGAN [18] and ResNet [19]. 5. RESULTS
Performing untargeted attacks with the WGAN-GP (i.e.,
training the network to output speech samples that mimic the 5.1. GAN Mel-Spectrogram
distribution of speech) is relatively straightforward: we simply
train the WGAN-GP using all speakers in our dataset. How- Using the improved Wasserstein GANs framework, we trained
ever, the most natural attack is one that is targeted: where the generators to construct 64x64 mel-spectrogram images from
GAN is trained to directly fool a speaker recognition system, a noise vector. Visual results are demonstrated below in Fig-
ure 2. We saw recognizable Mel-Spectrogram-like features in
4 Our community has not been able to replicate the results in Google’s
WaveNet paper 5 We draw a parallel with Universal Background Models in speech.

(a) Real (actual) (b) Fake (generated)
Fig. 2: Comparison of 6 real and fake mel-spectrogram sam-

ples from all speakers. (∼ 5000 generator iterations)
(a) Confusion matrix of un-
targeted model. x-axis cor- (b) Histogram of predictions given
the data after only 1000 generator iterations, and after 5000 responds to predicted label, WGAN-GP and mixed loss mod-
iterations the generated samples were indistinguishable from y-axis to ground truth. els. Ground truth label: 0.
real ones. Training took around 10 hours for 20000 iterations
on a single 4 GB Nvidia GK104GL GPU. Fig. 3: Summary histograms of targeted attacks
5.2. GAN Adversarial attacks CNN recognizer. We evaluate our controlled WGAN-GP sam-
ples against our CNN speaker recognition system, and the
Within the GAN framework, we train models for untargeted at-
confusion matrix can be found in Figure 3a.
tacks by using all data available from speakers that the speaker
recognition systems was trained on, irrespective of class la-
bel. We show in subsection 5.2.1 that an untargeted model 5.2.2. Targeted attacks
able to generate data from the real distribution with enough We trained the WGAN-GP on the entirety of the NIST 2004
variety can be used to perform adversarial attacks. Figure 3a corpus (100 speakers), a single speaker (P280) from the VCTK
depicts that our GAN-trained generator successfully learns all Corpus, and the single speaker from the Blizzard dataset. The
speakers across the dataset, without mode collapsing. samples from the other models were either downloaded from
As we described earlier, the models for targeted attacks the web or created from WaveNet globally conditioned on the
can be trained in two manners: 1) conditioning the model on single VCTK corpus speaker, and on SampleRNN trained only
additional information, e.g. class labels, as described in [22]; on data from the Blizzard dataset. Results for the WGAN-GP
2) using only data from the label of interest. While the first are demonstrated in Figure 3. In the samples generated with
approach might result in mode collapse, a drawback of the sampleRNN and WaveNet models, none of the predictions
second approach is that the discriminator, and by consequence made by the classifier match the target speaker.
the generator, does not have access to universal6 properties of We also trained the WGAN-GP with and without the
speech. In the targeted attacks subsection 5.2.2 we show results mixed loss on different speakers. The histogram of predictions
using our new objective function described in equation 1 that in Figure 3b shows WGAN-GP results for speaker 0. The
allows using data from all speakers. improved WGAN-GP loss achieves 0.38 error rate and our
mixed loss achieves 0.12 error rate, producing a 75% increase
5.2.1. Untargeted attacks in accuracy.
For each speaker audio data in the test set, we compute a 6. DISCUSSION AND CONCLUSION
Mel-Spectrogram as descibred in section 4.2. The resulting
Mel-Spectrogram is then fed into the CNN recognizer and In this research we have investigated the use of speech genera-
we extract a 1024-dimensional feature Φ from the first fully- tive models to perform adversarial attacks on speaker recogni-
connected layer (L5) in the pre-trained CNN model (1) trained tion systems. We show that the samples from autoregressive
on the real speech dataset with all speaker IDs. This deep models we trained, i.e. SampleRNN and WaveNet, or down-
feature/embedding Φ is then used to train a K-nearest-neighbor loaded from the web were not able to fool the CNN speaker
(KNN) classifier, with K equal to 5. recognizers we used in this research. On the other hand, we
To control the generator trained by our WGAN-GP, we show that adversarial examples generated with GAN networks
feed the generated Mel-Spectrograms into the same CNN-L7 are successful in performing targeted and untargeted adversar-
pipeline to extract their corresponding feature Φ.
b Utilizing ial attacks given the speaker recognition used herein.
the pre-trained KNN, each sample is assigned to the nearest
speaker in the deep feature space. Therefore, we know which
speaker our generated sample belongs to when we attack our
6 We draw a parallel with Universal Background Models in speech.
7. REFERENCES [12] Yanick Lukic, Carlo Vogt, Oliver Dürr, and Thilo Stadel-
mann, “Speaker identification and clustering using con-
[1] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi volutional neural networks,” in Machine Learning for
Yamagishi, Federico Alegre, and Haizhou Li, “Spoofing Signal Processing (MLSP), 2016 IEEE 26th International
and countermeasures for speaker verification: a survey,” Workshop on. IEEE, 2016, pp. 1–6.
Speech Communication, vol. 66, pp. 130–153, 2015.
[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
[2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, “Imagenet classification with deep convolutional neural
Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob networks,” in Advances in neural information processing
Fergus, “Intriguing properties of neural networks,” arXiv systems, 2012, pp. 1097–1105.
preprint arXiv:1312.6199, 2013.
[14] Adham Zeidan, Armin Lehmann, and Ulrich Trick, “We-
[3] Ian J Goodfellow, Jonathon Shlens, and Christian brtc enabled multimedia conferencing and collaboration
Szegedy, “Explaining and harnessing adversarial ex- solution,” in WTC 2014; World Telecommunications
amples,” arXiv preprint arXiv:1412.6572, 2014. Congress 2014; Proceedings of. VDE, 2014, pp. 1–6.
[4] Sanjit A Seshia, Dorsa Sadigh, and S Shankar Sastry, [15] Kishore Prahallad, Anandaswarup Vadapalli, Naresh
“Towards verified artificial intelligence,” . Elluru, G Mantena, B Pulugundla, P Bhaskararao,
HA Murthy, S King, V Karaiskos, and AW Black, “The
[5] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui
blizzard challenge 2013–indian language task,” in Bliz-
Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying
zard Challenge Workshop, 2013, vol. 2013.
Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: A
fully end-to-end text-to-speech synthesis model,” arXiv [16] Martin Arjovsky, Soumith Chintala, and Léon Bottou,
preprint arXiv:1703.10135, 2017. “Wasserstein gan,” arXiv preprint arXiv:1701.07875,
2017.
[6] Aäron van den Oord, Sander Dieleman, Heiga Zen,
Karen Simonyan, Oriol Vinyals, Alex Graves, Nal [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
“Wavenet: A generative model for raw audio,” CoRR and Yoshua Bengio, “Generative adversarial nets,” in
abs/1609.03499, 2016. Advances in neural information processing systems, 2014,
pp. 2672–2680.
[7] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani,
Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron [18] Alec Radford, Luke Metz, and Soumith Chintala, “Un-
Courville, and Yoshua Bengio, “Samplernn: An un- supervised representation learning with deep convolu-
conditional end-to-end neural audio generation model,” tional generative adversarial networks,” arXiv preprint
arXiv preprint arXiv:1612.07837, 2016. arXiv:1511.06434, 2015.
[8] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and [19] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
Michael K Reiter, “Accessorize to a crime: Real and ballero, Andrew Cunningham, Alejandro Acosta, An-
stealthy attacks on state-of-the-art face recognition,” in drew Aitken, Alykhan Tejani, Johannes Totz, Zehan
Proceedings of the 2016 ACM SIGSAC Conference on Wang, et al., “Photo-realistic single image super-
Computer and Communications Security. ACM, 2016, resolution using a generative adversarial network,” arXiv
pp. 1528–1540. preprint arXiv:1609.04802, 2016.
[9] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, [20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-
Yuankai Zhang, Micah Sherr, Clay Shields, David Wag- cent Dumoulin, and Aaron Courville, “Improved training
ner, and Wenchao Zhou, “Hidden voice commands,” in of wasserstein gans,” arXiv preprint arXiv:1704.00028,
25th USENIX Security Symposium (USENIX Security 16), 2017.
Austin, TX, 2016.
[21] Diederik Kingma and Jimmy Ba, “Adam: A
[10] Santiago Pascual, Antonio Bonafonte, and Joan Serrà, method for stochastic optimization,” arXiv preprint
“Segan: Speech enhancement generative adversarial net- arXiv:1412.6980, 2014.
work,” arXiv preprint arXiv:1703.09452, 2017.
[22] Mehdi Mirza and Simon Osindero, “Conditional genera-
[11] Jonathan Chang and Stefan Scherer, “Learning rep- tive adversarial nets,” arXiv preprint arXiv:1411.1784,
resentations of emotional speech with deep convolu- 2014.
tional generative adversarial networks,” arXiv preprint
arXiv:1705.02394, 2017.

Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley

Uploaded by

Copyright:

Available Formats

Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley

Uploaded by

Copyright:

Available Formats

ATTACKING SPEAKER RECOGNITION WITH

DEEP GENERATIVE MODELS

Wilson Cai, Anish Doshi, Rafael Valle

ABSTRACT • We evaluate samples produced with SampleRNN and

In this paper we investigate the ability of generative adversarial

Modern generative models are sophisticated enough to produce

L3: convolution 3x3x64

WaveNet paper 5 We draw a parallel with Universal Background Models in speech.

Fig. 2: Comparison of 6 real and fake mel-spectrogram sam-

You might also like