Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley
Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley
Attacking Speaker Recognition With Deep Generative Models Wilson Cai, Anish Doshi, Rafael Valle UC Berkeley
UC Berkeley
L1: convolution 3x3x32 Data pre-processing is dependent on the model being trained.
For SampleRNN and WaveNet, the raw audio is reduced to
16kHz and quantized using the µ-law companding transforma-
tion as referenced in [7] and [6]. For the model based on the
Wasserstein GAN, we pre-process the data by converting it
64 x 64 Mel-Spectrogram
to 16kHz and removing silences by using the WebRTC Voice
Fig. 1: Architecture for CNN speaker verifier. Activity Detector (VAD) as referenced in [14]. For the CNN
speaker recognition system, the data is pre-processed by re-
We train our speaker classifier using 64 by 64 Mel- sampling to 16kHz when necessary and removing silences by
Spectrograms 3 from 3 speech datasets, including 100 speak- using the aforemetioned VAD.
ers from NIST 2004, speaker p280 from CSTR VCTK and
the single speaker in Blizzard. Our speaker classifier has a 4.3. Feature extraction
rejection path, the “other” class, trained on environmental
sounds using samples from the ESC-50 dataset. Our model SampleRNN and WaveNet operate at the sample level, i.e.
achieves approximately 85% test set accuracy waveform, thus requiring no feature extraction. The features
used for the neural speaker recognition system are based on
3.2. Adversarial attacks Mel-Spectrograms with dynamic range compression. The Mel-
Spectrogram is obtained by projecting a spectrogram onto a
We define adversarial attacks on speaker recognition systems mel scale. We use the python library librosa to project the
as targeted or untargeted. In targeted attacks, an adversary spectrogram onto 64 mel bands, with window size equal to
2 More specifically, Mel-Spectrogram synthesis 1024 samples and hop size equal to 160 samples, i.e. 100ms
3 64 mel bands and 64 frames, 100 ms each long frames. Dynamic range compression is computed as
described in [12], with log(1 + C ∗ M ), where C is a compres- i.e., to produce samples that the system classifies as matching
sion constant scalar set to 1000 and M is a matrix representing a target speaker with reasonable confidence.
the Mel-Spectrogram. Training the GAN is also done with
Mel-Spectrograms of 64 bands and 64 frames image patch. 4.4.4. WGAN-GP with modified objective function
A naive approach for targeted attacks is to train the GAN
4.4. Models
on the data of the single target speaker. A drawback of this
4.4.1. WaveNet approach is that the critic, and by consequence the generator,
does not have access to universal properties of speech5 . To
Due to constraints on computing power and the extreme diffi- circumvent this problem, we rely on semi-supervised learning
culty in training WaveNet 4 , we used samples from WaveNet and propose a modification to the critic’s objective function
models that had been pre-trained for 88 thousand iterations. that allows it to learn to differentiate between not only real
Parameters of the models were kept the same as those in [6]. samples and generated samples, but also between real speech
The ability of WaveNet to perform untargeted attacks amounts samples from a target speaker and real speech samples from
to using a model trained on an entire corpus. Targeted attacks other speakers. We do this by adding a term to the critic’s
are more difficult - we found that a single speaker’s data was loss that encourages its discriminator to classify real speech
not enough to train WaveNet to converge successfully. To samples from untargeted speakers as fake:
construct speaker-dependent samples, we relied on samples
from pre-trained models that were globally conditioned on x∼P
E
D(x)
e +α ∗ E
ẋ∼Pẋ
D(ẋ) −
E
x∼Pr
D(x) + λ E
x̂∼Px̂
2
(k∇x̂ D(x̂)k2 − 1) ,
e g
speaker ID. Based on informal listening experiments, such | {z } | {z } | {z
Real Speaker
} | {z }
Generated Samples Different Speakers Gradient Penalty
samples do sound very similar to the real speech of the speaker (1)
in question.
where Px̂ is the distribution of samples from other speakers
and α is a tunable scaling factor. Note that equation 1 is
4.4.2. sampleRNN
no longer a direct approximation of the Wasserstein distance.
Similarly to WaveNet, we found that the best (least noisy) sam- Rather, it provides a balance of the distance between both the
pleRNN samples came from models which were pretrained fake distribution and real one, and the distance between other
with a high number of iterations. Accordingly, we obtained speakers’ distribution and the target speaker’s one. We refer to
samples from the three-tiered architecture, trained on the this objective function as mixed loss.
Blizzard 2013 dataset [15], which as mentioned in Section 3 Initially, we were able to converge the targeted loss model
is a 300 hour corpus of a single female speaker’s narration. used the same parameters as [20], namely 5 critic iterations per
We also downloaded samples from online repositories, in- generator iteration, a gradient penalty weight of 10, and batch
cluding samples from the original paper’s online repository size of 64. Both the generator and critic were trained using
at https://soundcloud.com/samplernn/sets, the Adam optimizer [21]. However, under these parameters
which we qualitatively found to have less noise than ours. we found that the highest α weight we could successfully use
was 0.1 (we found that not including this scaling factor led to
4.4.3. WGAN serious overfitting and poor convergence of the GAN).
In order to circumvent these problems and train a model
In all of our experiments, we use the Wasserstein GAN with with α set to 1, we made modifications to the setup, including
gradient penalty (WGAN-GP), which we found makes the setting the standard deviation of the DCGAN discriminator’s
model converge better than regular WGAN [16] or GAN [17]. weight initialization to 0.05 and iterations to 20. To accom-
In our experiments, we trained a WGAN-GP to produce mel- modate the critic’s access to additional data in the mixed loss
spectrograms from 1 target speaker against a set of 101 speak- function (4), we increased the generator’s learning rate. Fi-
ers. On each critic iteration, we fed it with a batch of samples nally, we added Gaussian noise to the target speaker data to
from one target speaker, and a batch of data uniformly sampled prevent overfitting.
from the other speakers. We used two popular architectures
for generator/critic pairs: DCGAN [18] and ResNet [19]. 5. RESULTS
Performing untargeted attacks with the WGAN-GP (i.e.,
training the network to output speech samples that mimic the 5.1. GAN Mel-Spectrogram
distribution of speech) is relatively straightforward: we simply
train the WGAN-GP using all speakers in our dataset. How- Using the improved Wasserstein GANs framework, we trained
ever, the most natural attack is one that is targeted: where the generators to construct 64x64 mel-spectrogram images from
GAN is trained to directly fool a speaker recognition system, a noise vector. Visual results are demonstrated below in Fig-
ure 2. We saw recognizable Mel-Spectrogram-like features in
4 Our community has not been able to replicate the results in Google’s
5.2. GAN Adversarial attacks CNN recognizer. We evaluate our controlled WGAN-GP sam-
ples against our CNN speaker recognition system, and the
Within the GAN framework, we train models for untargeted at-
confusion matrix can be found in Figure 3a.
tacks by using all data available from speakers that the speaker
recognition systems was trained on, irrespective of class la-
bel. We show in subsection 5.2.1 that an untargeted model 5.2.2. Targeted attacks
able to generate data from the real distribution with enough We trained the WGAN-GP on the entirety of the NIST 2004
variety can be used to perform adversarial attacks. Figure 3a corpus (100 speakers), a single speaker (P280) from the VCTK
depicts that our GAN-trained generator successfully learns all Corpus, and the single speaker from the Blizzard dataset. The
speakers across the dataset, without mode collapsing. samples from the other models were either downloaded from
As we described earlier, the models for targeted attacks the web or created from WaveNet globally conditioned on the
can be trained in two manners: 1) conditioning the model on single VCTK corpus speaker, and on SampleRNN trained only
additional information, e.g. class labels, as described in [22]; on data from the Blizzard dataset. Results for the WGAN-GP
2) using only data from the label of interest. While the first are demonstrated in Figure 3. In the samples generated with
approach might result in mode collapse, a drawback of the sampleRNN and WaveNet models, none of the predictions
second approach is that the discriminator, and by consequence made by the classifier match the target speaker.
the generator, does not have access to universal6 properties of We also trained the WGAN-GP with and without the
speech. In the targeted attacks subsection 5.2.2 we show results mixed loss on different speakers. The histogram of predictions
using our new objective function described in equation 1 that in Figure 3b shows WGAN-GP results for speaker 0. The
allows using data from all speakers. improved WGAN-GP loss achieves 0.38 error rate and our
mixed loss achieves 0.12 error rate, producing a 75% increase
5.2.1. Untargeted attacks in accuracy.
For each speaker audio data in the test set, we compute a 6. DISCUSSION AND CONCLUSION
Mel-Spectrogram as descibred in section 4.2. The resulting
Mel-Spectrogram is then fed into the CNN recognizer and In this research we have investigated the use of speech genera-
we extract a 1024-dimensional feature Φ from the first fully- tive models to perform adversarial attacks on speaker recogni-
connected layer (L5) in the pre-trained CNN model (1) trained tion systems. We show that the samples from autoregressive
on the real speech dataset with all speaker IDs. This deep models we trained, i.e. SampleRNN and WaveNet, or down-
feature/embedding Φ is then used to train a K-nearest-neighbor loaded from the web were not able to fool the CNN speaker
(KNN) classifier, with K equal to 5. recognizers we used in this research. On the other hand, we
To control the generator trained by our WGAN-GP, we show that adversarial examples generated with GAN networks
feed the generated Mel-Spectrograms into the same CNN-L7 are successful in performing targeted and untargeted adversar-
pipeline to extract their corresponding feature Φ.
b Utilizing ial attacks given the speaker recognition used herein.
the pre-trained KNN, each sample is assigned to the nearest
speaker in the deep feature space. Therefore, we know which
speaker our generated sample belongs to when we attack our
6 We draw a parallel with Universal Background Models in speech.
7. REFERENCES [12] Yanick Lukic, Carlo Vogt, Oliver Dürr, and Thilo Stadel-
mann, “Speaker identification and clustering using con-
[1] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi volutional neural networks,” in Machine Learning for
Yamagishi, Federico Alegre, and Haizhou Li, “Spoofing Signal Processing (MLSP), 2016 IEEE 26th International
and countermeasures for speaker verification: a survey,” Workshop on. IEEE, 2016, pp. 1–6.
Speech Communication, vol. 66, pp. 130–153, 2015.
[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
[2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, “Imagenet classification with deep convolutional neural
Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob networks,” in Advances in neural information processing
Fergus, “Intriguing properties of neural networks,” arXiv systems, 2012, pp. 1097–1105.
preprint arXiv:1312.6199, 2013.
[14] Adham Zeidan, Armin Lehmann, and Ulrich Trick, “We-
[3] Ian J Goodfellow, Jonathon Shlens, and Christian brtc enabled multimedia conferencing and collaboration
Szegedy, “Explaining and harnessing adversarial ex- solution,” in WTC 2014; World Telecommunications
amples,” arXiv preprint arXiv:1412.6572, 2014. Congress 2014; Proceedings of. VDE, 2014, pp. 1–6.
[4] Sanjit A Seshia, Dorsa Sadigh, and S Shankar Sastry, [15] Kishore Prahallad, Anandaswarup Vadapalli, Naresh
“Towards verified artificial intelligence,” . Elluru, G Mantena, B Pulugundla, P Bhaskararao,
HA Murthy, S King, V Karaiskos, and AW Black, “The
[5] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui
blizzard challenge 2013–indian language task,” in Bliz-
Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying
zard Challenge Workshop, 2013, vol. 2013.
Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: A
fully end-to-end text-to-speech synthesis model,” arXiv [16] Martin Arjovsky, Soumith Chintala, and Léon Bottou,
preprint arXiv:1703.10135, 2017. “Wasserstein gan,” arXiv preprint arXiv:1701.07875,
2017.
[6] Aäron van den Oord, Sander Dieleman, Heiga Zen,
Karen Simonyan, Oriol Vinyals, Alex Graves, Nal [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
“Wavenet: A generative model for raw audio,” CoRR and Yoshua Bengio, “Generative adversarial nets,” in
abs/1609.03499, 2016. Advances in neural information processing systems, 2014,
pp. 2672–2680.
[7] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani,
Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron [18] Alec Radford, Luke Metz, and Soumith Chintala, “Un-
Courville, and Yoshua Bengio, “Samplernn: An un- supervised representation learning with deep convolu-
conditional end-to-end neural audio generation model,” tional generative adversarial networks,” arXiv preprint
arXiv preprint arXiv:1612.07837, 2016. arXiv:1511.06434, 2015.
[8] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and [19] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
Michael K Reiter, “Accessorize to a crime: Real and ballero, Andrew Cunningham, Alejandro Acosta, An-
stealthy attacks on state-of-the-art face recognition,” in drew Aitken, Alykhan Tejani, Johannes Totz, Zehan
Proceedings of the 2016 ACM SIGSAC Conference on Wang, et al., “Photo-realistic single image super-
Computer and Communications Security. ACM, 2016, resolution using a generative adversarial network,” arXiv
pp. 1528–1540. preprint arXiv:1609.04802, 2016.
[9] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, [20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-
Yuankai Zhang, Micah Sherr, Clay Shields, David Wag- cent Dumoulin, and Aaron Courville, “Improved training
ner, and Wenchao Zhou, “Hidden voice commands,” in of wasserstein gans,” arXiv preprint arXiv:1704.00028,
25th USENIX Security Symposium (USENIX Security 16), 2017.
Austin, TX, 2016.
[21] Diederik Kingma and Jimmy Ba, “Adam: A
[10] Santiago Pascual, Antonio Bonafonte, and Joan Serrà, method for stochastic optimization,” arXiv preprint
“Segan: Speech enhancement generative adversarial net- arXiv:1412.6980, 2014.
work,” arXiv preprint arXiv:1703.09452, 2017.
[22] Mehdi Mirza and Simon Osindero, “Conditional genera-
[11] Jonathan Chang and Stefan Scherer, “Learning rep- tive adversarial nets,” arXiv preprint arXiv:1411.1784,
resentations of emotional speech with deep convolu- 2014.
tional generative adversarial networks,” arXiv preprint
arXiv:1705.02394, 2017.