Speaker Verification Using Convolutional Neural Networks: Hossein Salehghaffari
Speaker Verification Using Convolutional Neural Networks: Hossein Salehghaffari
Speaker Verification Using Convolutional Neural Networks: Hossein Salehghaffari
Networks
Hossein Salehghaffari
Control/Robotics Research Laboratory (CRRL),
Department of Electrical and Computer Engineering,
NYU Tandon School of Engineering (Polytechnic Institute), NY 11201, USA
Email: h.saleh@nyu.edu
arXiv:1803.05427v2 [eess.AS] 10 Aug 2018
Abstract—In this paper, a novel Convolutional Neural query utterances are identified by comparing to existing
Network architecture has been developed for speaker speaker models created in the enrollment phase.
verification in order to simultaneously capture and dis-
card speaker and non-speaker information, respectively. Recently, with the advent of deep learning in different
In training phase, the network is trained to distinguish applications such as speech, image recognition and net-
between different speaker identities for creating the back- work pruning [1]–[4], data-driven approaches using Deep
ground model. One of the crucial parts is to create the Neural Networks (DNNs) have also been proposed for
speaker models. Most of the previous approaches create effective feature learning for Automatic Speech Recog-
speaker models based on averaging the speaker representa- nition (ASR) [3] and Speaker Recognition (SR) [5],
tions provided by the background model. We overturn this [6]. Also deep architecture has mostly been treated as
problem by further fine-tuning the trained model using the
black boxes, some approaches based on Information
Siamese framework for generating a discriminative feature
space to distinguish between same and different speakers Theory [7], have been presented for multimodal feature
regardless of their identity. This provides a mechanism extraction and demonstrated promising results [8].
which simultaneously captures the speaker-related infor- Some traditional successful model for speaker verifica-
mation and create robustness to within-speaker variations. tion are Gaussian Mixture Model-Universal Background
It is demonstrated that the proposed method outperforms
Model (GMM-UBM) [9] and i-vector [10]. The main
the traditional verification methods which create speaker
models directly from the background model. disadvantage of these models is their unsupervised na-
ture since there are not trained objectively for speaker
I. I NTRODUCTION verification setup. Some methods have been proposed
In speaker verification (SV), the identity of a query to supervise the aforementioned models training such as
spoken utterance should be confirmed by comparing SVM-based GMM-UBMs [11] and PLDA for i-vectors
to the gallery of known speakers. The speaker verifi- model [12]. With the advent of Convolutional Neural
cation can be categorized to text-dependent and text- Networks (CNNs) and their promising results for action
independent. In text-independent, no restriction is con- recognition [13], scene understanding [14], recently they
sidered for the utterances. On the other hand, in text- have been proposed as well for speaker and speech
dependent setting, all speakers repeat the same phrase. recognition [6], [15].
Due to the variational nature of the former setup, it In this work, we propose to use the Siamese neural
considers being a more challenging task since the sys- networks to operate one traditional speech features such
tem must be able to clearly distinguish between the as MFCCs1 instead of raw feature for having a higher-
speaker and non-speaker characteristics of the uttered level representation for speaker-related characteristics.
phrases. The general procedure of speaker verification Moreover, we show the advantage of utilizing an effec-
consists of three phases: Development, enrollment, and tive pair selection method for verification purposes.
evaluation. For development, a background model must
be created for capturing the speaker-related information.
In enrollment, the speaker models are created using
1
the background model. Finally, in the evaluation, the Mel Frequency Cepstral Coefficients
II. R ELATED W ORKS speaker models using a score function and the one with
Convolutional Neural Networks [16] have recently the highest score is the predicted speaker. Considering
been used for speech recognition [17]. Deep models the one-vs-all setup, this stage is equivalent a binary
have effectively been proposed an utilized for text- classification problem in which the traditional Equal
independent setup in some research efforts [5], [18]. Error Rate (EER) is used for model evaluation. The
Locally Connected Networks (LCNs) have been utilized false reject rate and the false accept rate are determined
for SV as well [19]. Although in [19], the setup is text- by predefined threshold and when two errors become
dependent. In some other works such as [20], [21], the equal, the operating point is EER. Usually, as for the
deep networks have been employed for feature extractors scoring function, the simple cosine similarity score will
to create speaker models for further evaluations. We be employed. The score is measured by the similarity
investigate the CNNs specifically trained end-to-end for between the representation of the test utterance and the
verification purposes and furthermore employ them as targeted speaker model.
feature extractors to distinguish between the speaker and
non-speaker information.
III. S PEAKER V ERIFICATION P ROCEDURE AND
P ROTOCOL
The speaker verification protocol can be categorized
into three phases: development, enrollment, and
evaluation. The general view of the speaker verification
protocol is depicted in Fig 1. We explain these phases
in this section with a special emphasis on how they can
be adapted to deep learning. Different research efforts
proposed variety of methods for implementing and
adapting this protocol such i-vector [10], [22], d-vector
system [6].
Evaluation During the evaluation phase, test utterances The aim is to utilize CNNs as powerful feature ex-
will be fed to the model for speaker representation tractors. The input pipeline and the specific architecture
extraction. The query test sample will be compared to all are explained in this section.
TABLE I TABLE II
S TATISTICS OF THE VOX C ELEB DATASET. T HE ARCHITECTURE USED FOR VERIFICATION PURPOSE .