1 Introduction

Speaker recognition (SR) is a biometric recognition technology that identifies the speaker's identity according to speech information (Algabri et al., 2017; Lin & Zhang, 2019; Wang et al., 2023). Speaker recognition requires simple facilities and has convenient operation, good real-time interaction, which can avoid the replay attack of counterfeiters, and high recognition accuracy. It has been widely used in many fields, such as financial payment, access control, electronic lock (Anwer et al., 2015; Guapo et al., 2016; Khelif et al., 2017; Martinson & Lawson, 2011).

Research on spokesperson recognition focuses on two main areas: feature extraction and model building. For feature extraction, the descriptors such as Mel-frequency cepstral coefficient (MFCC), Mel filter bank coefficient (Fbank), linear prediction cepstral coefficient (LPCC), pitch period, resonance and peak are widely used (Hansen & Hasan, 2015). The features of adjacent frames in consonants very well. However, they ignore the possible internal structure of the signal, such as a strong correlation between adjacent frames. These basic features have disadvantages in some cases. The voiced description of periodic characteristics in LPCC is accurate, while the description of consonants is vague. Therefore, several algorithms have been proposed for the fusion of different features, which improves the effective expression of speaker features and the recognition rate of speaker identity to a certain extent (Chowdhury & Ross, 2019; Yujin et al., 2010; Zhang & Zheng, 2013). Among them, there are acoustic feature mosaic (Ahmed & Bawar, 2018), off-line deep feature fusion (Al-Kaltakchi et al., 2016), online feature fusion (Li et al., 2019). However, the noise in speaker acoustic data may deteriorate the accuracy of speaker recognition.

For model construction, the Gaussian mixture models (GMMs) can be used either for classification or for summarization of data that is still in use, such as collecting sufficient statistics (Reynolds et al., 2000). JFA can be used to obtain speaker verification scores from experiments, but it is also used to extract embedded content from feature sequences (Kenny et al., 2007). I-Vector is just an embedded model and usually requires PLDA as the backend (Dehak et al., 2010). With the rise of neural networks, there have been deep neural network (DNN) based speaker recognition models represented by the x-vector system (Snyder et al., 2018). The DNN-based methods use the hidden layer to extract speaker features, which can learn speaker features from a large number of samples and are robust to noise. In (You et al., 2019), a multi-task learning model based on the x-vector system is proposed, which uses labeled and unlabeled data in building the model, and improves the robustness and capability of the speaker recognition system. In (Garcia-Romero et al., 2019), the x-vector deep neural network model is proposed for long recordings, which not only modifies the network architecture but also optimizes the loss function to effectively solve the problem of time mismatch between the training set and the test set. The best recognition performance is achieved on the speakers in the wild (SITW) benchmark. In (Kanagasundaram et al., 2019), a time-delay neural network is used for short utterance evaluation in the x-vector based speak verification system, which extracts the speaker embedding from the deeper layer with lower dimensions and adopts a variance normalization approach to improve the performance further.

However, the x-vector system improves the operational efficiency of the system by adjusting unequally spaced frame-level features to equal-length segment-level features. The statistical mechanism may result in the loss of temporal information of the speech signal, which may degrade the accuracy of speaker recognition.

Therefore, we propose a framework based on multi-task learning and feature integration (MTFI). Firstly, for input features, multiple features are simultaneously fed into the neural network and sequentially connected on the statistical pooling layer, the statistical pooling layer can learn a lot of complementary representations. Secondly, we have developed an attention mechanism to calculate the frame weights for the statistical pool layer, so as to enhance the key frame information and weaken useless frame information. Finally, to improve the robustness of multitask learning, we propose a new SU that transmits specific representations of different tasks during forward propagation and prevents the gradient from adapting to different tasks during backward propagation.

The rest of this paper is structured as follows: Sect. 2 describes the proposed speaker recognition system, Sect. 3 conducts and discusses the experiments. Finally, Sect. 4 presents the conclusions and further work.

2 The proposed speaker recognition system

The speaker recognition system uses a similar architecture as the x-vector deep neural network model. An x-vector speaker recognition system is based on the time-delay neural network (TDNN), which maps variable-length corpus to a fixed-length speaker feature vector called x-vector (Liu et al., 2018; Okabe et al., 2018). As shown in Fig. 1. The proposed speaker recognition system, the proposed system comprises two modules of frame-level and segment-level. In order to make the network more lightweight, except the first three TDNN layers and the two segment layers of all connections are 128-dimensional, and the other layers are 256 dimensional. The differences between our system and the conventional x-vector system include (1) the frame-level module uses different features and connected in series in the stitching layer; and (2) the segment-level module uses the attention mechanism to process the frame weight of the statistical layer; and (3) the new shared unit is proposed to help multi task system learning and improve its robustness.

Fig. 1
figure 1

The proposed speaker recognition system

2.1 Frame-level module

In the feature space, different features contain different information. The conventional x-vector system uses the MFCC to convert the linear frequency into the Mel frequency through a series of Mel filters. The main drawback of MFCC is the low robustness to noise signals. In order to improve the upper boundary, the goal of our network is to simultaneously integrate the additional information, with the fully connected layer obtain a frame level representation. The calculation is shown as follows:

$$Append\left({\mathcal{X}}_{feature1},{\mathcal{X}}_{feature2}\right) \to \mathcal{X}$$
(1)

2.2 Segment-level module

The following module creates a segment-level representation of a stack consisting of several fully linked layers. Prior to the fully-connected layers, a statistical pooling layer converts variable-length features at the frame level into vectors of fixed dimensionality. At present, most networks use the average pool to derive speaker representation. The average pool strategy considers each frame as an equally important frame, which is not true. Often, some frames are more important for speaker recognition than others, as the ability to recognise speakers varies between speech segments. Recent studies (Bhattacharya et al., 2017; Zhu et al., 2018) have applied the attention mechanism to speaker recognition and calculated the importance of each frame through the attention mechanism. In this paper, the attention mechanism is combined with the statistical layer to calculate the weighted statistics of speech signals.

As shown in Fig. 2, an attention model first computes a scalar score et for each attribute ht at frame level (t = 1, 2, …, T) for each attribute ht, calculating a scalar score et as defined below:

$${e}_{t}={v}^{T}f\left(W{h}_{t}+b\right)+k$$
(2)

where \(f( \cdot )\) is a non-linear activation function. The estimates are normalized across all score using the softmax function:

Fig. 2
figure 2

Attention-based statistical layer

$${a}_{t}=\frac{exp\left({e}_{t}\right)}{{\sum }_{t-1}^{T}exp\left({e}_{t}\right)}$$
(3)

The normalized \(\alpha_{t}\) values are used as weights in the layer to calculate the following vectors of weighted means and standard deviations. The calculation process is as follows:

$$\chi { = }\sum\limits_{t = 1}^{T} {\alpha_{t} } h_{t}$$
(4)
$$\psi { = }\sqrt {\sum\limits_{t = 1}^{T} {\alpha_{t} } h_{t} \odot h_{t} - \chi \odot \chi }$$
(5)

where \(\odot\) represents the Hadamard product. This representation of the segment-level is thus more discriminative for the speaker.

The weighted average value and standard vector are entered into a stack of fully connected layers. One of these hidden layers usually has a smaller number of cells in order to provide lower dimensional information from the previous layer. The output is a softmax layer, where each output node corresponds to a speaker.

Some studies have implemented end-to-end neural networks by using contrast loss or triplet loss (Li et al., 2017).

2.3 The shared unit for multi-task learning

In multitask learning, by utilizing a shared representation, different tasks can share learned information to improve overall performance. Specifically, when the binary loss function learns information from the shared representation, the x vector in each task's branch will have a task-specific representation. This ensures that each task can acquire task-relevant features and enhance its learning capability. In Fig. 1, two tasks have the same underlying node. To propagate information between these tasks, a common segment-level block is introduced in Fig. 3. This segment-level block facilitates the transfer of representations from one task to the current task, promoting information exchange and sharing among tasks, which benefits the overall learning effectiveness. Additionally, considering the robustness of the x-vector, during the backpropagation process, the gradient of the current task is inserted into a SU. This ensures the transmission and updating of the gradient information within the network, further optimizing the model's performance. Therefore, the softmax loss of the binary decision Mt1, the loss of speaker recognition Mt2 and its partial derivatives are calculated as follows:

$$\hat{h}_{t1} = h_{t1} + m \cdot h_{t2}$$
(6)
$$\hat{h}_{t2} = h_{t2} + n \cdot h_{t1}$$
(7)
$$\frac{{\partial M_{t1} }}{{\partial \hat{h}_{t1} }} = \frac{{\partial M_{t1} }}{{\partial h_{t1} }} + \beta \cdot m \cdot \frac{{\partial M_{t1} }}{{\partial h_{t2} }}$$
(8)
$$\frac{{\partial M_{t2} }}{{\partial \hat{h}_{t2} }} = \frac{{\partial M_{t2} }}{{\partial h_{t2} }} + \beta \cdot n \cdot \frac{{\partial M_{t2} }}{{\partial h_{t1} }}$$
(9)

where \(\beta\) is the scale of the clipping gradient and m and n are the scaling parameters for forward propagation. When the network is trained with a common representation between the two tasks, the gradient falls around the different layers and branches of the last and second layers, so it does not affect the robustness of the x vector. In this task, we set \(\beta\) = 0 and m = n = 0.925 (Fig. 3).

Fig. 3
figure 3

The shared unit

3 Experimental and result analysis

3.1 The description of the evaluation system

In the decision scoring process, the log-likelihood ratio of the two is defined as follows:

$$score = \lg \frac{{p\left( {\eta_{1} {,}\eta_{2} |R_{s} } \right)}}{{p\left( {\eta_{1} {,}\eta_{2} |R_{d} } \right)}}$$
(10)

where \(R_{s}\) represents that \(\eta_{1}\) and \(\eta_{2}\) are from the same speaker, \(R_{d}\) represents that \(\eta_{1}\) and \(\eta_{2}\) are from different speakers.

We compare the results of baseline systems and the proposed systems, including equal error rate (EER) (Zhang et al., 2019), and the minimum of the normalized detection cost function (DCF) at Ptarget = 10–2 and Ptarget = 10–3.

The DCF is a performance evaluation method commonly used in NIST SRE, and is defined as follows:

$$DCF = C_{FR} * FRR * P_{{{\text{t}}\arg {\text{et}}}} + C_{FA} * FAR * (1 - P_{{{\text{t}}\arg {\text{et}}}} )$$
(11)

where CFR and CFA represent the penalty cost of false refusal and false acceptance respectively, Ptarget and 1-Ptarget are the prior probabilities of real speech test and impersonation test. False rejection rate (FRR) refers to errors caused by misidentifying target speakers as non-target speakers, and false acceptation rate (FAR) refers to errors caused by identifying non-target speakers as target speakers. When the values of CFR, CFA, Ptarget and 1-Ptarget are set, a set of values of FRR and FAR minimizes DCF, and the DCF is minDCF. minDCF considers not only the different costs of the two types of errors but also the a priori probabilities of the two test cases, which is more reasonable than EER.

3.2 Database

In this paper, the Kaldi speech recognition tool (Povey et al., 2011) is used to build the experiment, including data processing, feature extraction, network training, and system testing.

In this paper, VoxCeleb1 dataset (Nagrani et al., 2017) is used for experiments. It contains utterances extracted from videos uploaded on YouTube. The sampling rate of all audio is 16kHZ, and the audio format adopts the single channel 16bit wav audio file. There are real scene noises in the speech, such as environmental noise, background voice, indoor noise, recording equipment noise, etc. the time point of noise is irregular. The total number of speakers in the dataset is 1251, the total number of sentences is 153,516, the total duration is 351 h, the average duration of each sentence is 8.2 s, the maximum duration is 145 s and the minimum duration is 4 s. The speakers in the dataset have different races, ages, accents, and genders, including 690 males and 561 females. The training set includes 1211 people, 148,642 sentences in total, and the test set includes 40 people, 4874 sentences in total. The structure of Voxceleb1 is presented in Table 1.

Table 1 Voxceleb1 database structure

3.3 Experimental parameters

For the baseline system, the acoustic feature \(X{ = }\left( {x_{1} ,x_{2} , \cdots ,x_{i} } \right)\) is 40-dimensional MFCC with a frame length of 25 ms and a frame offset of 10 ms as the acoustic response.

For the experimental system, the acoustic features are spectrogram; the frame length is 25 ms; the frame shift is 10 ms, and the dimension of each frame is 257 dimensions. The x-vector system does not require that the feature dimension must be the same, so we directly extract the features of the speech data with different lengths. The feature dimension is 257 × T, where T represents the length of the speech. In order to ensure the rationality of the experiment, the parameters used in the training of each network in this section are consistent with the network training parameters of TDNN in the baseline system. For the experimental system, we adopted the network structure of 5-layer CNN. In all convolution layers conv1 ~ conv5, 3 × 3 small convolution kernels were used for convolution operation, with a stride of 1. The max-pooling is used with a pool window size of 2 × 2 and an interval of 2. The convl and conv2 layers have 64 channels, and conv3 ~ conv5 have 128 channels.

For the baseline system, we use the TDNN network. The number of nodes in the first four frame-level is 512, and the number of nodes in the last frame-level is 1500. The number of nodes in the two fully connected layers is 512, and all nonlinear functions use the RELU function. The training parameters are presented in Table 2.

Table 2 Network training parameters of x-vector system

In this paper, RELU function, batch normalization, and dropout technology are used in each layer of the neural network to optimize network training. The dropout ratio is randomly selected among 0.1, 0.15, and 0.2. Stochastic gradient descent algorithm is adopted in the optimization algorithm, with the momentum of 0.5, the minibatch-size of 128, the initial learning rate of 0.01, and the final learning rate of 0.001.

In the experiment, we use the Kaldi speech recognition tool to extract the acoustic features and build the speaker recognition system.

3.4 Speaker recognition experiment based on attention

In this section, we compare the different features as network input in the x-vector system. First, the same TDNN structure is used to compare the performance of different feature. Second, different features are integrated for multiple feature. The results show in Table 3. In all the cases, the average statistical pooling is used in the segment-level module. The baseline of the x-vector speaker recognition system is based on TDNN.

Table 3 Results of different acoustic features and network architecture on test set

As shown in Table 3, four different features under TDNN are tested. The effect of MFCC feature is the best, where the relative improvement is 67.1%, 35.3% and 50.7% than LPCC, Fbank and spectrum in EER, Then, we retest the four features based on the attention mechanism of the statistical layer. The four features are improved in EER compared with TDNN, especially the relative improvement of 7.68% in LPCC, this shows that the statistical layer based on attention mechanism can enhance the key frame information better.

3.5 Speaker recognition experiment with multi-task learning

In this section, we design two sets of experiments. First, we combine different features, calculate their average score fusion, and then put the above combinations into a single system without shared units. The experimental results are presented in Table 3.

As shown in Table 4, it can be observed that different systems and feature combinations will affect performance. The single-task systems (without additional common blocks) performed better than all the single-function systems, and the average fusion scores for these systems, particularly the integration of LPCC and Fbank, showed a relative improvement of 21.5%.

Table 4 Test result of the second program

Finally, the two related robust single-function systems MFCC and Fbank and their MFST systems achieved reasonably good results, and MTFI and SU-MTFI are tested below using MFCC and Fbank. The results are shown in Table 5.

Table 5 Results on the voxceleb1 test set

As shown in Table 5, for voxceleb1, it can be observed that the SU-MTFI system demonstrates the best performance when utilizing the same feature combination. Based on the integration of MFCC and Fbank, the SU-MTFI is far superior to the MTFI and MFST, and the SU-MTFI propagate shared representations well, which verifies that the SU improve multi-task learning robustness. The SU-MTFI yields a 19.7% relative improvement over the MTFI.

4 Conclusion and future work

This paper presents multi-task learning and feature integration for speaker recognition systems: two key concepts are feature integration in MTFI and better propagation of common representations in multi-task learning.

In the future, we will study different features, such as CQCC. In addition, we need to continue to analyze the invariance characteristics of the voiceprint to improve the robustness of the system.