Multi-task learning for X-vector based speaker recognition

Zhang, Yingjie; Liu, Liu

doi:10.1007/s10772-023-10058-5

Multi-task learning for X-vector based speaker recognition

Open access
Published: 28 October 2023

Volume 26, pages 817–823, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Speech Technology Aims and scope Submit manuscript

Multi-task learning for X-vector based speaker recognition

Download PDF

877 Accesses
Explore all metrics

Abstract

In this paper, we propose a speaker recognition system that leverages multi-task learning and features integration (MTFI), to improve the performance of x-vector based speaker recognition models. It is important to integrate complementary information from different features such as MFCC, Fbank, spectrogram and LPCC, as often a single feature usually cannot cover all information about a speaker and generalization is insufficient. Since the x-vector model outputs affine transformation values with the penultimate hidden layer in the trained model, the parameter distribution of this layer should be stable and should not be affected by tasks that are not current branches when switching tasks. Therefore, we propose a shared unit (SU) in multi-task learning, which is useful for sharing common representations and other auxiliary tasks. Then, an attention mechanism is designed to calculate the frame weight in the statistical pooling layer, so as to enhance the key frame information. The proposed system had an EER of 0.98% in voxceleb1 and the average score fusion obtained the EER of 0.65%.

Speaker Recognition Using SincNet and X-Vector Fusion

Resformer: Local Frame-Level Feature and Global Segment-Level Feature Joint Learning for Speaker Verification

Article 11 April 2024

A method of multi-models fusion for speaker recognition

Article 21 May 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Speaker recognition (SR) is a biometric recognition technology that identifies the speaker's identity according to speech information (Algabri et al., 2017; Lin & Zhang, 2019; Wang et al., 2023). Speaker recognition requires simple facilities and has convenient operation, good real-time interaction, which can avoid the replay attack of counterfeiters, and high recognition accuracy. It has been widely used in many fields, such as financial payment, access control, electronic lock (Anwer et al., 2015; Guapo et al., 2016; Khelif et al., 2017; Martinson & Lawson, 2011).

Research on spokesperson recognition focuses on two main areas: feature extraction and model building. For feature extraction, the descriptors such as Mel-frequency cepstral coefficient (MFCC), Mel filter bank coefficient (Fbank), linear prediction cepstral coefficient (LPCC), pitch period, resonance and peak are widely used (Hansen & Hasan, 2015). The features of adjacent frames in consonants very well. However, they ignore the possible internal structure of the signal, such as a strong correlation between adjacent frames. These basic features have disadvantages in some cases. The voiced description of periodic characteristics in LPCC is accurate, while the description of consonants is vague. Therefore, several algorithms have been proposed for the fusion of different features, which improves the effective expression of speaker features and the recognition rate of speaker identity to a certain extent (Chowdhury & Ross, 2019; Yujin et al., 2010; Zhang & Zheng, 2013). Among them, there are acoustic feature mosaic (Ahmed & Bawar, 2018), off-line deep feature fusion (Al-Kaltakchi et al., 2016), online feature fusion (Li et al., 2019). However, the noise in speaker acoustic data may deteriorate the accuracy of speaker recognition.

For model construction, the Gaussian mixture models (GMMs) can be used either for classification or for summarization of data that is still in use, such as collecting sufficient statistics (Reynolds et al., 2000). JFA can be used to obtain speaker verification scores from experiments, but it is also used to extract embedded content from feature sequences (Kenny et al., 2007). I-Vector is just an embedded model and usually requires PLDA as the backend (Dehak et al., 2010). With the rise of neural networks, there have been deep neural network (DNN) based speaker recognition models represented by the x-vector system (Snyder et al., 2018). The DNN-based methods use the hidden layer to extract speaker features, which can learn speaker features from a large number of samples and are robust to noise. In (You et al., 2019), a multi-task learning model based on the x-vector system is proposed, which uses labeled and unlabeled data in building the model, and improves the robustness and capability of the speaker recognition system. In (Garcia-Romero et al., 2019), the x-vector deep neural network model is proposed for long recordings, which not only modifies the network architecture but also optimizes the loss function to effectively solve the problem of time mismatch between the training set and the test set. The best recognition performance is achieved on the speakers in the wild (SITW) benchmark. In (Kanagasundaram et al., 2019), a time-delay neural network is used for short utterance evaluation in the x-vector based speak verification system, which extracts the speaker embedding from the deeper layer with lower dimensions and adopts a variance normalization approach to improve the performance further.

However, the x-vector system improves the operational efficiency of the system by adjusting unequally spaced frame-level features to equal-length segment-level features. The statistical mechanism may result in the loss of temporal information of the speech signal, which may degrade the accuracy of speaker recognition.

Therefore, we propose a framework based on multi-task learning and feature integration (MTFI). Firstly, for input features, multiple features are simultaneously fed into the neural network and sequentially connected on the statistical pooling layer, the statistical pooling layer can learn a lot of complementary representations. Secondly, we have developed an attention mechanism to calculate the frame weights for the statistical pool layer, so as to enhance the key frame information and weaken useless frame information. Finally, to improve the robustness of multitask learning, we propose a new SU that transmits specific representations of different tasks during forward propagation and prevents the gradient from adapting to different tasks during backward propagation.

The rest of this paper is structured as follows: Sect. 2 describes the proposed speaker recognition system, Sect. 3 conducts and discusses the experiments. Finally, Sect. 4 presents the conclusions and further work.

2 The proposed speaker recognition system

The speaker recognition system uses a similar architecture as the x-vector deep neural network model. An x-vector speaker recognition system is based on the time-delay neural network (TDNN), which maps variable-length corpus to a fixed-length speaker feature vector called x-vector (Liu et al., 2018; Okabe et al., 2018). As shown in Fig. 1. The proposed speaker recognition system, the proposed system comprises two modules of frame-level and segment-level. In order to make the network more lightweight, except the first three TDNN layers and the two segment layers of all connections are 128-dimensional, and the other layers are 256 dimensional. The differences between our system and the conventional x-vector system include (1) the frame-level module uses different features and connected in series in the stitching layer; and (2) the segment-level module uses the attention mechanism to process the frame weight of the statistical layer; and (3) the new shared unit is proposed to help multi task system learning and improve its robustness.

2.1 Frame-level module

In the feature space, different features contain different information. The conventional x-vector system uses the MFCC to convert the linear frequency into the Mel frequency through a series of Mel filters. The main drawback of MFCC is the low robustness to noise signals. In order to improve the upper boundary, the goal of our network is to simultaneously integrate the additional information, with the fully connected layer obtain a frame level representation. The calculation is shown as follows:

$$Append\left({\mathcal{X}}_{feature1},{\mathcal{X}}_{feature2}\right) \to \mathcal{X}$$

(1)

2.2 Segment-level module

The following module creates a segment-level representation of a stack consisting of several fully linked layers. Prior to the fully-connected layers, a statistical pooling layer converts variable-length features at the frame level into vectors of fixed dimensionality. At present, most networks use the average pool to derive speaker representation. The average pool strategy considers each frame as an equally important frame, which is not true. Often, some frames are more important for speaker recognition than others, as the ability to recognise speakers varies between speech segments. Recent studies (Bhattacharya et al., 2017; Zhu et al., 2018) have applied the attention mechanism to speaker recognition and calculated the importance of each frame through the attention mechanism. In this paper, the attention mechanism is combined with the statistical layer to calculate the weighted statistics of speech signals.

As shown in Fig. 2, an attention model first computes a scalar score et for each attribute h_t at frame level (t = 1, 2, …, T) for each attribute h_t, calculating a scalar score et as defined below:

$${e}_{t}={v}^{T}f\left(W{h}_{t}+b\right)+k$$

(2)

where $f( \cdot )$ is a non-linear activation function. The estimates are normalized across all score using the softmax function:

$${a}_{t}=\frac{exp\left({e}_{t}\right)}{{\sum }_{t-1}^{T}exp\left({e}_{t}\right)}$$

(3)

The normalized $\alpha_{t}$ values are used as weights in the layer to calculate the following vectors of weighted means and standard deviations. The calculation process is as follows:

$$\chi { = }\sum\limits_{t = 1}^{T} {\alpha_{t} } h_{t}$$

(4)

$$\psi { = }\sqrt {\sum\limits_{t = 1}^{T} {\alpha_{t} } h_{t} \odot h_{t} - \chi \odot \chi }$$

(5)

where $\odot$ represents the Hadamard product. This representation of the segment-level is thus more discriminative for the speaker.

The weighted average value and standard vector are entered into a stack of fully connected layers. One of these hidden layers usually has a smaller number of cells in order to provide lower dimensional information from the previous layer. The output is a softmax layer, where each output node corresponds to a speaker.

Some studies have implemented end-to-end neural networks by using contrast loss or triplet loss (Li et al., 2017).

2.3 The shared unit for multi-task learning

In multitask learning, by utilizing a shared representation, different tasks can share learned information to improve overall performance. Specifically, when the binary loss function learns information from the shared representation, the x vector in each task's branch will have a task-specific representation. This ensures that each task can acquire task-relevant features and enhance its learning capability. In Fig. 1, two tasks have the same underlying node. To propagate information between these tasks, a common segment-level block is introduced in Fig. 3. This segment-level block facilitates the transfer of representations from one task to the current task, promoting information exchange and sharing among tasks, which benefits the overall learning effectiveness. Additionally, considering the robustness of the x-vector, during the backpropagation process, the gradient of the current task is inserted into a SU. This ensures the transmission and updating of the gradient information within the network, further optimizing the model's performance. Therefore, the softmax loss of the binary decision M_t1, the loss of speaker recognition M_t2 and its partial derivatives are calculated as follows:

$$\hat{h}_{t1} = h_{t1} + m \cdot h_{t2}$$

(6)

$$\hat{h}_{t2} = h_{t2} + n \cdot h_{t1}$$

(7)

$$\frac{{\partial M_{t1} }}{{\partial \hat{h}_{t1} }} = \frac{{\partial M_{t1} }}{{\partial h_{t1} }} + \beta \cdot m \cdot \frac{{\partial M_{t1} }}{{\partial h_{t2} }}$$

(8)

$$\frac{{\partial M_{t2} }}{{\partial \hat{h}_{t2} }} = \frac{{\partial M_{t2} }}{{\partial h_{t2} }} + \beta \cdot n \cdot \frac{{\partial M_{t2} }}{{\partial h_{t1} }}$$

(9)

where $\beta$ is the scale of the clipping gradient and m and n are the scaling parameters for forward propagation. When the network is trained with a common representation between the two tasks, the gradient falls around the different layers and branches of the last and second layers, so it does not affect the robustness of the x vector. In this task, we set $\beta$ = 0 and m = n = 0.925 (Fig. 3).

3 Experimental and result analysis

3.1 The description of the evaluation system

In the decision scoring process, the log-likelihood ratio of the two is defined as follows:

$$score = \lg \frac{{p\left( {\eta_{1} {,}\eta_{2} |R_{s} } \right)}}{{p\left( {\eta_{1} {,}\eta_{2} |R_{d} } \right)}}$$

(10)

where $R_{s}$ represents that $\eta_{1}$ and $\eta_{2}$ are from the same speaker, $R_{d}$ represents that $\eta_{1}$ and $\eta_{2}$ are from different speakers.

We compare the results of baseline systems and the proposed systems, including equal error rate (EER) (Zhang et al., 2019), and the minimum of the normalized detection cost function (DCF) at P_target = 10^–2 and P_target = 10^–3.

The DCF is a performance evaluation method commonly used in NIST SRE, and is defined as follows:

$$DCF = C_{FR} * FRR * P_{{{\text{t}}\arg {\text{et}}}} + C_{FA} * FAR * (1 - P_{{{\text{t}}\arg {\text{et}}}} )$$

(11)

where C_FR and C_FA represent the penalty cost of false refusal and false acceptance respectively, P_target and 1-P_target are the prior probabilities of real speech test and impersonation test. False rejection rate (FRR) refers to errors caused by misidentifying target speakers as non-target speakers, and false acceptation rate (FAR) refers to errors caused by identifying non-target speakers as target speakers. When the values of C_FR, C_FA, P_target and 1-P_target are set, a set of values of FRR and FAR minimizes DCF, and the DCF is minDCF. minDCF considers not only the different costs of the two types of errors but also the a priori probabilities of the two test cases, which is more reasonable than EER.

3.2 Database

In this paper, the Kaldi speech recognition tool (Povey et al., 2011) is used to build the experiment, including data processing, feature extraction, network training, and system testing.

In this paper, VoxCeleb1 dataset (Nagrani et al., 2017) is used for experiments. It contains utterances extracted from videos uploaded on YouTube. The sampling rate of all audio is 16kHZ, and the audio format adopts the single channel 16bit wav audio file. There are real scene noises in the speech, such as environmental noise, background voice, indoor noise, recording equipment noise, etc. the time point of noise is irregular. The total number of speakers in the dataset is 1251, the total number of sentences is 153,516, the total duration is 351 h, the average duration of each sentence is 8.2 s, the maximum duration is 145 s and the minimum duration is 4 s. The speakers in the dataset have different races, ages, accents, and genders, including 690 males and 561 females. The training set includes 1211 people, 148,642 sentences in total, and the test set includes 40 people, 4874 sentences in total. The structure of Voxceleb1 is presented in Table 1.

Table 1 Voxceleb1 database structure

Full size table

3.3 Experimental parameters

For the baseline system, the acoustic feature $X{ = }\left( {x_{1} ,x_{2} , \cdots ,x_{i} } \right)$ is 40-dimensional MFCC with a frame length of 25 ms and a frame offset of 10 ms as the acoustic response.

For the experimental system, the acoustic features are spectrogram; the frame length is 25 ms; the frame shift is 10 ms, and the dimension of each frame is 257 dimensions. The x-vector system does not require that the feature dimension must be the same, so we directly extract the features of the speech data with different lengths. The feature dimension is 257 × T, where T represents the length of the speech. In order to ensure the rationality of the experiment, the parameters used in the training of each network in this section are consistent with the network training parameters of TDNN in the baseline system. For the experimental system, we adopted the network structure of 5-layer CNN. In all convolution layers conv1 ~ conv5, 3 × 3 small convolution kernels were used for convolution operation, with a stride of 1. The max-pooling is used with a pool window size of 2 × 2 and an interval of 2. The convl and conv2 layers have 64 channels, and conv3 ~ conv5 have 128 channels.

For the baseline system, we use the TDNN network. The number of nodes in the first four frame-level is 512, and the number of nodes in the last frame-level is 1500. The number of nodes in the two fully connected layers is 512, and all nonlinear functions use the RELU function. The training parameters are presented in Table 2.

Table 2 Network training parameters of x-vector system

Full size table

In this paper, RELU function, batch normalization, and dropout technology are used in each layer of the neural network to optimize network training. The dropout ratio is randomly selected among 0.1, 0.15, and 0.2. Stochastic gradient descent algorithm is adopted in the optimization algorithm, with the momentum of 0.5, the minibatch-size of 128, the initial learning rate of 0.01, and the final learning rate of 0.001.

In the experiment, we use the Kaldi speech recognition tool to extract the acoustic features and build the speaker recognition system.

3.4 Speaker recognition experiment based on attention

In this section, we compare the different features as network input in the x-vector system. First, the same TDNN structure is used to compare the performance of different feature. Second, different features are integrated for multiple feature. The results show in Table 3. In all the cases, the average statistical pooling is used in the segment-level module. The baseline of the x-vector speaker recognition system is based on TDNN.

Table 3 Results of different acoustic features and network architecture on test set

Full size table

As shown in Table 3, four different features under TDNN are tested. The effect of MFCC feature is the best, where the relative improvement is 67.1%, 35.3% and 50.7% than LPCC, Fbank and spectrum in EER, Then, we retest the four features based on the attention mechanism of the statistical layer. The four features are improved in EER compared with TDNN, especially the relative improvement of 7.68% in LPCC, this shows that the statistical layer based on attention mechanism can enhance the key frame information better.

3.5 Speaker recognition experiment with multi-task learning

In this section, we design two sets of experiments. First, we combine different features, calculate their average score fusion, and then put the above combinations into a single system without shared units. The experimental results are presented in Table 3.

As shown in Table 4, it can be observed that different systems and feature combinations will affect performance. The single-task systems (without additional common blocks) performed better than all the single-function systems, and the average fusion scores for these systems, particularly the integration of LPCC and Fbank, showed a relative improvement of 21.5%.

Table 4 Test result of the second program

Full size table

Finally, the two related robust single-function systems MFCC and Fbank and their MFST systems achieved reasonably good results, and MTFI and SU-MTFI are tested below using MFCC and Fbank. The results are shown in Table 5.

Table 5 Results on the voxceleb1 test set

Full size table

As shown in Table 5, for voxceleb1, it can be observed that the SU-MTFI system demonstrates the best performance when utilizing the same feature combination. Based on the integration of MFCC and Fbank, the SU-MTFI is far superior to the MTFI and MFST, and the SU-MTFI propagate shared representations well, which verifies that the SU improve multi-task learning robustness. The SU-MTFI yields a 19.7% relative improvement over the MTFI.

4 Conclusion and future work

This paper presents multi-task learning and feature integration for speaker recognition systems: two key concepts are feature integration in MTFI and better propagation of common representations in multi-task learning.

In the future, we will study different features, such as CQCC. In addition, we need to continue to analyze the invariance characteristics of the voiceprint to improve the robustness of the system.

Data availability

The datasets generated during and/or analyzed during the current study are available in the Open Science Framework reposi-tory, https://www.robots.ox.ac.uk/~vgg/data/voxceleb/.

References

Ahmed, M. & Bawar, Z. H. (2018). Text-independent speaker recognition based on syllabic pitch contour parameters. In Proceedings of the fourth international conference on engineering & MIS 2018 (pp. 1–4).
Al-Kaltakchi, M. T., Woo, W. L., Dlay, S. S. & Chambers, J. A. (2016). Study of fusion strategies and exploiting the combination of MFCC and PNCC features for robust biometric speaker identification. In 2016 4th international conference on biometrics and forensics (IWBF)(pp. 1–6).
Algabri, M., Mathkour, H., Bencherif, M. A., Alsulaiman, M., & Mekhtiche, M. A. (2017). Automatic speaker recognition for mobile forensic applications. Mobile Information Systems, 2017, 1–7.
Article Google Scholar
Anwer, H., Anjum, S. & Saqib, N. A. (2015). Robust speaker recognition for e-commerce system. In 2015 international conference on radar, antenna, microwave, electronics and telecommunications (ICRAMET)(pp. 92–97).
Bhattacharya, G., Alam, M. J. & Kenny, P. (2017). Deep speaker embeddings for short-duration speaker verification. In Interspeech (pp. 1517–1521).
Chowdhury, A., & Ross, A. (2019). Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. Ieee Transactions on Information Forensics and Security, 15, 1616–1629.
Article Google Scholar
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Article Google Scholar
Garcia-Romero, D., Snyder, D., Sell, G., McCree, A., Povey, D. & Khudanpur, S. (2019). x-Vector DNN refinement with full-length recordings for speaker recognition. In Interspeech (pp. 1493–1496).
Guapo, F., Correia, P., Meuwly, D. & van der Vloed, D. (2016). Empirical validation of likelihood ratio methods–A case study in forensic speaker recognition. In 2016 4th international conference on biometrics and forensics (IWBF) (pp. 1–5).
Hansen, J. H., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99.
Article Google Scholar
Kanagasundaram, A., Sridharan, S., Ganapathy, S., Singh, P. & Fookes, C. (2019). A study of x-vector based speaker recognition on short utterances. In Proceedings of the 20th annual conference of the international speech communication association, Interspeech 2019. Vol. 2019-September. (pp. 2943–2947).
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.
Article Google Scholar
Khelif, K., Mombrun, Y., Backfried, G., Sahito, F., Scarpato, L., Motlicek, P., Madikeri, S., Kelly, D., Hazzani, G. & Chatzigavriil, E. (2017). Towards a breakthrough speaker identification approach for law enforcement agencies: SIIP. In 2017 European Intelligence and Security Informatics Conference (EISIC)(pp. 32–39).
Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., & Zhu, Z. (2017). Deep speaker: An end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304.
Li, R., Zhao, M., Li, Z., Li, L. & Hong, Q. (2019). Anti-spoofing speaker verification system with multi-feature integration and multi-task learning. In Interspeech (pp. 1048–1052).
Lin, T., & Zhang, Y. (2019). Speaker recognition based on long-term acoustic features with analysis sparse representation. IEEE Access, 7, 87439–87447.
Article Google Scholar
Liu, Y., He, L., Liu, W., & Liu, J. (2018). Exploring a unified attention-based pooling framework for speaker verification. In 2018 11th international symposium on Chinese spoken language processing (ISCSLP)(pp. 200–204).
Martinson, E. & Lawson, W. (2011). Learning speaker recognition models through human-robot interaction. In 2011 IEEE international conference on robotics and automation (pp. 3915–3920).
Nagrani, A., Chung, J. S. & Zisserman, A. (2017). Voxceleb: A large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612.
Okabe, K., Koshinaka, T. & Shinoda, K. (2018). Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y. & Schwarz, P. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (pp. 1–4).
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Article Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329–5333).
Wang, W., Zhao, H., Yang, Y., Chang, Y., & You, H. (2023). Few-shot short utterance speaker verification using meta-learning. Peerj Computer Science, 9, e1276.
Article Google Scholar
You, L., Guo, W., Dai, L. & Du, J. (2019). Multi-task learning with high-order statistics for X-vector based text-independent speaker verification. arXiv preprint https://arxiv.org/arXiv:1903.12058.
Yujin, Y., Peihua, Z. & Qun, Z. (2010). Research of speaker recognition based on combination of LPCC and MFCC. In 2010 IEEE international conference on intelligent computing and intelligent systems (pp. 765–767).
Zhang, C., Bahmaninezhad, F., Ranjan, S., Dubey, H., Xia, W. & Hansen, J. H. (2019). UTD-CRSS systems for 2018 NIST speaker recognition evaluation. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5776–5780).
Zhang, C. & Zheng, T. F. (2013). A fishervoice based feature fusion method for short utterance speaker recognition. In 2013 IEEE China summit and international conference on signal and information processing (pp. 165–169).
Zhu, Y., Ko, T., Snyder, D., Mak, B. & Povey, D. (2018). Self-attentive speaker embeddings for text-independent speaker verification. In Interspeech (pp. 3573–3577).

Download references

Funding

This research is funded by the National Social Science Fund of China Key Research Project (No.21AZD036) and the Planning Project in Humanities and Social Science Research of the Ministry of Education of the People’s Republic of China (No.22YJA630096).

Author information

Yingjie Zhang and Liu Liu have contributed equally to this work.

Authors and Affiliations

School of Management, University of Shanghai for Science and Technology, 334 Jungong Road, Yangpu District, Shanghai, China
Yingjie Zhang & Liu Liu

Authors

Yingjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Liu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this work. All authors of the manuscript have read and agreed the final manuscript.

Corresponding author

Correspondence to Liu Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Consent for publication

All authors of the manuscript have read and agreed the final manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Liu, L. Multi-task learning for X-vector based speaker recognition. Int J Speech Technol 26, 817–823 (2023). https://doi.org/10.1007/s10772-023-10058-5

Download citation

Received: 08 September 2023
Accepted: 03 October 2023
Published: 28 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10772-023-10058-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-task learning for X-vector based speaker recognition

Abstract

Similar content being viewed by others

Speaker Recognition Using SincNet and X-Vector Fusion

Resformer: Local Frame-Level Feature and Global Segment-Level Feature Joint Learning for Speaker Verification

A method of multi-models fusion for speaker recognition

1 Introduction