♦ Developing machine learning, deep learning algorithms ♦ Good research experience with several publications that allow me to understand complex machine learning algorithms and use them for new applications. Supervisors: Prof. Sridha Sridharan
The aim of this work is to gain insights into how the deep neu-ral network (DNN) models should be... more The aim of this work is to gain insights into how the deep neu-ral network (DNN) models should be trained for short utterance evaluation conditions in an x-vector based speaker verification system. The study suggests that the speaker embedding can be extracted with reduced dimensions for short utterance evaluation conditions. When the speaker embedding is extracted from deeper layer which has lower dimension, the x-vector system achieves 14% relative improvement over baseline approach on EER on NIST2010 5sec-5sec truncated conditions. We surmise that since short utterances have less phonetic information speaker discriminative x-vectors can be extracted from a deeper layer of the DNN which captures less phonetic information. Another interesting finding is that the x-vector system achieves 5% relative improvement on NIST2010 5sec-5sec evaluation condition when the back-end PLDA is trained using short utterance development data. The results confirms the intuitive expectation that duration of development utterances and the duration of evaluation utterances should be matched. Finally, for the duration mismatch condition, we propose a variance normal-ization approach for PLDA training that provides a 4% relative improvement on EER over baseline approach.
This paper presents the LEAP System, developed for the Second DIHARD diarization Challenge. The e... more This paper presents the LEAP System, developed for the Second DIHARD diarization Challenge. The evaluation data in the challenge is composed of multi-talker speech in restaurants , doctor-patient conversations, child language acquisition recordings in home environments and audio extracted YouTube videos. The LEAP system is developed using two types of em-beddings, one based on i-vector representations and the other one based on x-vector representations. The initial diariza-tion output obtained using agglomerative hierarchical clustering (AHC) done on the probabilistic linear discriminant analysis (PLDA) scores is refined using the Variational-Bayes hidden Markov model (VB-HMM) model. We propose a modified VB-HMM model with posterior scaling which provides significant improvements in the final diarization error rate (DER). We also use a domain compensation on the i-vector features to reduce the mis-match between training and evaluation conditions. Using the proposed approaches, we obtain relative improvements in DER of about 7.1% relative for the best individual system over the DIHARD baseline system and about 13.7% relative for the final system combination on evaluation set. An analysis performed using the proposed posterior scaling method shows that scaling results in improved discrimination among the HMM states in the VB-HMM.
National Undergraduate Research Symposium of National Science and Technology Commission (NASTEC), 2019
The purpose of this study is to come up with a most accurate model for predicting the Solar photo... more The purpose of this study is to come up with a most accurate model for predicting the Solar photovoltaic (PV) power generation and Solar irradiance. For this study, the data is collected from Faculty of Engineering, University of Jaffa solar measuring station. In this paper, deep learning based univariate long short-term memory (LSTM) approach is introduced to predict the Solar irradiance. A univariate LSTM and auto-regressive integrated moving average (ARIMA) based time series approaches are compared. Both models are evaluated using root mean-square error (RMSE). This study suggests that univariate LSTM approach performs well over ARIMA approach.
International Conference On. Solar Energy Materials, Solar Cells & Solar Energy Applications, 2018
A number of parameters such as solar irradiance, temperature, wind speed, wind direction and soil... more A number of parameters such as solar irradiance, temperature, wind speed, wind direction and soiling are influencing the solar energy harvesting. It is essential to develop deeper understanding of the factors influencing the solar energy production in a particular region and have reliable models to forecast energy production. For this research study, Killinochi district was chosen as it has a lot of potential for solar PV, and solar thermal compared to other districts. In this paper, initially it was analysed how the weather data and solar irradiance vary on a daily and yearly basis. Subsequently, the correlation between individual weather parameters, solar irradiance and silicon voltage are studied. Pearson correlation estimation was used for correlation studies. Based on correlation studies, it was found that the solar parameters are influencing the solar power generation. After that, temperature variations were modelled using ARIMA modelling and the model were used to forecast the next hour data. Similarly, the diffused horizontal irradiance (DHI) and global horizontal irradiance (GHI) data can be forecasted using ARIMA modelling, and the next hour data can be predicted. Future study will include modelling of correlation between solar irradiance and temperature or humidity using support vector regression methods; and DHI and GHI will be forecasted based on weather data. These prediction models are useful for power generation entities and households. The effects of soiling on PV modules which vary with soil type, location and weather patterns will also be considered.
IEEE International Conference on Information and Automation for Sustainability (ICIAfS), 2018
The need of solar irradiation forecast at a specific location over long time horizons has attaine... more The need of solar irradiation forecast at a specific location over long time horizons has attained massive importance. In this paper, we study the machine learning techniques to predict solar irradiation in 10 min intervals using data sets from Killinochchi district, Faculty of Engineering, University of Jaffna measuring center. The accuracies of the prediction models such as ARIMA, Random Forest Regression, Neural Networks, Linear Regression and Supportive Vector Machine is compared. This study suggests that ARIMA performs well over other approaches.
We investigate the use of deep neural networks (DNNs) for the speaker diarization task to improve... more We investigate the use of deep neural networks (DNNs) for the speaker diarization task to improve performance under domain mismatched conditions. Three unsupervised domain adaptation techniques, namely inter-dataset variability compensation (IDVC), domain-invariant covariance normalization (DICN), and domain mismatch modeling (DMM), are applied on DNN based speaker embeddings to compensate for the mismatch in the embedding subspace. We present results conducted on the DIHARD data, which was released for the 2018 diarization challenge. Collected from a diverse set of domains, this data provides very challenging domain mismatched conditions for the diarization task. Our results provide insights into how the performance of our proposed system could be further improved.
14th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA ), 2013
A significant amount of speech data is required to develop a robust speaker verification system, ... more A significant amount of speech data is required to develop a robust speaker verification system, but it is difficult to find enough development speech to match all expected conditions. In this paper we introduce a new approach to Gaussian prob-abilistic linear discriminant analysis (GPLDA) to estimate reliable model parameters as a linearly weighted model taking more input from the large volume of available telephone data and smaller proportional input from limited microphone data. In comparison to a traditional pooled training approach, where the GPLDA model is trained over both telephone and microphone speech, this linear-weighted GPLDA approach is shown to provide better EER and DCF performance in microphone and mixed conditions in both the NIST 2008 and NIST 2010 evaluation corpora. Based upon these results, we believe that linear-weighted GPLDA will provide a better approach than pooled GPLDA, allowing for the further improvement of GPLDA speaker verification in conditions with limited development data.
IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012
This paper introduces the Weighted Linear Discriminant Analysis (WLDA) technique, based upon the ... more This paper introduces the Weighted Linear Discriminant Analysis (WLDA) technique, based upon the weighted pairwise Fisher criterion, for the purposes of improving i-vector speaker verification in the presence of high intersession variability. By taking advantage of the speaker discriminative information that is available in the distances between pairs of speakers clustered in the development i-vector space, the WLDA technique is shown to provide an improvement in speaker verification performance over traditional Linear Discriminant Analysis (LDA) approaches. A similar approach is also taken to extend the recently developed Source Normalised LDA (SNLDA) into Weighted SNLDA (WSNLDA) which, similarly, shows an improvement in speaker verification performance in both matched and mismatched enrolment/verification conditions. Based upon the results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset, we believe that both WLDA and WSNLDA are viable as replacement techniques to improve the performance of LDA and SNLDA-based i-vector speaker verification.
This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification ap... more This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification approach with limited short utterance development data. Experimental studies have found that when speaker verification is evaluated on 10sec-10sec condition , at least around 40sec speech utterances are required for PLDA modelling. Subsequently, when limited session data is available, utterance partitioning approach is introduced to increase the number of session data and partitioning approach-based PLDA speaker verification has shown improvement over baseline approach in limited session data conditions. Partitioning approach is also introduced to source-normalized weighted linear discriminant analysis (SN-WLDA)-projected GPLDA system and it shows improvement over full-length SN-WLDA-projected GPLDA system.
This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification ap... more This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification approach with limited development data. This paper investigates the use of the median as the central tendency of a speaker's i-vector representation , and the effectiveness of weighted discrimina-tive techniques on the performance of state-of-the-art length-normalised Gaussian PLDA (GPLDA) speaker verification systems. The analysis within shows that the median (using a median fisher discriminator (MFD)) provides a better representation of a speaker when the number of representative i-vectors available during development is reduced, and that further, usage of the pair-wise weighting approach in weighted LDA and weighted MFD provides further improvement in limited development conditions. Best performance is obtained using a weighted MFD approach, which shows over 10% improvement in EER over the baseline GPLDA system on mismatched and interview-interview conditions.
12th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA ), 2011
Robust speaker verification on short utterances remains a key consideration when deploying automa... more Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.
IEEE International Conference on Acoustics, Speech, and Signal Processing, 2015
Experimental studies have found that when the state-of-the-art probabilistic linear discriminant ... more Experimental studies have found that when the state-of-the-art probabilistic linear discriminant analysis (PLDA) speaker verification systems are trained using out-domain data, it significantly affects speaker verification performance due to the mismatch between development data and evaluation data. To overcome this problem we propose a novel unsupervised inter dataset variability (IDV) compensation approach to compensate the dataset mismatch. IDV-compensated PLDA system achieves over 10% relative improvement in EER values over out-domain PLDA system by effectively compensating the mismatch between in-domain and out-domain data.
In The Speaker and Language Recognition Workshop (Odyssey 2012), 2012
This paper investigates the effects of limited speech data in the context of speaker verification... more This paper investigates the effects of limited speech data in the context of speaker verification using a probabilistic linear dis-criminant analysis (PLDA) approach. Being able to reduce the length of required speech data is important to the development of automatic speaker verification system in real world applications. When sufficient speech is available, previous research has shown that heavy-tailed PLDA (HTPLDA) modeling of speakers in the i-vector space provides state-of-the-art performance, however, the robustness of HTPLDA to the limited speech resources in development, enrolment and verification is an important issue that has not yet been investigated. In this paper , we analyze the speaker verification performance with regards to the duration of utterances used for both speaker evaluation (enrolment and verification) and score normalization and PLDA modeling during development. Two different approaches to total-variability representation are analyzed within the PLDA approach to show improved performance in short-utterance mis-matched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. The results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset suggest that the HT-PLDA system can continue to achieve better performance than Gaussian PLDA (GPLDA) as evaluation utterance lengths are decreased. We also highlight the importance of matching durations for score normalization and PLDA modeling to the expected evaluation conditions. Finally, we found that a pooled total-variability approach to PLDA modeling can achieve better performance than the traditional concatenated total-variability approach for short utterances in mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development.
14th Australasian International Conference on Speech Science and Technology, 2012
This paper investigates the use of mel-frequency delta-phase (MFDP) features in comparison to, an... more This paper investigates the use of mel-frequency delta-phase (MFDP) features in comparison to, and in fusion with, traditional mel-frequency cepstral coefficient (MFCC) features within joint factor analysis (JFA) speaker verification. MFCC features, commonly used in speaker recognition systems, are derived purely from the magnitude spectrum, with the phase spectrum completely discarded. In this paper, we investigate if features derived from the phase spectrum can provide additional speaker discriminant information to the traditional MFCC approach in a JFA based speaker verification system. Results are presented which provide a comparison of MFCC-only, MFDP-only and score fusion of the two approaches within a JFA speaker verification approach. Based upon the results presented using the NIST 2008 Speaker Recognition Evaluation (SRE) dataset, we believe that, while MFDP features alone cannot compete with MFCC features, MFDP can provide complementary information that result in improved speaker verification performance when both approaches are combined in score fusion, particularly in the case of shorter utterances.
16th Annual Conference of the International Speech Communication Association, Interspeech 2015, 2015
In this paper we introduce a novel domain-invariant covariance normalization (DICN) technique to ... more In this paper we introduce a novel domain-invariant covariance normalization (DICN) technique to relocate both in-domain and out-domain i-vectors into a third dataset-invariant space, providing an improvement for out-domain PLDA speaker verification with a very small number of unlabelled in-domain adaptation i-vectors. By capturing the dataset variance from a global mean using both development out-domain i-vectors and limited unlabelled in-domain i-vectors, we could obtain domain-invariant representations of PLDA training data. The DICN-compensated out-domain PLDA system is shown to perform as well as in-domain PLDA training with as few as 500 unlabelled in-domain i-vectors for NIST-2010 SRE and 2000 unlabelled in-domain i-vectors for NIST-2008 SRE, and considerable relative improvement over both out-domain and in-domain PLDA development if more are available.
The Speaker and Language Recognition Workshop (Odyssey 2012), 2012
This paper investigates the use of the dimensionality-reduction techniques weighted linear discri... more This paper investigates the use of the dimensionality-reduction techniques weighted linear discriminant analysis (WLDA), and weighted median fisher discriminant analysis (WMFD), before probabilistic linear discriminant analysis (PLDA) modeling for the purpose of improving speaker verification performance in the presence of high inter-session variability. Recently it was shown that WLDA techniques can provide improvement over traditional linear discriminant analysis (LDA) for channel compensation in i-vector based speaker verification systems. We show in this paper that the speaker discriminative information that is available in the distance between pair of speakers clustered in the development i-vector space can also be exploited in heavy-tailed PLDA modeling by using the weighted dis-criminant approaches prior to PLDA modeling. Based upon the results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset, we believe that WLDA and WMFD projections before PLDA modeling can provide an improved approach when compared to uncompensated PLDA modeling for i-vector based speaker verification systems.
14th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA ), 2013
A significant amount of speech is typically required for speaker verification system development ... more A significant amount of speech is typically required for speaker verification system development and evaluation, especially in the presence of large intersession variability. This paper introduces a source and utterance-duration normalized linear discriminant analysis (SUN-LDA) approaches to compensate session variability in short-utterance i-vector speaker verification systems. Two variations of SUN-LDA are proposed where normalization techniques are used to capture source variation from both short and full-length development i-vectors, one based upon pooling (SUN-LDA-pooled) and the other on concatenation (SUN-LDA-concat) across the duration and source-dependent session variation. Both the SUN-LDA-pooled and SUN-LDA-concat techniques are shown to provide improvement over traditional LDA on NIST 08 truncated 10sec-10sec evaluation conditions, with the highest improvement obtained with the SUN-LDA-concat technique achieving a relative improvement of 8% in EER for mis-matched conditions and over 3% for matched conditions over traditional LDA approaches.
17th Annual Conference of the International Speech Communication Association (ISCA), International Speech Communication Association (ISCA), 2016
This paper analyses the short utterance probabilistic linear dis-criminant analysis (PLDA) speake... more This paper analyses the short utterance probabilistic linear dis-criminant analysis (PLDA) speaker verification with utterance partitioning and short utterance variance (SUV) modelling approaches. Experimental studies have found that instead of using single long-utterance as enrolment data, if long enrolled-utterance is partitioned into multiple short utterances and average of short utterance i-vectors is used as enrolled data, that improves the Gaussian PLDA (GPLDA) speaker verification. This is because short utterance i-vectors have speaker, session and utterance variations, and utterance-partitioning approach compensates the utterance variation. Subsequently, SUV-PLDA is also studied with utterance partitioning approach, and utterance-partitioning-based SUV-GPLDA system shows relative improvement of 9% and 16% in EER for NIST 2008 and NIST 2010 truncated 10sec-10sec evaluation condition as utterance-partitioning approach compensates the utterance variation and SUV modelling approach compensates the mismatch between full-length development data and short-length evaluation data.
16th Annual Conference of the International Speech Communication Association, Interspeech 2015, 2015
This paper analyzes the limitations upon the amount of in-domain (NIST SREs) data required for tr... more This paper analyzes the limitations upon the amount of in-domain (NIST SREs) data required for training a probabilistic linear discriminant analysis (PLDA) speaker verification system based on out-domain (Switchboard) total variability subspaces. By limiting the number of speakers, the number of sessions per speaker and the length of active speech per session available in the target domain for PLDA training, we investigated the relative effect of these three parameters on PLDA speaker verification performance in the NIST 2008 and NIST 2010 speaker recognition evaluation datasets. Experimental results indicate that while these parameters depend highly on each other, to beat out-domain PLDA training, more than 10 seconds of active speech should be available for at least 4 sessions/speaker for a minimum of 800 speakers. If further data is available, considerable improvement can be made over solely out-domain PLDA training.
This paper presents the QUT speaker recognition system, as a competing system in the Speakers In ... more This paper presents the QUT speaker recognition system, as a competing system in the Speakers In The Wild (SITW) speaker recognition challenge. Our proposed system achieved an overall ranking of second place, in the main core-core condition evaluations of the SITW challenge. This system uses an i-vector/PLDA approach, with domain adaptation and a deep neu-ral network (DNN) trained to provide feature statistics. The statistics are accumulated by using class posteriors from the DNN, in place of GMM component posteriors in a typical GMM-UBM i-vector/PLDA system. Once the statistics have been collected, the i-vector computation is carried out as in a GMM-UBM based system. We apply domain adaptation to the extracted i-vectors to ensure robustness against dataset variability , PLDA modelling is used to capture speaker and session variability in the i-vector space, and the processed i-vectors are compared using the batch likelihood ratio. The final scores are calibrated to obtain the calibrated likelihood scores, which are then used to carry out speaker recognition and evaluate the performance of the system. Finally, we explore the practical application of our system to the core-multi condition recordings of the SITW data and propose a technique for speaker recognition in recordings with multiple speakers.
The aim of this work is to gain insights into how the deep neu-ral network (DNN) models should be... more The aim of this work is to gain insights into how the deep neu-ral network (DNN) models should be trained for short utterance evaluation conditions in an x-vector based speaker verification system. The study suggests that the speaker embedding can be extracted with reduced dimensions for short utterance evaluation conditions. When the speaker embedding is extracted from deeper layer which has lower dimension, the x-vector system achieves 14% relative improvement over baseline approach on EER on NIST2010 5sec-5sec truncated conditions. We surmise that since short utterances have less phonetic information speaker discriminative x-vectors can be extracted from a deeper layer of the DNN which captures less phonetic information. Another interesting finding is that the x-vector system achieves 5% relative improvement on NIST2010 5sec-5sec evaluation condition when the back-end PLDA is trained using short utterance development data. The results confirms the intuitive expectation that duration of development utterances and the duration of evaluation utterances should be matched. Finally, for the duration mismatch condition, we propose a variance normal-ization approach for PLDA training that provides a 4% relative improvement on EER over baseline approach.
This paper presents the LEAP System, developed for the Second DIHARD diarization Challenge. The e... more This paper presents the LEAP System, developed for the Second DIHARD diarization Challenge. The evaluation data in the challenge is composed of multi-talker speech in restaurants , doctor-patient conversations, child language acquisition recordings in home environments and audio extracted YouTube videos. The LEAP system is developed using two types of em-beddings, one based on i-vector representations and the other one based on x-vector representations. The initial diariza-tion output obtained using agglomerative hierarchical clustering (AHC) done on the probabilistic linear discriminant analysis (PLDA) scores is refined using the Variational-Bayes hidden Markov model (VB-HMM) model. We propose a modified VB-HMM model with posterior scaling which provides significant improvements in the final diarization error rate (DER). We also use a domain compensation on the i-vector features to reduce the mis-match between training and evaluation conditions. Using the proposed approaches, we obtain relative improvements in DER of about 7.1% relative for the best individual system over the DIHARD baseline system and about 13.7% relative for the final system combination on evaluation set. An analysis performed using the proposed posterior scaling method shows that scaling results in improved discrimination among the HMM states in the VB-HMM.
National Undergraduate Research Symposium of National Science and Technology Commission (NASTEC), 2019
The purpose of this study is to come up with a most accurate model for predicting the Solar photo... more The purpose of this study is to come up with a most accurate model for predicting the Solar photovoltaic (PV) power generation and Solar irradiance. For this study, the data is collected from Faculty of Engineering, University of Jaffa solar measuring station. In this paper, deep learning based univariate long short-term memory (LSTM) approach is introduced to predict the Solar irradiance. A univariate LSTM and auto-regressive integrated moving average (ARIMA) based time series approaches are compared. Both models are evaluated using root mean-square error (RMSE). This study suggests that univariate LSTM approach performs well over ARIMA approach.
International Conference On. Solar Energy Materials, Solar Cells & Solar Energy Applications, 2018
A number of parameters such as solar irradiance, temperature, wind speed, wind direction and soil... more A number of parameters such as solar irradiance, temperature, wind speed, wind direction and soiling are influencing the solar energy harvesting. It is essential to develop deeper understanding of the factors influencing the solar energy production in a particular region and have reliable models to forecast energy production. For this research study, Killinochi district was chosen as it has a lot of potential for solar PV, and solar thermal compared to other districts. In this paper, initially it was analysed how the weather data and solar irradiance vary on a daily and yearly basis. Subsequently, the correlation between individual weather parameters, solar irradiance and silicon voltage are studied. Pearson correlation estimation was used for correlation studies. Based on correlation studies, it was found that the solar parameters are influencing the solar power generation. After that, temperature variations were modelled using ARIMA modelling and the model were used to forecast the next hour data. Similarly, the diffused horizontal irradiance (DHI) and global horizontal irradiance (GHI) data can be forecasted using ARIMA modelling, and the next hour data can be predicted. Future study will include modelling of correlation between solar irradiance and temperature or humidity using support vector regression methods; and DHI and GHI will be forecasted based on weather data. These prediction models are useful for power generation entities and households. The effects of soiling on PV modules which vary with soil type, location and weather patterns will also be considered.
IEEE International Conference on Information and Automation for Sustainability (ICIAfS), 2018
The need of solar irradiation forecast at a specific location over long time horizons has attaine... more The need of solar irradiation forecast at a specific location over long time horizons has attained massive importance. In this paper, we study the machine learning techniques to predict solar irradiation in 10 min intervals using data sets from Killinochchi district, Faculty of Engineering, University of Jaffna measuring center. The accuracies of the prediction models such as ARIMA, Random Forest Regression, Neural Networks, Linear Regression and Supportive Vector Machine is compared. This study suggests that ARIMA performs well over other approaches.
We investigate the use of deep neural networks (DNNs) for the speaker diarization task to improve... more We investigate the use of deep neural networks (DNNs) for the speaker diarization task to improve performance under domain mismatched conditions. Three unsupervised domain adaptation techniques, namely inter-dataset variability compensation (IDVC), domain-invariant covariance normalization (DICN), and domain mismatch modeling (DMM), are applied on DNN based speaker embeddings to compensate for the mismatch in the embedding subspace. We present results conducted on the DIHARD data, which was released for the 2018 diarization challenge. Collected from a diverse set of domains, this data provides very challenging domain mismatched conditions for the diarization task. Our results provide insights into how the performance of our proposed system could be further improved.
14th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA ), 2013
A significant amount of speech data is required to develop a robust speaker verification system, ... more A significant amount of speech data is required to develop a robust speaker verification system, but it is difficult to find enough development speech to match all expected conditions. In this paper we introduce a new approach to Gaussian prob-abilistic linear discriminant analysis (GPLDA) to estimate reliable model parameters as a linearly weighted model taking more input from the large volume of available telephone data and smaller proportional input from limited microphone data. In comparison to a traditional pooled training approach, where the GPLDA model is trained over both telephone and microphone speech, this linear-weighted GPLDA approach is shown to provide better EER and DCF performance in microphone and mixed conditions in both the NIST 2008 and NIST 2010 evaluation corpora. Based upon these results, we believe that linear-weighted GPLDA will provide a better approach than pooled GPLDA, allowing for the further improvement of GPLDA speaker verification in conditions with limited development data.
IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012
This paper introduces the Weighted Linear Discriminant Analysis (WLDA) technique, based upon the ... more This paper introduces the Weighted Linear Discriminant Analysis (WLDA) technique, based upon the weighted pairwise Fisher criterion, for the purposes of improving i-vector speaker verification in the presence of high intersession variability. By taking advantage of the speaker discriminative information that is available in the distances between pairs of speakers clustered in the development i-vector space, the WLDA technique is shown to provide an improvement in speaker verification performance over traditional Linear Discriminant Analysis (LDA) approaches. A similar approach is also taken to extend the recently developed Source Normalised LDA (SNLDA) into Weighted SNLDA (WSNLDA) which, similarly, shows an improvement in speaker verification performance in both matched and mismatched enrolment/verification conditions. Based upon the results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset, we believe that both WLDA and WSNLDA are viable as replacement techniques to improve the performance of LDA and SNLDA-based i-vector speaker verification.
This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification ap... more This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification approach with limited short utterance development data. Experimental studies have found that when speaker verification is evaluated on 10sec-10sec condition , at least around 40sec speech utterances are required for PLDA modelling. Subsequently, when limited session data is available, utterance partitioning approach is introduced to increase the number of session data and partitioning approach-based PLDA speaker verification has shown improvement over baseline approach in limited session data conditions. Partitioning approach is also introduced to source-normalized weighted linear discriminant analysis (SN-WLDA)-projected GPLDA system and it shows improvement over full-length SN-WLDA-projected GPLDA system.
This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification ap... more This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification approach with limited development data. This paper investigates the use of the median as the central tendency of a speaker's i-vector representation , and the effectiveness of weighted discrimina-tive techniques on the performance of state-of-the-art length-normalised Gaussian PLDA (GPLDA) speaker verification systems. The analysis within shows that the median (using a median fisher discriminator (MFD)) provides a better representation of a speaker when the number of representative i-vectors available during development is reduced, and that further, usage of the pair-wise weighting approach in weighted LDA and weighted MFD provides further improvement in limited development conditions. Best performance is obtained using a weighted MFD approach, which shows over 10% improvement in EER over the baseline GPLDA system on mismatched and interview-interview conditions.
12th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA ), 2011
Robust speaker verification on short utterances remains a key consideration when deploying automa... more Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.
IEEE International Conference on Acoustics, Speech, and Signal Processing, 2015
Experimental studies have found that when the state-of-the-art probabilistic linear discriminant ... more Experimental studies have found that when the state-of-the-art probabilistic linear discriminant analysis (PLDA) speaker verification systems are trained using out-domain data, it significantly affects speaker verification performance due to the mismatch between development data and evaluation data. To overcome this problem we propose a novel unsupervised inter dataset variability (IDV) compensation approach to compensate the dataset mismatch. IDV-compensated PLDA system achieves over 10% relative improvement in EER values over out-domain PLDA system by effectively compensating the mismatch between in-domain and out-domain data.
In The Speaker and Language Recognition Workshop (Odyssey 2012), 2012
This paper investigates the effects of limited speech data in the context of speaker verification... more This paper investigates the effects of limited speech data in the context of speaker verification using a probabilistic linear dis-criminant analysis (PLDA) approach. Being able to reduce the length of required speech data is important to the development of automatic speaker verification system in real world applications. When sufficient speech is available, previous research has shown that heavy-tailed PLDA (HTPLDA) modeling of speakers in the i-vector space provides state-of-the-art performance, however, the robustness of HTPLDA to the limited speech resources in development, enrolment and verification is an important issue that has not yet been investigated. In this paper , we analyze the speaker verification performance with regards to the duration of utterances used for both speaker evaluation (enrolment and verification) and score normalization and PLDA modeling during development. Two different approaches to total-variability representation are analyzed within the PLDA approach to show improved performance in short-utterance mis-matched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. The results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset suggest that the HT-PLDA system can continue to achieve better performance than Gaussian PLDA (GPLDA) as evaluation utterance lengths are decreased. We also highlight the importance of matching durations for score normalization and PLDA modeling to the expected evaluation conditions. Finally, we found that a pooled total-variability approach to PLDA modeling can achieve better performance than the traditional concatenated total-variability approach for short utterances in mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development.
14th Australasian International Conference on Speech Science and Technology, 2012
This paper investigates the use of mel-frequency delta-phase (MFDP) features in comparison to, an... more This paper investigates the use of mel-frequency delta-phase (MFDP) features in comparison to, and in fusion with, traditional mel-frequency cepstral coefficient (MFCC) features within joint factor analysis (JFA) speaker verification. MFCC features, commonly used in speaker recognition systems, are derived purely from the magnitude spectrum, with the phase spectrum completely discarded. In this paper, we investigate if features derived from the phase spectrum can provide additional speaker discriminant information to the traditional MFCC approach in a JFA based speaker verification system. Results are presented which provide a comparison of MFCC-only, MFDP-only and score fusion of the two approaches within a JFA speaker verification approach. Based upon the results presented using the NIST 2008 Speaker Recognition Evaluation (SRE) dataset, we believe that, while MFDP features alone cannot compete with MFCC features, MFDP can provide complementary information that result in improved speaker verification performance when both approaches are combined in score fusion, particularly in the case of shorter utterances.
16th Annual Conference of the International Speech Communication Association, Interspeech 2015, 2015
In this paper we introduce a novel domain-invariant covariance normalization (DICN) technique to ... more In this paper we introduce a novel domain-invariant covariance normalization (DICN) technique to relocate both in-domain and out-domain i-vectors into a third dataset-invariant space, providing an improvement for out-domain PLDA speaker verification with a very small number of unlabelled in-domain adaptation i-vectors. By capturing the dataset variance from a global mean using both development out-domain i-vectors and limited unlabelled in-domain i-vectors, we could obtain domain-invariant representations of PLDA training data. The DICN-compensated out-domain PLDA system is shown to perform as well as in-domain PLDA training with as few as 500 unlabelled in-domain i-vectors for NIST-2010 SRE and 2000 unlabelled in-domain i-vectors for NIST-2008 SRE, and considerable relative improvement over both out-domain and in-domain PLDA development if more are available.
The Speaker and Language Recognition Workshop (Odyssey 2012), 2012
This paper investigates the use of the dimensionality-reduction techniques weighted linear discri... more This paper investigates the use of the dimensionality-reduction techniques weighted linear discriminant analysis (WLDA), and weighted median fisher discriminant analysis (WMFD), before probabilistic linear discriminant analysis (PLDA) modeling for the purpose of improving speaker verification performance in the presence of high inter-session variability. Recently it was shown that WLDA techniques can provide improvement over traditional linear discriminant analysis (LDA) for channel compensation in i-vector based speaker verification systems. We show in this paper that the speaker discriminative information that is available in the distance between pair of speakers clustered in the development i-vector space can also be exploited in heavy-tailed PLDA modeling by using the weighted dis-criminant approaches prior to PLDA modeling. Based upon the results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset, we believe that WLDA and WMFD projections before PLDA modeling can provide an improved approach when compared to uncompensated PLDA modeling for i-vector based speaker verification systems.
14th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA ), 2013
A significant amount of speech is typically required for speaker verification system development ... more A significant amount of speech is typically required for speaker verification system development and evaluation, especially in the presence of large intersession variability. This paper introduces a source and utterance-duration normalized linear discriminant analysis (SUN-LDA) approaches to compensate session variability in short-utterance i-vector speaker verification systems. Two variations of SUN-LDA are proposed where normalization techniques are used to capture source variation from both short and full-length development i-vectors, one based upon pooling (SUN-LDA-pooled) and the other on concatenation (SUN-LDA-concat) across the duration and source-dependent session variation. Both the SUN-LDA-pooled and SUN-LDA-concat techniques are shown to provide improvement over traditional LDA on NIST 08 truncated 10sec-10sec evaluation conditions, with the highest improvement obtained with the SUN-LDA-concat technique achieving a relative improvement of 8% in EER for mis-matched conditions and over 3% for matched conditions over traditional LDA approaches.
17th Annual Conference of the International Speech Communication Association (ISCA), International Speech Communication Association (ISCA), 2016
This paper analyses the short utterance probabilistic linear dis-criminant analysis (PLDA) speake... more This paper analyses the short utterance probabilistic linear dis-criminant analysis (PLDA) speaker verification with utterance partitioning and short utterance variance (SUV) modelling approaches. Experimental studies have found that instead of using single long-utterance as enrolment data, if long enrolled-utterance is partitioned into multiple short utterances and average of short utterance i-vectors is used as enrolled data, that improves the Gaussian PLDA (GPLDA) speaker verification. This is because short utterance i-vectors have speaker, session and utterance variations, and utterance-partitioning approach compensates the utterance variation. Subsequently, SUV-PLDA is also studied with utterance partitioning approach, and utterance-partitioning-based SUV-GPLDA system shows relative improvement of 9% and 16% in EER for NIST 2008 and NIST 2010 truncated 10sec-10sec evaluation condition as utterance-partitioning approach compensates the utterance variation and SUV modelling approach compensates the mismatch between full-length development data and short-length evaluation data.
16th Annual Conference of the International Speech Communication Association, Interspeech 2015, 2015
This paper analyzes the limitations upon the amount of in-domain (NIST SREs) data required for tr... more This paper analyzes the limitations upon the amount of in-domain (NIST SREs) data required for training a probabilistic linear discriminant analysis (PLDA) speaker verification system based on out-domain (Switchboard) total variability subspaces. By limiting the number of speakers, the number of sessions per speaker and the length of active speech per session available in the target domain for PLDA training, we investigated the relative effect of these three parameters on PLDA speaker verification performance in the NIST 2008 and NIST 2010 speaker recognition evaluation datasets. Experimental results indicate that while these parameters depend highly on each other, to beat out-domain PLDA training, more than 10 seconds of active speech should be available for at least 4 sessions/speaker for a minimum of 800 speakers. If further data is available, considerable improvement can be made over solely out-domain PLDA training.
This paper presents the QUT speaker recognition system, as a competing system in the Speakers In ... more This paper presents the QUT speaker recognition system, as a competing system in the Speakers In The Wild (SITW) speaker recognition challenge. Our proposed system achieved an overall ranking of second place, in the main core-core condition evaluations of the SITW challenge. This system uses an i-vector/PLDA approach, with domain adaptation and a deep neu-ral network (DNN) trained to provide feature statistics. The statistics are accumulated by using class posteriors from the DNN, in place of GMM component posteriors in a typical GMM-UBM i-vector/PLDA system. Once the statistics have been collected, the i-vector computation is carried out as in a GMM-UBM based system. We apply domain adaptation to the extracted i-vectors to ensure robustness against dataset variability , PLDA modelling is used to capture speaker and session variability in the i-vector space, and the processed i-vectors are compared using the batch likelihood ratio. The final scores are calibrated to obtain the calibrated likelihood scores, which are then used to carry out speaker recognition and evaluate the performance of the system. Finally, we explore the practical application of our system to the core-multi condition recordings of the SITW data and propose a technique for speaker recognition in recordings with multiple speakers.
In typical x-vector based speaker recognition systems, standard linear discriminant analysis (LDA... more In typical x-vector based speaker recognition systems, standard linear discriminant analysis (LDA) is used to transform the x-vector space with the aim of maximizing the between-speaker discriminant information while minimizing the within-speaker variability. For LDA, it is customary to use all the available speakers in the speaker recognition development dataset. In this study, we investigate if it would be more beneficial to estimate the between-speaker discriminant information and the within-speaker variability using the most confusing samples and the most distant samples (from the target speaker mean) respectively in the LDA based channel compensation. The between-speaker variance is estimated using a pairwise approach where the most confusing non-target speaker samples are found based on the Euclidean distance between the speaker mean and adjacent speaker's samples. The within-speaker variance is estimated using the mean of each speaker and the furthermost samples in the speaker sessions. Experimental results demonstrate the proposed LDA approach for an x-vector x-vector based speaker recognition system achieves over 17% relative improvement on EER over standard LDA based x-vector speaker recognition systems on the NIST2010 corext-corext condition.
The performance of state-of-the-art i-vector speaker verification systems relies on a large amoun... more The performance of state-of-the-art i-vector speaker verification systems relies on a large amount of training data for probabilistic linear discriminant analysis (PLDA) modeling. During the evaluation, it is also crucial that the target condition data is matched well with the development data used for PLDA training. However, in many practical scenarios, these systems have to be developed, and trained, using data that is often outside the domain of the intended application, since the collection of a significant amount of in-domain data is often difficult. Experimental studies have found that PLDA speaker verification performance degrades significantly due to this development/evaluation mismatch. This paper introduces a domain-invariant linear discriminant analysis (DI-LDA) technique for out-domain PLDA speaker verification that compensates domain mismatch in the LDA subspace. We also propose a domain-invariant probabilistic linear discriminant analysis (DI PLDA) technique for domain mismatch modeling in the PLDA subspace, using only a small amount of in-domain data. In addition, we propose the sequential and score-level combination of DI-LDA, and DI-PLDA to further improve out-domain speaker verification performance. Experimental results show the proposed domain mismatch compensation techniques yield at least 27% and 14.5% improvement in equal error rate (EER) over a pooled PLDA system for telephone-telephone and interview-interview conditions, respectively. Finally, we show that the improvement over the baseline pooled system can be attained even when significantly reducing the number of in-domain speakers, down to 30 in most of the evaluation conditions.
In practical applications, speaker verification systems have to be developed and trained using da... more In practical applications, speaker verification systems have to be developed and trained using data which is outside the domain of the intended application as the collection of significant amount of in-domain data could be difficult. Experimental studies have found that when a GPLDA system is trained using out-domain data, it significantly affects the speaker verification performance due to the mismatch between development data and evaluation data. This paper proposes several unsupervised inter-dataset variability compensation approaches for the purpose of improving the performance of GPLDA systems trained using out-domain data. We show that when GPLDA is trained using out-domain data, we can improve the performance by as much as 39% by using by score normalisation using small amounts of in-domain data. Also in situations where rich out-domain data and only limited in-domain data are available, a pooled-linear-weighted technique to estimate the GPLDA parameters shows 35% relative improvements in equal error rate (EER) on int–int conditions. We also propose a novel inter-dataset covariance normalization (IDCN) approach to overcome in- and out-domain data mismatch problem. Our unsupervised IDCN-compensated GPLDA system shows 14 and 25% improvement respectively in EER over out-domain GPLDA speaker verification on tel–tel and int–int training–testing conditions. We provide intuitive explanations as to why these inter-dataset variability compensation approaches provide improvements to speaker verification accuracy.
This paper studies the performance degradation of Gaussian probabilistic linear discriminant anal... more This paper studies the performance degradation of Gaussian probabilistic linear discriminant analysis(GPLDA) speaker verification system, when only short-utterance data is used for speaker verification system development. Subsequently, a number of techniques, including utterance partitioning and source-normalised weighted linear discriminant analysis(SN-WLDA) projections are introduced to improve the speaker verification performance in such conditions. Experimental studies have found that when short utterance data is available for speaker verification development, GPLDA system overall achieves best performance with a lower number of universal background model (UBM) components. As a lower number of UBM components significantly reduces the computational complexity of speaker verification system, that is a useful observation. In limited session data conditions, we propose a simple utterance-partitioning technique, which when applied to the LDA-projected GPLDA system shows over 8% relative improvement on EER values over baseline system on NIST 2008 truncated 10–10 s conditions. We conjecture that this improvement arises from the apparent increase in the number of sessions arising from our partitioning technique and this helps to better model the GPLDA parameters. Further, partitioning SN-WLDA-projected GPLDA shows over 16% and 6% relative improvement on EER values over LDA-projected GPLDA systems respectively on NIST 2008 truncated 10–10 s interview-interview, and NIST 2010 truncated 10–10 s interview-interview and telephone-telephone conditions.
This paper investigates advanced channel compensation techniques for the purpose of improving i-v... more This paper investigates advanced channel compensation techniques for the purpose of improving i-vector speaker verification performance in the presence of high intersession variability using the NIST 2008 and 2010 SRE corpora. The performance of four channel compensation techniques: (a) weighted maximum margin criterion (WMMC), (b) source-normalized WMMC (SN-WMMC), (c) weighted linear discriminant analysis (WLDA), and; (d) source-normalized WLDA (SN-WLDA) have been investigated.
We show that, by extracting the discriminatory information between pairs of speakers as well as capturing the source variation information in the development i-vector space, the SN-WLDA based cosine similarity scoring (CSS) i-vector system is shown to provide over 20% improvement in EER for NIST 2008 interview and microphone verification and over 10% improvement in EER for NIST 2008 telephone verification, when compared to SN-LDA based CSS i-vector system. Further, score-level fusion techniques are analyzed to combine the best channel compensation approaches, to provide over 8% improvement in DCF over the best single approach, (SN-WLDA), for NIST 2008 interview/ telephone enrolment-verification condition. Finally, we demonstrate that the improvements found in the context of CSS also generalize to state-of-the-art GPLDA with up to 14% relative improvement in EER for NIST SRE 2010 interview and microphone verification and over 7% relative improvement in EER for NIST SRE 2010 telephone verification.
This paper proposes techniques to improve the performance of i-vector based speaker verification ... more This paper proposes techniques to improve the performance of i-vector based speaker verification systems when only short utterances are available. Short-length utterance i-vectors vary with speaker, session variations, and the phonetic content of the utterance. Well established methods such as linear discriminant analysis (LDA), source-normalized LDA (SN-LDA) and within-class covariance normalisation (WCCN) exist for compensating the session variation but we have identified the variability introduced by phonetic content due to utterance variation as an additional source of degradation when short-duration utterances are used. To compensate for utterance variations in short i-vector speaker verification systems using cosine similarity scoring (CSS), we have introduced a short utterance variance normalization (SUVN) technique and a short utterance variance (SUV) modelling approach at the i-vector feature level. A combination of SUVN with LDA and SN-LDA is proposed to compensate the session and utterance variations and is shown to provide improvement in performance over the traditional approach of using LDA and/or SN-LDA followed by WCCN. An alternative approach is also introduced using probabilistic linear discriminant analysis (PLDA) approach to directly model the SUV. The combination of SUVN, LDA and SN-LDA followed by SUV PLDA modelling provides an improvement over the baseline PLDA approach. We also show that for this combination of techniques, the utterance variation information needs to be artificially added to full-length i-vectors for PLDA modelling.
We present details of the QUT submission to the First DIHARD challenge, which is focussed on spea... more We present details of the QUT submission to the First DIHARD challenge, which is focussed on speaker diarization on a diverse set of challenging domains. Our i-vector/GMM system achieves a diarization error rate (DER) of 33.15% on the track one evaluation data that is diarization using gold speech segmentation.
Uploads
Confernece papers by Ahilan Kanagasundaram
where rich out-domain data and only limited in-domain data are available, a pooled-linear-weighted technique to estimate
the GPLDA parameters shows 35% relative improvements in equal error rate (EER) on int–int conditions. We also propose
a novel inter-dataset covariance normalization (IDCN) approach to overcome in- and out-domain data mismatch problem.
Our unsupervised IDCN-compensated GPLDA system shows 14 and 25% improvement respectively in EER over out-domain
GPLDA speaker verification on tel–tel and int–int training–testing conditions. We provide intuitive explanations as to why
these inter-dataset variability compensation approaches provide improvements to speaker verification accuracy.
We show that, by extracting the discriminatory information between pairs of speakers as well as capturing the source variation information in the development i-vector space, the SN-WLDA based cosine similarity scoring (CSS) i-vector system is shown to provide over 20% improvement in EER for NIST 2008 interview and microphone verification and over 10% improvement in EER for NIST 2008 telephone verification, when compared to SN-LDA based CSS i-vector system. Further, score-level fusion techniques are analyzed to combine the best channel compensation approaches, to provide over 8% improvement in DCF over the best single approach, (SN-WLDA), for NIST 2008 interview/ telephone enrolment-verification condition. Finally, we demonstrate that the improvements found in the context of CSS also generalize to state-of-the-art GPLDA with up to 14% relative improvement in EER for NIST SRE 2010 interview and microphone verification and over 7% relative improvement in EER for NIST SRE 2010 telephone verification.