Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–43 of 43 results for author: Woodland, P C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.19984  [pdf, other

    cs.CL

    Confidence Estimation for Automatic Detection of Depression and Alzheimer's Disease Based on Clinical Interviews

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: Speech-based automatic detection of Alzheimer's disease (AD) and depression has attracted increased attention. Confidence estimation is crucial for a trust-worthy automatic diagnostic system which informs the clinician about the confidence of model predictions and helps reduce the risk of misdiagnosis. This paper investigates confidence estimation for automatic detection of AD and depression based… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2406.04541  [pdf, other

    cs.CL eess.AS

    Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: While the neural transducer is popular for online speech recognition, simultaneous speech translation (SST) requires both streaming and re-ordering capabilities. This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for SST, which naturally possesses these two properties. The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressiv… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024 Main Conference

  3. arXiv:2406.00522  [pdf, other

    eess.AS cs.SD

    Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

    Authors: Keqi Deng, Guangzhi Sun, Philip C. Woodland

    Abstract: Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  4. arXiv:2405.20064  [pdf, other

    eess.AS cs.SD

    1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

    Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

    Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  5. arXiv:2402.12862  [pdf, other

    cs.CL

    Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation

    Authors: Wen Wu, Bo Li, Chao Zhang, Chung-Cheng Chiu, Qiujia Li, Junwen Bai, Tara N. Sainath, Philip C. Woodland

    Abstract: The subjective perception of emotion leads to inconsistent labels from human annotators. Typically, utterances lacking majority-agreed labels are excluded when training an emotion classifier, which cause problems when encountering ambiguous emotional expressions during testing. This paper investigates three methods to handle ambiguous emotion. First, we show that incorporating utterances without m… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  6. Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation

    Authors: Nineli Lashkarashvili, Wen Wu, Guangzhi Sun, Philip C. Woodland

    Abstract: Foundation models have shown superior performance for speech emotion recognition (SER). However, given the limited data in emotion corpora, finetuning all parameters of large pre-trained models for SER can be both resource-intensive and susceptible to overfitting. This paper investigates parameter-efficient finetuning (PEFT) for SER. Various PEFT adaptors are systematically studied for both classi… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Journal ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 10986-10990

  7. arXiv:2312.09100  [pdf, other

    eess.AS cs.SD

    FastInject: Injecting Unpaired Text Data into CTC-based ASR training

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the development of self-supervised learning. However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint train… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP2024

  8. arXiv:2311.07418  [pdf, other

    cs.CL cs.SD eess.AS

    Speech-based Slot Filling using Large Language Models

    Authors: Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang, Milica Gašić, Philip C. Woodland

    Abstract: Recently, advancements in large language models (LLMs) have shown an unprecedented ability across various language tasks. This paper investigates the potential application of LLMs to slot filling with noisy ASR transcriptions, via both in-context learning and task-specific fine-tuning. Dedicated prompt designs and fine-tuning approaches are proposed to improve the robustness of LLMs for slot filli… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

  9. arXiv:2310.04791  [pdf, other

    eess.AS cs.LG cs.SD

    Conditional Diffusion Model for Target Speaker Extraction

    Authors: Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland

    Abstract: We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: 5 pages, 4 figures, submitted to ICASSP 2024

  10. arXiv:2310.00486  [pdf, other

    cs.CL cs.HC cs.LG

    It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation

    Authors: Wen Wu, Wenlin Chen, Chao Zhang, Philip C. Woodland

    Abstract: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations, which should be taken into account in modelling to better mimic the way people perceive and interact with the… ▽ More

    Submitted 30 September, 2023; originally announced October 2023.

    Comments: Code available at: https://github.com/W-Wu/HAS_CNF

  11. arXiv:2308.13345  [pdf, other

    eess.AS cs.CL cs.SD

    Decoupled Structure for Improved Adaptability of End-to-End Models

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable n… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

  12. Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: Although automatic emotion recognition (AER) has recently drawn significant research interest, most current AER studies use manually segmented utterances, which are usually unavailable for dialogue systems. This paper proposes integrating AER with automatic speech recognition (ASR) and speaker diarisation (SD) in a jointly-trained system. Distinct output layers are built for four sub-tasks includi… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Interspeech 2023

  13. arXiv:2307.01764  [pdf, other

    cs.CL

    Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data

    Authors: Guangzhi Sun, Chao Zhang, Ivan Vulić, Paweł Budzianowski, Philip C. Woodland

    Abstract: Manually annotating fine-grained slot-value labels for task-oriented dialogue (ToD) systems is an expensive and time-consuming endeavour. This motivates research into slot-filling methods that operate with limited amounts of labelled data. Moreover, the majority of current work on ToD is based solely on text as the input modality, neglecting the additional challenges of imperfect automatic speech… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: to submit to CS&L

  14. Estimating the Uncertainty in Emotion Attributes using Deep Evidential Regression

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: In automatic emotion recognition (AER), labels assigned by different human annotators to the same utterance are often inconsistent due to the inherent complexity of emotion and the subjectivity of perception. Though deterministic labels generated by averaging or voting are often used as the ground truth, it ignores the intrinsic uncertainty revealed by the inconsistent labels. This paper proposes… ▽ More

    Submitted 11 June, 2023; originally announced June 2023.

    Comments: Accepted by ACL 2023

    Journal ref: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2023

  15. arXiv:2306.01942  [pdf, other

    cs.CL cs.SD eess.AS

    Can Contextual Biasing Remain Effective with Whisper and GPT-2?

    Authors: Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland

    Abstract: End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data. Despite the large amount of training data, infrequent content words that occur in a particular task may still exhibit poor ASR performance, with contextual biasing a possible remedy. This paper investigates the effectiveness of neural c… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: To appear in Interspeech 2023

  16. Self-supervised representations in speech-based depression detection

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). An analysis of SSL representations derived from different layers of pre-trained foundation models is first presented for SDD, which provides insight to suitable indicator for depression detection. Knowledge transfer is the… ▽ More

    Submitted 6 July, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

  17. arXiv:2303.10917  [pdf, other

    eess.AS cs.SD

    Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

    Authors: Xiaoyu Yang, Qiujia Li, Chao Zhang, Philip C. Woodland

    Abstract: Although large foundation models pre-trained by self-supervised learning have achieved state-of-the-art performance in many tasks including automatic speech recognition (ASR), knowledge distillation (KD) is often required in practice to transfer the knowledge learned by large teacher models into much smaller student models with affordable computation and memory costs. This paper proposes a novel t… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

  18. arXiv:2302.08579  [pdf, other

    eess.AS cs.SD

    Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replac… ▽ More

    Submitted 14 March, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  19. Distribution-based Emotion Recognition in Conversation

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: Automatic emotion recognition in conversation (ERC) is crucial for emotion-aware conversational artificial intelligence. This paper proposes a distribution-based framework that formulates ERC as a sequence-to-sequence problem for emotion distribution estimation. The inherent ambiguity of emotions and the subjectivity of human perception lead to disagreements in emotion labels, which is handled nat… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

    Comments: To appear in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT)

  20. arXiv:2211.02536  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Biased Self-supervised learning for ASR

    Authors: Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdelrahman Mohamed, Philip C. Woodland

    Abstract: Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Fur… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  21. arXiv:2210.16554  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-end Spoken Language Understanding with Tree-constrained Pointer Generator

    Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: End-to-end spoken language understanding (SLU) suffers from the long-tail word problem. This paper exploits contextual biasing, a technique to improve the speech recognition of rare words, in end-to-end SLU systems. Specifically, a tree-constrained pointer generator (TCPGen), a powerful and efficient biasing model component, is studied, which leverages a slot shortlist with corresponding entities… ▽ More

    Submitted 14 March, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: 5 pages, to appear in ICASSP 2023

  22. arXiv:2210.13576  [pdf, ps, other

    cs.SD eess.AS

    Spectral Clustering-aware Learning of Embeddings for Speaker Diarisation

    Authors: Evonne P. C. Lee, Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: In speaker diarisation, speaker embedding extraction models often suffer from the mismatch between their training loss functions and the speaker clustering method. In this paper, we propose the method of spectral clustering-aware learning of embeddings (SCALE) to address the mismatch. Specifically, besides an angular prototype cal (AP) loss, SCALE uses a novel affinity matrix loss which directly m… ▽ More

    Submitted 14 March, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: To appear in ICASSP 2023, 5 pages

  23. arXiv:2207.03852  [pdf, other

    eess.AS cs.SD

    Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

    Authors: Xianrui Zheng, Chao Zhang, Philip C. Woodland

    Abstract: Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  24. arXiv:2207.00857  [pdf, other

    cs.SD cs.CL eess.AS

    Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition

    Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: Incorporating biasing words obtained as contextual knowledge is critical for many automatic speech recognition (ASR) applications. This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator (TCPGen) component for end-to-end contextual ASR. By encoding the biasing words in the prefix-tree with a tree-based GNN, lookahead for future wordpieces in end-… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022. arXiv admin note: text overlap with arXiv:2205.09058

  25. arXiv:2205.09058  [pdf, other

    cs.CL cs.SD eess.AS

    Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

    Authors: Guangzhi Sun, Chao Zhang, Philip C Woodland

    Abstract: Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thou… ▽ More

    Submitted 23 May, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

    Comments: This work has been submitted to the IEEE Transactions on Audio, Speech, and Language Processing for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  26. Estimating the Uncertainty in Emotion Class Labels with Utterance-Specific Dirichlet Priors

    Authors: Wen Wu, Chao Zhang, Xixin Wu, Philip C. Woodland

    Abstract: Emotion recognition is a key attribute for artificial intelligence systems that need to naturally interact with humans. However, the task definition is still an open problem due to the inherent ambiguity of emotions. In this paper, a novel Bayesian training loss based on per-utterance Dirichlet prior distributions is proposed for verbal emotion recognition, which models the uncertainty in one-hot… ▽ More

    Submitted 17 November, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

    Journal ref: IEEE Transactions on Affective Computing ( Volume: 14, Issue: 4, 01 Oct.-Dec. 2023)

  27. arXiv:2110.03327  [pdf, other

    eess.AS cs.LG

    Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition

    Authors: Qiujia Li, Yu Zhang, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland

    Abstract: As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions,… ▽ More

    Submitted 2 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Accepted as a conference paper at ICASSP 2022

  28. arXiv:2109.00627  [pdf, other

    cs.CL cs.SD

    Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

    Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an effi… ▽ More

    Submitted 17 September, 2021; v1 submitted 1 September, 2021; originally announced September 2021.

    Comments: To appear in ASRU 2021

  29. arXiv:2108.07789  [pdf, other

    cs.CL cs.SD eess.AS

    Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition

    Authors: Xianrui Zheng, Chao Zhang, Philip C. Woodland

    Abstract: Language models (LMs) pre-trained on massive amounts of text, in particular bidirectional encoder representations from Transformers (BERT), generative pre-training (GPT), and GPT-2, have become a key technology for many natural language processing tasks. In this paper, we present results using fine-tuned GPT, GPT-2, and their combination for automatic speech recognition (ASR). Unlike unidirectiona… ▽ More

    Submitted 1 October, 2021; v1 submitted 29 July, 2021; originally announced August 2021.

    Comments: To appear in ASRU 2021

  30. arXiv:2103.14152  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Residual Energy-Based Models for End-to-End Speech Recognition

    Authors: Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland

    Abstract: End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the m… ▽ More

    Submitted 23 June, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

    Comments: To appear in Proc. Interspeech 2021

  31. arXiv:2103.07554  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    A Distributed Optimisation Framework Combining Natural Gradient with Hessian-Free for Discriminative Sequence Training

    Authors: Adnan Haider, Chao Zhang, Florian L. Kreyssig, Philip C. Woodland

    Abstract: This paper presents a novel natural gradient and Hessian-free (NGHF) optimisation framework for neural network training that can operate efficiently in a distributed manner. It relies on the linear conjugate gradient (CG) algorithm to combine the natural gradient (NG) method with local curvature information from Hessian-free (HF) or other second-order methods. A solution to a numerical issue in CG… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

  32. arXiv:2102.06474  [pdf, other

    cs.CL cs.AI

    Transformer Language Models with LSTM-based Cross-utterance Information Representation

    Authors: G. Sun, C. Zhang, P. C. Woodland

    Abstract: The effective incorporation of cross-utterance information has the potential to improve language models (LMs) for automatic speech recognition (ASR). To extract more powerful and robust cross-utterance representations for the Transformer LM (TLM), this paper proposes the R-TLM which uses hidden states in a long short-term memory (LSTM) LM. To encode the cross-utterance information, the R-TLM incor… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  33. arXiv:2102.06467  [pdf, other

    cs.SD cs.LG eess.AS eess.IV

    Content-Aware Speaker Embeddings for Speaker Diarisation

    Authors: G. Sun, D. Liu, C. Zhang, P. C. Woodland

    Abstract: Recent speaker diarisation systems often convert variable length speech segments into fixed-length vector representations for speaker clustering, which are known as speaker embeddings. In this paper, the content-aware speaker embeddings (CASE) approach is proposed, which extends the input of the speaker classifier to include not only acoustic features but also their corresponding speech content, v… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  34. arXiv:2010.14102  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Emotion recognition by fusing time synchronous and time asynchronous representations

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then does pooling across time to form a single embed… ▽ More

    Submitted 22 July, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6269-6273

  35. arXiv:2010.11428  [pdf, other

    eess.AS cs.CL cs.LG

    Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition

    Authors: Qiujia Li, David Qiu, Yu Zhang, Bo Li, Yanzhang He, Philip C. Woodland, Liangliang Cao, Trevor Strohman

    Abstract: For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based seq… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  36. arXiv:2009.01008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Cross-Utterance Language Models with Acoustic Error Sampling

    Authors: G. Sun, C. Zhang, P. C. Woodland

    Abstract: The effective exploitation of richer contextual information in language models (LMs) is a long-standing research problem for automatic speech recognition (ASR). A cross-utterance LM (CULM) is proposed in this paper, which augments the input to a standard long short-term memory (LSTM) LM with a context vector derived from past and future utterances using an extraction network. The extraction networ… ▽ More

    Submitted 19 August, 2020; originally announced September 2020.

    Comments: 5 pages

  37. arXiv:2008.03756  [pdf, ps, other

    eess.AS cs.SD

    Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings

    Authors: Florian L. Kreyssig, Philip C. Woodland

    Abstract: In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by lev… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted to Interspeech 2020

  38. arXiv:1911.03970  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Improved Large-margin Softmax Loss for Speaker Diarisation

    Authors: Yassir Fathullah, Chao Zhang, Philip C. Woodland

    Abstract: Speaker diarisation systems nowadays use embeddings generated from speech segments in a bottleneck layer, which are needed to be discriminative for unseen speakers. It is well-known that large-margin training can improve the generalisation ability to unseen data, and its use in such open-set problems has been widespread. Therefore, this paper introduces a general approach to the large-margin softm… ▽ More

    Submitted 6 July, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: ICASSP 2020

    Journal ref: ICASSP 2020, Barcelona, Spain, 2020, pp. 7104-7108

  39. arXiv:1910.09703  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD

    Discriminative Neural Clustering for Speaker Diarisation

    Authors: Qiujia Li, Florian L. Kreyssig, Chao Zhang, Philip C. Woodland

    Abstract: In this paper, we propose Discriminative Neural Clustering (DNC) that formulates data clustering with a maximum number of clusters as a supervised sequence-to-sequence learning problem. Compared to traditional unsupervised clustering algorithms, DNC learns clustering patterns from training data without requiring an explicit definition of a similarity measure. An implementation of DNC based on the… ▽ More

    Submitted 23 November, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

    Comments: Accepted as a conference paper at the 8th IEEE Spoken Language Technology Workshop (SLT 2021)

  40. arXiv:1909.06614  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Integrating Source-channel and Attention-based Sequence-to-sequence Models for Speech Recognition

    Authors: Qiujia Li, Chao Zhang, Philip C. Woodland

    Abstract: This paper proposes a novel automatic speech recognition (ASR) framework called Integrated Source-Channel and Attention (ISCA) that combines the advantages of traditional systems based on the noisy source-channel model (SC) and end-to-end style systems using attention-based sequence-to-sequence models. The traditional SC system framework includes hidden Markov models and connectionist temporal cla… ▽ More

    Submitted 1 October, 2019; v1 submitted 14 September, 2019; originally announced September 2019.

    Comments: To appear in Proc. ASRU2019, December 14-18, 2019, Sentosa, Singapore

  41. arXiv:1810.01873  [pdf, ps, other

    cs.LG stat.ML

    Combining Natural Gradient with Hessian Free Methods for Sequence Training

    Authors: Adnan Haider, P. C. Woodland

    Abstract: This paper presents a new optimisation approach to train Deep Neural Networks (DNNs) with discriminative sequence criteria. At each iteration, the method combines information from the Natural Gradient (NG) direction with local curvature information of the error surface that enables better paths on the parameter manifold to be traversed. The method is derived using an alternative derivation of Tayl… ▽ More

    Submitted 3 October, 2018; originally announced October 2018.

    Comments: in Proc. INTERSPEECH 2018, September 2-6, 2018, Hyderabad, India

  42. arXiv:1804.02204  [pdf, other

    cs.CL cs.LG stat.ML

    Sequence Training of DNN Acoustic Models With Natural Gradient

    Authors: Adnan Haider, Philip C. Woodland

    Abstract: Deep Neural Network (DNN) acoustic models often use discriminative sequence training that optimises an objective function that better approximates the word error rate (WER) than frame-based training. Sequence training is normally implemented using Stochastic Gradient Descent (SGD) or Hessian Free (HF) training. This paper proposes an alternative batch style optimisation framework that employs a Na… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.

    Comments: In Proceedings of IEEE ASRU 2017

  43. arXiv:1610.00277  [pdf, other

    cs.CL

    Very Deep Convolutional Neural Networks for Robust Speech Recognition

    Authors: Yanmin Qian, Philip C Woodland

    Abstract: This paper describes the extension and optimization of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps a… ▽ More

    Submitted 2 October, 2016; originally announced October 2016.

    Comments: accepted by SLT 2016