Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–15 of 15 results for author: Sak, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.02582  [pdf, other

    cs.SD cs.AI eess.AS

    Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition

    Authors: Jaeyoung Kim, Han Lu, Soheil Khorram, Anshuman Tripathi, Qian Zhang, Hasim Sak

    Abstract: Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present acce… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  2. arXiv:2205.14054  [pdf, other

    cs.LG

    Contrastive Siamese Network for Semi-supervised Speech Recognition

    Authors: Soheil Khorram, Jaeyoung Kim, Anshuman Tripathi, Han Lu, Qian Zhang, Hasim Sak

    Abstract: This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a cont… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

  3. arXiv:2109.11641  [pdf, other

    eess.AS cs.LG cs.SD

    Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection

    Authors: Wei Xia, Han Lu, Quan Wang, Anshuman Tripathi, Yiling Huang, Ignacio Lopez Moreno, Hasim Sak

    Abstract: In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces… ▽ More

    Submitted 25 January, 2022; v1 submitted 23 September, 2021; originally announced September 2021.

  4. arXiv:2105.05005  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Reducing Streaming ASR Model Delay with Self Alignment

    Authors: Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak

    Abstract: Reducing prediction delay for streaming end-to-end ASR models with minimal performance regression is a challenging problem. Constrained alignment is a well-known existing approach that penalizes predicted word boundaries using external low-latency acoustic models. On the contrary, recently proposed FastEmit is a sequence-level delay regularization scheme encouraging vocabulary tokens over blanks w… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: submitted to INTERSPEECH 2021

  5. arXiv:2010.03192  [pdf, other

    cs.SD cs.LG eess.AS

    Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

    Authors: Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, Hasim Sak

    Abstract: In this paper we present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model. The model is composed of a stack of transformer layers for audio encoding with no lookahead or right context and an additional stack of transformer layers on top trained with variable right context. In inference time, the conte… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

  6. arXiv:2002.11268  [pdf, other

    eess.AS cs.CL cs.SD

    A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition

    Authors: Erik McDermott, Hasim Sak, Ehsan Variani

    Abstract: This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a m… ▽ More

    Submitted 27 February, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Comments: 8 pages, 4 figures, presented at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)

  7. arXiv:2002.02562  [pdf, other

    eess.AS cs.CL cs.SD

    Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

    Authors: Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar

    Abstract: In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution ove… ▽ More

    Submitted 14 February, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

    Comments: This is the final version of the paper submitted to the ICASSP 2020 on Oct 21, 2019

  8. arXiv:1906.07093  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Adversarial Training for Multilingual Acoustic Modeling

    Authors: Ke Hu, Hasim Sak, Hank Liao

    Abstract: Multilingual training has been shown to improve acoustic modeling performance by sharing and transferring knowledge in modeling different languages. Knowledge sharing is usually achieved by using common lower-level layers for different languages in a deep neural network. Recently, the domain adversarial network was proposed to reduce domain mismatch of training data and learn domain-invariant feat… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

  9. arXiv:1807.05162  [pdf, other

    cs.CV cs.LG

    Large-Scale Visual Speech Recognition

    Authors: Brendan Shillingford, Yannis Assael, Matthew W. Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett, Marie Mulville, Ben Coppin, Ben Laurie, Andrew Senior, Nando de Freitas

    Abstract: This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable v… ▽ More

    Submitted 1 October, 2018; v1 submitted 13 July, 2018; originally announced July 2018.

  10. arXiv:1801.00841  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

    Authors: Kanishka Rao, Haşim Sak, Rohit Prabhavalkar

    Abstract: We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data… ▽ More

    Submitted 2 January, 2018; originally announced January 2018.

    Comments: In Proceedings of IEEE ASRU 2017

  11. arXiv:1711.07274  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    Speech recognition for medical conversations

    Authors: Chung-Cheng Chiu, Anshuman Tripathi, Katherine Chou, Chris Co, Navdeep Jaitly, Diana Jaunzeikare, Anjuli Kannan, Patrick Nguyen, Hasim Sak, Ananth Sankar, Justin Tansuwan, Nathan Wan, Yonghui Wu, Xuedong Zhang

    Abstract: In this work we explored building automatic speech recognition models for transcribing doctor patient conversation. We collected a large scale dataset of clinical conversations ($14,000$ hr), designed the task to represent the real word scenario, and explored several alignment approaches to iteratively improve data quality. We explored both CTC and LAS systems for building speech recognition model… ▽ More

    Submitted 20 June, 2018; v1 submitted 20 November, 2017; originally announced November 2017.

    Comments: Interspeech 2018 camera ready

  12. arXiv:1610.09975  [pdf, other

    cs.CL cs.LG cs.NE

    Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

    Authors: Hagen Soltau, Hank Liao, Hasim Sak

    Abstract: We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to allevia… ▽ More

    Submitted 31 October, 2016; originally announced October 2016.

  13. arXiv:1603.03185  [pdf, other

    cs.CL cs.LG cs.SD

    Personalized Speech recognition on mobile devices

    Authors: Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Francoise Beaufays, Carolina Parada

    Abstract: We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its… ▽ More

    Submitted 11 March, 2016; v1 submitted 10 March, 2016; originally announced March 2016.

  14. arXiv:1507.06947  [pdf, other

    cs.CL cs.LG cs.NE stat.ML

    Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

    Authors: Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays

    Abstract: We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initi… ▽ More

    Submitted 24 July, 2015; originally announced July 2015.

    Comments: To be published in the INTERSPEECH 2015 proceedings

  15. arXiv:1402.1128  [pdf, other

    cs.NE cs.CL cs.LG stat.ML

    Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

    Authors: Haşim Sak, Andrew Senior, Françoise Beaufays

    Abstract: Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting rec… ▽ More

    Submitted 5 February, 2014; originally announced February 2014.