Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–28 of 28 results for author: Khurana, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.02252  [pdf, other

    cs.SD eess.AS

    SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

    Authors: Junghyun Koo, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

    Abstract: We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of dr… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  2. arXiv:2402.17907  [pdf, other

    eess.AS cs.SD

    NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

    Authors: Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Head-related transfer functions (HRTFs) are important for immersive audio, and their spatial interpolation has been studied to upsample finite measurements. Recently, neural fields (NFs) which map from sound source direction to HRTF have gained attention. Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  3. arXiv:2312.07513  [pdf, other

    eess.AS cs.SD

    NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

    Authors: Zexu Pan, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

    Abstract: Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity. This activity is usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have a high speaker confusion error, where the interfering speaker is extracted instead of t… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  4. arXiv:2312.01467  [pdf, other

    cs.CG

    Online Dominating Set and Coloring for Geometric Intersection Graphs

    Authors: Minati De, Sambhav Khurana, Satyam Singh

    Abstract: We present online deterministic algorithms for minimum coloring and minimum dominating set problems in the context of geometric intersection graphs. We consider a graph parameter: the independent kissing number $ζ$, which is a number equal to `the size of the largest induced star in the graph $-1$'. For a graph with an independent kissing number at most $ζ$, we show that the famous greedy algorith… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: 20 pages, 7 figures and 1 table. arXiv admin note: text overlap with arXiv:2111.07812

  5. arXiv:2312.01059  [pdf, other

    cs.RO

    Swarm-GPT: Combining Large Language Models with Safe Motion Planning for Robot Choreography Design

    Authors: Aoran Jiao, Tanmay P. Patel, Sanjmi Khurana, Anna-Mariya Korol, Lukas Brunke, Vivek K. Adajania, Utku Culha, Siqi Zhou, Angela P. Schoellig

    Abstract: This paper presents Swarm-GPT, a system that integrates large language models (LLMs) with safe swarm motion planning - offering an automated and novel approach to deployable drone swarm choreography. Swarm-GPT enables users to automatically generate synchronized drone performances through natural language instructions. With an emphasis on safety and creativity, Swarm-GPT addresses a critical gap i… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: 10 pages, 9 figures

  6. arXiv:2311.01933  [pdf, other

    cs.LG

    ForecastPFN: Synthetically-Trained Zero-Shot Forecasting

    Authors: Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, Colin White

    Abstract: The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Journal ref: Thirty-seventh Conference on Neural Information Processing Systems, 2023

  7. arXiv:2310.19644  [pdf, other

    eess.AS cs.MM

    Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

    Authors: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker a… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU 2023

  8. arXiv:2310.10604  [pdf, other

    eess.AS cs.SD

    Generation or Replication: Auscultating Audio Latent Diffusion Models

    Authors: Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  9. arXiv:2309.07478  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Direct Text to Speech Translation System using Acoustic Units

    Authors: Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret

    Abstract: This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  10. Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

    Authors: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

    Abstract: In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sound… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: Accepted at Interspeech 2023. Code at https://github.com/yuangongnd/whisper-at

    Journal ref: Proceedings of Interspeech 2023

  11. arXiv:2306.00789  [pdf, other

    cs.CL cs.AI eess.AS eess.SP

    Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

    Authors: Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass

    Abstract: Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAM… ▽ More

    Submitted 25 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

  12. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  13. arXiv:2211.07795  [pdf, other

    eess.AS cs.AI cs.LG

    On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

    Authors: Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

    Abstract: Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

  14. arXiv:2205.08180  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a s… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

  15. arXiv:2204.09904  [pdf, other

    cs.HC cs.AI cs.CV cs.LG stat.ML

    Infographics Wizard: Flexible Infographics Authoring and Design Exploration

    Authors: Anjul Tyagi, Jian Zhao, Pushkar Patel, Swasti Khurana, Klaus Mueller

    Abstract: Infographics are an aesthetic visual representation of information following specific design principles of human perception. Designing infographics can be a tedious process for non-experts and time-consuming, even for professional designers. With the help of designers, we propose a semi-automated infographic framework for general structured and flow-based infographic design generation. For novice… ▽ More

    Submitted 8 May, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Preprint of the EUROVIS 22 accepted paper. arXiv admin note: substantial text overlap with arXiv:2108.11914

    ACM Class: H.5.2; I.4.6; J.5

    Journal ref: Computer Graphics Forum, 2022, 41: 121-132

  16. arXiv:2203.06760  [pdf, other

    cs.SD cs.AI eess.AS

    CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

    Authors: Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass

    Abstract: Audio classification is an active research area with a wide range of applications. Over the past decade, convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models. Recently, neural networks based solely on self-attention mechanisms such as the Audio Spectrogram Transformer (AST) have been shown to outperform CNNs. In this paper,… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

  17. Machine Learning: Algorithms, Models, and Applications

    Authors: Jaydip Sen, Sidra Mehtab, Rajdeep Sen, Abhishek Dutta, Pooja Kherwa, Saheel Ahmed, Pranay Berry, Sahil Khurana, Sonali Singh, David W. W Cadotte, David W. Anderson, Kalum J. Ost, Racheal S. Akinbo, Oladunni A. Daramola, Bongs Lainjo

    Abstract: Recent times are witnessing rapid development in machine learning algorithm systems, especially in reinforcement learning, natural language processing, computer and robot vision, image processing, speech, and emotional processing and understanding. In tune with the increasing importance and relevance of machine learning models, algorithms, and their applications, and with the emergence of more inn… ▽ More

    Submitted 6 January, 2022; originally announced January 2022.

    Comments: Published by IntechOpen, London Uk in Dec 2021. the book contains 6 chapters spanning over 154 pages

  18. arXiv:2111.07812  [pdf, other

    cs.CG

    Online Dominating Set and Independent Set

    Authors: Minati De, Sambhav Khurana, Satyam Singh

    Abstract: Finding minimum dominating set and maximum independent set for graphs in the classical online setup are notorious due to their disastrous $Ω(n)$ lower bound of the competitive ratio that even holds for interval graphs, where $n$ is the number of vertices. In this paper, inspired by Newton number, first, we introduce the independent kissing number $ζ$ of a graph. We prove that the well known online… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

    Comments: 26 pages, 17 figures

  19. Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a modera… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  20. arXiv:2108.11914  [pdf, other

    cs.HC cs.CV

    User-Centric Semi-Automated Infographics Authoring and Recommendation

    Authors: Anjul Tyagi, Jian Zhao, Pushkar Patel, Swasti Khurana, Klaus Mueller

    Abstract: Designing infographics can be a tedious process for non-experts and time-consuming even for professional designers. Based on the literature and a formative study, we propose a flexible framework for automated and semi-automated infographics design. This framework captures the main design components in infographics and streamlines the generation workflow into three steps, allowing users to control… ▽ More

    Submitted 27 August, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

  21. arXiv:2106.05933  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

    Authors: Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass

    Abstract: Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted prunin… ▽ More

    Submitted 26 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

  22. arXiv:2011.13439  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

    Authors: Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses a… ▽ More

    Submitted 16 February, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

    Comments: ICASSP 2021

  23. arXiv:2006.02814  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but… ▽ More

    Submitted 5 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

  24. arXiv:2006.02547  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

    Authors: Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

    Abstract: Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs… ▽ More

    Submitted 8 September, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

    Comments: Proceedings of Interspeech, 2020

  25. arXiv:2005.08520  [pdf, other

    cs.LG cs.CL stat.ML

    Robust Training of Vector Quantized Bottleneck Models

    Authors: Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

    Abstract: In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representat… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

    Comments: Published at IJCNN 2020

  26. arXiv:1909.12163  [pdf, other

    cs.CL

    DARTS: Dialectal Arabic Transcription System

    Authors: Sameer Khurana, Ahmed Ali, James Glass

    Abstract: We present the speech to text transcription system, called DARTS, for low resource Egyptian Arabic dialect. We analyze the following; transfer learning from high resource broadcast domain to low-resource dialectal domain and semi-supervised learning where we use in-domain unlabeled audio data collected from YouTube. Key features of our system are: A deep neural network acoustic model that consists… ▽ More

    Submitted 26 September, 2019; originally announced September 2019.

  27. arXiv:1609.05650  [pdf, other

    cs.CL

    Multi-view Dimensionality Reduction for Dialect Identification of Arabic Broadcast Speech

    Authors: Sameer Khurana, Ahmed Ali, Steve Renals

    Abstract: In this work, we present a new Vector Space Model (VSM) of speech utterances for the task of spoken dialect identification. Generally, DID systems are built using two sets of features that are extracted from speech utterances; acoustic and phonetic. The acoustic and phonetic features are used to form vector representations of speech utterances in an attempt to encode information about the spoken d… ▽ More

    Submitted 19 September, 2016; originally announced September 2016.

  28. arXiv:1509.06928  [pdf, ps, other

    cs.CL

    Automatic Dialect Detection in Arabic Broadcast Speech

    Authors: Ahmed Ali, Najim Dehak, Patrick Cardinal, Sameer Khurana, Sree Harsha Yella, James Glass, Peter Bell, Steve Renals

    Abstract: We investigate different approaches for dialect identification in Arabic broadcast speech, using phonetic, lexical features obtained from a speech recognition system, and acoustic features using the i-vector framework. We studied both generative and discriminate classifiers, and we combined these features using a multi-class Support Vector Machine (SVM). We validated our results on an Arabic/Engli… ▽ More

    Submitted 10 August, 2016; v1 submitted 23 September, 2015; originally announced September 2015.