Search | arXiv e-print repository

Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction

Authors: Rithik Sachdev, Zhong-Qiu Wang, Chao-Han Huck Yang

Abstract: Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern automatic speech recognition (ASR) systems. One representative approach is to leverage in-context learning to prompt LLMs so that a better hypothesis can be generated by the LLMs based on a carefully-designed prompt and… ▽ More Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern automatic speech recognition (ASR) systems. One representative approach is to leverage in-context learning to prompt LLMs so that a better hypothesis can be generated by the LLMs based on a carefully-designed prompt and an $N$-best list of hypotheses produced by ASR systems. However, it is yet unknown whether the existing prompts are the most effective ones for the task of post-ASR error correction. In this context, this paper first explores alternative prompts to identify an initial set of effective prompts, and then proposes to employ an evolutionary prompt optimization algorithm to refine the initial prompts. Evaluations results on the CHiME-4 subset of the Task $1$ of the SLT $2024$ GenSEC challenge show the effectiveness and potential of the proposed algorithms. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: in submission

arXiv:2407.15778 [pdf, other]

Violating Bell's inequality in gate-defined quantum dots

Authors: Paul Steinacker, Tuomo Tanttu, Wee Han Lim, Nard Dumoulin Stuyck, MengKe Feng, Santiago Serrano, Ensar Vahapoglu, Rocky Y. Su, Jonathan Y. Huang, Cameron Jones, Kohei M. Itoh, Fay E. Hudson, Christopher C. Escott, Andrea Morello, Andre Saraiva, Chih Hwan Yang, Andrew S. Dzurak, Arne Laucht

Abstract: The superior computational power promised by quantum computers utilises the fundamental quantum mechanical principle of entanglement. However, achieving entanglement and verifying that the generated state does not follow the principle of local causality has proven difficult for spin qubits in gate-defined quantum dots, as it requires simultaneously high concurrence values and readout fidelities to… ▽ More The superior computational power promised by quantum computers utilises the fundamental quantum mechanical principle of entanglement. However, achieving entanglement and verifying that the generated state does not follow the principle of local causality has proven difficult for spin qubits in gate-defined quantum dots, as it requires simultaneously high concurrence values and readout fidelities to break the classical bound imposed by Bell's inequality. Here we employ advanced operational protocols for spin qubits in silicon, such as heralded initialization and calibration via gate set tomography (GST), to reduce all relevant errors and push the fidelities of the full 2-qubit gate set above 99%. We demonstrate a 97.17% Bell state fidelity without correcting for readout errors and violate Bell's inequality with a Bell signal of S = 2.731 close to the theoretical maximum of 2{\sqrt{2}}. Our measurements exceed the classical limit even at elevated temperatures of 1.1K or entanglement lifetimes of 100 μs. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: 19 pages, 5 main figures, 9 extended data figures

MSC Class: 81P68; 81-05

arXiv:2407.15151 [pdf, other]

Spin Qubits with Scalable milli-kelvin CMOS Control

Authors: Samuel K. Bartee, Will Gilbert, Kun Zuo, Kushal Das, Tuomo Tanttu, Chih Hwan Yang, Nard Dumoulin Stuyck, Sebastian J. Pauka, Rocky Y. Su, Wee Han Lim, Santiago Serrano, Christopher C. Escott, Fay E. Hudson, Kohei M. Itoh, Arne Laucht, Andrew S. Dzurak, David J. Reilly

Abstract: A key virtue of spin qubits is their sub-micron footprint, enabling a single silicon chip to host the millions of qubits required to execute useful quantum algorithms with error correction. With each physical qubit needing multiple control lines however, a fundamental barrier to scale is the extreme density of connections that bridge quantum devices to their external control and readout hardware.… ▽ More A key virtue of spin qubits is their sub-micron footprint, enabling a single silicon chip to host the millions of qubits required to execute useful quantum algorithms with error correction. With each physical qubit needing multiple control lines however, a fundamental barrier to scale is the extreme density of connections that bridge quantum devices to their external control and readout hardware. A promising solution is to co-locate the control system proximal to the qubit platform at milli-kelvin temperatures, wired-up via miniaturized interconnects. Even so, heat and crosstalk from closely integrated control have potential to degrade qubit performance, particularly for two-qubit entangling gates based on exchange coupling that are sensitive to electrical noise. Here, we benchmark silicon MOS-style electron spin qubits controlled via heterogeneously-integrated cryo-CMOS circuits with a low enough power density to enable scale-up. Demonstrating that cryo-CMOS can efficiently enable universal logic operations for spin qubits, we go on to show that mill-kelvin control has little impact on the performance of single- and two-qubit gates. Given the complexity of our milli-kelvin CMOS platform, with some 100-thousand transistors, these results open the prospect of scalable control based on the tight packaging of spin qubits with a chiplet style control architecture. △ Less

Submitted 21 July, 2024; originally announced July 2024.

arXiv:2407.06103 [pdf, other]

QTRL: Toward Practical Quantum Reinforcement Learning via Quantum-Train

Authors: Chen-Yu Liu, Chu-Hsuan Abraham Lin, Chao-Han Huck Yang, Kuan-Cheng Chen, Min-Hsiu Hsieh

Abstract: Quantum reinforcement learning utilizes quantum layers to process information within a machine learning model. However, both pure and hybrid quantum reinforcement learning face challenges such as data encoding and the use of quantum computers during the inference stage. We apply the Quantum-Train method to reinforcement learning tasks, called QTRL, training the classical policy network model using… ▽ More Quantum reinforcement learning utilizes quantum layers to process information within a machine learning model. However, both pure and hybrid quantum reinforcement learning face challenges such as data encoding and the use of quantum computers during the inference stage. We apply the Quantum-Train method to reinforcement learning tasks, called QTRL, training the classical policy network model using a quantum machine learning model with polylogarithmic parameter reduction. This QTRL approach eliminates the data encoding issues of conventional quantum machine learning and reduces the training parameters of the corresponding classical policy network. Most importantly, the training result of the QTRL is a classical model, meaning the inference stage only requires classical computer. This is extremely practical and cost-efficient for reinforcement learning tasks, where low-latency feedback from the policy model is essential. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 6 pages, 1 figure

arXiv:2406.13912 [pdf, other]

From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment

Authors: Yusuke Hirota, Ryo Hachiuma, Chao-Han Huck Yang, Yuta Nakashima

Abstract: Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-form… ▽ More Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2405.14161 [pdf, other]

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 23 pages, Preprint

arXiv:2405.06573 [pdf, other]

An Investigation of Incorporating Mamba for Speech Enhancement

Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric-oriented loss functions. SEMamba demonstrates promising results and attains a PESQ score of 3.55 on the VoiceBank-DEMAND dataset. When combined with the perceptual contrast stretching technique, the proposed SEMamba yields a new state-of-the-art PESQ score of 3.69. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.14716 [pdf, other]

Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayes… ▽ More Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities. △ Less

Submitted 16 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 17 pages, 6 figures

arXiv:2402.06894 [pdf, other]

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers… ▽ More Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model. △ Less

Submitted 16 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

Comments: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

arXiv:2402.05457 [pdf, other]

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

arXiv:2402.01496 [pdf]

Constructing 100 MΩ and 1 GΩ Resistance Standards via Star-Mesh Transformations

Authors: Dean G. Jarrett, Albert F. Rigosi, Dominick S. Scaletta, Ngoc Thanh Mai Tran, Heather M. Hill, Alireza R. Panna, Cheng Hsueh Yang, Yanfei Yang, Randolph E. Elmquist, David B. Newell

Abstract: A recent mathematical framework for optimizing resistor networks to achieve values in the MΩ through GΩ levels was employed for two specific cases. Objectives here include proof of concept and identification of possible apparatus limitations for future experiments involving graphene-based quantum Hall array resistance standards. Using fractal-like, or recursive, features of the framework allows on… ▽ More A recent mathematical framework for optimizing resistor networks to achieve values in the MΩ through GΩ levels was employed for two specific cases. Objectives here include proof of concept and identification of possible apparatus limitations for future experiments involving graphene-based quantum Hall array resistance standards. Using fractal-like, or recursive, features of the framework allows one to calculate and implement network designs with substantially lower-valued resistors. The cases of 100 MΩ and 1 GΩ demonstrate that, theoretically, one would not need more than 100 quantum Hall elements to achieve these high resistances. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.10447 [pdf, other]

Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling. △ Less

Submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.10446 [pdf, other]

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by e… ▽ More Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

arXiv:2312.15316 [pdf, other]

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively. △ Less

Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024. Camera-ready version

arXiv:2312.14378 [pdf, other]

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowledge transfer from attention matrices of models rooted in high resource modalities, text and images, to those in resource-constrained domains, speech and audio, employing a zero-shot paradigm. MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%, and relative classification error of an Audio Event Classification (AEC) model by 10.63%. In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. △ Less

Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

arXiv:2311.12159 [pdf, other]

Conditional Modeling Based Automatic Video Summarization

Authors: Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hung Chen, Marcel Worring

Abstract: The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods mainly rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. There are other non-visual factors, such as interestingness, representativeness,… ▽ More The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods mainly rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. There are other non-visual factors, such as interestingness, representativeness, and storyline consistency that should also be considered for generating high-quality video summaries. Current methods do not adequately take into account these non-visual factors, resulting in suboptimal performance. In this work, a new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries. The method utilizes a conditional modeling perspective and introduces multiple meaningful random variables and joint distributions to characterize the key components of video summarization. Helper distributions are employed to improve the training of the model. A conditional attention module is designed to mitigate potential performance degradation in the presence of multi-modal input. The proposed video summarization method incorporates the above innovative design choices that aim to narrow the gap between human-generated and machine-generated video summaries. Extensive experiments show that the proposed approach outperforms existing methods and achieves state-of-the-art performance on commonly used video summarization datasets. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: This work has been submitted to the IEEE for possible publication. arXiv admin note: substantial text overlap with arXiv:2305.00455

arXiv:2311.09567 [pdf, other]

Entangling gates on degenerate spin qubits dressed by a global field

Authors: Ingvild Hansen, Amanda E. Seedhouse, Santiago Serrano, Andreas Nickl, MengKe Feng, Jonathan Y. Huang, Tuomo Tanttu, Nard Dumoulin Stuyck, Wee Han Lim, Fay E. Hudson, Kohei M. Itoh, Andre Saraiva, Arne Laucht, Andrew S. Dzurak, Chih Hwan Yang

Abstract: Coherently dressed spins have shown promising results as building blocks for future quantum computers owing to their resilience to environmental noise and their compatibility with global control fields. This mode of operation allows for more amenable qubit architecture requirements and simplifies signal routing on the chip. However, multi-qubit operations, such as qubit addressability and two-qubi… ▽ More Coherently dressed spins have shown promising results as building blocks for future quantum computers owing to their resilience to environmental noise and their compatibility with global control fields. This mode of operation allows for more amenable qubit architecture requirements and simplifies signal routing on the chip. However, multi-qubit operations, such as qubit addressability and two-qubit gates, are yet to be demonstrated to establish global control in combination with dressed qubits as a viable path to universal quantum computing. Here we demonstrate simultaneous on-resonance driving of degenerate qubits using a global field while retaining addressability for qubits with equal Larmor frequencies. Furthermore, we implement SWAP oscillations during on-resonance driving, constituting the demonstration of driven two-qubit gates. Significantly, our findings highlight the fragility of entangling gates between superposition states and how dressing can increase the noise robustness. These results represent a crucial milestone towards global control operation with dressed qubits. It also opens a door to interesting spin physics on degenerate spins. △ Less

Submitted 30 November, 2023; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2310.13013 [pdf, other]

Generative error correction for code-switching speech recognition using large language models

Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) mapping by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages. △ Less

Submitted 17 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP2024

arXiv:2310.06434 [pdf, other]

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the exis… ▽ More We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA. △ Less

Submitted 16 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023 as main paper. 10 pages. Revised math notations. GitHub: https://github.com/Srijith-rkr/Whispering-LLaMA

arXiv:2309.15701 [pdf, other]

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs. △ Less

Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

arXiv:2309.15649 [pdf, other]

doi 10.1109/ASRU57964.2023.10389673

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Authors: Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines caus… ▽ More We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs. △ Less

Submitted 10 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version

Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

arXiv:2309.15463 [pdf, other]

Tomography of entangling two-qubit logic operations in exchange-coupled donor electron spin qubits

Authors: Holly G. Stemp, Serwan Asaad, Mark R. van Blankenstein, Arjen Vaartjes, Mark A. I. Johnson, Mateusz T. Mądzik, Amber J. A. Heskes, Hannes R. Firgau, Rocky Y. Su, Chih Hwan Yang, Arne Laucht, Corey I. Ostrove, Kenneth M. Rudinger, Kevin Young, Robin Blume-Kohout, Fay E. Hudson, Andrew S. Dzurak, Kohei M. Itoh, Alexander M. Jakob, Brett C. Johnson, David N. Jamieson, Andrea Morello

Abstract: Scalable quantum processors require high-fidelity universal quantum logic operations in a manufacturable physical platform. Donors in silicon provide atomic size, excellent quantum coherence and compatibility with standard semiconductor processing, but no entanglement between donor-bound electron spins has been demonstrated to date. Here we present the experimental demonstration and tomography of… ▽ More Scalable quantum processors require high-fidelity universal quantum logic operations in a manufacturable physical platform. Donors in silicon provide atomic size, excellent quantum coherence and compatibility with standard semiconductor processing, but no entanglement between donor-bound electron spins has been demonstrated to date. Here we present the experimental demonstration and tomography of universal 1- and 2-qubit gates in a system of two weakly exchange-coupled electrons, bound to single phosphorus donors introduced in silicon by ion implantation. We surprisingly observe that the exchange interaction has no effect on the qubit coherence. We quantify the fidelity of the quantum operations using gate set tomography (GST), and we use the universal gate set to create entangled Bell states of the electrons spins, with fidelity ~ 93%, and concurrence 0.91 +/- 0.08. These results form the necessary basis for scaling up donor-based quantum computers. △ Less

Submitted 2 March, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.15223 [pdf, other]

doi 10.1109/ASRU57964.2023.10389632

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6. △ Less

Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

arXiv:2309.12542 [pdf, other]

Spatio-temporal correlations of noise in MOS spin qubits

Authors: Amanda E. Seedhouse, Nard Dumoulin Stuyck, Santiago Serrano, Tuomo Tanttu, Will Gilbert, Jonathan Yue Huang, Fay E. Hudson, Kohei M. Itoh, Arne Laucht, Wee Han Lim, Chih Hwan Yang, Andrew S. Dzurak, Andre Saraiva

Abstract: In quantum computing, characterising the full noise profile of qubits can aid the efforts towards increasing coherence times and fidelities by creating error mitigating techniques specific to the type of noise in the system, or by completely removing the sources of noise. Spin qubits in MOS quantum dots are exposed to noise originated from the complex glassy behaviour of two-level fluctuators, lea… ▽ More In quantum computing, characterising the full noise profile of qubits can aid the efforts towards increasing coherence times and fidelities by creating error mitigating techniques specific to the type of noise in the system, or by completely removing the sources of noise. Spin qubits in MOS quantum dots are exposed to noise originated from the complex glassy behaviour of two-level fluctuators, leading to non-trivial correlations between qubit properties both in space and time. With recent engineering progress, large amounts of data are being collected in typical spin qubit device experiments, and it is beneficiary to explore data analysis options inspired from fields of research that are experienced in managing large data sets, examples include astrophysics, finance and climate science. Here, we propose and demonstrate wavelet-based analysis techniques to decompose signals into both frequency and time components to gain a deeper insight into the sources of noise in our systems. We apply the analysis to a long feedback experiment performed on a state-of-the-art two-qubit system in a pair of SiMOS quantum dots. The observed correlations serve to identify common microscopic causes of noise, as well as to elucidate pathways for multi-qubit operation with a more scalable feedback system. △ Less

Submitted 24 September, 2023; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: updated reference

arXiv:2309.12541 [pdf, other]

doi 10.1063/5.0179958

Real-time feedback protocols for optimizing fault-tolerant two-qubit gate fidelities in a silicon spin system

Authors: Nard Dumoulin Stuyck, Amanda E. Seedhouse, Santiago Serrano, Tuomo Tanttu, Will Gilbert, Jonathan Yue Huang, Fay Hudson, Kohei M. Itoh, Arne Laucht, Wee Han Lim, Chih Hwan Yang, Andre Saraiva, Andrew S. Dzurak

Abstract: Recently, several groups have demonstrated two-qubit gate fidelities in semiconductor spin qubit systems above 99%. Achieving this regime of fault-tolerant compatible high fidelities is nontrivial and requires exquisite stability and precise control over the different qubit parameters over an extended period of time. This can be done by efficiently calibrating qubit control parameters against diff… ▽ More Recently, several groups have demonstrated two-qubit gate fidelities in semiconductor spin qubit systems above 99%. Achieving this regime of fault-tolerant compatible high fidelities is nontrivial and requires exquisite stability and precise control over the different qubit parameters over an extended period of time. This can be done by efficiently calibrating qubit control parameters against different sources of micro- and macroscopic noise. Here, we present several single- and two-qubit parameter feedback protocols, optimised for and implemented in state-of-the-art fast FPGA hardware. Furthermore, we use wavelet-based analysis on the collected feedback data to gain insight into the different sources of noise in the system. Scalable feedback is an outstanding challenge and the presented implementation and analysis gives insight into the benefits and drawbacks of qubit parameter feedback, as feedback related overhead increases. This work demonstrates a pathway towards robust qubit parameter feedback and systematic noise analysis, crucial for mitigation strategies towards systematic high-fidelity qubit operation compatible with quantum error correction protocols. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.07081 [pdf, other]

Can Whisper perform speech-based in-context learning?

Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

Abstract: This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chi… ▽ More This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and both achieved considerable relative WER reductions. Detailed quantitative analyses are also provided to shed light on SICL's adaptability to phonological variances and dialect-specific lexical nuances. △ Less

Submitted 19 March, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted by ICASSP 2024

arXiv:2309.01849 [pdf, other]

Impact of electrostatic crosstalk on spin qubits in dense CMOS quantum dot arrays

Authors: Jesus D. Cifuentes, Tuomo Tanttu, Paul Steinacker, Santiago Serrano, Ingvild Hansen, James P. Slack-Smith, Will Gilbert, Jonathan Y. Huang, Ensar Vahapoglu, Ross C. C. Leon, Nard Dumoulin Stuyck, Kohei Itoh, Nikolay Abrosimov, Hans-Joachim Pohl, Michael Thewalt, Arne Laucht, Chih Hwan Yang, Christopher C. Escott, Fay E. Hudson, Wee Han Lim, Rajib Rahman, Andrew S. Dzurak, Andre Saraiva

Abstract: Quantum processors based on integrated nanoscale silicon spin qubits are a promising platform for highly scalable quantum computation. Current CMOS spin qubit processors consist of dense gate arrays to define the quantum dots, making them susceptible to crosstalk from capacitive coupling between a dot and its neighbouring gates. Small but sizeable spin-orbit interactions can transfer this electros… ▽ More Quantum processors based on integrated nanoscale silicon spin qubits are a promising platform for highly scalable quantum computation. Current CMOS spin qubit processors consist of dense gate arrays to define the quantum dots, making them susceptible to crosstalk from capacitive coupling between a dot and its neighbouring gates. Small but sizeable spin-orbit interactions can transfer this electrostatic crosstalk to the spin g-factors, creating a dependence of the Larmor frequency on the electric field created by gate electrodes positioned even tens of nanometers apart. By studying the Stark shift from tens of spin qubits measured in nine different CMOS devices, we developed a theoretical frawework that explains how electric fields couple to the spin of the electrons in increasingly complex arrays, including those electric fluctuations that limit qubit dephasing times $T_2^*$. The results will aid in the design of robust strategies to scale CMOS quantum technology. △ Less

Submitted 4 September, 2023; originally announced September 2023.

Comments: 9 pages, 4 figures

arXiv:2308.02111 [pdf, other]

doi 10.1038/s41586-024-07160-2

High-fidelity operation and algorithmic initialisation of spin qubits above one kelvin

Authors: Jonathan Y. Huang, Rocky Y. Su, Wee Han Lim, MengKe Feng, Barnaby van Straaten, Brandon Severin, Will Gilbert, Nard Dumoulin Stuyck, Tuomo Tanttu, Santiago Serrano, Jesus D. Cifuentes, Ingvild Hansen, Amanda E. Seedhouse, Ensar Vahapoglu, Nikolay V. Abrosimov, Hans-Joachim Pohl, Michael L. W. Thewalt, Fay E. Hudson, Christopher C. Escott, Natalia Ares, Stephen D. Bartlett, Andrea Morello, Andre Saraiva, Arne Laucht, Andrew S. Dzurak , et al. (1 additional authors not shown)

Abstract: The encoding of qubits in semiconductor spin carriers has been recognised as a promising approach to a commercial quantum computer that can be lithographically produced and integrated at scale. However, the operation of the large number of qubits required for advantageous quantum applications will produce a thermal load exceeding the available cooling power of cryostats at millikelvin temperatures… ▽ More The encoding of qubits in semiconductor spin carriers has been recognised as a promising approach to a commercial quantum computer that can be lithographically produced and integrated at scale. However, the operation of the large number of qubits required for advantageous quantum applications will produce a thermal load exceeding the available cooling power of cryostats at millikelvin temperatures. As the scale-up accelerates, it becomes imperative to establish fault-tolerant operation above 1 kelvin, where the cooling power is orders of magnitude higher. Here, we tune up and operate spin qubits in silicon above 1 kelvin, with fidelities in the range required for fault-tolerant operation at such temperatures. We design an algorithmic initialisation protocol to prepare a pure two-qubit state even when the thermal energy is substantially above the qubit energies, and incorporate radio-frequency readout to achieve fidelities up to 99.34 per cent for both readout and initialisation. Importantly, we demonstrate a single-qubit Clifford gate fidelity of 99.85 per cent, and a two-qubit gate fidelity of 98.92 per cent. These advances overcome the fundamental limitation that the thermal energy must be well below the qubit energies for high-fidelity operation to be possible, surmounting a major obstacle in the pathway to scalable and fault-tolerant quantum computation. △ Less

Submitted 18 August, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

Journal ref: Nature 627, 772-777 (2024)

arXiv:2307.12452 [pdf, other]

Characterizing non-Markovian Quantum Process by Fast Bayesian Tomography

Authors: R. Y. Su, J. Y. Huang, N. Dumoulin. Stuyck, M. K. Feng, W. Gilbert, T. J. Evans, W. H. Lim, F. E. Hudson, K. W. Chan, W. Huang, Kohei M. Itoh, R. Harper, S. D. Bartlett, C. H. Yang, A. Laucht, A. Saraiva, T. Tanttu, A. S. Dzurak

Abstract: To push gate performance to levels beyond the thresholds for quantum error correction, it is important to characterize the error sources occurring on quantum gates. However, the characterization of non-Markovian error poses a challenge to current quantum process tomography techniques. Fast Bayesian Tomography (FBT) is a self-consistent gate set tomography protocol that can be bootstrapped from ear… ▽ More To push gate performance to levels beyond the thresholds for quantum error correction, it is important to characterize the error sources occurring on quantum gates. However, the characterization of non-Markovian error poses a challenge to current quantum process tomography techniques. Fast Bayesian Tomography (FBT) is a self-consistent gate set tomography protocol that can be bootstrapped from earlier characterization knowledge and be updated in real-time with arbitrary gate sequences. Here we demonstrate how FBT allows for the characterization of key non-Markovian error processes. We introduce two experimental protocols for FBT to diagnose the non-Markovian behavior of two-qubit systems on silicon quantum dots. To increase the efficiency and scalability of the experiment-analysis loop, we develop an online FBT software stack. To reduce experiment cost and analysis time, we also introduce a native readout method and warm boot strategy. Our results demonstrate that FBT is a useful tool for probing non-Markovian errors that can be detrimental to the ultimate realization of fault-tolerant operation on quantum computing. △ Less

Submitted 4 October, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

arXiv:2307.01947 [pdf, other]

Causal Video Summarizer for Video Exploration

Authors: Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Andrew Brown, Marcel Worring

Abstract: Recently, video summarization has been proposed as a method to help video exploration. However, traditional video summarization models only generate a fixed video summary which is usually independent of user-specific needs and hence limits the effectiveness of video exploration. Multi-modal video summarization is one of the approaches utilized to address this issue. Multi-modal video summarization… ▽ More Recently, video summarization has been proposed as a method to help video exploration. However, traditional video summarization models only generate a fixed video summary which is usually independent of user-specific needs and hence limits the effectiveness of video exploration. Multi-modal video summarization is one of the approaches utilized to address this issue. Multi-modal video summarization has a video input and a text-based query input. Hence, effective modeling of the interaction between a video input and text-based query is essential to multi-modal video summarization. In this work, a new causality-based method named Causal Video Summarizer (CVS) is proposed to effectively capture the interactive information between the video and query to tackle the task of multi-modal video summarization. The proposed method consists of a probabilistic encoder and a probabilistic decoder. Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective with the increase of +5.4% in accuracy and +4.92% increase of F 1- score, compared with the state-of-the-art method. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: This paper is accepted by IEEE International Conference on Multimedia and Expo (ICME), 2022

arXiv:2306.03741

Classical-to-Quantum Transfer Learning Facilitates Machine Learning with Variational Quantum Circuit

Authors: Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh, Hector Zenil, Jesper Tegner

Abstract: While Quantum Machine Learning (QML) is an exciting emerging area, the accuracy of the loss function still needs to be improved by the number of available qubits. Here, we reformulate the QML problem such that the approximation error (representation power) does not depend on the number of qubits. We prove that a classical-to-quantum transfer learning architecture using a Variational Quantum Circui… ▽ More While Quantum Machine Learning (QML) is an exciting emerging area, the accuracy of the loss function still needs to be improved by the number of available qubits. Here, we reformulate the QML problem such that the approximation error (representation power) does not depend on the number of qubits. We prove that a classical-to-quantum transfer learning architecture using a Variational Quantum Circuit (VQC) improves the representation and generalization (estimation error) capabilities of the VQC model. We derive analytical bounds for the approximation and estimation error. We show that the architecture of classical-to-quantum transfer learning leverages pre-trained classical generative AI models, making it easier to find the optimal parameters for the VQC in the training stage. To validate our theoretical analysis, we perform experiments on single-dot and double-dot binary classification tasks for charge stability diagrams in semiconductor quantum dots, where the related empirical results support our theoretical findings. Our analytical and empirical results demonstrate the effectiveness of classical-to-quantum transfer learning architecture in realistic tasks. This sets the stage for accelerating QML applications beyond the current limits of available qubits. △ Less

Submitted 18 June, 2024; v1 submitted 17 May, 2023; originally announced June 2023.

Comments: The paper needs a major revision before it could be submitted to a new journal, and the authors agree that the latest version could not be open to public at the moment

arXiv:2306.01015 [pdf, other]

doi 10.21437/Interspeech.2023-1079

How to Estimate Model Transferability of Pre-Trained Speech Models?

Authors: Zih-Ching Chen, Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Shuo-Yiin Chang, Rohit Prabhavalkar, Hung-yi Lee, Tara N. Sainath

Abstract: In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates using the extracted representations. Our framework efficiently computes transferability… ▽ More In this work, we introduce a "score-based assessment" framework for estimating the transferability of pre-trained speech models (PSMs) for fine-tuning target tasks. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates using the extracted representations. Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers by making a temporal independent hypothesis. We evaluate some popular supervised speech models (e.g., Conformer RNN-Transducer) and self-supervised speech models (e.g., HuBERT) in cross-layer and cross-model settings using public data. Experimental results show a high Spearman's rank correlation and low $p$-value between our estimation framework and fine-tuning ground truth. Our proposed transferability framework requires less computational time and resources, making it a resource-saving and time-efficient approach for tuning speech foundation models. △ Less

Submitted 5 February, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted to Interspeech. Code is available at: https://github.com/virginiakm1988/LogME-CTC. Fixed a typo

arXiv:2306.00331 [pdf, other]

doi 10.21437/Interspeech.2023-1084

A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

Authors: Pin-Jui Ku, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

Abstract: We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF)… ▽ More We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains. The 2-D S4 layer can be considered a particular convolutional layer with an infinite receptive field although it utilizes fewer parameters than a conventional convolutional layer. Evaluated on the VoiceBank-DEMAND data set, when compared with the conventional U-net model based on convolutional layers, the proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation. By increasing the model size, we can even reach a PESQ score of 3.18. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted to Interspeech 2023. Code will be released at https://github.com/Kuray107/S4ND-U-Net_speech_enhancement

arXiv:2305.16932 [pdf, other]

A Neural State-Space Model Approach to Efficient Speech Separation

Authors: Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng

Abstract: In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we firs… ▽ More In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we first decompose the input mixture into multi-scale representations with different resolutions. This mechanism enables S4M to learn globally coherent separation and reconstruction. The experimental results show that S4M performs comparably to other separation backbones in terms of SI-SDRi, while having a much lower model complexity with significantly fewer trainable parameters. In addition, our S4M-tiny model (1.8M parameters) even surpasses attention-based Sepformer (26.0M parameters) in noisy conditions with only 9.2 of multiply-accumulate operation (MACs). △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted by InterSpeech 2023

arXiv:2305.11360 [pdf, other]

doi 10.21437/Interspeech.2023-551

Differentially Private Adapters for Parameter Efficient Acoustic Modeling

Authors: Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi

Abstract: In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-tra… ▽ More In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-trained acoustic model and attain superior performance than DP-based stochastic gradient descent (DPSGD). Next, we insert residual adapters (RA) between layers of the frozen pre-trained acoustic model. The RAs reduce training cost and time significantly with a negligible performance drop. Evaluated on the open-access Multilingual Spoken Words (MLSW) dataset, our solution reduces the number of trainable parameters by 97.5% using the RAs with only a 4% performance drop with respect to fine-tuning the cross-lingual speech classifier while preserving DP guarantees. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023. Code will be available at: https://github.com/Chun-wei-Ho/Private-Speech-Adapter. The authors would like to express their gratitude to Prof. Chin-Hui Lee from Georgia Tech for providing helpful insights and suggestions

arXiv:2305.11320 [pdf, other]

doi 10.21437/Interspeech.2023-1212

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Authors: Li-Jen Yang, Chao-Han Huck Yang, Jen-Tzung Chien

Abstract: This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study… ▽ More This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study carries out PEL for TTS where an auxiliary unsupervised loss based on OT is introduced to maximize a difference between the pre-trained source domain and the (unseen) target domain, in addition to its supervised training loss. Further, we leverage upon this unsupervised loss refinement to boost system performance via either sliced Wasserstein distance or maximum mean discrepancy. The merit of this work is demonstrated by fulfilling PEL solutions based on residual adapter learning, and model reprogramming when evaluating the Mandarin accent adaptation. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning, and the auxiliary unsupervised loss improves model performance empirically. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023

arXiv:2305.11244 [pdf, other]

doi 10.21437/Interspeech.2023-1407

A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model

Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

Abstract: In this work, we explore Parameter-Efficient-Learning (PEL) techniques to repurpose a General-Purpose-Speech (GSM) model for Arabic dialect identification (ADI). Specifically, we investigate different setups to incorporate trainable features into a multi-layer encoder-decoder GSM formulation under frozen pre-trained settings. Our architecture includes residual adapter and model reprogramming (inpu… ▽ More In this work, we explore Parameter-Efficient-Learning (PEL) techniques to repurpose a General-Purpose-Speech (GSM) model for Arabic dialect identification (ADI). Specifically, we investigate different setups to incorporate trainable features into a multi-layer encoder-decoder GSM formulation under frozen pre-trained settings. Our architecture includes residual adapter and model reprogramming (input-prompting). We design a token-level label mapping to condition the GSM for Arabic Dialect Identification (ADI). This is challenging due to the high variation in vocabulary and pronunciation among the numerous regional dialects. We achieve new state-of-the-art accuracy on the ADI-17 dataset by vanilla fine-tuning. We further reduce the training budgets with the PEL method, which performs within 1.86% accuracy to fine-tuning using only 2.5% of (extra) network trainable parameters. Our study demonstrates how to identify Arabic dialects using a small dataset and limited computation with open source code and pre-trained models. △ Less

Submitted 3 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023, 5 pages. Code is available at: https://github.com/Srijith-rkr/KAUST-Whisper-Adapter under MIT license

arXiv:2305.00455 [pdf, other]

Causalainer: Causal Explainer for Automatic Video Summarization

Authors: Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hung Chen, Marcel Worring

Abstract: The goal of video summarization is to automatically shorten videos such that it conveys the overall story without losing relevant information. In many application scenarios, improper video summarization can have a large impact. For example in forensics, the quality of the generated video summary will affect an investigator's judgment while in journalism it might yield undesired bias. Because of th… ▽ More The goal of video summarization is to automatically shorten videos such that it conveys the overall story without losing relevant information. In many application scenarios, improper video summarization can have a large impact. For example in forensics, the quality of the generated video summary will affect an investigator's judgment while in journalism it might yield undesired bias. Because of this, modeling explainability is a key concern. One of the best ways to address the explainability challenge is to uncover the causal relations that steer the process and lead to the result. Current machine learning-based video summarization algorithms learn optimal parameters but do not uncover causal relationships. Hence, they suffer from a relative lack of explainability. In this work, a Causal Explainer, dubbed Causalainer, is proposed to address this issue. Multiple meaningful random variables and their joint distributions are introduced to characterize the behaviors of key components in the problem of video summarization. In addition, helper distributions are introduced to enhance the effectiveness of model training. In visual-textual input scenarios, the extra input can decrease the model performance. A causal semantics extractor is designed to tackle this issue by effectively distilling the mutual information from the visual and textual inputs. Experimental results on commonly used benchmarks demonstrate that the proposed method achieves state-of-the-art performance while being more explainable. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: The paper has been accepted by the CVPR Workshop on New Frontiers in Visual Language Reasoning: Compositionality, Prompts, and Causality, 2023

arXiv:2303.14864 [pdf, other]

doi 10.1038/s41467-024-48557-x

Bounds to electron spin qubit variability for scalable CMOS architectures

Authors: Jesús D. Cifuentes, Tuomo Tanttu, Will Gilbert, Jonathan Y. Huang, Ensar Vahapoglu, Ross C. C. Leon, Santiago Serrano, Dennis Otter, Daniel Dunmore, Philip Y. Mai, Frédéric Schlattner, MengKe Feng, Kohei Itoh, Nikolay Abrosimov, Hans-Joachim Pohl, Michael Thewalt, Arne Laucht, Chih Hwan Yang, Christopher C. Escott, Wee Han Lim, Fay E. Hudson, Rajib Rahman, Andrew S. Dzurak, Andre Saraiva

Abstract: Spins of electrons in CMOS quantum dots combine exquisite quantum properties and scalable fabrication. In the age of quantum technology, however, the metrics that crowned Si/SiO2 as the microelectronics standard need to be reassessed with respect to their impact upon qubit performance. We chart the spin qubit variability due to the unavoidable atomic-scale roughness of the Si/SiO$_2$ interface, co… ▽ More Spins of electrons in CMOS quantum dots combine exquisite quantum properties and scalable fabrication. In the age of quantum technology, however, the metrics that crowned Si/SiO2 as the microelectronics standard need to be reassessed with respect to their impact upon qubit performance. We chart the spin qubit variability due to the unavoidable atomic-scale roughness of the Si/SiO$_2$ interface, compiling experiments in 12 devices, and developing theoretical tools to analyse these results. Atomistic tight binding and path integral Monte Carlo methods are adapted for describing fluctuations in devices with millions of atoms by directly analysing their wavefunctions and electron paths instead of their energy spectra. We correlate the effect of roughness with the variability in qubit position, deformation, valley splitting, valley phase, spin-orbit coupling and exchange coupling. These variabilities are found to be bounded and lie within the tolerances for scalable architectures for quantum computing as long as robust control methods are incorporated. △ Less

Submitted 5 July, 2024; v1 submitted 26 March, 2023; originally announced March 2023.

Comments: 20 pages, 8 figures

Journal ref: Nat Commun 15, 4299 (2024)

arXiv:2303.04090 [pdf, other]

Assessment of error variation in high-fidelity two-qubit gates in silicon

Authors: Tuomo Tanttu, Wee Han Lim, Jonathan Y. Huang, Nard Dumoulin Stuyck, Will Gilbert, Rocky Y. Su, MengKe Feng, Jesus D. Cifuentes, Amanda E. Seedhouse, Stefan K. Seritan, Corey I. Ostrove, Kenneth M. Rudinger, Ross C. C. Leon, Wister Huang, Christopher C. Escott, Kohei M. Itoh, Nikolay V. Abrosimov, Hans-Joachim Pohl, Michael L. W. Thewalt, Fay E. Hudson, Robin Blume-Kohout, Stephen D. Bartlett, Andrea Morello, Arne Laucht, Chih Hwan Yang , et al. (2 additional authors not shown)

Abstract: Achieving high-fidelity entangling operations between qubits consistently is essential for the performance of multi-qubit systems and is a crucial factor in achieving fault-tolerant quantum processors. Solid-state platforms are particularly exposed to errors due to materials-induced variability between qubits, which leads to performance inconsistencies. Here we study the errors in a spin qubit pro… ▽ More Achieving high-fidelity entangling operations between qubits consistently is essential for the performance of multi-qubit systems and is a crucial factor in achieving fault-tolerant quantum processors. Solid-state platforms are particularly exposed to errors due to materials-induced variability between qubits, which leads to performance inconsistencies. Here we study the errors in a spin qubit processor, tying them to their physical origins. We leverage this knowledge to demonstrate consistent and repeatable operation with above 99% fidelity of two-qubit gates in the technologically important silicon metal-oxide-semiconductor (SiMOS) quantum dot platform. We undertake a detailed study of these operations by analysing the physical errors and fidelities in multiple devices through numerous trials and extended periods to ensure that we capture the variation and the most common error types. Physical error sources include the slow nuclear and electrical noise on single qubits and contextual noise. The identification of the noise sources can be used to maintain performance within tolerance as well as inform future device fabrication. Furthermore, we investigate the impact of qubit design, feedback systems, and robust gates on implementing scalable, high-fidelity control strategies. These results are achieved by using three different characterization methods, we measure entangling gate fidelities ranging from 96.8% to 99.8%. Our analysis tools identify the causes of qubit degradation and offer ways understand their physical mechanisms. These results highlight both the capabilities and challenges for the scaling up of silicon spin-based qubits into full-scale quantum processors. △ Less

Submitted 15 March, 2024; v1 submitted 7 March, 2023; originally announced March 2023.

arXiv:2303.01660 [pdf, other]

doi 10.1103/PhysRevA.108.012426

Accessing the Full Capabilities of Filter Functions: A Tool for Detailed Noise and Control Susceptibility Analysis

Authors: Ingvild Hansen, Amanda E. Seedhouse, Andre Saraiva, Andrew S. Dzurak, Chih Hwan Yang

Abstract: The filter function formalism from quantum control theory is typically used to determine the noise susceptibility of pulse sequences by looking at the overlap between the filter function of the sequence and the noise power spectral density. Importantly, the square modulus of the filter function is used for this method, hence directional and phase information is lost. In this work, we take advantag… ▽ More The filter function formalism from quantum control theory is typically used to determine the noise susceptibility of pulse sequences by looking at the overlap between the filter function of the sequence and the noise power spectral density. Importantly, the square modulus of the filter function is used for this method, hence directional and phase information is lost. In this work, we take advantage of the full filter function including directional and phase information. By decomposing the filter function with phase preservation before taking the modulus, we are able to consider the contributions to $x$-, $y$- and $z$-rotation separately. Continuously driven systems provide noise protection in the form of dynamical decoupling by cancelling low-frequency noise, however, generating control pulses synchronously with an arbitrary driving field is not trivial. Using the decomposed filter function we look at the controllability of a system under arbitrary driving fields, as well as the noise susceptibility, and also relate the filter function to the geometric formalism. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Journal ref: Phys. Rev. A 108, 012426 (2023)

arXiv:2301.07851 [pdf, other]

doi 10.1109/ICASSP49357.2023.10094903

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N. Sainath, Trevor Strohman

Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time… ▽ More In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency. △ Less

Submitted 18 January, 2023; originally announced January 2023.

Comments: Submitted to ICASSP 2023. The project was initiated in May 2022 during a research internship at Google Research

arXiv:2211.01317 [pdf, other]

Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming

Authors: Yun-Ning Hung, Chao-Han Huck Yang, Pin-Yu Chen, Alexander Lerch

Abstract: Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neur… ▽ More Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a pre-trained model from a source domain to a target domain by modifying the input of a frozen pre-trained model. In addition to the known, input-independent, reprogramming method, we propose an advanced reprogramming paradigm: Input-dependent NMR, to increase adaptability to complex input data such as musical audio. Experimental results suggest that a neural model pre-trained on large-scale datasets can successfully perform music genre classification by using this reprogramming method. The two proposed Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a small genre classification dataset. △ Less

Submitted 3 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted to IEEE ICASSP 2023. The implementation is available at https://github.com/biboamy/music-repro

arXiv:2211.01263 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095142

A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

Authors: Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Tara N. Sainath, Sabato Marco Siniscalchi, Chin-Hui Lee

Abstract: We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers… ▽ More We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers. Experimental results on challenging spoken command recognition tasks for a few low-resource languages, such as Arabic, Georgian, Chuvash, and Lithuanian, show that the proposed QKL-based hybrid approach attains good improvements over existing classical and quantum solutions. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.01189 [pdf, other]

Inference and Denoise: Causal Inference-based Neural Speech Enhancement

Authors: Tsun-An Hsieh, Chao-Han Huck Yang, Pin-Yu Chen, Sabato Marco Siniscalchi, Yu Tsao

Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement module… ▽ More This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use the presence of noise as guidance for EM selection during training, and the noise detector selects the enhancement module according to the prediction of the presence of noise for each frame. Moreover, we derived a SE-specific average treatment effect to quantify the causal effect adequately. Experimental evidence demonstrates that CISE outperforms a non-causal mask-based SE approach in the studied settings and has better performance and efficiency than more complex SE models. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.00887 [pdf, other]

Certified Robustness of Quantum Classifiers against Adversarial Examples through Quantum Noise

Authors: Jhih-Cing Huang, Yu-Lin Tsai, Chao-Han Huck Yang, Cheng-Fang Su, Chia-Mu Yu, Pin-Yu Chen, Sy-Yen Kuo

Abstract: Recently, quantum classifiers have been found to be vulnerable to adversarial attacks, in which quantum classifiers are deceived by imperceptible noises, leading to misclassification. In this paper, we propose the first theoretical study demonstrating that adding quantum random rotation noise can improve robustness in quantum classifiers against adversarial attacks. We link the definition of diffe… ▽ More Recently, quantum classifiers have been found to be vulnerable to adversarial attacks, in which quantum classifiers are deceived by imperceptible noises, leading to misclassification. In this paper, we propose the first theoretical study demonstrating that adding quantum random rotation noise can improve robustness in quantum classifiers against adversarial attacks. We link the definition of differential privacy and show that the quantum classifier trained with the natural presence of additive noise is differentially private. Finally, we derive a certified robustness bound to enable quantum classifiers to defend against adversarial examples, supported by experimental results simulated with noises from IBM's 7-qubits device. △ Less

Submitted 28 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted to IEEE ICASSP 2023

arXiv:2210.06382 [pdf, other]

An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

Authors: Chao-Han Huck Yang, Jun Qi, Sabato Marco Siniscalchi, Chin-Hui Lee

Abstract: We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms,… ▽ More We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework that introduces DP-preserving noise at the output of the teacher models and transfers DP-preserving properties via noisy labels. Privacy-preserving student models are then trained with the noisy labels to learn the knowledge with DP-protection from the teacher model ensemble. Experimental evidences on spoken command recognition and continuous speech recognition of Mandarin speech show that our proposed framework greatly outperforms existing DP-preserving algorithms in both speech processing tasks. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to ISCA, ISCSLP 2022, Singapore. 5 Pages

arXiv:2210.05614 [pdf, other]

An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

Authors: Chao-Han Huck Yang, I-Fan Chen, Andreas Stolcke, Sabato Marco Siniscalchi, Chin-Hui Lee

Abstract: Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilit… ▽ More Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilities to improve ASR accuracy when dealing with the noise effects controlled by small values of $\varepsilon$. We extend PATE learning to work with dynamic patterns, namely speech utterances, and perform a first experimental demonstration that it prevents acoustic data leakage in ASR training. We evaluate three end-to-end deep models, including LAS, hybrid CTC/attention, and RNN transducer, on the open-source LibriSpeech and TIMIT corpora. PATE learning-enhanced ASR models outperform the benchmark DP-SGD mechanisms, especially under strict DP budgets, giving relative word error rate reductions between 26.2% and 27.5% for an RNN transducer model evaluated with LibriSpeech. We also introduce a DP-preserving ASR solution for pretraining on public speech corpora. △ Less

Submitted 13 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: 5 pages. Accepted to IEEE SLT 2022. A first version draft was finished in Aug 2021

arXiv:2208.14671 [pdf, other]

doi 10.1103/PhysRevA.108.022606

High Fidelity Control of a Nitrogen-Vacancy Spin Qubit at Room Temperature using the SMART Protocol

Authors: Hyma H. Vallabhapurapu, Ingvild Hansen, Chris Adambukulam, Rainer Stohr, Andrej Denisenko, Chih Hwan Yang, Arne Laucht

Abstract: A practical implementation of a quantum computer requires robust qubits that are protected against their noisy environment. Dynamical decoupling techniques have been successfully used in the past to offer protected high-fidelity gate operations in negatively-charged Nitrogen-Vacancy (NV-) centers in diamond, albeit under specific conditions with the intrinsic nitrogen nuclear spin initialised. In… ▽ More A practical implementation of a quantum computer requires robust qubits that are protected against their noisy environment. Dynamical decoupling techniques have been successfully used in the past to offer protected high-fidelity gate operations in negatively-charged Nitrogen-Vacancy (NV-) centers in diamond, albeit under specific conditions with the intrinsic nitrogen nuclear spin initialised. In this work, we show how the SMART protocol, an extension of the dressed-qubit concept, can be implemented for continuous protection to offer Clifford gate fidelities compatible with fault-tolerant schemes, whilst prolonging the coherence time of a single NV- qubit at room temperature. We show an improvement in the average Clifford gate fidelity from $0.940\pm0.005$ for the bare qubit to $0.993\pm0.002$ for the SMART qubit, with the nitrogen nuclear spin in a random orientation. We further show a $\gtrsim$ 30 times improvement in the qubit coherence times compared to the bare qubit. △ Less

Submitted 9 September, 2022; v1 submitted 31 August, 2022; originally announced August 2022.

Comments: Minor changes. Updated figures, some text and added more references

arXiv:2208.04724 [pdf, other]

doi 10.1002/adma.202208557

Jellybean quantum dots in silicon for qubit coupling and on-chip quantum chemistry

Authors: Zeheng Wang, MengKe Feng, Santiago Serrano, William Gilbert, Ross C. C. Leon, Tuomo Tanttu, Philip Mai, Dylan Liang, Jonathan Y. Huang, Yue Su, Wee Han Lim, Fay E. Hudson, Christopher C. Escott, Andrea Morello, Chih Hwan Yang, Andrew S. Dzurak, Andre Saraiva, Arne Laucht

Abstract: The small size and excellent integrability of silicon metal-oxide-semiconductor (SiMOS) quantum dot spin qubits make them an attractive system for mass-manufacturable, scaled-up quantum processors. Furthermore, classical control electronics can be integrated on-chip, in-between the qubits, if an architecture with sparse arrays of qubits is chosen. In such an architecture qubits are either transpor… ▽ More The small size and excellent integrability of silicon metal-oxide-semiconductor (SiMOS) quantum dot spin qubits make them an attractive system for mass-manufacturable, scaled-up quantum processors. Furthermore, classical control electronics can be integrated on-chip, in-between the qubits, if an architecture with sparse arrays of qubits is chosen. In such an architecture qubits are either transported across the chip via shuttling, or coupled via mediating quantum systems over short-to-intermediate distances. This paper investigates the charge and spin characteristics of an elongated quantum dot -- a so-called jellybean quantum dot -- for the prospects of acting as a qubit-qubit coupler. Charge transport, charge sensing and magneto-spectroscopy measurements are performed on a SiMOS quantum dot device at mK temperature, and compared to Hartree-Fock multi-electron simulations. At low electron occupancies where disorder effects and strong electron-electron interaction dominate over the electrostatic confinement potential, the data reveals the formation of three coupled dots, akin to a tunable, artificial molecule. One dot is formed centrally under the gate and two are formed at the edges. At high electron occupancies, these dots merge into one large dot with well-defined spin states, verifying that jellybean dots have the potential to be used as qubit couplers in future quantum computing architectures. △ Less

Submitted 8 August, 2022; originally announced August 2022.

Showing 1–50 of 138 results for author: Yang, C H