RODA: Reverse Operation Based Data Augmentation for Solving Math Word Problems
Automatically solving math word problems is a critical task in the field of natural language processing. Recent models have reached their performance bottleneck and require more high-quality data for training. We propose a novel data augmentation method ...
Scalable and Efficient Neural Speech Coding: A Hybrid Design
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (...
Text Generation From Data With Dynamic Planning
Transcribing structural data into readable text (data-to-text) is a fundamental language generation task. One of its challenges is to plan the input records for text realization. Recent works tackle this problem with a static planner, which performs ...
Occlusion Effect Cancellation in Headphones and Hearing Devices—The Sister of Active Noise Cancellation
The perception of one’s own voice influences the acceptance of hearing devices, such as headphones, headsets or hearing aids. When these devices fully or partially occlude the ear canal, the wearer’s own voice sounds boomy or like talking in ...
Which Apple Keeps Which Doctor Away? Colorful Word Representations With Visual Oracles
Recent pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level context for modeling. Although the PrLMs generally provide more effective contextualized word representations ...
Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain ...
Unsupervised Character Embedding Correction and Candidate Word Denoising
Inthis paper, we take Indonesian as the research object, and propose a multiple filter correction framework (MFCF). The main idea of MFCF is to remove noise from candidate words to increase the probability of correct words being selected. In MFCF, we use ...
Extractive Dialogue Summarization Without Annotation Based on Distantly Supervised Machine Reading Comprehension in Customer Service
Given a long dialogue, the dialogue summarization system aims to obtain a shorter highlight which retains the important information in the original text. For the customer service scenarios, the summaries of most dialogues between an agent and a user focus ...
Efficient Combinatorial Optimization for Word-Level Adversarial Textual Attack
Over the past few years, various word-level textual attack approaches have been proposed to reveal the vulnerability of deep neural networks used in natural language processing. Typically, these approaches involve an important optimization step to ...
Comparison of Feature Extraction Methods for Sound-Based Classification of Honey Bee Activity
Honey bees are one of the most important insects on the planet since they play a key role in the pollination services of both cultivated and spontaneous flora. Recent years have seen an increase in bee mortality which points out the necessity of intensive ...
Enhancing Segment-Based Speech Emotion Recognition by Iterative Self-Learning
Despite the widespread utilization of deep neural networks (DNNs) for speech emotion recognition (SER), they are severely restricted due to the paucity of labeled data for training. Recently, segment-based approaches for SER have been evolving, which ...
Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models
We investigate the problem of speaker independent acoustic-to-articulatory inversion (AAI) in noisy conditions within the deep neural network (DNN) framework. In contrast with recent results in the literature, we argue that a DNN vector-to-vector ...
Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models
Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience on this ...
End-to-End Neural Based Modification of Noisy Speech for Speech-in-Noise Intelligibility Improvement
Intelligibility of speech can be significantly reduced when it is presented in adverse near-end listening conditions, like background noise. Multiple approaches have been suggested to improve the perception of speech in such conditions. However, most of ...
VACE-WPE: Virtual Acoustic Channel Expansion Based on Neural Networks for Weighted Prediction Error-Based Speech Dereverberation
Speech dereverberation is an important issue for many real-world speech processing applications. Among the techniques developed, the weighted prediction error (WPE) algorithm has been widely adopted and advanced over the last decade, which blindly cancels ...
Phone-Level Prosody Modelling With GMM-Based MDN for Diverse and Controllable Speech Synthesis
Generating natural speech with a diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech ...
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning
Previous works have shown that automatic speaker verification (ASV) is seriously vulnerable to malicious spoofing attacks, such as replay, synthetic speech, and recently emerged adversarial attacks. Great efforts have been dedicated to defending ASV ...
Multi-View Speech Emotion Recognition Via Collective Relation Construction
Automatic emotion recognition from speech plays a fundamental role towards advanced emotional intelligence in human-machine interaction systems. The discriminative knowledge from speech for effective emotion recognition may come from multiple physical ...
Learning Phone Recognition From Unpaired Audio and Phone Sequences Based on Generative Adversarial Network
ASRhas been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech ...
Word-Region Alignment-Guided Multimodal Neural Machine Translation
We propose word-region alignment-guided multimodal neural machine translation (MNMT), a novel model for MNMT that links the semantic correlation between textual and visual modalities using word-region alignment (WRA). Existing studies on MNMT have mainly ...
Syntax-Aware Multi-Spans Generation for Reading Comprehension
This paper presents a novel method to generate answers for non-extraction machine reading comprehension (MRC) tasks whose answers cannot be simply extracted as one span from the given passages. Using a pointer network-style extractive decoder for such ...
DUMA: Reading Comprehension With Transposition Thinking
Multi-choice Machine Reading Comprehension (MRC) requires models to decide the correct answer from a set of answer options when given a passage and a question. Thus, in addition to a powerful Pre-trained Language Model (PrLM) as an encoder, multi-choice ...
Diverse Distractor Generation for Constructing High-Quality Multiple Choice Questions
Distractor generation task aims to generate incorrect options (i.e., distractors) for multiple choice questions from an article.Existing methods for this task often utilize a standard encoder-decoder framework. However, these methods often tend to ...
A Parametric Unconstrained Beamformer Based Binaural Noise Reduction for Assistive Hearing
For hearing-impaired listeners, it is required not only to enhance the target speech by suppressing ambient noises, but also to preserve the binaural cues of important directional sources, such that a complete spatial awareness of the acoustic scene is ...
Music Emotion Recognition: Intention of Composers-Performers Versus Perception of Musicians, Non-Musicians, and Listening Machines
This paper investigates to which extent state of the art machine learning methods are effective in classifying emotions in the context of individual musical instruments, and how their performances compare with musically trained and untrained listeners. To ...
Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition
Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language. Since the low-resource language has limited training data, speech recognition models can easily ...
Integrating Prior Translation Knowledge Into Neural Machine Translation
Neural machine translation (NMT), which is an encoder-decoder joint neural language model with an attention mechanism, has achieved impressive results on various machine translation tasks in the past several years. However, the language model attribute of ...
Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification
Recently, we have witnessed excellent improvement of end-to-end (E2E) automatic speech recognition (ASR). However, how to tackle the long-tailed data distribution problem while maintaining E2E ASR models' performance for high-frequency tokens is still ...
HPSG-Inspired Joint Neural Constituent and Dependency Parsing in O(<inline-formula><tex-math notation="LaTeX">$n^3$</tex-math></inline-formula>) Time Complexity
Constituent and dependency parsing, the two classic forms of syntactic parsing, have been found to benefit from joint training and decoding under a uniform formalism, inspired by Head-driven Phrase Structure Grammar (HPSG). We thus refer to this joint ...
Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds
Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic ...