-
SHANGUS: Deep Reinforcement Learning Meets Heuristic Optimization for Speedy Frontier-Based Exploration of Autonomous Vehicles in Unknown Spaces
Authors:
Seunghyeop Nam,
Tuan Anh Nguyen,
Eunmi Choi,
Dugki Min
Abstract:
This paper introduces SHANGUS, an advanced framework combining Deep Reinforcement Learning (DRL) with heuristic optimization to improve frontier-based exploration efficiency in unknown environments, particularly for intelligent vehicles in autonomous air services, search and rescue operations, and space exploration robotics. SHANGUS harnesses DRL's adaptability and heuristic prioritization, marked…
▽ More
This paper introduces SHANGUS, an advanced framework combining Deep Reinforcement Learning (DRL) with heuristic optimization to improve frontier-based exploration efficiency in unknown environments, particularly for intelligent vehicles in autonomous air services, search and rescue operations, and space exploration robotics. SHANGUS harnesses DRL's adaptability and heuristic prioritization, markedly enhancing exploration efficiency, reducing completion time, and minimizing travel distance. The strategy involves a frontier selection node to identify unexplored areas and a DRL navigation node using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for robust path planning and dynamic obstacle avoidance. Extensive experiments in ROS2 and Gazebo simulation environments show SHANGUS surpasses representative traditional methods like the Nearest Frontier (NF), Novel Frontier-Based Exploration Algorithm (CFE), and Goal-Driven Autonomous Exploration (GDAE) algorithms, especially in complex scenarios, excelling in completion time, travel distance, and exploration rate. This scalable solution is suitable for real-time autonomous navigation in fields such as industrial automation, autonomous driving, household robotics, and space exploration. Future research will integrate additional sensory inputs and refine heuristic functions to further boost SHANGUS's efficiency and robustness.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Delay-Aware Robust Edge Network Hardening Under Decision-Dependent Uncertainty
Authors:
Jiaming Cheng,
Duong Thuy Anh Nguyen,
Ni Trieu,
Duong Tung Nguyen
Abstract:
Edge computing promises to offer low-latency and ubiquitous computation to numerous devices at the network edge. For delay-sensitive applications, link delays can have a direct impact on service quality. These delays can fluctuate drastically over time due to various factors such as network congestion, changing traffic conditions, cyberattacks, component failures, and natural disasters. Thus, it i…
▽ More
Edge computing promises to offer low-latency and ubiquitous computation to numerous devices at the network edge. For delay-sensitive applications, link delays can have a direct impact on service quality. These delays can fluctuate drastically over time due to various factors such as network congestion, changing traffic conditions, cyberattacks, component failures, and natural disasters. Thus, it is crucial to efficiently harden the edge network to mitigate link delay variation as well as ensure a stable and improved user experience. To this end, we propose a novel robust model for optimal edge network hardening, considering the link delay uncertainty. Departing from the existing literature that treats uncertainties as exogenous, our model incorporates an endogenous uncertainty set to properly capture the impact of hardening and workload allocation decisions on link delays. However, the endogenous set introduces additional complexity to the problem due to the interdependence between decisions and uncertainties. We present two efficient methods to transform the problem into a solvable form. Extensive numerical results are shown to demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Q-learning-based Opportunistic Communication for Real-time Mobile Air Quality Monitoring Systems
Authors:
Trung Thanh Nguyen,
Truong Thao Nguyen,
Dinh Tuan Anh Nguyen,
Thanh Hung Nguyen,
Phi Le Nguyen
Abstract:
We focus on real-time air quality monitoring systems that rely on devices installed on automobiles in this research. We investigate an opportunistic communication model in which devices can send the measured data directly to the air quality server through a 4G communication channel or via Wi-Fi to adjacent devices or the so-called Road Side Units deployed along the road. We aim to reduce 4G costs…
▽ More
We focus on real-time air quality monitoring systems that rely on devices installed on automobiles in this research. We investigate an opportunistic communication model in which devices can send the measured data directly to the air quality server through a 4G communication channel or via Wi-Fi to adjacent devices or the so-called Road Side Units deployed along the road. We aim to reduce 4G costs while assuring data latency, where the data latency is defined as the amount of time it takes for data to reach the server. We propose an offloading scheme that leverages Q-learning to accomplish the purpose. The experiment results show that our offloading method significantly cuts down around 40-50% of the 4G communication cost while keeping the latency of 99.5% packets smaller than the required threshold.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
AAM-VDT: Vehicle Digital Twin for Tele-Operations in Advanced Air Mobility
Authors:
Tuan Anh Nguyen,
Taeho Kwag,
Vinh Pham,
Viet Nghia Nguyen,
Jeongseok Hyun,
Minseok Jang,
Jae-Woo Lee
Abstract:
This study advanced tele-operations in Advanced Air Mobility (AAM) through the creation of a Vehicle Digital Twin (VDT) system for eVTOL aircraft, tailored to enhance remote control safety and efficiency, especially for Beyond Visual Line of Sight (BVLOS) operations. By synergizing digital twin technology with immersive Virtual Reality (VR) interfaces, we notably elevate situational awareness and…
▽ More
This study advanced tele-operations in Advanced Air Mobility (AAM) through the creation of a Vehicle Digital Twin (VDT) system for eVTOL aircraft, tailored to enhance remote control safety and efficiency, especially for Beyond Visual Line of Sight (BVLOS) operations. By synergizing digital twin technology with immersive Virtual Reality (VR) interfaces, we notably elevate situational awareness and control precision for remote operators. Our VDT framework integrates immersive tele-operation with a high-fidelity aerodynamic database, essential for authentically simulating flight dynamics and control tactics. At the heart of our methodology lies an eVTOL's high-fidelity digital replica, placed within a simulated reality that accurately reflects physical laws, enabling operators to manage the aircraft via a master-slave dynamic, substantially outperforming traditional 2D interfaces. The architecture of the designed system ensures seamless interaction between the operator, the digital twin, and the actual aircraft, facilitating exact, instantaneous feedback. Experimental assessments, involving propulsion data gathering, simulation database fidelity verification, and tele-operation testing, verify the system's capability in precise control command transmission and maintaining the digital-physical eVTOL synchronization. Our findings underscore the VDT system's potential in augmenting AAM efficiency and safety, paving the way for broader digital twin application in autonomous aerial vehicles.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Evidence, Definitions and Algorithms regarding the Existence of Cohesive-Convergence Groups in Neural Network Optimization
Authors:
Thien An L. Nguyen
Abstract:
Understanding the convergence process of neural networks is one of the most complex and crucial issues in the field of machine learning. Despite the close association of notable successes in this domain with the convergence of artificial neural networks, this concept remains predominantly theoretical. In reality, due to the non-convex nature of the optimization problems that artificial neural netw…
▽ More
Understanding the convergence process of neural networks is one of the most complex and crucial issues in the field of machine learning. Despite the close association of notable successes in this domain with the convergence of artificial neural networks, this concept remains predominantly theoretical. In reality, due to the non-convex nature of the optimization problems that artificial neural networks tackle, very few trained networks actually achieve convergence. To expand recent research efforts on artificial-neural-network convergence, this paper will discuss a different approach based on observations of cohesive-convergence groups emerging during the optimization process of an artificial neural network.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
SpiRit-LM: Interleaved Spoken and Written Language Model
Authors:
Tu Anh Nguyen,
Benjamin Muller,
Bokai Yu,
Marta R. Costa-jussa,
Maha Elbayad,
Sravya Popuri,
Paul-Ambroise Duquenne,
Robin Algayres,
Ruslan Mavlyutov,
Itai Gat,
Gabriel Synnaeve,
Juan Pino,
Benoit Sagot,
Emmanuel Dupoux
Abstract:
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated…
▽ More
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Two-Stage Distributionally Robust Edge Node Placement Under Endogenous Demand Uncertainty
Authors:
Jiaming Cheng,
Duong Thuy Anh Nguyen,
Duong Tung Nguyen
Abstract:
Edge computing (EC) promises to deliver low-latency and ubiquitous computation to numerous devices at the network edge. This paper aims to jointly optimize edge node (EN) placement and resource allocation for an EC platform, considering demand uncertainty. Diverging from existing approaches treating uncertainties as exogenous, we propose a novel two-stage decision-dependent distributionally robust…
▽ More
Edge computing (EC) promises to deliver low-latency and ubiquitous computation to numerous devices at the network edge. This paper aims to jointly optimize edge node (EN) placement and resource allocation for an EC platform, considering demand uncertainty. Diverging from existing approaches treating uncertainties as exogenous, we propose a novel two-stage decision-dependent distributionally robust optimization (DRO) framework to effectively capture the interdependence between EN placement decisions and uncertain demands. The first stage involves making EN placement decisions, while the second stage optimizes resource allocation after uncertainty revelation. We present an exact mixed-integer linear program reformulation for solving the underlying ``min-max-min" two-stage model. We further introduce a valid inequality method to enhance computational efficiency, especially for large-scale networks. Extensive numerical experiments demonstrate the benefits of considering endogenous uncertainties and the advantages of the proposed model and approach.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Generative Spoken Language Model based on continuous word-sized audio tokens
Authors:
Robin Algayres,
Yossi Adi,
Tu Anh Nguyen,
Jade Copet,
Gabriel Synnaeve,
Benoit Sagot,
Emmanuel Dupoux
Abstract:
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can g…
▽ More
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
Authors:
Po-chun Hsu,
Ali Elkahky,
Wei-Ning Hsu,
Yossi Adi,
Tu Anh Nguyen,
Jade Copet,
Emmanuel Dupoux,
Hung-yi Lee,
Abdelrahman Mohamed
Abstract:
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT…
▽ More
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.
△ Less
Submitted 4 June, 2024; v1 submitted 29 September, 2023;
originally announced September 2023.
-
Nash equilibrium seeking over digraphs with row-stochastic matrices and network-independent step-sizes
Authors:
Duong Thuy Anh Nguyen,
Mattia Bianchi,
Florian Dörfler,
Duong Tung Nguyen,
Angelia Nedić
Abstract:
In this paper, we address the challenge of Nash equilibrium (NE) seeking in non-cooperative convex games with partial-decision information. We propose a distributed algorithm, where each agent refines its strategy through projected-gradient steps and an averaging procedure. Each agent uses estimates of competitors' actions obtained solely from local neighbor interactions, in a directed communicati…
▽ More
In this paper, we address the challenge of Nash equilibrium (NE) seeking in non-cooperative convex games with partial-decision information. We propose a distributed algorithm, where each agent refines its strategy through projected-gradient steps and an averaging procedure. Each agent uses estimates of competitors' actions obtained solely from local neighbor interactions, in a directed communication network. Unlike previous approaches that rely on (strong) monotonicity assumptions, this work establishes the convergence towards a NE under a diagonal dominance property of the pseudo-gradient mapping, that can be checked locally by the agents. Further, this condition is physically interpretable and of relevance for many applications, as it suggests that an agent's objective function is primarily influenced by its individual strategic decisions, rather than by the actions of its competitors. In virtue of a novel block-infinity norm convergence argument, we provide explicit bounds for constant step-size that are independent of the communication structure, and can be computed in a totally decentralized way. Numerical simulations on an optical network's power control problem validate the algorithm's effectiveness.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
Authors:
Tu Anh Nguyen,
Wei-Ning Hsu,
Antony D'Avirro,
Bowen Shi,
Itai Gat,
Maryam Fazel-Zarani,
Tal Remez,
Jade Copet,
Gabriel Synnaeve,
Michael Hassid,
Felix Kreuk,
Yossi Adi,
Emmanuel Dupoux
Abstract:
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthes…
▽ More
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Neural Multigrid Memory For Computational Fluid Dynamics
Authors:
Duc Minh Nguyen,
Minh Chau Vu,
Tuan Anh Nguyen,
Tri Huynh,
Nguyen Tri Nguyen,
Truong Son Hy
Abstract:
Turbulent flow simulation plays a crucial role in various applications, including aircraft and ship design, industrial process optimization, and weather prediction. In this paper, we propose an advanced data-driven method for simulating turbulent flow, representing a significant improvement over existing approaches. Our methodology combines the strengths of Video Prediction Transformer (VPTR) (Ye…
▽ More
Turbulent flow simulation plays a crucial role in various applications, including aircraft and ship design, industrial process optimization, and weather prediction. In this paper, we propose an advanced data-driven method for simulating turbulent flow, representing a significant improvement over existing approaches. Our methodology combines the strengths of Video Prediction Transformer (VPTR) (Ye & Bilodeau, 2022) and Multigrid Architecture (MgConv, MgResnet) (Ke et al., 2017). VPTR excels in capturing complex spatiotemporal dependencies and handling large input data, making it a promising choice for turbulent flow prediction. Meanwhile, Multigrid Architecture utilizes multiple grids with different resolutions to capture the multiscale nature of turbulent flows, resulting in more accurate and efficient simulations. Through our experiments, we demonstrate the effectiveness of our proposed approach, named MGxTransformer, in accurately predicting velocity, temperature, and turbulence intensity for incompressible turbulent flows across various geometries and flow conditions. Our results exhibit superior accuracy compared to other baselines, while maintaining computational efficiency. Our implementation in PyTorch is available publicly at https://github.com/Combi2k2/MG-Turbulent-Flow
△ Less
Submitted 24 June, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Textually Pretrained Speech Language Models
Authors:
Michael Hassid,
Tal Remez,
Tu Anh Nguyen,
Itai Gat,
Alexis Conneau,
Felix Kreuk,
Jade Copet,
Alexandre Defossez,
Gabriel Synnaeve,
Emmanuel Dupoux,
Roy Schwartz,
Yossi Adi
Abstract:
Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de…
▽ More
Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .
△ Less
Submitted 30 January, 2024; v1 submitted 22 May, 2023;
originally announced May 2023.
-
CrowdCache: A Decentralized Game-Theoretic Framework for Mobile Edge Content Sharing
Authors:
Duong Thuy Anh Nguyen,
Jiaming Cheng,
Duong Tung Nguyen,
Angelia Nedich
Abstract:
Mobile edge computing (MEC) is a promising solution for enhancing the user experience, minimizing content delivery expenses, and reducing backhaul traffic. In this paper, we propose a novel privacy-preserving decentralized game-theoretic framework for resource crowdsourcing in MEC. Our framework models the interactions between a content provider (CP) and multiple mobile edge device users (MEDs) as…
▽ More
Mobile edge computing (MEC) is a promising solution for enhancing the user experience, minimizing content delivery expenses, and reducing backhaul traffic. In this paper, we propose a novel privacy-preserving decentralized game-theoretic framework for resource crowdsourcing in MEC. Our framework models the interactions between a content provider (CP) and multiple mobile edge device users (MEDs) as a non-cooperative game, in which MEDs offer idle storage resources for content caching in exchange for rewards. We introduce efficient decentralized gradient play algorithms for Nash equilibrium (NE) computation by exchanging local information among neighboring MEDs only, thus preventing attackers from learning users' private information. The key challenge in designing such algorithms is that communication among MEDs is not fixed and is facilitated by a sequence of undirected time-varying graphs. Our approach achieves linear convergence to the NE without imposing any assumptions on the values of parameters in the local objective functions, such as requiring strong monotonicity to be stronger than its dependence on other MEDs' actions, which is commonly required in existing literature when the graph is directed time-varying. Extensive simulations demonstrate the effectiveness of our approach in achieving efficient resource outsourcing decisions while preserving the privacy of the edge devices.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
Geometric Convergence of Distributed Heavy-Ball Nash Equilibrium Algorithm over Time-Varying Digraphs with Unconstrained Actions
Authors:
Duong Thuy Anh Nguyen,
Duong Tung Nguyen,
Angelia Nedich
Abstract:
This paper presents a new distributed algorithm that leverages heavy-ball momentum and a consensus-based gradient method to find a Nash equilibrium (NE) in a class of non-cooperative convex games with unconstrained action sets. In this approach, each agent in the game has access to its own smooth local cost function and can exchange information with its neighbors over a communication network. The…
▽ More
This paper presents a new distributed algorithm that leverages heavy-ball momentum and a consensus-based gradient method to find a Nash equilibrium (NE) in a class of non-cooperative convex games with unconstrained action sets. In this approach, each agent in the game has access to its own smooth local cost function and can exchange information with its neighbors over a communication network. The main novelty of our work is the incorporation of heavy-ball momentum in the context of non-cooperative games that operate on fully-decentralized, directed, and time-varying communication graphs, while also accommodating non-identical step-sizes and momentum parameters. Overcoming technical challenges arising from the dynamic and asymmetric nature of mixing matrices and the presence of an additional momentum term, we provide a rigorous proof of the geometric convergence to the NE. Moreover, we establish explicit bounds for the step-size values and momentum parameters based on the characteristics of the cost functions, mixing matrices, and graph connectivity structures. We perform numerical simulations on a Nash-Cournot game to demonstrate accelerated convergence of the proposed algorithm compared to that of the existing methods.
△ Less
Submitted 3 June, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.
-
A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation
Authors:
Jiaming Cheng,
Duong Thuy Anh Nguyen,
Lele Wang,
Duong Tung Nguyen,
Vijay K. Bhargava
Abstract:
Edge Computing (EC) offers a superior user experience by positioning cloud resources in close proximity to end users. The challenge of allocating edge resources efficiently while maximizing profit for the EC platform remains a sophisticated problem, especially with the added complexity of the online arrival of resource requests. To address this challenge, we propose to cast the problem as a multi-…
▽ More
Edge Computing (EC) offers a superior user experience by positioning cloud resources in close proximity to end users. The challenge of allocating edge resources efficiently while maximizing profit for the EC platform remains a sophisticated problem, especially with the added complexity of the online arrival of resource requests. To address this challenge, we propose to cast the problem as a multi-armed bandit problem and develop two novel online pricing mechanisms, the Kullback-Leibler Upper Confidence Bound (KL-UCB) algorithm and the Min-Max Optimal algorithm, for heterogeneous edge resource allocation. These mechanisms operate in real-time and do not require prior knowledge of demand distribution, which can be difficult to obtain in practice. The proposed posted pricing schemes allow users to select and pay for their preferred resources, with the platform dynamically adjusting resource prices based on observed historical data. Numerical results show the advantages of the proposed mechanisms compared to several benchmark schemes derived from traditional bandit algorithms, including the Epsilon-Greedy, basic UCB, and Thompson Sampling algorithms.
△ Less
Submitted 14 February, 2023;
originally announced February 2023.
-
DRG-Net: Interactive Joint Learning of Multi-lesion Segmentation and Classification for Diabetic Retinopathy Grading
Authors:
Hasan Md Tusfiqur,
Duy M. H. Nguyen,
Mai T. N. Truong,
Triet A. Nguyen,
Binh T. Nguyen,
Michael Barz,
Hans-Juergen Profitlich,
Ngoc T. T. Than,
Ngan Le,
Pengtao Xie,
Daniel Sonntag
Abstract:
Diabetic Retinopathy (DR) is a leading cause of vision loss in the world, and early DR detection is necessary to prevent vision loss and support an appropriate treatment. In this work, we leverage interactive machine learning and introduce a joint learning framework, termed DRG-Net, to effectively learn both disease grading and multi-lesion segmentation. Our DRG-Net consists of two modules: (i) DR…
▽ More
Diabetic Retinopathy (DR) is a leading cause of vision loss in the world, and early DR detection is necessary to prevent vision loss and support an appropriate treatment. In this work, we leverage interactive machine learning and introduce a joint learning framework, termed DRG-Net, to effectively learn both disease grading and multi-lesion segmentation. Our DRG-Net consists of two modules: (i) DRG-AI-System to classify DR Grading, localize lesion areas, and provide visual explanations; (ii) DRG-Expert-Interaction to receive feedback from user-expert and improve the DRG-AI-System. To deal with sparse data, we utilize transfer learning mechanisms to extract invariant feature representations by using Wasserstein distance and adversarial learning-based entropy minimization. Besides, we propose a novel attention strategy at both low- and high-level features to automatically select the most significant lesion information and provide explainable properties. In terms of human interaction, we further develop DRG-Net as a tool that enables expert users to correct the system's predictions, which may then be used to update the system as a whole. Moreover, thanks to the attention mechanism and loss functions constraint between lesion features and classification features, our approach can be robust given a certain level of noise in the feedback of users. We have benchmarked DRG-Net on the two largest DR datasets, i.e., IDRID and FGADR, and compared it to various state-of-the-art deep learning networks. In addition to outperforming other SOTA approaches, DRG-Net is effectively updated using user feedback, even in a weakly-supervised manner.
△ Less
Submitted 30 December, 2022;
originally announced December 2022.
-
A Solution for a Fundamental Problem of 3D Inference based on 2D Representations
Authors:
Thien An L. Nguyen
Abstract:
3D inference from monocular vision using neural networks is an important research area of computer vision. Applications of the research area are various with many proposed solutions and have shown remarkable performance. Although many efforts have been invested, there are still unanswered questions, some of which are fundamental. In this paper, I discuss a problem that I hope will come to be known…
▽ More
3D inference from monocular vision using neural networks is an important research area of computer vision. Applications of the research area are various with many proposed solutions and have shown remarkable performance. Although many efforts have been invested, there are still unanswered questions, some of which are fundamental. In this paper, I discuss a problem that I hope will come to be known as a generalization of the Blind Perspective-n-Point (Blind PnP) problem for object-driven 3D inference based on 2D representations. The vital difference between the fundamental problem and the Blind PnP problem is that 3D inference parameters in the fundamental problem are attached directly to 3D points and the camera concept will be represented through the sharing of the parameters of these points. By providing an explainable and robust gradient-decent solution based on 2D representations for an important special case of the problem, the paper opens up a new approach for using available information-based learning methods to solve problems related to 3D object pose estimation from 2D images.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Are word boundaries useful for unsupervised language learning?
Authors:
Tu Anh Nguyen,
Maureen de Seyssel,
Robin Algayres,
Patricia Roze,
Ewan Dunbar,
Emmanuel Dupoux
Abstract:
Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case…
▽ More
Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and 28\% in relative performance depending on the task. We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm, and that even modest segmentation performance gives a gain in performance on two of the three tasks compared to basic character/phone based models without boundary information.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling
Authors:
Itai Gat,
Felix Kreuk,
Tu Anh Nguyen,
Ann Lee,
Jade Copet,
Gabriel Synnaeve,
Emmanuel Dupoux,
Yossi Adi
Abstract:
Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensi…
▽ More
Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the robustness of discrete input representations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn robust discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speech-to-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.
△ Less
Submitted 29 May, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Improving Document Image Understanding with Reinforcement Finetuning
Authors:
Bao-Sinh Nguyen,
Dung Tien Le,
Hieu M. Vu,
Tuan Anh D. Nguyen,
Minh-Tien Nguyen,
Hung Le
Abstract:
Successful Artificial Intelligence systems often require numerous labeled data to extract information from document images. In this paper, we investigate the problem of improving the performance of Artificial Intelligence systems in understanding document images, especially in cases where training data is limited. We address the problem by proposing a novel finetuning method using reinforcement le…
▽ More
Successful Artificial Intelligence systems often require numerous labeled data to extract information from document images. In this paper, we investigate the problem of improving the performance of Artificial Intelligence systems in understanding document images, especially in cases where training data is limited. We address the problem by proposing a novel finetuning method using reinforcement learning. Our approach treats the Information Extraction model as a policy network and uses policy gradient training to update the model to maximize combined reward functions that complement the traditional cross-entropy losses. Our experiments on four datasets using labels and expert feedback demonstrate that our finetuning mechanism consistently improves the performance of a state-of-the-art information extractor, especially in the small training data regime.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
Enhancing Deep Learning-based 3-lead ECG Classification with Heartbeat Counting and Demographic Data Integration
Authors:
Khiem H. Le,
Hieu H. Pham,
Thao B. T. Nguyen,
Tu A. Nguyen,
Cuong D. Do
Abstract:
Nowadays, an increasing number of people are being diagnosed with cardiovascular diseases (CVDs), the leading cause of death globally. The gold standard for identifying these heart problems is via electrocardiogram (ECG). The standard 12-lead ECG is widely used in clinical practice and the majority of current research. However, using a lower number of leads can make ECG more pervasive as it can be…
▽ More
Nowadays, an increasing number of people are being diagnosed with cardiovascular diseases (CVDs), the leading cause of death globally. The gold standard for identifying these heart problems is via electrocardiogram (ECG). The standard 12-lead ECG is widely used in clinical practice and the majority of current research. However, using a lower number of leads can make ECG more pervasive as it can be integrated with portable or wearable devices. This article introduces two novel techniques to improve the performance of the current deep learning system for 3-lead ECG classification, making it comparable with models that are trained using standard 12-lead ECG. Specifically, we propose a multi-task learning scheme in the form of the number of heartbeats regression and an effective mechanism to integrate patient demographic data into the system. With these two advancements, we got classification performance in terms of F1 scores of 0.9796 and 0.8140 on two large-scale ECG datasets, i.e., Chapman and CPSC-2018, respectively, which surpassed current state-of-the-art ECG classification methods, even those trained on 12-lead data. To encourage further development, our source code is publicly available at https://github.com/lhkhiem28/LightX3ECG.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
LightX3ECG: A Lightweight and eXplainable Deep Learning System for 3-lead Electrocardiogram Classification
Authors:
Khiem H. Le,
Hieu H. Pham,
Thao BT. Nguyen,
Tu A. Nguyen,
Tien N. Thanh,
Cuong D. Do
Abstract:
Cardiovascular diseases (CVDs) are a group of heart and blood vessel disorders that is one of the most serious dangers to human health, and the number of such patients is still growing. Early and accurate detection plays a key role in successful treatment and intervention. Electrocardiogram (ECG) is the gold standard for identifying a variety of cardiovascular abnormalities. In clinical practices…
▽ More
Cardiovascular diseases (CVDs) are a group of heart and blood vessel disorders that is one of the most serious dangers to human health, and the number of such patients is still growing. Early and accurate detection plays a key role in successful treatment and intervention. Electrocardiogram (ECG) is the gold standard for identifying a variety of cardiovascular abnormalities. In clinical practices and most of the current research, standard 12-lead ECG is mainly used. However, using a lower number of leads can make ECG more prevalent as it can be conveniently recorded by portable or wearable devices. In this research, we develop a novel deep learning system to accurately identify multiple cardiovascular abnormalities by using only three ECG leads.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
STOP: A dataset for Spoken Task Oriented Semantic Parsing
Authors:
Paden Tomasello,
Akshat Shrivastava,
Daniel Lazar,
Po-Chun Hsu,
Duc Le,
Adithya Sagar,
Ali Elkahky,
Jade Copet,
Wei-Ning Hsu,
Yossi Adi,
Robin Algayres,
Tu Ahn Nguyen,
Emmanuel Dupoux,
Luke Zettlemoyer,
Abdelrahman Mohamed
Abstract:
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi…
▽ More
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. However, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems. Initial experimentation show end-to-end SLU models performing slightly worse than their cascaded counterparts, which we hope encourages future work in this direction.
△ Less
Submitted 18 October, 2022; v1 submitted 28 June, 2022;
originally announced July 2022.
-
Generative Spoken Dialogue Language Modeling
Authors:
Tu Anh Nguyen,
Eugene Kharitonov,
Jade Copet,
Yossi Adi,
Wei-Ning Hsu,
Ali Elkahky,
Paden Tomasello,
Robin Algayres,
Benoit Sagot,
Abdelrahman Mohamed,
Emmanuel Dupoux
Abstract:
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech,…
▽ More
We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn-taking compared to a text-based cascaded model.
△ Less
Submitted 22 November, 2022; v1 submitted 30 March, 2022;
originally announced March 2022.
-
Are discrete units necessary for Spoken Language Modeling?
Authors:
Tu Anh Nguyen,
Benoit Sagot,
Emmanuel Dupoux
Abstract:
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the…
▽ More
Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1 - Speech Only).
△ Less
Submitted 22 August, 2022; v1 submitted 11 March, 2022;
originally announced March 2022.
-
textless-lib: a Library for Textless Spoken Language Processing
Authors:
Eugene Kharitonov,
Jade Copet,
Kushal Lakhotia,
Tu Anh Nguyen,
Paden Tomasello,
Ann Lee,
Ali Elkahky,
Wei-Ning Hsu,
Abdelrahman Mohamed,
Emmanuel Dupoux,
Yossi Adi
Abstract:
Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three differ…
▽ More
Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large. The code, documentation, and pre-trained models are available at https://github.com/facebookresearch/textlesslib/ .
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Distributed Nash Equilibrium Seeking over Time-Varying Directed Communication Networks
Authors:
Duong Thuy Anh Nguyen,
Duong Tung Nguyen,
Angelia Nedić
Abstract:
We study distributed algorithms for finding a Nash equilibrium (NE) in a class of non-cooperative convex games under partial information. Specifically, each agent has access only to its own smooth local cost function and can receive information from its neighbors in a time-varying directed communication network. To this end, we propose a distributed gradient play algorithm to compute a NE by utili…
▽ More
We study distributed algorithms for finding a Nash equilibrium (NE) in a class of non-cooperative convex games under partial information. Specifically, each agent has access only to its own smooth local cost function and can receive information from its neighbors in a time-varying directed communication network. To this end, we propose a distributed gradient play algorithm to compute a NE by utilizing local information exchange among the players. In this algorithm, every agent performs a gradient step to minimize its own cost function while sharing and retrieving information locally among its neighbors. The existing methods impose strong assumptions such as balancedness of the mixing matrices and global knowledge of the network communication structure, including Perron-Frobenius eigenvector of the adjacency matrix and other graph connectivity constants. In contrast, our approach relies only on a reasonable and widely-used assumption of row-stochasticity of the mixing matrices. We analyze the algorithm for time-varying directed graphs and prove its convergence to the NE, when the agents' cost functions are strongly convex and have Lipschitz continuous gradients. Numerical simulations are performed for a Nash-Cournot game to illustrate the efficacy of the proposed algorithm.
△ Less
Submitted 13 March, 2024; v1 submitted 6 January, 2022;
originally announced January 2022.
-
The Zero Resource Speech Challenge 2021: Spoken language modelling
Authors:
Ewan Dunbar,
Mathieu Bernard,
Nicolas Hamilakis,
Tu Anh Nguyen,
Maureen de Seyssel,
Patricia Rozé,
Morgane Rivière,
Eugene Kharitonov,
Emmanuel Dupoux
Abstract:
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (C…
▽ More
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels. The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text. We provide a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer ($k$-means) and a standard language model (BERT or LSTM). The metrics evaluate the learned representations at the acoustic (ABX discrimination), lexical (spot-the-word), syntactic (acceptability judgment) and semantic levels (similarity judgment). We present an overview of the eight submitted systems from four groups and discuss the main results.
△ Less
Submitted 9 August, 2021; v1 submitted 29 April, 2021;
originally announced April 2021.
-
The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling
Authors:
Tu Anh Nguyen,
Maureen de Seyssel,
Patricia Rozé,
Morgane Rivière,
Evgeny Kharitonov,
Alexei Baevski,
Ewan Dunbar,
Emmanuel Dupoux
Abstract:
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a com…
▽ More
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a composite baseline made of the concatenation of three unsupervised systems: self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT). The language models learn on the basis of the pseudo-text derived from clustering the learned representations. This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech. It also yields worse performance compared to text-based 'topline' systems trained on the same data, delineating the space to be explored by more sophisticated end-to-end models.
△ Less
Submitted 1 December, 2020; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Deep Learning Approach for Receipt Recognition
Authors:
Anh Duc Le,
Dung Van Pham,
Tuan Anh Nguyen
Abstract:
Inspired by the recent successes of deep learning on Computer Vision and Natural Language Processing, we present a deep learning approach for recognizing scanned receipts. The recognition system has two main modules: text detection based on Connectionist Text Proposal Network and text recognition based on Attention-based Encoder-Decoder. We also proposed pre-processing to extract receipt area and…
▽ More
Inspired by the recent successes of deep learning on Computer Vision and Natural Language Processing, we present a deep learning approach for recognizing scanned receipts. The recognition system has two main modules: text detection based on Connectionist Text Proposal Network and text recognition based on Attention-based Encoder-Decoder. We also proposed pre-processing to extract receipt area and OCR verification to ignore handwriting. The experiments on the dataset of the Robust Reading Challenge on Scanned Receipts OCR and Information Extraction 2019 demonstrate that the accuracies were improved by integrating the pre-processing and the OCR verification. Our recognition system achieved 71.9% of the F1 score for detection and recognition task.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
Secure Biometric-based Remote Authentication Protocol using Chebyshev Polynomials and Fuzzy Extractor
Authors:
Thi Ai Thao Nguyen,
Tran Khanh Dang,
Quynh Chi Truong,
Dinh Thanh Nguyen
Abstract:
In this paper, we have proposed a multi factor biometric-based remote authentication protocol. Our proposal overcomes the vulnerabilities of some previous works. At the same time, the protocol also obtains a low false accept rate (FAR) and false reject rate (FRR).
In this paper, we have proposed a multi factor biometric-based remote authentication protocol. Our proposal overcomes the vulnerabilities of some previous works. At the same time, the protocol also obtains a low false accept rate (FAR) and false reject rate (FRR).
△ Less
Submitted 9 April, 2019;
originally announced April 2019.
-
A New Biometric Template Protection using Random Orthonormal Projection and Fuzzy Commitment
Authors:
Thi Ai Thao Nguyen,
Tran Khanh Dang,
Dinh Thanh Nguyen
Abstract:
Biometric template protection is one of most essential parts in putting a biometric-based authentication system into practice. There have been many researches proposing different solutions to secure biometric templates of users. They can be categorized into two approaches: feature transformation and biometric cryptosystem. However, no one single template protection approach can satisfy all the req…
▽ More
Biometric template protection is one of most essential parts in putting a biometric-based authentication system into practice. There have been many researches proposing different solutions to secure biometric templates of users. They can be categorized into two approaches: feature transformation and biometric cryptosystem. However, no one single template protection approach can satisfy all the requirements of a secure biometric-based authentication system. In this work, we will propose a novel hybrid biometric template protection which takes benefits of both approaches while preventing their limitations. The experiments demonstrate that the performance of the system can be maintained with the support of a new random orthonormal project technique, which reduces the computational complexity while preserving the accuracy. Meanwhile, the security of biometric templates is guaranteed by employing fuzzy commitment protocol.
△ Less
Submitted 30 March, 2019;
originally announced April 2019.