Search | arXiv e-print repository

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Authors: Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

Abstract: Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis revea… ▽ More Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Accepted for INTERSPEECH 2024

arXiv:2406.19135 [pdf, other]

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Authors: Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

Abstract: Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a… ▽ More Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.16994 [pdf, other]

Quantum Multi-Agent Reinforcement Learning for Cooperative Mobile Access in Space-Air-Ground Integrated Networks

Authors: Gyu Seon Kim, Yeryeong Cho, Jaehyun Chung, Soohyun Park, Soyi Jung, Zhu Han, Joongheon Kim

Abstract: Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for prov… ▽ More Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for providing cooperatively global access sustainability and energy efficiency. However, as the number of CubeSats and HALE-UAVs, increases, the scheduling dimension of each ground station (GS) increases. As a result, each GS can fall into the curse of dimensionality, and this challenge becomes one major hurdle for efficient global access. Therefore, this paper provides a quantum multi-agent reinforcement Learning (QMARL)-based method for scheduling between GSs and CubeSats/HALE-UAVs in order to improve global access availability and energy efficiency. The main reason why the QMARL-based scheduler can be beneficial is that the algorithm facilitates a logarithmic-scale reduction in scheduling action dimensions, which is one critical feature as the number of CubeSats and HALE-UAVs expands. Additionally, individual GSs have different traffic demands depending on their locations and characteristics, thus it is essential to provide differentiated access services. The superiority of the proposed scheduler is validated through data-intensive experiments in realistic CubeSat/HALE-UAV settings. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 17 pages, 22 figures

arXiv:2406.14372 [pdf, ps, other]

Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growth

Authors: Yeongjun Jang, Joowon Lee, Seonhong Min, Hyesun Kwak, Junsoo Kim, Yongsoo Song

Abstract: In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a pol… ▽ More In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a polynomial having multiple error coefficients. Such errors accumulate under recursive homomorphic operations, and it has been studied that their effect can be suppressed by the closed-loop stability when dynamic controllers are encrypted using LWE based schemes. We show that this also holds for the proposed controller encrypted using a Ring-LWE based scheme. Specifically, only the constant terms of the error polynomials affect the control performance, and their effect can be arbitrarily bounded even when the noneffective terms diverge. Furthermore, a novel packing algorithm is applied, resulting in reduced computation time and enhanced memory efficiency. Simulation results demonstrate the effectiveness of the proposed method. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 12 pages, 3 figures

arXiv:2406.11427 [pdf, other]

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Authors: Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

Abstract: Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models f… ▽ More Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.09819 [pdf, other]

Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments

Authors: Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang

Abstract: Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with du… ▽ More Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with dual-path transformers for speech separation from two simultaneous talkers in realistic settings. However, the distributed nature makes it hard to fuse information across microphones efficiently. Therefore, we explore the efficacy of blindly clustering microphones around sources of interest prior to enhancement. Experimental results show that this deep cluster-informed approach significantly improves the system's capacity to cope with the inherent variability observed in ad-hoc distributed microphone environments. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2406.09286 [pdf, other]

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

Authors: Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

Abstract: This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the numbe… ▽ More This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available at: https://cyongong.github.io/FlowAVSE.github.io/ △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: INTERSPEECH 2024

arXiv:2406.07103 [pdf, other]

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

Authors: Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu

Abstract: In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw… ▽ More In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw waveforms. The MR-RawNet extracts time-frequency representations from raw waveforms via a multi-resolution feature extractor that optimally adjusts both temporal and spectral resolutions simultaneously. Furthermore, we apply a multi-resolution attention block that focuses on diverse and extensive temporal contexts, ensuring robustness against changes in utterance length. The experimental results, conducted on VoxCeleb1 dataset, demonstrate that the MR-RawNet exhibits superior performance in handling utterances of variable duration compared to other raw waveform-based systems. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 5 pages, accepted by Interspeech 2024

arXiv:2406.06786 [pdf, other]

BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

Authors: June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung

Abstract: Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model u… ▽ More Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model using free-text descriptions derived from the sound samples' metadata which includes the gender and age of patients, type of recording devices, and recording location on the patient's body. Our method achieves state-of-the-art performance on the ICBHI dataset, surpassing the previous best result by a notable margin of 1.17%. This result validates the effectiveness of leveraging metadata and respiratory sound samples in enhancing RSC performance. Additionally, we investigate the model performance in the case where metadata is partially unavailable, which may occur in real-world clinical setting. △ Less

Submitted 14 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted INTERSPEECH 2024

arXiv:2406.02936 [pdf]

Radiomics-guided Multimodal Self-attention Network for Predicting Pathological Complete Response in Breast MRI

Authors: Jonghun Kim, Hyunjin Park

Abstract: Breast cancer is the most prevalent cancer among women and predicting pathologic complete response (pCR) after anti-cancer treatment is crucial for patient prognosis and treatment customization. Deep learning has shown promise in medical imaging diagnosis, particularly when utilizing multiple imaging modalities to enhance accuracy. This study presents a model that predicts pCR in breast cancer pat… ▽ More Breast cancer is the most prevalent cancer among women and predicting pathologic complete response (pCR) after anti-cancer treatment is crucial for patient prognosis and treatment customization. Deep learning has shown promise in medical imaging diagnosis, particularly when utilizing multiple imaging modalities to enhance accuracy. This study presents a model that predicts pCR in breast cancer patients using dynamic contrast-enhanced (DCE) magnetic resonance imaging (MRI) and apparent diffusion coefficient (ADC) maps. Radiomics features are established hand-crafted features of the tumor region and thus could be useful in medical image analysis. Our approach extracts features from both DCE MRI and ADC using an encoder with a self-attention mechanism, leveraging radiomics to guide feature extraction from tumor-related regions. Our experimental results demonstrate the superior performance of our model in predicting pCR compared to other baseline methods. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 5 pages, 5 figures, IEEE ISBI 2024 proceedings

arXiv:2406.00123 [pdf]

Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration

Authors: Mingyuan Meng, Dagan Feng, Lei Bi, Jinman Kim

Abstract: Deformable image registration is a fundamental step for medical image analysis. Recently, transformers have been used for registration and outperformed Convolutional Neural Networks (CNNs). Transformers can capture long-range dependence among image features, which have been shown beneficial for registration. However, due to the high computation/memory loads of self-attention, transformers are typi… ▽ More Deformable image registration is a fundamental step for medical image analysis. Recently, transformers have been used for registration and outperformed Convolutional Neural Networks (CNNs). Transformers can capture long-range dependence among image features, which have been shown beneficial for registration. However, due to the high computation/memory loads of self-attention, transformers are typically used at downsampled feature resolutions and cannot capture fine-grained long-range dependence at the full image resolution. This limits deformable registration as it necessitates precise dense correspondence between each image pixel. Multi-layer Perceptrons (MLPs) without self-attention are efficient in computation/memory usage, enabling the feasibility of capturing fine-grained long-range dependence at full resolution. Nevertheless, MLPs have not been extensively explored for image registration and are lacking the consideration of inductive bias crucial for medical registration tasks. In this study, we propose the first correlation-aware MLP-based registration network (CorrMLP) for deformable medical image registration. Our CorrMLP introduces a correlation-aware multi-window MLP block in a novel coarse-to-fine registration architecture, which captures fine-grained multi-range dependence to perform correlation-aware coarse-to-fine registration. Extensive experiments with seven public medical datasets show that our CorrMLP outperforms state-of-the-art deformable registration methods. △ Less

Submitted 12 June, 2024; v1 submitted 31 May, 2024; originally announced June 2024.

Comments: Accepted at CVPR2024 as Oral Presentation && Best Paper Candidate

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9645-9654

arXiv:2405.10272 [pdf, other]

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations… ▽ More The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: CVPR 2024

arXiv:2405.08614 [pdf, other]

FDD Massive MIMO: How to Optimally Combine UL Pilot and Limited DL CSI Feedback?

Authors: Jungyeon Kim, Jinseok Choi, Jeonghun Park, Ahmed Alkhateeb, Namyoon Lee

Abstract: In frequency-division duplexing (FDD) multiple-input multiple-output (MIMO) systems, obtaining accurate downlink channel state information (CSI) for precoding is vastly challenging due to the tremendous feedback overhead with the growing number of antennas. Utilizing uplink pilots for downlink CSI estimation is a promising approach that can eliminate CSI feedback. However, the downlink CSI estimat… ▽ More In frequency-division duplexing (FDD) multiple-input multiple-output (MIMO) systems, obtaining accurate downlink channel state information (CSI) for precoding is vastly challenging due to the tremendous feedback overhead with the growing number of antennas. Utilizing uplink pilots for downlink CSI estimation is a promising approach that can eliminate CSI feedback. However, the downlink CSI estimation accuracy diminishes significantly as the number of channel paths increases, resulting in reduced spectral efficiency. In this paper, we demonstrate that achieving downlink spectral efficiency comparable to perfect CSI is feasible by combining uplink CSI with limited downlink CSI feedback information. Our proposed downlink CSI feedback strategy transmits quantized phase information of downlink channel paths, deviating from conventional limited methods. We put forth a mean square error (MSE)-optimal downlink channel reconstruction method by jointly exploiting the uplink CSI and the limited downlink CSI. Armed with the MSE-optimal estimator, we derive the MSE as a function of the number of feedback bits for phase quantization. Subsequently, we present an optimal feedback bit allocation method for minimizing the MSE in the reconstructed channel through phase quantization. Utilizing a robust downlink precoding technique, we establish that the proposed downlink channel reconstruction method is sufficient for attaining a sum-spectral efficiency comparable to perfect CSI. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 13 pages, 10 figures

arXiv:2405.08277 [pdf, other]

AI-driven, Model-Free Current Control: A Deep Symbolic Approach for Optimal Induction Machine Performance

Authors: Muhammad Usama, Yunkyung Hwang, Jaehong Kim

Abstract: This paper proposed a straightforward and efficient current control solution for induction machines employing deep symbolic regression (DSR). The proposed DSR-based control design offers a simple yet highly effective approach by creating an optimal control model through training and fitting, resulting in an analytical dynamic numerical expression that characterizes the data. Notably, this approach… ▽ More This paper proposed a straightforward and efficient current control solution for induction machines employing deep symbolic regression (DSR). The proposed DSR-based control design offers a simple yet highly effective approach by creating an optimal control model through training and fitting, resulting in an analytical dynamic numerical expression that characterizes the data. Notably, this approach not only produces an understandable model but also demonstrates the capacity to extrapolate and estimate data points outside its training dataset, showcasing its adaptability and resilience. In contrast to conventional state-of-the-art proportional-integral (PI) current controllers, which heavily rely on specific system models, the proposed DSR-based approach stands out for its model independence. Simulation and experimental tests validate its effectiveness, highlighting its superior extrapolation capabilities compared to conventional methods. These findings pave the way for the integration of deep learning methods in power conversion applications, promising improved performance and adaptability in the control of induction machines. The simulation and experimental test results are provided with a 3.7 kw induction machine to verify the efficacy of the proposed control solution. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: This work has been accepted for potential publication at the IEEE ECCE Asia 2024 International Power Electronics and Motion Control Conference. Please note that copyright may be transferred without prior notice

arXiv:2405.07255 [pdf, ps, other]

Deep Learning-aided Parametric Sparse Channel Estimation for Terahertz Massive MIMO Systems

Authors: Jinhong Kim, Yongjun Ahn, Seungnyun Kim, Byonghyo Shim

Abstract: Terahertz (THz) communications is considered as one of key solutions to support extremely high data demand in 6G. One main difficulty of the THz communication is the severe signal attenuation caused by the foliage loss, oxygen/atmospheric absorption, body and hand losses. To compensate for the severe path loss, multiple-input-multiple-output (MIMO) antenna array-based beamforming has been widely u… ▽ More Terahertz (THz) communications is considered as one of key solutions to support extremely high data demand in 6G. One main difficulty of the THz communication is the severe signal attenuation caused by the foliage loss, oxygen/atmospheric absorption, body and hand losses. To compensate for the severe path loss, multiple-input-multiple-output (MIMO) antenna array-based beamforming has been widely used. Since the beams should be aligned with the signal propagation path to achieve the maximum beamforming gain, acquisition of accurate channel knowledge, i.e., channel estimation, is of great importance. An aim of this paper is to propose a new type of deep learning (DL)-based parametric channel estimation technique. In our work, DL figures out the mapping function between the received pilot signal and the sparse channel parameters characterizing the spherical domain channel. By exploiting the long short-term memory (LSTM), we can efficiently extract the temporally correlated features of sparse channel parameters and thus make an accurate estimation with relatively small pilot overhead. From the numerical experiments, we show that the proposed scheme is effective in estimating the near-field THz MIMO channel in THz downlink environments. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.06284 [pdf, other]

Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention

Authors: Ju-Hyeon Nam, Nur Suriza Syazwany, Su Jung Kim, Sang-Chul Lee

Abstract: Generalizability in deep neural networks plays a pivotal role in medical image segmentation. However, deep learning-based medical image analyses tend to overlook the importance of frequency variance, which is critical element for achieving a model that is both modality-agnostic and domain-generalizable. Additionally, various models fail to account for the potential information loss that can arise… ▽ More Generalizability in deep neural networks plays a pivotal role in medical image segmentation. However, deep learning-based medical image analyses tend to overlook the importance of frequency variance, which is critical element for achieving a model that is both modality-agnostic and domain-generalizable. Additionally, various models fail to account for the potential information loss that can arise from multi-task learning under deep supervision, a factor that can impair the model representation ability. To address these challenges, we propose a Modality-agnostic Domain Generalizable Network (MADGNet) for medical image segmentation, which comprises two key components: a Multi-Frequency in Multi-Scale Attention (MFMSA) block and Ensemble Sub-Decoding Module (E-SDM). The MFMSA block refines the process of spatial feature extraction, particularly in capturing boundary features, by incorporating multi-frequency and multi-scale features, thereby offering informative cues for tissue outline and anatomical structures. Moreover, we propose E-SDM to mitigate information loss in multi-task learning with deep supervision, especially during substantial upsampling from low resolution. We evaluate the segmentation performance of MADGNet across six modalities and fifteen datasets. Through extensive experiments, we demonstrate that MADGNet consistently outperforms state-of-the-art models across various modalities, showcasing superior segmentation performance. This affirms MADGNet as a robust solution for medical image segmentation that excels in diverse imaging scenarios. Our MADGNet code is available in GitHub Link. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: Accepted in Computer Vision and Pattern Recognition (CVPR) 2024

arXiv:2405.03732 [pdf]

Accelerated MR Cholangiopancreatography with Deep Learning-based Reconstruction

Authors: Jinho Kim, Marcel Dominik Nickel, Florian Knoll

Abstract: This study accelerates MR cholangiopancreatography (MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and 0.55T. Thirty healthy volunteers underwent conventional two-fold MRCP scans at field strengths of 3T or 0.55T. We trained a variational network (VN) using retrospectively six-fold undersampled data obtained at 3T. We then evaluated our method against standard techniques su… ▽ More This study accelerates MR cholangiopancreatography (MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and 0.55T. Thirty healthy volunteers underwent conventional two-fold MRCP scans at field strengths of 3T or 0.55T. We trained a variational network (VN) using retrospectively six-fold undersampled data obtained at 3T. We then evaluated our method against standard techniques such as parallel imaging (PI) and compressed sensing (CS), focusing on peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) as metrics. Furthermore, considering acquiring fully-sampled MRCP is impractical, we added a self-supervised DL reconstruction (SSDU) to the evaluating group. We also tested our method in a prospective accelerated scenario to reflect real-world clinical applications and evaluated its adaptability to MRCP at 0.55T. Our method demonstrated a remarkable reduction of average acquisition time from 599/542 to 255/180 seconds for MRCP at 3T/0.55T. In both retrospective and prospective undersampling scenarios, the PSNR and SSIM of VN were higher than those of PI, CS, and SSDU. At the same time, VN preserved the image quality of undersampled data, i.e., sharpness and the visibility of hepatobiliary ducts. In addition, VN also produced high quality reconstructions at 0.55T resulting in the highest PSNR and SSIM. In summary, VN trained for highly accelerated MRCP allows to reduce the acquisition time by a factor of 2.4/3.0 at 3T/0.55T while maintaining the image quality of the conventional acquisition. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 20 pages, 6 figures, 2 tables

arXiv:2405.03684 [pdf, other]

All-in-One Deep Learning Framework for MR Image Reconstruction

Authors: Geunu Jeong, Hyeonsoo Kim, Joonyoung Yang, Kyungeun Jang, Jeewook Kim

Abstract: We introduce a novel, all-in-one deep learning framework for MR image reconstruction, enabling a single model to enhance image quality across multiple aspects of k-space sampling and to be effective across a wide range of clinical and technical scenarios. This DICOM-based algorithm serves as the core of SwiftMR (AIRS Medical, Seoul, Korea), which is FDA-cleared, CE-certified, and commercially avai… ▽ More We introduce a novel, all-in-one deep learning framework for MR image reconstruction, enabling a single model to enhance image quality across multiple aspects of k-space sampling and to be effective across a wide range of clinical and technical scenarios. This DICOM-based algorithm serves as the core of SwiftMR (AIRS Medical, Seoul, Korea), which is FDA-cleared, CE-certified, and commercially available. We first detail the comprehensive development process of the model, including data collection, training pair preparation, model architecture design, and DICOM inference. We then assess the model's capability to enhance image quality in a multi-dimensional manner, specifically across various aspects of k-space sampling. Subsequently, we evaluate several features of the multi-dimensional enhancement: the accuracy of tunable denoising, the effectiveness of super-resolution in each encoding direction, and the reduction of artifacts that become more prominent at lower spatial resolutions. Additionally, we assess its compatibility with various scan parameter sets and its generalizability across scanner vendors not seen during training. Finally, we present specific cases demonstrating the model's utility in reducing scan time across anatomical regions in conjunction with protocol optimization. The proposed model is compatible with a broad spectrum of scenarios, including various vendors, pulse sequences, scan parameters, and anatomical regions. Its DICOM-based operation particularly enhances its applicability for real-world applications. Given its demonstrated effectiveness and versatility, we expect its use to expand in the field of clinical MRI. △ Less

Submitted 26 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 22 pages, 9 figures; number of collected MR raw data corrected

arXiv:2405.02996 [pdf, other]

RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification

Authors: June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung

Abstract: Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrain… ▽ More Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. However, the most widely used augmentation technique for audio and speech, SpecAugment, requires 2-dimensional spectrogram format and cannot be applied to models pretrained on speech waveforms. To address this, we propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment, but is also suitable for respiratory sound classification with waveform pretrained models. Experimental results show that our approach outperforms the SpecAugment, demonstrating a substantial improvement in the accuracy of minority disease classes, reaching up to 7.14%. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted EMBC 2024

arXiv:2405.02066 [pdf, other]

WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights

Authors: Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, Sangpil Kim

Abstract: The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representat… ▽ More The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods. △ Less

Submitted 27 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.18516 [pdf, ps, other]

Downlink Pilots are Essential for Cell-Free Massive MIMO with Multi-Antenna Users

Authors: Eren Berk Kama, Junbeom Kim, Emil Björnson

Abstract: We consider a cell-free massive MIMO system with multiple antennas on the users and access points. In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well when having single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pilots are sent since the effectiv… ▽ More We consider a cell-free massive MIMO system with multiple antennas on the users and access points. In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well when having single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pilots are sent since the effective channel matrix does not harden when having multi-antenna users. We propose a pilot-based downlink estimation scheme and derive a new SE expression that utilizes zero-forcing combining. We show numerically how the number of users and user antennas affects the SE. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2404.16484 [pdf, other]

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024, AI for Streaming (AIS) Workshop

arXiv:2404.15333 [pdf, other]

EB-GAME: A Game-Changer in ECG Heartbeat Anomaly Detection

Authors: JuneYoung Park, Da Young Kim, Yunsoo Kim, Jisu Yoo, Tae Joon Kim

Abstract: Cardiologists use electrocardiograms (ECG) for the detection of arrhythmias. However, continuous monitoring of ECG signals to detect cardiac abnormal-ities requires significant time and human resources. As a result, several deep learning studies have been conducted in advance for the automatic detection of arrhythmia. These models show relatively high performance in supervised learning, but are no… ▽ More Cardiologists use electrocardiograms (ECG) for the detection of arrhythmias. However, continuous monitoring of ECG signals to detect cardiac abnormal-ities requires significant time and human resources. As a result, several deep learning studies have been conducted in advance for the automatic detection of arrhythmia. These models show relatively high performance in supervised learning, but are not applicable in cases with few training examples. This is because abnormal ECG data is scarce compared to normal data in most real-world clinical settings. Therefore, in this study, GAN-based anomaly detec-tion, i.e., unsupervised learning, was employed to address the issue of data imbalance. This paper focuses on detecting abnormal signals in electrocardi-ograms (ECGs) using only labels from normal signals as training data. In-spired by self-supervised vision transformers, which learn by dividing images into patches, and masked auto-encoders, known for their effectiveness in patch reconstruction and solving information redundancy, we introduce the ECG Heartbeat Anomaly Detection model, EB-GAME. EB-GAME was trained and validated on the MIT-BIH Arrhythmia Dataset, where it achieved state-of-the-art performance on this benchmark. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.10936 [pdf, other]

Beam Training in mmWave Vehicular Systems: Machine Learning for Decoupling Beam Selection

Authors: Ibrahim Kilinc, Ryan M. Dreifuerst, Junghoon Kim, Robert W. Heath Jr

Abstract: Codebook-based beam selection is one approach for configuring millimeter wave communication links. The overhead required to reconfigure the transmit and receive beam pair, though, increases in highly dynamic vehicular communication systems. Location information coupled with machine learning (ML) beam recommendation is one way to reduce the overhead of beam pair selection. In this paper, we develop… ▽ More Codebook-based beam selection is one approach for configuring millimeter wave communication links. The overhead required to reconfigure the transmit and receive beam pair, though, increases in highly dynamic vehicular communication systems. Location information coupled with machine learning (ML) beam recommendation is one way to reduce the overhead of beam pair selection. In this paper, we develop ML-based location-aided approaches to decouple the beam selection between the user equipment (UE) and the base station (BS). We quantify the performance gaps due to decoupling beam selection and also disaggregating the UE's location information from the BS. Our simulation results show that decoupling beam selection with available location information at the BS performs comparable to joint beam pair selection at the BS. Moreover, decoupled beam selection without location closely approaches the performance of beam pair selection at the BS when sufficient beam pairs are swept. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Submitted to IEEE BlackSeaCom 2024, 6 pages, 5 figures

arXiv:2404.07021 [pdf, other]

A 4x32Gb/s 1.8pJ/bit Collaborative Baud-Rate CDR with Background Eye-Climbing Algorithm and Low-Power Global Clock Distribution

Authors: Jihee Kim, Jia Park, Jiwon Shin, Hanseok Kim, Kahyun Kim, Haengbeom Shin, Ha-Jung Park, Woo-Seok Choi

Abstract: This paper presents design techniques for an energy-efficient multi-lane receiver (RX) with baud-rate clock and data recovery (CDR), which is essential for high-throughput low-latency communication in high-performance computing systems. The proposed low-power global clock distribution not only significantly reduces power consumption across multi-lane RXs but is capable of compensating for the freq… ▽ More This paper presents design techniques for an energy-efficient multi-lane receiver (RX) with baud-rate clock and data recovery (CDR), which is essential for high-throughput low-latency communication in high-performance computing systems. The proposed low-power global clock distribution not only significantly reduces power consumption across multi-lane RXs but is capable of compensating for the frequency offset without any phase interpolators. To this end, a fractional divider controlled by CDR is placed close to the global phase locked loop. Moreover, in order to address the sub-optimal lock point of conventional baud-rate phase detectors, the proposed CDR employs a background eye-climbing algorithm, which optimizes the sampling phase and maximizes the vertical eye margin (VEM). Fabricated in a 28nm CMOS process, the proposed 4x32Gb/s RX shows a low integrated fractional spur of -40.4dBc at a 2500ppm frequency offset. Furthermore, it improves bit-error-rate performance by increasing the VEM by 17%. The entire RX achieves the energy efficiency of 1.8pJ/bit with the aggregate data rate of 128Gb/s. △ Less

Submitted 22 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.05558 [pdf, other]

JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients

Authors: Woo Kyoung Han, Sunghoon Im, Jaedeok Kim, Kyong Hwan Jin

Abstract: We propose a practical approach to JPEG image decoding, utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate, inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the… ▽ More We propose a practical approach to JPEG image decoding, utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate, inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the quality degradation issue that restores the distorted spectrum. By leveraging local DCT formulations, our network has the privilege to exploit dequantization and upsampling simultaneously. Our proposed model enables decoding compressed images directly across different quality factors using a single pre-trained model without relying on a conventional JPEG decoder. As a result, our proposed network achieves state-of-the-art performance in flexible color image JPEG artifact removal tasks. Our source code is available at https://github.com/WooKyoungHan/JDEC. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.05119 [pdf, other]

A 0.65-pJ/bit 3.6-TB/s/mm I/O Interface with XTalk Minimizing Affine Signaling for Next-Generation HBM with High Interconnect Density

Authors: Hyunjun Park, Jiwon Shin, Hanseok Kim, Jihee Kim, Haengbeom Shin, Taehoon Kim, Jung-Hun Park, Woo-Seok Choi

Abstract: This paper presents an I/O interface with Xtalk Minimizing Affine Signaling (XMAS), which is designed to support high-speed data transmission in die-to-die communication over silicon interposers or similar high-density interconnects susceptible to crosstalk. The operating principles of XMAS are elucidated through rigorous analyses, and its advantages over existing signaling are validated through n… ▽ More This paper presents an I/O interface with Xtalk Minimizing Affine Signaling (XMAS), which is designed to support high-speed data transmission in die-to-die communication over silicon interposers or similar high-density interconnects susceptible to crosstalk. The operating principles of XMAS are elucidated through rigorous analyses, and its advantages over existing signaling are validated through numerical experiments. XMAS not only demonstrates exceptional crosstalk removing capabilities but also exhibits robustness against noise, especially simultaneous switching noise. Fabricated in a 28-nm CMOS process, the prototype XMAS transceiver achieves an edge density of 3.6TB/s/mm and an energy efficiency of 0.65pJ/b. Compared to the single-ended signaling, the crosstalk-induced peak-to-peak jitter of the received eye with XMAS is reduced by 75% at 10GS/s/pin data rate, and the horizontal eye opening extends to 0.2UI at a bit error rate < 10$^{-12}$. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2404.04727 [pdf, other]

A code-driven tutorial on encrypted control: From pioneering realizations to modern implementations

Authors: Nils Schlüter, Junsoo Kim, Moritz Schulze Darup

Abstract: The growing interconnectivity in control systems due to robust wireless communication and cloud usage paves the way for exciting new opportunities such as data-driven control and service-based decision-making. At the same time, connected systems are susceptible to cyberattacks and data leakages. Against this background, encrypted control aims to increase the security and safety of cyber-physical s… ▽ More The growing interconnectivity in control systems due to robust wireless communication and cloud usage paves the way for exciting new opportunities such as data-driven control and service-based decision-making. At the same time, connected systems are susceptible to cyberattacks and data leakages. Against this background, encrypted control aims to increase the security and safety of cyber-physical systems. A central goal is to ensure confidentiality of process data during networked controller evaluations, which is enabled by, e.g., homomorphic encryption. However, the integration of advanced cryptographic systems renders the design of encrypted controllers an interdisciplinary challenge. This code-driven tutorial paper aims to facilitate the access to encrypted control by providing exemplary realizations based on popular homomorphic cryptosystems. In particular, we discuss the encrypted implementation of state feedback and PI controllers using the Paillier, GSW, and CKKS cryptosystem. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.02781 [pdf, other]

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Authors: Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

Abstract: With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complex… ▽ More With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: ICLR 2024

arXiv:2404.02574 [pdf, ps, other]

Learning with errors based dynamic encryption that discloses residue signal for anomaly detection

Authors: Yeongjun Jang, Joowon Lee, Junsoo Kim, Hyungbo Shim

Abstract: Anomaly detection is a protocol that detects integrity attacks on control systems by comparing the residue signal with a threshold. Implementing anomaly detection on encrypted control systems has been a challenge because it is hard to detect an anomaly from the encrypted residue signal without the secret key. In this paper, we propose a dynamic encryption scheme for a linear system that automatica… ▽ More Anomaly detection is a protocol that detects integrity attacks on control systems by comparing the residue signal with a threshold. Implementing anomaly detection on encrypted control systems has been a challenge because it is hard to detect an anomaly from the encrypted residue signal without the secret key. In this paper, we propose a dynamic encryption scheme for a linear system that automatically discloses the residue signal. The initial state and the input are encrypted based on the zero-dynamics of the system, so that the effect of encryption on the residue signal remains identically zero. The proposed scheme is shown to be secure in the sense that no other information than the residue signal is disclosed. Furthermore, we demonstrate a method of utilizing the disclosed residue signal to operate an observer-based controller over encrypted data for an infinite time horizon without re-encryption. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 7 pages, 1 figure

arXiv:2404.01464 [pdf, other]

Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images

Authors: JungEun Kim, Hangyul Yoon, Geondo Park, Kyungsu Kim, Eunho Yang

Abstract: 4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Gi… ▽ More 4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Given these circumstances, not only is data acquisition challenging, but increasing the frame rate for each dataset also proves difficult. To address this challenge, this paper proposes a simple yet effective Unsupervised Volumetric Interpolation framework, UVI-Net. This framework facilitates temporal interpolation without the need for any intermediate frames, distinguishing it from the majority of other existing unsupervised methods. Experiments on benchmark datasets demonstrate significant improvements across diverse evaluation metrics compared to unsupervised and supervised baselines. Remarkably, our approach achieves this superior performance even when trained with a dataset as small as one, highlighting its exceptional robustness and efficiency in scenarios with sparse supervision. This positions UVI-Net as a compelling alternative for 4D medical imaging, particularly in settings where data availability is limited. The source code is available at https://github.com/jungeun122333/UVI-Net. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.17420 [pdf, other]

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Authors: Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localizati… ▽ More The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted at CVPR 2024

arXiv:2403.17327 [pdf, other]

Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer

Authors: Jeong-Yoon Kim, Seung-Ho Lee

Abstract: In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectr… ▽ More In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram to analyze the correlation of frequencies over time. This type of patch allows us to correlate the most relevant frequencies for a particular emotion with the time they were uttered. ii) We propose the use of image coordinate encoding, an absolute positional encoding suitable for ViT. By normalizing the x, y coordinates of the image to -1 to 1 and concatenating them to the image, we can effectively provide valid absolute positional information for ViT. iii) Through feature map matching, the locality and location information of the teacher network is effectively transmitted to the student network. Teacher network is a ViT that contains locality of convolutional stem and absolute position information through image coordinate encoding, and student network is a structure that lacks positional encoding in the basic ViT structure. In feature map matching stage, we train through the mean absolute error (L1 loss) to minimize the difference between the feature maps of the two networks. To validate the proposed method, three emotion datasets (SAVEE, EmoDB, and CREMA-D) consisting of speech were converted into log-Mel spectrograms for comparison experiments. The experimental results show that the proposed method significantly outperforms the state-of-the-art methods in terms of weighted accuracy while requiring significantly fewer floating point operations (FLOPs). Overall, the proposed method offers an promising solution for SER by providing improved efficiency and performance. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.12726 [pdf]

Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

Authors: Hang Song, Hyun Joon Kim, Mingxia Wan, Bo Wei, Takamaro Kikkawa, Jun-ichi Takada

Abstract: Measuring the complex permittivity of material is essential in many scenarios such as quality check and component analysis. Generally, measurement methods for characterizing the material are based on the usage of vector network analyzer, which is large and not easy for on-site measurement, especially in high frequency range such as millimeter wave (mmWave). In addition, some measurement methods re… ▽ More Measuring the complex permittivity of material is essential in many scenarios such as quality check and component analysis. Generally, measurement methods for characterizing the material are based on the usage of vector network analyzer, which is large and not easy for on-site measurement, especially in high frequency range such as millimeter wave (mmWave). In addition, some measurement methods require the destruction of samples, which is not suitable for non-destructive inspection. In this work, a small distance increment (SDI) method is proposed to non-destructively measure the complex permittivity of material. In SDI, the transmitter and receiver are formed as the monostatic radar, which is facing towards the material under test (MUT). During the measurement, the distance between radar and MUT changes with small increments and the signals are recorded at each position. A mathematical model is formulated to depict the relationship among the complex permittivity, distance increment, and measured signals. By fitting the model, the complex permittivity of MUT is estimated. To implement and evaluate the proposed SDI method, a commercial off-the-shelf mmWave radar is utilized and the measurement system is developed. Then, the evaluation was carried out on the acrylic plate. With the proposed method, the estimated complex permittivity of acrylic plate shows good agreement with the literature values, demonstrating the efficacy of SDI method for characterizing the complex permittivity of material. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12098 [pdf, other]

Deep Generative Design for Mass Production

Authors: Jihoon Kim, Yongmin Kwon, Namwoo Kang

Abstract: Generative Design (GD) has evolved as a transformative design approach, employing advanced algorithms and AI to create diverse and innovative solutions beyond traditional constraints. Despite its success, GD faces significant challenges regarding the manufacturability of complex designs, often necessitating extensive manual modifications due to limitations in standard manufacturing processes and t… ▽ More Generative Design (GD) has evolved as a transformative design approach, employing advanced algorithms and AI to create diverse and innovative solutions beyond traditional constraints. Despite its success, GD faces significant challenges regarding the manufacturability of complex designs, often necessitating extensive manual modifications due to limitations in standard manufacturing processes and the reliance on additive manufacturing, which is not ideal for mass production. Our research introduces an innovative framework addressing these manufacturability concerns by integrating constraints pertinent to die casting and injection molding into GD, through the utilization of 2D depth images. This method simplifies intricate 3D geometries into manufacturable profiles, removing unfeasible features such as non-manufacturable overhangs and allowing for the direct consideration of essential manufacturing aspects like thickness and rib design. Consequently, designs previously unsuitable for mass production are transformed into viable solutions. We further enhance this approach by adopting an advanced 2D generative model, which offer a more efficient alternative to traditional 3D shape generation methods. Our results substantiate the efficacy of this framework, demonstrating the production of innovative, and, importantly, manufacturable designs. This shift towards integrating practical manufacturing considerations into GD represents a pivotal advancement, transitioning from purely inspirational concepts to actionable, production-ready solutions. Our findings underscore usefulness and potential of GD for broader industry adoption, marking a significant step forward in aligning GD with the demands of manufacturing challenges. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.11094 [pdf, other]

Nonlinear Self-Interference Cancellation With Learnable Orthonormal Polynomials for Full-Duplex Wireless Systems

Authors: Hyowon Lee, Jungyeon Kim, Geon Choi, Ian P. Roberts, Jinseok Choi, Namyoon Lee

Abstract: Nonlinear self-interference cancellation (SIC) is essential for full-duplex communication systems, which can offer twice the spectral efficiency of traditional half-duplex systems. The challenge of nonlinear SIC is similar to the classic problem of system identification in adaptive filter theory, whose crux lies in identifying the optimal nonlinear basis functions for a nonlinear system. This beco… ▽ More Nonlinear self-interference cancellation (SIC) is essential for full-duplex communication systems, which can offer twice the spectral efficiency of traditional half-duplex systems. The challenge of nonlinear SIC is similar to the classic problem of system identification in adaptive filter theory, whose crux lies in identifying the optimal nonlinear basis functions for a nonlinear system. This becomes especially difficult when the system input has a non-stationary distribution. In this paper, we propose a novel algorithm for nonlinear digital SIC that adaptively constructs orthonormal polynomial basis functions according to the non-stationary moments of the transmit signal. By combining these basis functions with the least mean squares (LMS) algorithm, we introduce a new SIC technique, called as the adaptive orthonormal polynomial LMS (AOP-LMS) algorithm. To reduce computational complexity for practical systems, we augment our approach with a precomputed look-up table, which maps a given modulation and coding scheme to its corresponding basis functions. Numerical simulation indicates that our proposed method surpasses existing state-of-the-art SIC algorithms in terms of convergence speed and mean squared error when the transmit signal is non-stationary, such as with adaptive modulation and coding. Experimental evaluation with a wireless testbed confirms that our proposed approach outperforms existing digital SIC algorithms. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 13 pages, total 16 figures

arXiv:2403.09612 [pdf, other]

Compute-first optical detection for noise-resilient visual perception

Authors: Jungmin Kim, Nanfang Yu, Zongfu Yu

Abstract: In the context of visual perception, the optical signal from a scene is transferred into the electronic domain by detectors in the form of image data, which are then processed for the extraction of visual information. In noisy and weak-signal environments such as thermal imaging for night vision applications, however, the performance of neural computing tasks faces a significant bottleneck due to… ▽ More In the context of visual perception, the optical signal from a scene is transferred into the electronic domain by detectors in the form of image data, which are then processed for the extraction of visual information. In noisy and weak-signal environments such as thermal imaging for night vision applications, however, the performance of neural computing tasks faces a significant bottleneck due to the inherent degradation of data quality upon noisy detection. Here, we propose a concept of optical signal processing before detection to address this issue. We demonstrate that spatially redistributing optical signals through a properly designed linear transformer can enhance the detection noise resilience of visual perception tasks, as benchmarked with the MNIST classification. Our idea is supported by a quantitative analysis detailing the relationship between signal concentration and noise robustness, as well as its practical implementation in an incoherent imaging system. This compute-first detection scheme can pave the way for advancing infrared machine vision technologies widely used for industrial and defense applications. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: Main 9 pages, 5 figures, Supplementary information 5 pages

arXiv:2403.08187 [pdf, other]

Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children

Authors: Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam

Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children wit… ▽ More This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 12 pages, 2 figures

ACM Class: I.2.7

arXiv:2403.06397 [pdf, other]

DeepSafeMPC: Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning

Authors: Xuefeng Wang, Henglin Pu, Hyung Jun Kim, Husheng Li

Abstract: Safe Multi-agent reinforcement learning (safe MARL) has increasingly gained attention in recent years, emphasizing the need for agents to not only optimize the global return but also adhere to safety requirements through behavioral constraints. Some recent work has integrated control theory with multi-agent reinforcement learning to address the challenge of ensuring safety. However, there have bee… ▽ More Safe Multi-agent reinforcement learning (safe MARL) has increasingly gained attention in recent years, emphasizing the need for agents to not only optimize the global return but also adhere to safety requirements through behavioral constraints. Some recent work has integrated control theory with multi-agent reinforcement learning to address the challenge of ensuring safety. However, there have been only very limited applications of Model Predictive Control (MPC) methods in this domain, primarily due to the complex and implicit dynamics characteristic of multi-agent environments. To bridge this gap, we propose a novel method called Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning (DeepSafeMPC). The key insight of DeepSafeMPC is leveraging a entralized deep learning model to well predict environmental dynamics. Our method applies MARL principles to search for optimal solutions. Through the employment of MPC, the actions of agents can be restricted within safe states concurrently. We demonstrate the effectiveness of our approach using the Safe Multi-agent MuJoCo environment, showcasing significant advancements in addressing safety concerns in MARL. △ Less

Submitted 11 March, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

Comments: 8 pages, 5 figures

arXiv:2403.03642 [pdf, other]

Generative Active Learning with Variational Autoencoder for Radiology Data Generation in Veterinary Medicine

Authors: In-Gyu Lee, Jun-Young Oh, Hee-Jung Yu, Jae-Hwan Kim, Ki-Dong Eom, Ji-Hoon Jeong

Abstract: Recently, with increasing interest in pet healthcare, the demand for computer-aided diagnosis (CAD) systems in veterinary medicine has increased. The development of veterinary CAD has stagnated due to a lack of sufficient radiology data. To overcome the challenge, we propose a generative active learning framework based on a variational autoencoder. This approach aims to alleviate the scarcity of r… ▽ More Recently, with increasing interest in pet healthcare, the demand for computer-aided diagnosis (CAD) systems in veterinary medicine has increased. The development of veterinary CAD has stagnated due to a lack of sufficient radiology data. To overcome the challenge, we propose a generative active learning framework based on a variational autoencoder. This approach aims to alleviate the scarcity of reliable data for CAD systems in veterinary medicine. This study utilizes datasets comprising cardiomegaly radiograph data. After removing annotations and standardizing images, we employed a framework for data augmentation, which consists of a data generation phase and a query phase for filtering the generated data. The experimental results revealed that as the data generated through this framework was added to the training data of the generative model, the frechet inception distance consistently decreased from 84.14 to 50.75 on the radiograph. Subsequently, when the generated data were incorporated into the training of the classification model, the false positive of the confusion matrix also improved from 0.16 to 0.66 on the radiograph. The proposed framework has the potential to address the challenges of data scarcity in medical CAD, contributing to its advancement. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.01898 [pdf, other]

Revisiting Learning-based Video Motion Magnification for Real-time Processing

Authors: Hyunwoo Ha, Oh Hyun-Bin, Kim Jun-Seong, Kwon Byung-Ki, Kim Sung-Bin, Linh-Tam Tran, Ji-Yun Kim, Sung-Ho Bae, Tae-Hyun Oh

Abstract: Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being e… ▽ More Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being extended to various online applications. In this paper, we investigate an efficient deep learning-based motion magnification model that runs in real time for full-HD resolution videos. Due to the specified network design of the prior art, i.e. inhomogeneous architecture, the direct application of existing neural architecture search methods is complicated. Instead of automatic search, we carefully investigate the architecture module by module for its role and importance in the motion magnification task. Two key findings are 1) Reducing the spatial resolution of the latent motion representation in the decoder provides a good trade-off between computational efficiency and task quality, and 2) surprisingly, only a single linear layer and a single branch in the encoder are sufficient for the motion magnification task. Based on these findings, we introduce a real-time deep learning-based motion magnification model with4.2X fewer FLOPs and is 2.7X faster than the prior art while maintaining comparable quality. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 19 pages

arXiv:2402.05706 [pdf, other]

Unified Speech-Text Pretraining for Spoken Dialog Modeling

Authors: Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Sungroh Yoon, Kang Min Yoo

Abstract: While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with… ▽ More While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with organic prosodic features relevant to the given input speech without relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions. Our approach employs a multi-step speech-text inference scheme that leverages chain-of-reasoning capabilities exhibited by the underlying LLM. We also propose a generalized speech-text pretraining scheme that helps with capturing cross-modal semantics. Automatic and human evaluations show that the proposed approach is effective in generating natural-sounding spoken responses, outperforming both prior and cascaded baselines. Detailed comparative studies reveal that, despite the cascaded approach being stronger in individual components, the joint speech-text modeling improves robustness against recognition errors and speech quality. Demo is available at https://unifiedsdm.github.io. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.05350 [pdf, other]

Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model

Authors: Junghun Cha, Ali Haider, Seoyun Yang, Hoeyeong Jin, Subin Yang, A. F. M. Shahab Uddin, Jaehyoung Kim, Soo Ye Kim, Sung-Ho Bae

Abstract: A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies… ▽ More A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies has become an indispensable task for many products, it has not been systematically explored, and to the best of our knowledge, no public datasets are available. In this paper, we define this problem as Descanning and introduce a new high-quality and large-scale dataset named DESCAN-18K. It contains 18K pairs of original and scanned images collected in the wild containing multiple complex degradations. In order to eliminate such complex degradations, we propose a new image restoration model called DescanDiffusion consisting of a color encoder that corrects the global color degradation and a conditional denoising diffusion probabilistic model (DDPM) that removes local degradations. To further improve the generalization ability of DescanDiffusion, we also design a synthetic data generation scheme by reproducing prominent degradations in scanned images. We demonstrate that our DescanDiffusion outperforms other baselines including commercial restoration products, objectively and subjectively, via comprehensive experiments and analyses. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted to AAAI 2024

arXiv:2402.01298 [pdf, other]

Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

Authors: Jaeyeon Kim, Injune Hwang, Kyogu Lee

Abstract: We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representa… ▽ More We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representation. We also present new training objectives, masked context reconstruction and masked context prediction, that push models to learn semantics effectively. Experiments on the sSIMI metric of Zero Resource Speech Benchmark 2021 and Fluent Speech Command dataset show our framework learns semantics better than models trained with only one type of representation. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.18006 [pdf, other]

EEG-GPT: Exploring Capabilities of Large Language Models for EEG Classification and Interpretation

Authors: Jonathan W. Kim, Ahmed Alaa, Danilo Bernardo

Abstract: In conventional machine learning (ML) approaches applied to electroencephalography (EEG), this is often a limited focus, isolating specific brain activities occurring across disparate temporal scales (from transient spikes in milliseconds to seizures lasting minutes) and spatial scales (from localized high-frequency oscillations to global sleep activity). This siloed approach limits the developmen… ▽ More In conventional machine learning (ML) approaches applied to electroencephalography (EEG), this is often a limited focus, isolating specific brain activities occurring across disparate temporal scales (from transient spikes in milliseconds to seizures lasting minutes) and spatial scales (from localized high-frequency oscillations to global sleep activity). This siloed approach limits the development EEG ML models that exhibit multi-scale electrophysiological understanding and classification capabilities. Moreover, typical ML EEG approaches utilize black-box approaches, limiting their interpretability and trustworthiness in clinical contexts. Thus, we propose EEG-GPT, a unifying approach to EEG classification that leverages advances in large language models (LLM). EEG-GPT achieves excellent performance comparable to current state-of-the-art deep learning methods in classifying normal from abnormal EEG in a few-shot learning paradigm utilizing only 2% of training data. Furthermore, it offers the distinct advantages of providing intermediate reasoning steps and coordinating specialist EEG tools across multiple scales in its operation, offering transparent and interpretable step-by-step verification, thereby promoting trustworthiness in clinical contexts. △ Less

Submitted 3 February, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

arXiv:2401.17690 [pdf, other]

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Authors: Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo

Abstract: We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpa… ▽ More We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap . △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.13836 [pdf, other]

doi 10.1016/j.conengprac.2024.105841

Machine learning for industrial sensing and control: A survey and practical perspective

Authors: Nathan P. Lawrence, Seshu Kumar Damarla, Jong Woo Kim, Aditya Tulsyan, Faraz Amjad, Kai Wang, Benoit Chachuat, Jong Min Lee, Biao Huang, R. Bhushan Gopaluni

Abstract: With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: so… ▽ More With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: soft sensing, process optimization, and control. Soft sensing contains a wealth of industrial applications of statistical and machine learning methods. We quantitatively identify research trends, allowing insight into the most successful techniques in practice. We consider two distinct flavors for data-driven optimization and control: hybrid modeling in conjunction with mathematical programming techniques and reinforcement learning. Throughout these application areas, we discuss their respective industrial requirements and challenges. A common challenge is the interpretability and efficiency of purely data-driven methods. This suggests a need to carefully balance deep learning techniques with domain knowledge. As a result, we highlight ways prior knowledge may be integrated into industrial machine learning applications. The treatment of methods, problems, and applications presented here is poised to inform and inspire practitioners and researchers to develop impactful data-driven sensing, optimization, and control solutions in the process industries. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 48 pages

Journal ref: Control Engineering Practice 2024

arXiv:2401.10465 [pdf, other]

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

Authors: Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda

Abstract: Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languag… ▽ More Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2401.10032 [pdf, other]

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

Authors: Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

Abstract: The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated co… ▽ More The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.08902 [pdf, other]

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Authors: Matthew C. McCallum, Florian Henkel, Jaehun Kim, Samuel E. Sandberg, Matthew E. P. Davies

Abstract: Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces rep… ▽ More Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both training of downstream tempo predictors, and improved nearest neighbor retrieval of properties largely independent of tempo. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

Showing 1–50 of 494 results for author: Kim, J