Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 494 results for author: Kim, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.04280  [pdf, other

    cs.CL cs.SD eess.AS

    LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

    Authors: Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

    Abstract: Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis revea… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Accepted for INTERSPEECH 2024

  2. arXiv:2406.19135  [pdf, other

    eess.AS cs.AI

    DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

    Authors: Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

    Abstract: Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Preprint

  3. arXiv:2406.16994  [pdf, other

    eess.SP cs.AI

    Quantum Multi-Agent Reinforcement Learning for Cooperative Mobile Access in Space-Air-Ground Integrated Networks

    Authors: Gyu Seon Kim, Yeryeong Cho, Jaehyun Chung, Soohyun Park, Soyi Jung, Zhu Han, Joongheon Kim

    Abstract: Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for prov… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 17 pages, 22 figures

  4. arXiv:2406.14372  [pdf, ps, other

    eess.SY

    Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growth

    Authors: Yeongjun Jang, Joowon Lee, Seonhong Min, Hyesun Kwak, Junsoo Kim, Yongsoo Song

    Abstract: In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a pol… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: 12 pages, 3 figures

  5. arXiv:2406.11427  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

    Authors: Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

    Abstract: Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models f… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  6. arXiv:2406.09819  [pdf, other

    eess.AS

    Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone Environments

    Authors: Jihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang

    Abstract: Ad-hoc distributed microphone environments, where microphone locations and numbers are unpredictable, present a challenge to traditional deep learning models, which typically require fixed architectures. To tailor deep learning models to accommodate arbitrary array configurations, the Transform-Average-Concatenate (TAC) layer was previously introduced. In this work, we integrate TAC layers with du… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  7. arXiv:2406.09286  [pdf, other

    eess.AS cs.SD

    FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

    Authors: Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

    Abstract: This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the numbe… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024

  8. arXiv:2406.07103  [pdf, other

    eess.AS cs.AI

    MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

    Authors: Seung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu

    Abstract: In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted by Interspeech 2024

  9. arXiv:2406.06786  [pdf, other

    cs.SD cs.AI eess.AS

    BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

    Authors: June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung

    Abstract: Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model u… ▽ More

    Submitted 14 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted INTERSPEECH 2024

  10. arXiv:2406.02936  [pdf

    eess.IV cs.CV

    Radiomics-guided Multimodal Self-attention Network for Predicting Pathological Complete Response in Breast MRI

    Authors: Jonghun Kim, Hyunjin Park

    Abstract: Breast cancer is the most prevalent cancer among women and predicting pathologic complete response (pCR) after anti-cancer treatment is crucial for patient prognosis and treatment customization. Deep learning has shown promise in medical imaging diagnosis, particularly when utilizing multiple imaging modalities to enhance accuracy. This study presents a model that predicts pCR in breast cancer pat… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 5 pages, 5 figures, IEEE ISBI 2024 proceedings

  11. arXiv:2406.00123  [pdf

    eess.IV cs.CV

    Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration

    Authors: Mingyuan Meng, Dagan Feng, Lei Bi, Jinman Kim

    Abstract: Deformable image registration is a fundamental step for medical image analysis. Recently, transformers have been used for registration and outperformed Convolutional Neural Networks (CNNs). Transformers can capture long-range dependence among image features, which have been shown beneficial for registration. However, due to the high computation/memory loads of self-attention, transformers are typi… ▽ More

    Submitted 12 June, 2024; v1 submitted 31 May, 2024; originally announced June 2024.

    Comments: Accepted at CVPR2024 as Oral Presentation && Best Paper Candidate

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9645-9654

  12. arXiv:2405.10272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS eess.IV

    Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

    Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

    Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  13. arXiv:2405.08614  [pdf, other

    eess.SP

    FDD Massive MIMO: How to Optimally Combine UL Pilot and Limited DL CSI Feedback?

    Authors: Jungyeon Kim, Jinseok Choi, Jeonghun Park, Ahmed Alkhateeb, Namyoon Lee

    Abstract: In frequency-division duplexing (FDD) multiple-input multiple-output (MIMO) systems, obtaining accurate downlink channel state information (CSI) for precoding is vastly challenging due to the tremendous feedback overhead with the growing number of antennas. Utilizing uplink pilots for downlink CSI estimation is a promising approach that can eliminate CSI feedback. However, the downlink CSI estimat… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 13 pages, 10 figures

  14. arXiv:2405.08277  [pdf, other

    eess.SY

    AI-driven, Model-Free Current Control: A Deep Symbolic Approach for Optimal Induction Machine Performance

    Authors: Muhammad Usama, Yunkyung Hwang, Jaehong Kim

    Abstract: This paper proposed a straightforward and efficient current control solution for induction machines employing deep symbolic regression (DSR). The proposed DSR-based control design offers a simple yet highly effective approach by creating an optimal control model through training and fitting, resulting in an analytical dynamic numerical expression that characterizes the data. Notably, this approach… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: This work has been accepted for potential publication at the IEEE ECCE Asia 2024 International Power Electronics and Motion Control Conference. Please note that copyright may be transferred without prior notice

  15. arXiv:2405.07255  [pdf, ps, other

    eess.SP

    Deep Learning-aided Parametric Sparse Channel Estimation for Terahertz Massive MIMO Systems

    Authors: Jinhong Kim, Yongjun Ahn, Seungnyun Kim, Byonghyo Shim

    Abstract: Terahertz (THz) communications is considered as one of key solutions to support extremely high data demand in 6G. One main difficulty of the THz communication is the severe signal attenuation caused by the foliage loss, oxygen/atmospheric absorption, body and hand losses. To compensate for the severe path loss, multiple-input-multiple-output (MIMO) antenna array-based beamforming has been widely u… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  16. arXiv:2405.06284  [pdf, other

    eess.IV cs.CV cs.LG

    Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention

    Authors: Ju-Hyeon Nam, Nur Suriza Syazwany, Su Jung Kim, Sang-Chul Lee

    Abstract: Generalizability in deep neural networks plays a pivotal role in medical image segmentation. However, deep learning-based medical image analyses tend to overlook the importance of frequency variance, which is critical element for achieving a model that is both modality-agnostic and domain-generalizable. Additionally, various models fail to account for the potential information loss that can arise… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: Accepted in Computer Vision and Pattern Recognition (CVPR) 2024

  17. arXiv:2405.03732  [pdf

    eess.IV cs.AI cs.CV cs.LG

    Accelerated MR Cholangiopancreatography with Deep Learning-based Reconstruction

    Authors: Jinho Kim, Marcel Dominik Nickel, Florian Knoll

    Abstract: This study accelerates MR cholangiopancreatography (MRCP) acquisitions using deep learning-based (DL) reconstruction at 3T and 0.55T. Thirty healthy volunteers underwent conventional two-fold MRCP scans at field strengths of 3T or 0.55T. We trained a variational network (VN) using retrospectively six-fold undersampled data obtained at 3T. We then evaluated our method against standard techniques su… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 20 pages, 6 figures, 2 tables

  18. arXiv:2405.03684  [pdf, other

    eess.IV

    All-in-One Deep Learning Framework for MR Image Reconstruction

    Authors: Geunu Jeong, Hyeonsoo Kim, Joonyoung Yang, Kyungeun Jang, Jeewook Kim

    Abstract: We introduce a novel, all-in-one deep learning framework for MR image reconstruction, enabling a single model to enhance image quality across multiple aspects of k-space sampling and to be effective across a wide range of clinical and technical scenarios. This DICOM-based algorithm serves as the core of SwiftMR (AIRS Medical, Seoul, Korea), which is FDA-cleared, CE-certified, and commercially avai… ▽ More

    Submitted 26 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: 22 pages, 9 figures; number of collected MR raw data corrected

  19. arXiv:2405.02996  [pdf, other

    cs.SD cs.AI eess.AS

    RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification

    Authors: June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung

    Abstract: Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrain… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: Accepted EMBC 2024

  20. arXiv:2405.02066  [pdf, other

    cs.CV eess.IV

    WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights

    Authors: Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, Sangpil Kim

    Abstract: The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representat… ▽ More

    Submitted 27 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  21. arXiv:2404.18516  [pdf, ps, other

    eess.SP cs.IT

    Downlink Pilots are Essential for Cell-Free Massive MIMO with Multi-Antenna Users

    Authors: Eren Berk Kama, Junbeom Kim, Emil Björnson

    Abstract: We consider a cell-free massive MIMO system with multiple antennas on the users and access points. In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well when having single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pilots are sent since the effectiv… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  22. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  23. arXiv:2404.15333  [pdf, other

    eess.SP cs.LG

    EB-GAME: A Game-Changer in ECG Heartbeat Anomaly Detection

    Authors: JuneYoung Park, Da Young Kim, Yunsoo Kim, Jisu Yoo, Tae Joon Kim

    Abstract: Cardiologists use electrocardiograms (ECG) for the detection of arrhythmias. However, continuous monitoring of ECG signals to detect cardiac abnormal-ities requires significant time and human resources. As a result, several deep learning studies have been conducted in advance for the automatic detection of arrhythmia. These models show relatively high performance in supervised learning, but are no… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  24. arXiv:2404.10936  [pdf, other

    eess.SP cs.LG

    Beam Training in mmWave Vehicular Systems: Machine Learning for Decoupling Beam Selection

    Authors: Ibrahim Kilinc, Ryan M. Dreifuerst, Junghoon Kim, Robert W. Heath Jr

    Abstract: Codebook-based beam selection is one approach for configuring millimeter wave communication links. The overhead required to reconfigure the transmit and receive beam pair, though, increases in highly dynamic vehicular communication systems. Location information coupled with machine learning (ML) beam recommendation is one way to reduce the overhead of beam pair selection. In this paper, we develop… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Submitted to IEEE BlackSeaCom 2024, 6 pages, 5 figures

  25. arXiv:2404.07021  [pdf, other

    eess.SP

    A 4x32Gb/s 1.8pJ/bit Collaborative Baud-Rate CDR with Background Eye-Climbing Algorithm and Low-Power Global Clock Distribution

    Authors: Jihee Kim, Jia Park, Jiwon Shin, Hanseok Kim, Kahyun Kim, Haengbeom Shin, Ha-Jung Park, Woo-Seok Choi

    Abstract: This paper presents design techniques for an energy-efficient multi-lane receiver (RX) with baud-rate clock and data recovery (CDR), which is essential for high-throughput low-latency communication in high-performance computing systems. The proposed low-power global clock distribution not only significantly reduces power consumption across multi-lane RXs but is capable of compensating for the freq… ▽ More

    Submitted 22 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

  26. arXiv:2404.05558  [pdf, other

    eess.IV cs.CV

    JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients

    Authors: Woo Kyoung Han, Sunghoon Im, Jaedeok Kim, Kyong Hwan Jin

    Abstract: We propose a practical approach to JPEG image decoding, utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate, inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  27. arXiv:2404.05119  [pdf, other

    eess.SP

    A 0.65-pJ/bit 3.6-TB/s/mm I/O Interface with XTalk Minimizing Affine Signaling for Next-Generation HBM with High Interconnect Density

    Authors: Hyunjun Park, Jiwon Shin, Hanseok Kim, Jihee Kim, Haengbeom Shin, Taehoon Kim, Jung-Hun Park, Woo-Seok Choi

    Abstract: This paper presents an I/O interface with Xtalk Minimizing Affine Signaling (XMAS), which is designed to support high-speed data transmission in die-to-die communication over silicon interposers or similar high-density interconnects susceptible to crosstalk. The operating principles of XMAS are elucidated through rigorous analyses, and its advantages over existing signaling are validated through n… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

  28. arXiv:2404.04727  [pdf, other

    eess.SY

    A code-driven tutorial on encrypted control: From pioneering realizations to modern implementations

    Authors: Nils Schlüter, Junsoo Kim, Moritz Schulze Darup

    Abstract: The growing interconnectivity in control systems due to robust wireless communication and cloud usage paves the way for exciting new opportunities such as data-driven control and service-based decision-making. At the same time, connected systems are susceptible to cyberattacks and data leakages. Against this background, encrypted control aims to increase the security and safety of cyber-physical s… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

  29. arXiv:2404.02781  [pdf, other

    eess.AS cs.SD

    CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

    Authors: Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

    Abstract: With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complex… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: ICLR 2024

  30. arXiv:2404.02574  [pdf, ps, other

    eess.SY

    Learning with errors based dynamic encryption that discloses residue signal for anomaly detection

    Authors: Yeongjun Jang, Joowon Lee, Junsoo Kim, Hyungbo Shim

    Abstract: Anomaly detection is a protocol that detects integrity attacks on control systems by comparing the residue signal with a threshold. Implementing anomaly detection on encrypted control systems has been a challenge because it is hard to detect an anomaly from the encrypted residue signal without the secret key. In this paper, we propose a dynamic encryption scheme for a linear system that automatica… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 7 pages, 1 figure

  31. arXiv:2404.01464  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images

    Authors: JungEun Kim, Hangyul Yoon, Geondo Park, Kyungsu Kim, Eunho Yang

    Abstract: 4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Gi… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  32. arXiv:2403.17420  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

    Authors: Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

    Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localizati… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024

  33. arXiv:2403.17327  [pdf, other

    cs.SD cs.CV eess.AS

    Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer

    Authors: Jeong-Yoon Kim, Seung-Ho Lee

    Abstract: In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectr… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  34. arXiv:2403.12726  [pdf

    eess.SP

    Small Distance Increment Method for Measuring Complex Permittivity With mmWave Radar

    Authors: Hang Song, Hyun Joon Kim, Mingxia Wan, Bo Wei, Takamaro Kikkawa, Jun-ichi Takada

    Abstract: Measuring the complex permittivity of material is essential in many scenarios such as quality check and component analysis. Generally, measurement methods for characterizing the material are based on the usage of vector network analyzer, which is large and not easy for on-site measurement, especially in high frequency range such as millimeter wave (mmWave). In addition, some measurement methods re… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  35. arXiv:2403.12098  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Deep Generative Design for Mass Production

    Authors: Jihoon Kim, Yongmin Kwon, Namwoo Kang

    Abstract: Generative Design (GD) has evolved as a transformative design approach, employing advanced algorithms and AI to create diverse and innovative solutions beyond traditional constraints. Despite its success, GD faces significant challenges regarding the manufacturability of complex designs, often necessitating extensive manual modifications due to limitations in standard manufacturing processes and t… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  36. arXiv:2403.11094  [pdf, other

    eess.SP

    Nonlinear Self-Interference Cancellation With Learnable Orthonormal Polynomials for Full-Duplex Wireless Systems

    Authors: Hyowon Lee, Jungyeon Kim, Geon Choi, Ian P. Roberts, Jinseok Choi, Namyoon Lee

    Abstract: Nonlinear self-interference cancellation (SIC) is essential for full-duplex communication systems, which can offer twice the spectral efficiency of traditional half-duplex systems. The challenge of nonlinear SIC is similar to the classic problem of system identification in adaptive filter theory, whose crux lies in identifying the optimal nonlinear basis functions for a nonlinear system. This beco… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: 13 pages, total 16 figures

  37. arXiv:2403.09612  [pdf, other

    physics.optics cs.CV cs.LG eess.IV

    Compute-first optical detection for noise-resilient visual perception

    Authors: Jungmin Kim, Nanfang Yu, Zongfu Yu

    Abstract: In the context of visual perception, the optical signal from a scene is transferred into the electronic domain by detectors in the form of image data, which are then processed for the extraction of visual information. In noisy and weak-signal environments such as thermal imaging for night vision applications, however, the performance of neural computing tasks faces a significant bottleneck due to… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Main 9 pages, 5 figures, Supplementary information 5 pages

  38. arXiv:2403.08187  [pdf, other

    cs.CL cs.SD eess.AS

    Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children

    Authors: Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam

    Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children wit… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 12 pages, 2 figures

    ACM Class: I.2.7

  39. arXiv:2403.06397  [pdf, other

    cs.LG cs.AI eess.SY

    DeepSafeMPC: Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning

    Authors: Xuefeng Wang, Henglin Pu, Hyung Jun Kim, Husheng Li

    Abstract: Safe Multi-agent reinforcement learning (safe MARL) has increasingly gained attention in recent years, emphasizing the need for agents to not only optimize the global return but also adhere to safety requirements through behavioral constraints. Some recent work has integrated control theory with multi-agent reinforcement learning to address the challenge of ensuring safety. However, there have bee… ▽ More

    Submitted 11 March, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

    Comments: 8 pages, 5 figures

  40. arXiv:2403.03642  [pdf, other

    eess.IV cs.CV cs.LG

    Generative Active Learning with Variational Autoencoder for Radiology Data Generation in Veterinary Medicine

    Authors: In-Gyu Lee, Jun-Young Oh, Hee-Jung Yu, Jae-Hwan Kim, Ki-Dong Eom, Ji-Hoon Jeong

    Abstract: Recently, with increasing interest in pet healthcare, the demand for computer-aided diagnosis (CAD) systems in veterinary medicine has increased. The development of veterinary CAD has stagnated due to a lack of sufficient radiology data. To overcome the challenge, we propose a generative active learning framework based on a variational autoencoder. This approach aims to alleviate the scarcity of r… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  41. arXiv:2403.01898  [pdf, other

    cs.CV eess.IV

    Revisiting Learning-based Video Motion Magnification for Real-time Processing

    Authors: Hyunwoo Ha, Oh Hyun-Bin, Kim Jun-Seong, Kwon Byung-Ki, Kim Sung-Bin, Linh-Tam Tran, Ji-Yun Kim, Sung-Ho Bae, Tae-Hyun Oh

    Abstract: Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being e… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 19 pages

  42. arXiv:2402.05706  [pdf, other

    cs.CL cs.SD eess.AS

    Unified Speech-Text Pretraining for Spoken Dialog Modeling

    Authors: Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Sungroh Yoon, Kang Min Yoo

    Abstract: While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  43. arXiv:2402.05350  [pdf, other

    cs.CV eess.IV

    Descanning: From Scanned to the Original Images with a Color Correction Diffusion Model

    Authors: Junghun Cha, Ali Haider, Seoyun Yang, Hoeyeong Jin, Subin Yang, A. F. M. Shahab Uddin, Jaehyoung Kim, Soo Ye Kim, Sung-Ho Bae

    Abstract: A significant volume of analog information, i.e., documents and images, have been digitized in the form of scanned copies for storing, sharing, and/or analyzing in the digital world. However, the quality of such contents is severely degraded by various distortions caused by printing, storing, and scanning processes in the physical world. Although restoring high-quality content from scanned copies… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: Accepted to AAAI 2024

  44. arXiv:2402.01298  [pdf, other

    eess.AS cs.AI cs.SD

    Learning Semantic Information from Raw Audio Signal Using Both Contextual and Phonetic Representations

    Authors: Jaeyeon Kim, Injune Hwang, Kyogu Lee

    Abstract: We propose a framework to learn semantics from raw audio signals using two types of representations, encoding contextual and phonetic information respectively. Specifically, we introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions. For the language model, we adopt a dual-channel architecture to incorporate both types of representa… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  45. arXiv:2401.18006  [pdf, other

    q-bio.QM cs.LG eess.SP

    EEG-GPT: Exploring Capabilities of Large Language Models for EEG Classification and Interpretation

    Authors: Jonathan W. Kim, Ahmed Alaa, Danilo Bernardo

    Abstract: In conventional machine learning (ML) approaches applied to electroencephalography (EEG), this is often a limited focus, isolating specific brain activities occurring across disparate temporal scales (from transient spikes in milliseconds to seizures lasting minutes) and spatial scales (from localized high-frequency oscillations to global sleep activity). This siloed approach limits the developmen… ▽ More

    Submitted 3 February, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

  46. arXiv:2401.17690  [pdf, other

    eess.AS cs.AI cs.SD

    EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

    Authors: Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo

    Abstract: We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpa… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  47. Machine learning for industrial sensing and control: A survey and practical perspective

    Authors: Nathan P. Lawrence, Seshu Kumar Damarla, Jong Woo Kim, Aditya Tulsyan, Faraz Amjad, Kai Wang, Benoit Chachuat, Jong Min Lee, Biao Huang, R. Bhushan Gopaluni

    Abstract: With the rise of deep learning, there has been renewed interest within the process industries to utilize data on large-scale nonlinear sensing and control problems. We identify key statistical and machine learning techniques that have seen practical success in the process industries. To do so, we start with hybrid modeling to provide a methodological framework underlying core application areas: so… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: 48 pages

    Journal ref: Control Engineering Practice 2024

  48. arXiv:2401.10465  [pdf, other

    cs.CL cs.SD eess.AS

    Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

    Authors: Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda

    Abstract: Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languag… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  49. arXiv:2401.10032  [pdf, other

    eess.AS cs.AI eess.SP

    FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

    Authors: Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

    Abstract: The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated co… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  50. arXiv:2401.08902  [pdf, other

    cs.SD cs.DL cs.IR cs.LG eess.AS

    Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

    Authors: Matthew C. McCallum, Florian Henkel, Jaehun Kim, Samuel E. Sandberg, Matthew E. P. Davies

    Abstract: Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces rep… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: Accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024