Search | arXiv e-print repository

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations… ▽ More The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: CVPR 2024

arXiv:2405.08247 [pdf, other]

Automated classification of multi-parametric body MRI series

Authors: Boah Kim, Tejas Sudharshan Mathai, Kimberly Helm, Ronald M. Summers

Abstract: Multi-parametric MRI (mpMRI) studies are widely available in clinical practice for the diagnosis of various diseases. As the volume of mpMRI exams increases yearly, there are concomitant inaccuracies that exist within the DICOM header fields of these exams. This precludes the use of the header information for the arrangement of the different series as part of the radiologist's hanging protocol, an… ▽ More Multi-parametric MRI (mpMRI) studies are widely available in clinical practice for the diagnosis of various diseases. As the volume of mpMRI exams increases yearly, there are concomitant inaccuracies that exist within the DICOM header fields of these exams. This precludes the use of the header information for the arrangement of the different series as part of the radiologist's hanging protocol, and clinician oversight is needed for correction. In this pilot work, we propose an automated framework to classify the type of 8 different series in mpMRI studies. We used 1,363 studies acquired by three Siemens scanners to train a DenseNet-121 model with 5-fold cross-validation. Then, we evaluated the performance of the DenseNet-121 ensemble on a held-out test set of 313 mpMRI studies. Our method achieved an average precision of 96.6%, sensitivity of 96.6%, specificity of 99.6%, and F1 score of 96.6% for the MRI series classification task. To the best of our knowledge, we are the first to develop a method to classify the series type in mpMRI studies acquired at the level of the chest, abdomen, and pelvis. Our method has the capability for robust automation of hanging protocols in modern radiology practice. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.05944 [pdf, other]

MRISegmentator-Abdomen: A Fully Automated Multi-Organ and Structure Segmentation Tool for T1-weighted Abdominal MRI

Authors: Yan Zhuang, Tejas Sudharshan Mathai, Pritam Mukherjee, Brandon Khoury, Boah Kim, Benjamin Hou, Nusrat Rabbee, Abhinav Suri, Ronald M. Summers

Abstract: Background: Segmentation of organs and structures in abdominal MRI is useful for many clinical applications, such as disease diagnosis and radiotherapy. Current approaches have focused on delineating a limited set of abdominal structures (13 types). To date, there is no publicly available abdominal MRI dataset with voxel-level annotations of multiple organs and structures. Consequently, a segmenta… ▽ More Background: Segmentation of organs and structures in abdominal MRI is useful for many clinical applications, such as disease diagnosis and radiotherapy. Current approaches have focused on delineating a limited set of abdominal structures (13 types). To date, there is no publicly available abdominal MRI dataset with voxel-level annotations of multiple organs and structures. Consequently, a segmentation tool for multi-structure segmentation is also unavailable. Methods: We curated a T1-weighted abdominal MRI dataset consisting of 195 patients who underwent imaging at National Institutes of Health (NIH) Clinical Center. The dataset comprises of axial pre-contrast T1, arterial, venous, and delayed phases for each patient, thereby amounting to a total of 780 series (69,248 2D slices). Each series contains voxel-level annotations of 62 abdominal organs and structures. A 3D nnUNet model, dubbed as MRISegmentator-Abdomen (MRISegmentator in short), was trained on this dataset, and evaluation was conducted on an internal test set and two large external datasets: AMOS22 and Duke Liver. The predicted segmentations were compared against the ground-truth using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD). Findings: MRISegmentator achieved an average DSC of 0.861$\pm$0.170 and a NSD of 0.924$\pm$0.163 in the internal test set. On the AMOS22 dataset, MRISegmentator attained an average DSC of 0.829$\pm$0.133 and a NSD of 0.908$\pm$0.067. For the Duke Liver dataset, an average DSC of 0.933$\pm$0.015 and a NSD of 0.929$\pm$0.021 was obtained. Interpretation: The proposed MRISegmentator provides automatic, accurate, and robust segmentations of 62 organs and structures in T1-weighted abdominal MRI sequences. The tool has the potential to accelerate research on various clinical topics, such as abnormality detection, radiotherapy, disease classification among others. △ Less

Submitted 24 June, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: We made the segmentation model publicly available

arXiv:2405.05107 [pdf, other]

Leveraging AES Padding: dBs for Nothing and FEC for Free in IoT Systems

Authors: Jongchan Woo, Vipindev Adat Vasudevan, Benjamin D. Kim, Rafael G. L. D'Oliveira, Alejandro Cohen, Thomas Stahlbuhk, Ken R. Duffy, Muriel Médard

Abstract: The Internet of Things (IoT) represents a significant advancement in digital technology, with its rapidly growing network of interconnected devices. This expansion, however, brings forth critical challenges in data security and reliability, especially under the threat of increasing cyber vulnerabilities. Addressing the security concerns, the Advanced Encryption Standard (AES) is commonly employed… ▽ More The Internet of Things (IoT) represents a significant advancement in digital technology, with its rapidly growing network of interconnected devices. This expansion, however, brings forth critical challenges in data security and reliability, especially under the threat of increasing cyber vulnerabilities. Addressing the security concerns, the Advanced Encryption Standard (AES) is commonly employed for secure encryption in IoT systems. Our study explores an innovative use of AES, by repurposing AES padding bits for error correction and thus introducing a dual-functional method that seamlessly integrates error-correcting capabilities into the standard encryption process. The integration of the state-of-the-art Guessing Random Additive Noise Decoder (GRAND) in the receiver's architecture facilitates the joint decoding and decryption process. This strategic approach not only preserves the existing structure of the transmitter but also significantly enhances communication reliability in noisy environments, achieving a notable over 3 dB gain in Block Error Rate (BLER). Remarkably, this enhanced performance comes with a minimal power overhead at the receiver - less than 15% compared to the traditional decryption-only process, underscoring the efficiency of our hardware design for IoT applications. This paper discusses a comprehensive analysis of our approach, particularly in energy efficiency and system performance, presenting a novel and practical solution for reliable IoT communications. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2402.08098 [pdf, other]

Automated Classification of Body MRI Sequence Type Using Convolutional Neural Networks

Authors: Kimberly Helm, Tejas Sudharshan Mathai, Boah Kim, Pritam Mukherjee, Jianfei Liu, Ronald M. Summers

Abstract: Multi-parametric MRI of the body is routinely acquired for the identification of abnormalities and diagnosis of diseases. However, a standard naming convention for the MRI protocols and associated sequences does not exist due to wide variations in imaging practice at institutions and myriad MRI scanners from various manufacturers being used for imaging. The intensity distributions of MRI sequences… ▽ More Multi-parametric MRI of the body is routinely acquired for the identification of abnormalities and diagnosis of diseases. However, a standard naming convention for the MRI protocols and associated sequences does not exist due to wide variations in imaging practice at institutions and myriad MRI scanners from various manufacturers being used for imaging. The intensity distributions of MRI sequences differ widely as a result, and there also exists information conflicts related to the sequence type in the DICOM headers. At present, clinician oversight is necessary to ensure that the correct sequence is being read and used for diagnosis. This poses a challenge when specific series need to be considered for building a cohort for a large clinical study or for developing AI algorithms. In order to reduce clinician oversight and ensure the validity of the DICOM headers, we propose an automated method to classify the 3D MRI sequence acquired at the levels of the chest, abdomen, and pelvis. In our pilot work, our 3D DenseNet-121 model achieved an F1 score of 99.5% at differentiating 5 common MRI sequences obtained by three Siemens scanners (Aera, Verio, Biograph mMR). To the best of our knowledge, we are the first to develop an automated method for the 3D classification of MRI sequences in the chest, abdomen, and pelvis, and our work has outperformed the previous state-of-the-art MRI series classifiers. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: Accepted at SPIE 2024

arXiv:2402.06846 [pdf, other]

System-level Analysis of Adversarial Attacks and Defenses on Intelligence in O-RAN based Cellular Networks

Authors: Azuka Chiejina, Brian Kim, Kaushik Chowhdury, Vijay K. Shah

Abstract: While the open architecture, open interfaces, and integration of intelligence within Open Radio Access Network technology hold the promise of transforming 5G and 6G networks, they also introduce cybersecurity vulnerabilities that hinder its widespread adoption. In this paper, we conduct a thorough system-level investigation of cyber threats, with a specific focus on machine learning (ML) intellige… ▽ More While the open architecture, open interfaces, and integration of intelligence within Open Radio Access Network technology hold the promise of transforming 5G and 6G networks, they also introduce cybersecurity vulnerabilities that hinder its widespread adoption. In this paper, we conduct a thorough system-level investigation of cyber threats, with a specific focus on machine learning (ML) intelligence components known as xApps within the O-RAN's near-real-time RAN Intelligent Controller (near-RT RIC) platform. Our study begins by developing a malicious xApp designed to execute adversarial attacks on two types of test data - spectrograms and key performance metrics (KPMs), stored in the RIC database within the near-RT RIC. To mitigate these threats, we utilize a distillation technique that involves training a teacher model at a high softmax temperature and transferring its knowledge to a student model trained at a lower softmax temperature, which is deployed as the robust ML model within xApp. We prototype an over-the-air LTE/5G O-RAN testbed to assess the impact of these attacks and the effectiveness of the distillation defense technique by leveraging an ML-based Interference Classification (InterClass) xApp as an example. We examine two versions of InterClass xApp under distinct scenarios, one based on Convolutional Neural Networks (CNNs) and another based on Deep Neural Networks (DNNs) using spectrograms and KPMs as input data respectively. Our findings reveal up to 100% and 96.3% degradation in the accuracy of both the CNN and DNN models respectively resulting in a significant decline in network performance under considered adversarial attacks. Under the strict latency constraints of the near-RT RIC closed control loop, our analysis shows that the distillation technique outperforms classical adversarial training by achieving an accuracy of up to 98.3% for mitigating such attacks. △ Less

Submitted 13 February, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: his paper has been accepted for publication in ACM WiSec 2024

arXiv:2402.00977 [pdf, other]

Enhanced fringe-to-phase framework using deep learning

Authors: Won-Hoe Kim, Bongjoong Kim, Hyung-Gun Chi, Jae-Sang Hyun

Abstract: In Fringe Projection Profilometry (FPP), achieving robust and accurate 3D reconstruction with a limited number of fringe patterns remains a challenge in structured light 3D imaging. Conventional methods require a set of fringe images, but using only one or two patterns complicates phase recovery and unwrapping. In this study, we introduce SFNet, a symmetric fusion network that transforms two fring… ▽ More In Fringe Projection Profilometry (FPP), achieving robust and accurate 3D reconstruction with a limited number of fringe patterns remains a challenge in structured light 3D imaging. Conventional methods require a set of fringe images, but using only one or two patterns complicates phase recovery and unwrapping. In this study, we introduce SFNet, a symmetric fusion network that transforms two fringe images into an absolute phase. To enhance output reliability, Our framework predicts refined phases by incorporating information from fringe images of a different frequency than those used as input. This allows us to achieve high accuracy with just two images. Comparative experiments and ablation studies validate the effectiveness of our proposed method. The dataset and code are publicly accessible on our project page https://wonhoe-kim.github.io/SFNet. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 35 pages, 13 figures, 6 tables

arXiv:2401.13921 [pdf, other]

Intelli-Z: Toward Intelligible Zero-Shot TTS

Authors: Sunghee Jung, Won Jang, Jaesam Yoon, Bongwan Kim

Abstract: Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new… ▽ More Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new one at the inference stage. In this study, we propose a zero-shot TTS model focused on intelligibility, which we refer to as Intelli-Z. Intelli-Z learns speaker embeddings by using multi-speaker TTS as its teacher and is trained with a cycle-consistency loss to include mismatched text-speech pairs for training. Additionally, it selectively aggregates speaker embeddings along the temporal dimension to minimize the interference of the text content of reference speech at the inference stage. We substantiate the effectiveness of the proposed methods with an ablation study. The Mean Opinion Score (MOS) increases by 9% for unseen speakers when the first two methods are applied, and it further improves by 16% when selective temporal aggregation is applied. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.12473 [pdf, other]

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Authors: Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

Abstract: We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations… ▽ More We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, accepted by ICASSP 2024

arXiv:2312.14939 [pdf, other]

Large-scale Graph Representation Learning of Dynamic Brain Connectome with Transformers

Authors: Byung-Hoon Kim, Jungwon Choi, EungGu Yun, Kyungsang Kim, Xiang Li, Juho Lee

Abstract: Graph Transformers have recently been successful in various graph representation learning tasks, providing a number of advantages over message-passing Graph Neural Networks. Utilizing Graph Transformers for learning the representation of the brain functional connectivity network is also gaining interest. However, studies to date have underlooked the temporal dynamics of functional connectivity, wh… ▽ More Graph Transformers have recently been successful in various graph representation learning tasks, providing a number of advantages over message-passing Graph Neural Networks. Utilizing Graph Transformers for learning the representation of the brain functional connectivity network is also gaining interest. However, studies to date have underlooked the temporal dynamics of functional connectivity, which fluctuates over time. Here, we propose a method for learning the representation of dynamic functional connectivity with Graph Transformers. Specifically, we define the connectome embedding, which holds the position, structure, and time information of the functional connectivity graph, and use Transformers to learn its representation across time. We perform experiments with over 50,000 resting-state fMRI samples obtained from three datasets, which is the largest number of fMRI data used in studies by far. The experimental results show that our proposed method outperforms other competitive baselines in gender classification and age regression tasks based on the functional connectivity extracted from the fMRI data. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023 Temporal Graph Learning Workshop

arXiv:2312.07896 [pdf, other]

TRACTOR: Traffic Analysis and Classification Tool for Open RAN

Authors: Joshua Groen, Mauro Belgiovine, Utku Demir, Brian Kim, Kaushik Chowdhury

Abstract: 5G and beyond cellular networks promise remarkable advancements in bandwidth, latency, and connectivity. The emergence of Open Radio Access Network (O-RAN) represents a pivotal direction for the evolution of cellular networks, inherently supporting machine learning (ML) for network operation control. Within this framework, RAN Intelligence Controllers (RICs) from one provider can employ ML models… ▽ More 5G and beyond cellular networks promise remarkable advancements in bandwidth, latency, and connectivity. The emergence of Open Radio Access Network (O-RAN) represents a pivotal direction for the evolution of cellular networks, inherently supporting machine learning (ML) for network operation control. Within this framework, RAN Intelligence Controllers (RICs) from one provider can employ ML models developed by third-party vendors through the acquisition of key performance indicators (KPIs) from geographically distant base stations or user equipment (UE). Yet, the development of ML models hinges on the availability of realistic and robust datasets. In this study, we embark on a two-fold journey. First, we collect a comprehensive 5G dataset, harnessing real-world cell phones across diverse applications, locations, and mobility scenarios. Next, we replicate this traffic within a full-stack srsRAN-based O-RAN framework on Colosseum, the world's largest radio frequency (RF) emulator. This process yields a robust and O-RAN compliant KPI dataset mirroring real-world conditions. We illustrate how such a dataset can fuel the training of ML models and facilitate the deployment of xApps for traffic slice classification by introducing a CNN based classifier that achieves accuracy $>95\%$ offline and $92\%$ online. To accelerate research in this domain, we provide open-source access to our toolchain and supplementary utilities, empowering the broader research community to expedite the creation of realistic and O-RAN compliant datasets. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: 6 pages, 5 figures, 2 tables, submitted to ICC 2024

arXiv:2312.06453 [pdf, other]

Semantic Image Synthesis for Abdominal CT

Authors: Yan Zhuang, Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Boah Kim, Ronald M. Summers

Abstract: As a new emerging and promising type of generative models, diffusion models have proven to outperform Generative Adversarial Networks (GANs) in multiple tasks, including image synthesis. In this work, we explore semantic image synthesis for abdominal CT using conditional diffusion models, which can be used for downstream applications such as data augmentation. We systematically evaluated the perfo… ▽ More As a new emerging and promising type of generative models, diffusion models have proven to outperform Generative Adversarial Networks (GANs) in multiple tasks, including image synthesis. In this work, we explore semantic image synthesis for abdominal CT using conditional diffusion models, which can be used for downstream applications such as data augmentation. We systematically evaluated the performance of three diffusion models, as well as to other state-of-the-art GAN-based approaches, and studied the different conditioning scenarios for the semantic mask. Experimental results demonstrated that diffusion models were able to synthesize abdominal CT images with better quality. Additionally, encoding the mask and the input separately is more effective than naïve concatenating. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: This paper has been accepted at Deep Generative Models workshop at MICCAI 2023

arXiv:2312.01994 [pdf, other]

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

Authors: Jungwon Choi, Seongho Keum, EungGu Yun, Byung-Hoon Kim, Juho Lee

Abstract: Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could signifi… ▽ More Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model naïvely trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023 Temporal Graph Learning Workshop

arXiv:2311.11745 [pdf, other]

ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

Authors: Jungil Kong, Junmo Lee, Jeongmin Kim, Beomjeong Kim, Jihoon Park, Dohee Kong, Changheon Lee, Sangjin Kim

Abstract: In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to th… ▽ More In this work, we propose a novel method for modeling numerous speakers, which enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model without additional training on the target speaker's dataset. Although various works with similar purposes have been actively studied, their performance has not yet reached that of trained multi-speaker models due to their fundamental limitations. To overcome previous limitations, we propose effective methods for feature learning and representing target speakers' speech characteristics by discretizing the features and conditioning them to a speech synthesis model. Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. Furthermore, our method shows remarkable performance in generating new artificial speakers. In addition, we demonstrate that the encoded latent features are sufficiently informative to reconstruct an original speaker's speech completely. It implies that our method can be used as a general methodology to encode and reconstruct speakers' characteristics in various tasks. △ Less

Submitted 31 May, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: ICML 2024

arXiv:2309.06035 [pdf]

Non-reciprocal absorption and zero reflection in physically separated dual photonic resonators by traveling-wave-induced indirect coupling

Authors: Bojong Kim, Junyoung Kim, Hae-Chan Jeon, Sang-Koog Kim

Abstract: We experimentally explored novel behaviors of non-reciprocal absorption and almost zero reflection in a dual photon resonator system, which is physically separated and composed of two inverted split ring resonators (ISRRs) with varying inter-distances. We also found that an electromagnetically-induced-transparency (EIT)-like peak at a specific inter-distance of d = 18 mm through traveling waves fl… ▽ More We experimentally explored novel behaviors of non-reciprocal absorption and almost zero reflection in a dual photon resonator system, which is physically separated and composed of two inverted split ring resonators (ISRRs) with varying inter-distances. We also found that an electromagnetically-induced-transparency (EIT)-like peak at a specific inter-distance of d = 18 mm through traveling waves flowing along a shared microstrip line to which the dual ISRRs are dissipatively coupled. With the aid of CST-simulations and analytical modeling, we found that destructive and/or constructive interferences in traveling waves, indirectly coupled to each ISRR, result in a traveling-wave-induced transparency peak within a narrow window. Furthermore, we observed not only strong non-reciprocal responses of reflectivity and absorptivity at individual inter-distances exactly at the corresponding EIT-like peak positions, but also nearly zero reflection and almost perfect absorption for a specific case of d = 20 mm. Finally, the unidirectional absorptions with zero reflection at d = 20 mm are found to be ascribed to a non-Hermitian origin. This work not only provides a better understanding of traveling-wave-induced indirect coupling between two photonic resonators without magnetic coupling, but also suggests potential implications for the resulting non-reciprocal behaviors of absorption and reflection in microwave circuits and quantum information devices. △ Less

Submitted 12 September, 2023; originally announced September 2023.

arXiv:2309.03844 [pdf, other]

Experimental Study of Adversarial Attacks on ML-based xApps in O-RAN

Authors: Naveen Naik Sapavath, Brian Kim, Kaushik Chowdhury, Vijay K Shah

Abstract: Open Radio Access Network (O-RAN) is considered as a major step in the evolution of next-generation cellular networks given its support for open interfaces and utilization of artificial intelligence (AI) into the deployment, operation, and maintenance of RAN. However, due to the openness of the O-RAN architecture, such AI models are inherently vulnerable to various adversarial machine learning (ML… ▽ More Open Radio Access Network (O-RAN) is considered as a major step in the evolution of next-generation cellular networks given its support for open interfaces and utilization of artificial intelligence (AI) into the deployment, operation, and maintenance of RAN. However, due to the openness of the O-RAN architecture, such AI models are inherently vulnerable to various adversarial machine learning (ML) attacks, i.e., adversarial attacks which correspond to slight manipulation of the input to the ML model. In this work, we showcase the vulnerability of an example ML model used in O-RAN, and experimentally deploy it in the near-real time (near-RT) RAN intelligent controller (RIC). Our ML-based interference classifier xApp (extensible application in near-RT RIC) tries to classify the type of interference to mitigate the interference effect on the O-RAN system. We demonstrate the first-ever scenario of how such an xApp can be impacted through an adversarial attack by manipulating the data stored in a shared database inside the near-RT RIC. Through a rigorous performance analysis deployed on a laboratory O-RAN testbed, we evaluate the performance in terms of capacity and the prediction accuracy of the interference classifier xApp using both clean and perturbed data. We show that even small adversarial attacks can significantly decrease the accuracy of ML application in near-RT RIC, which can directly impact the performance of the entire O-RAN deployment. △ Less

Submitted 7 September, 2023; originally announced September 2023.

Comments: Accepted for Globecom 2023

arXiv:2309.00647 [pdf, other]

Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data

Authors: Seunghan Yang, Byeonggeun Kim, Kyuhong Shim, Simyung Chang

Abstract: Few-shot keyword spotting (FS-KWS) models usually require large-scale annotated datasets to generalize to unseen target keywords. However, existing KWS datasets are limited in scale and gathering keyword-like labeled data is costly undertaking. To mitigate this issue, we propose a framework that uses easily collectible, unlabeled reading speech data as an auxiliary source. Self-supervised learning… ▽ More Few-shot keyword spotting (FS-KWS) models usually require large-scale annotated datasets to generalize to unseen target keywords. However, existing KWS datasets are limited in scale and gathering keyword-like labeled data is costly undertaking. To mitigate this issue, we propose a framework that uses easily collectible, unlabeled reading speech data as an auxiliary source. Self-supervised learning has been widely adopted for learning representations from unlabeled data; however, it is known to be suitable for large models with enough capacity and is not practical for training a small footprint FS-KWS model. Instead, we automatically annotate and filter the data to construct a keyword-like dataset, LibriWord, enabling supervision on auxiliary data. We then adopt multi-task learning that helps the model to enhance the representation power from out-of-domain auxiliary data. Our method notably improves the performance over competitive methods in the FS-KWS benchmark. △ Less

Submitted 31 August, 2023; originally announced September 2023.

Comments: Interspeech 2023

arXiv:2308.10372 [pdf]

Developing a Machine Learning-Based Clinical Decision Support Tool for Uterine Tumor Imaging

Authors: Darryl E. Wright, Adriana V. Gregory, Deema Anaam, Sepideh Yadollahi, Sumana Ramanathan, Kafayat A. Oyemade, Reem Alsibai, Heather Holmes, Harrison Gottlich, Cherie-Akilah G. Browne, Sarah L. Cohen Rassier, Isabel Green, Elizabeth A. Stewart, Hiroaki Takahashi, Bohyun Kim, Shannon Laughlin-Tommaso, Timothy L. Kline

Abstract: Uterine leiomyosarcoma (LMS) is a rare but aggressive malignancy. On imaging, it is difficult to differentiate LMS from, for example, degenerated leiomyoma (LM), a prevalent but benign condition. We curated a data set of 115 axial T2-weighted MRI images from 110 patients (mean [range] age=45 [17-81] years) with UTs that included five different tumor types. These data were randomly split stratifyin… ▽ More Uterine leiomyosarcoma (LMS) is a rare but aggressive malignancy. On imaging, it is difficult to differentiate LMS from, for example, degenerated leiomyoma (LM), a prevalent but benign condition. We curated a data set of 115 axial T2-weighted MRI images from 110 patients (mean [range] age=45 [17-81] years) with UTs that included five different tumor types. These data were randomly split stratifying on tumor volume into training (n=85) and test sets (n=30). An independent second reader (reader 2) provided manual segmentations for all test set images. To automate segmentation, we applied nnU-Net and explored the effect of training set size on performance by randomly generating subsets with 25, 45, 65 and 85 training set images. We evaluated the ability of radiomic features to distinguish between types of UT individually and when combined through feature selection and machine learning. Using the entire training set the mean [95% CI] fibroid DSC was measured as 0.87 [0.59-1.00] and the agreement between the two readers was 0.89 [0.77-1.0] on the test set. When classifying degenerated LM from LMS we achieve a test set F1-score of 0.80. Classifying UTs based on radiomic features we identify classifiers achieving F1-scores of 0.53 [0.45, 0.61] and 0.80 [0.80, 0.80] on the test set for the benign versus malignant, and degenerated LM versus LMS tasks. We show that it is possible to develop an automated method for 3D segmentation of the uterus and UT that is close to human-level performance with fewer than 150 annotated images. For distinguishing UT types, while we train models that merit further investigation with additional data, reliable automatic differentiation of UTs remains a challenge. △ Less

Submitted 20 August, 2023; originally announced August 2023.

arXiv:2308.06957 [pdf, other]

CEmb-SAM: Segment Anything Model with Condition Embedding for Joint Learning from Heterogeneous Datasets

Authors: Dongik Shin, Beomsuk Kim, Seungjun Baek

Abstract: Automated segmentation of ultrasound images can assist medical experts with diagnostic and therapeutic procedures. Although using the common modality of ultrasound, one typically needs separate datasets in order to segment, for example, different anatomical structures or lesions with different levels of malignancy. In this paper, we consider the problem of jointly learning from heterogeneous datas… ▽ More Automated segmentation of ultrasound images can assist medical experts with diagnostic and therapeutic procedures. Although using the common modality of ultrasound, one typically needs separate datasets in order to segment, for example, different anatomical structures or lesions with different levels of malignancy. In this paper, we consider the problem of jointly learning from heterogeneous datasets so that the model can improve generalization abilities by leveraging the inherent variability among datasets. We merge the heterogeneous datasets into one dataset and refer to each component dataset as a subgroup. We propose to train a single segmentation model so that the model can adapt to each sub-group. For robust segmentation, we leverage recently proposed Segment Anything model (SAM) in order to incorporate sub-group information into the model. We propose SAM with Condition Embedding block (CEmb-SAM) which encodes sub-group conditions and combines them with image embeddings from SAM. The conditional embedding block effectively adapts SAM to each image sub-group by incorporating dataset properties through learnable parameters for normalization. Experiments show that CEmb-SAM outperforms the baseline methods on ultrasound image segmentation for peripheral nerves and breast cancer. The experiments highlight the effectiveness of Cemb-SAM in learning from heterogeneous datasets in medical image segmentation tasks. △ Less

Submitted 14 August, 2023; originally announced August 2023.

arXiv:2308.05063 [pdf, other]

CERMET: Coding for Energy Reduction with Multiple Encryption Techniques -- $It's\ easy\ being\ green$

Authors: Jongchan Woo, Vipindev Adat Vasudevan, Benjamin Kim, Alejandro Cohen, Rafael G. L. D'Oliveira, Thomas Stahlbuhk, Muriel Médard

Abstract: This paper presents CERMET, an energy-efficient hardware architecture designed for hardware-constrained cryptosystems. CERMET employs a base cryptosystem in conjunction with network coding to provide both information-theoretic and computational security while reducing energy consumption per bit. This paper introduces the hardware architecture for the system and explores various optimizations to en… ▽ More This paper presents CERMET, an energy-efficient hardware architecture designed for hardware-constrained cryptosystems. CERMET employs a base cryptosystem in conjunction with network coding to provide both information-theoretic and computational security while reducing energy consumption per bit. This paper introduces the hardware architecture for the system and explores various optimizations to enhance its performance. The universality of the approach is demonstrated by designing the architecture to accommodate both asymmetric and symmetric cryptosystems. The analysis reveals that the benefits of this proposed approach are multifold, reducing energy per bit and area without compromising security or throughput. The optimized hardware architectures can achieve below 1 pJ/bit operations for AES-256. Furthermore, for a public key cryptosystem based on Elliptic Curve Cryptography (ECC), a remarkable 14.6X reduction in energy per bit and a 9.3X reduction in area are observed, bringing it to less than 1 nJ/bit. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2308.00193 [pdf, other]

C-DARL: Contrastive diffusion adversarial representation learning for label-free blood vessel segmentation

Authors: Boah Kim, Yujin Oh, Bradford J. Wood, Ronald M. Summers, Jong Chul Ye

Abstract: Blood vessel segmentation in medical imaging is one of the essential steps for vascular disease diagnosis and interventional planning in a broad spectrum of clinical scenarios in image-based medicine and interventional medicine. Unfortunately, manual annotation of the vessel masks is challenging and resource-intensive due to subtle branches and complex structures. To overcome this issue, this pape… ▽ More Blood vessel segmentation in medical imaging is one of the essential steps for vascular disease diagnosis and interventional planning in a broad spectrum of clinical scenarios in image-based medicine and interventional medicine. Unfortunately, manual annotation of the vessel masks is challenging and resource-intensive due to subtle branches and complex structures. To overcome this issue, this paper presents a self-supervised vessel segmentation method, dubbed the contrastive diffusion adversarial representation learning (C-DARL) model. Our model is composed of a diffusion module and a generation module that learns the distribution of multi-domain blood vessel data by generating synthetic vessel images from diffusion latent. Moreover, we employ contrastive learning through a mask-based contrastive loss so that the model can learn more realistic vessel representations. To validate the efficacy, C-DARL is trained using various vessel datasets, including coronary angiograms, abdominal digital subtraction angiograms, and retinal imaging. Experimental results confirm that our model achieves performance improvement over baseline methods with noise robustness, suggesting the effectiveness of C-DARL for vessel segmentation. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2307.16430 [pdf, other]

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Authors: Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Abstract: Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text… ▽ More Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: Interspeech 2023

arXiv:2306.14385 [pdf, other]

Calibration of Wideband LFM Radars based on Sliding Window Algorithm

Authors: Hyung-Woo Kim, Jin-woo Kim, Jin-ha Kim, JaeYoung Choi, Sangpyo Hong, Byungkwan Kim

Abstract: This paper addresses the challenges of wideband signal beamforming in radar systems and proposes a new calibration method. Due to operating conditions, the frequency dependent characteristics of the system can be changed, and amplitude, phase, and time delay error can be generated. The proposed method is based on the concept of sliding window algorithm for linear frequency modulated (LFM) signals.… ▽ More This paper addresses the challenges of wideband signal beamforming in radar systems and proposes a new calibration method. Due to operating conditions, the frequency dependent characteristics of the system can be changed, and amplitude, phase, and time delay error can be generated. The proposed method is based on the concept of sliding window algorithm for linear frequency modulated (LFM) signals. To calibrate the frequency-dependent errors from transceiver and the time delay error from true time delay elements, the proposed method utilizes the characteristic of the LFM signal. The LFM signal changes its frequency linearly with time, and the frequency domain characteristics of the hardware are presented in time. Therefore, by applying matched filter to a part of the LFM signal, the frequency dependent characteristics can be monitored and calibrated. The proposed method is compared with the conventional matched filter based calibration results and verified by simulation results and beampatterns. Since the proposed method utilizes LFM signal as calibration tone, the proposed method can be applied to any beamforming systems, not limited to LFM radars. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 11 pages

arXiv:2305.16699 [pdf, other]

Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

Authors: Seongyeon Park, Bohyung Kim, Tae-hyun Oh

Abstract: Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the… ▽ More Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Interspeech 2023

arXiv:2304.08707 [pdf, other]

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain. The model maintains an information highway to flow an over-complete input representation through multiple FSB-LSTM modules. Each FSB-LSTM module consists of a full-band bl… ▽ More We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture that integrates full- and sub-band (FSB) modeling, for single- and multi-channel speech enhancement in the short-time Fourier transform (STFT) domain. The model maintains an information highway to flow an over-complete input representation through multiple FSB-LSTM modules. Each FSB-LSTM module consists of a full-band block to model spectro-temporal patterns at all frequencies and a sub-band block to model patterns within each sub-band, where each of the two blocks takes a down-sampled representation as input and returns an up-sampled discriminative representation to be added to the block input via a residual connection. The model is designed to have a low algorithmic complexity, a small run-time buffer and a very low algorithmic latency, at the same time producing a strong enhancement performance on a noisy-reverberant speech enhancement task even if the hop size is as low as $2$ ms. △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: in ICASSP 2023

arXiv:2304.00471 [pdf, other]

A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation

Authors: Bo-Kyeong Kim, Jaemin Kang, Daeun Seo, Hancheol Park, Shinkook Choi, Hyoung-Kyu Song, Hyungshin Kim, Sungsu Lim

Abstract: Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limi… ▽ More Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limit their efficient deployment. This study aims to develop a lightweight model for speech-driven talking-face synthesis. We build a compact generator by removing the residual blocks and reducing the channel width from Wav2Lip, a popular talking-face generator. We also present a knowledge distillation scheme to stably yet effectively train the small-capacity generator without adversarial learning. We reduce the number of parameters and MACs by 28$\times$ while retaining the performance of the original model. Moreover, to alleviate a severe performance drop when converting the whole generator to INT8 precision, we adopt a selective quantization method that uses FP16 for the quantization-sensitive layers and INT8 for the other layers. Using this mixed precision, we achieve up to a 19$\times$ speedup on edge GPUs without noticeably compromising the generation quality. △ Less

Submitted 28 April, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: MLSys Workshop on On-Device Intelligence, 2023; Demo: https://huggingface.co/spaces/nota-ai/compressed_wav2lip

arXiv:2303.16511 [pdf, other]

Joint unsupervised and supervised learning for context-aware language identification

Authors: Jinseok Park, Hyung Yong Kim, Jihwan Park, Byeong-Yeol Kim, Shukjae Choi, Yunkyu Lim

Abstract: Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this probl… ▽ More Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. The proposed method learns the context of speech through masked language modeling (MLM) loss and simultaneously trains to determine the language of the utterance with supervised learning loss. The proposed joint learning was found to reduce the error rate by 15.6% compared to the same structure model trained by supervised-only learning on a subset of the VoxLingua107 dataset consisting of sub-three-second utterances in 11 languages. △ Less

Submitted 14 April, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.15669 [pdf, other]

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Authors: Seongyeon Park, Myungseo Song, Bohyung Kim, Tae-Hyun Oh

Abstract: Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired… ▽ More Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2303.09463 [pdf, other]

An Autonomous System for Head-to-Head Race: Design, Implementation and Analysis; Team KAIST at the Indy Autonomous Challenge

Authors: Chanyoung Jung, Andrea Finazzi, Hyunki Seong, Daegyu Lee, Seungwook Lee, Bosung Kim, Gyuri Gang, Seungil Han, David Hyunchul Shim

Abstract: While the majority of autonomous driving research has concentrated on everyday driving scenarios, further safety and performance improvements of autonomous vehicles require a focus on extreme driving conditions. In this context, autonomous racing is a new area of research that has been attracting considerable interest recently. Due to the fact that a vehicle is driven by its perception, planning,… ▽ More While the majority of autonomous driving research has concentrated on everyday driving scenarios, further safety and performance improvements of autonomous vehicles require a focus on extreme driving conditions. In this context, autonomous racing is a new area of research that has been attracting considerable interest recently. Due to the fact that a vehicle is driven by its perception, planning, and control limits during racing, numerous research and development issues arise. This paper provides a comprehensive overview of the autonomous racing system built by team KAIST for the Indy Autonomous Challenge (IAC). Our autonomy stack consists primarily of a multi-modal perception module, a high-speed overtaking planner, a resilient control stack, and a system status manager. We present the details of all components of our autonomy solution, including algorithms, implementation, and unit test results. In addition, this paper outlines the design principles and the results of a systematical analysis. Even though our design principles are derived from the unique application domain of autonomous racing, they can also be applied to a variety of safety-critical, high-cost-of-failure robotics applications. The proposed system was integrated into a full-scale autonomous race car (Dallara AV-21) and field-tested extensively. As a result, team KAIST was one of three teams who qualified and participated in the official IAC race events without any accidents. Our proposed autonomous system successfully completed all missions, including overtaking at speeds of around $220 km/h$ in the IAC@CES2022, the world's first autonomous 1:1 head-to-head race. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 35 pages, 31 figures, 5 tables, Field Robotics (accepted)

arXiv:2302.14370 [pdf, other]

CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis

Authors: Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim

Abstract: While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and… ▽ More While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and language information in the level of acoustic feature space. Specifically, CrossSpeech decomposes the speech generation pipeline into the speaker-independent generator (SIG) and speaker-dependent generator (SDG). The SIG produces the speaker-independent acoustic representation which is not biased to specific speaker distributions. On the other hand, the SDG models speaker-dependent speech variation that characterizes speaker attributes. By handling each information separately, CrossSpeech can obtain disentangled speaker and language representations. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS, especially in terms of speaker similarity to the target speaker. △ Less

Submitted 12 June, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted to ICASSP 2023

arXiv:2301.08078 [pdf, other]

Stable Contact Guaranteeing Motion/Force Control for an Aerial Manipulator on an Arbitrarily Tilted Surface

Authors: Jeonghyun Byun, Byeongjun Kim, Changhyeon Kim, Donggeon David Oh, H. Jin Kim

Abstract: This study aims to design a motion/force controller for an aerial manipulator which guarantees the tracking of time-varying motion/force trajectories as well as the stability during the transition between free and contact motions. To this end, we model the force exerted on the end-effector as the Kelvin-Voigt linear model and estimate its parameters by recursive least-squares estimator. Then, the… ▽ More This study aims to design a motion/force controller for an aerial manipulator which guarantees the tracking of time-varying motion/force trajectories as well as the stability during the transition between free and contact motions. To this end, we model the force exerted on the end-effector as the Kelvin-Voigt linear model and estimate its parameters by recursive least-squares estimator. Then, the gains of the disturbance-observer (DOB)-based motion/force controller are calculated based on the stability conditions considering both the model uncertainties in the dynamic equation and switching between the free and contact motions. To validate the proposed controller, we conducted the time-varying motion/force tracking experiments with different approach speeds and orientations of the surface. The results show that our controller enables the aerial manipulator to track the time-varying motion/force trajectories. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: to be presented in 2023 IEEE International Conference on Robotics and Automations (ICRA), London, United Kingdom, 2023

arXiv:2211.12433 [pdf, other]

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI)… ▽ More We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this paper is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. △ Less

Submitted 4 August, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: In IEEE/ACM Transactions on Audio, Speech, and Language Processing. A sound demo is available at https://zqwang7.github.io/demos/TF-GridNet-demo/index.html, and the code is available at https://github.com/espnet/espnet/pull/5395

arXiv:2211.00439 [pdf, other]

Metric Learning for User-defined Keyword Spotting

Authors: Jaemin Jung, Youkyum Kim, Jihwan Park, Youshin Lim, Byeong-Yeol Kim, Youngjoon Jang, Joon Son Chung

Abstract: The goal of this work is to detect new spoken terms defined by users. While most previous works address Keyword Spotting (KWS) as a closed-set classification problem, this limits their transferability to unseen terms. The ability to define custom keywords has advantages in terms of user experience. In this paper, we propose a metric learning-based training strategy for user-defined keyword spott… ▽ More The goal of this work is to detect new spoken terms defined by users. While most previous works address Keyword Spotting (KWS) as a closed-set classification problem, this limits their transferability to unseen terms. The ability to define custom keywords has advantages in terms of user experience. In this paper, we propose a metric learning-based training strategy for user-defined keyword spotting. In particular, we make the following contributions: (1) we construct a large-scale keyword dataset with an existing speech corpus and propose a filtering method to remove data that degrade model training; (2) we propose a metric learning-based two-stage training strategy, and demonstrate that the proposed method improves the performance on the user-defined keyword spotting task by enriching their representations; (3) to facilitate the fair comparison in the user-defined KWS field, we propose unified evaluation protocol and metrics. Our proposed system does not require an incremental training on the user-defined keywords, and outperforms previous works by a significant margin on the Google Speech Commands dataset using the proposed as well as the existing metrics. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2209.14566 [pdf, other]

Diffusion Adversarial Representation Learning for Self-supervised Vessel Segmentation

Authors: Boah Kim, Yujin Oh, Jong Chul Ye

Abstract: Vessel segmentation in medical images is one of the important tasks in the diagnosis of vascular diseases and therapy planning. Although learning-based segmentation approaches have been extensively studied, a large amount of ground-truth labels are required in supervised methods and confusing background structures make neural networks hard to segment vessels in an unsupervised manner. To address t… ▽ More Vessel segmentation in medical images is one of the important tasks in the diagnosis of vascular diseases and therapy planning. Although learning-based segmentation approaches have been extensively studied, a large amount of ground-truth labels are required in supervised methods and confusing background structures make neural networks hard to segment vessels in an unsupervised manner. To address this, here we introduce a novel diffusion adversarial representation learning (DARL) model that leverages a denoising diffusion probabilistic model with adversarial learning, and apply it to vessel segmentation. In particular, for self-supervised vessel segmentation, DARL learns the background signal using a diffusion module, which lets a generation module effectively provide vessel representations. Also, by adversarial learning based on the proposed switchable spatially-adaptive denormalization, our model estimates synthetic fake vessel images as well as vessel segmentation masks, which further makes the model capture vessel-relevant semantic information. Once the proposed model is trained, the model generates segmentation masks in a single step and can be applied to general vascular structure segmentation of coronary angiography and retinal images. Experimental results on various datasets show that our method significantly outperforms existing unsupervised and self-supervised vessel segmentation methods. △ Less

Submitted 15 February, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: Accepted at ICLR 2023

arXiv:2209.06305 [pdf]

Ptychographic lens-less polarization microscopy

Authors: Jeongsoo Kim, Seungri Song, Bora Kim, Mirae Park, Seung Jae Oh, Daesuk Kim, Barry Cense, Yong-Min Huh, Joo Yong Lee, Chulmin Joo

Abstract: Birefringence, an inherent characteristic of optically anisotropic materials, is widely utilized in various imaging applications ranging from material characterizations to clinical diagnosis. Polarized light microscopy enables high-resolution, high-contrast imaging of optically anisotropic specimens, but it is associated with mechanical rotations of polarizer/analyzer and relatively complex optica… ▽ More Birefringence, an inherent characteristic of optically anisotropic materials, is widely utilized in various imaging applications ranging from material characterizations to clinical diagnosis. Polarized light microscopy enables high-resolution, high-contrast imaging of optically anisotropic specimens, but it is associated with mechanical rotations of polarizer/analyzer and relatively complex optical designs. Here, we present a novel form of polarization-sensitive microscopy capable of birefringence imaging of transparent objects without an optical lens and any moving parts. Our method exploits an optical mask-modulated polarization image sensor and single-input-state LED illumination design to obtain complex and birefringence images of the object via ptychographic phase retrieval. Using a camera with a pixel resolution of 3.45 um, the method achieves birefringence imaging with a half-pitch resolution of 2.46 um over a 59.74 mm^2 field-of-view, which corresponds to a space-bandwidth product of 9.9 megapixels. We demonstrate the high-resolution, large-area birefringence imaging capability of our method by presenting the birefringence images of various anisotropic objects, including a birefringent resolution target, liquid crystal polymer depolarizer, monosodium urate crystal, and excised mouse eye and heart tissues. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: 18 pages, 10 figures, author names corrected

arXiv:2209.03952 [pdf, other]

TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

Authors: Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, Shinji Watanabe

Abstract: We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal inf… ▽ More We propose TF-GridNet, a novel multi-path deep neural network (DNN) operating in the time-frequency (T-F) domain, for monaural talker-independent speaker separation in anechoic conditions. The model stacks several multi-path blocks, each consisting of an intra-frame spectral module, a sub-band temporal module, and a full-band self-attention module, to leverage local and global spectro-temporal information for separation. The model is trained to perform complex spectral mapping, where the real and imaginary (RI) components of the input mixture are stacked as input features to predict target RI components. Besides using the scale-invariant signal-to-distortion ratio (SI-SDR) loss for model training, we include a novel loss term to encourage separated sources to add up to the input mixture. Without using dynamic mixing, we obtain 23.4 dB SI-SDR improvement (SI-SDRi) on the WSJ0-2mix dataset, outperforming the previous best by a large margin. △ Less

Submitted 15 March, 2023; v1 submitted 8 September, 2022; originally announced September 2022.

Comments: in IEEE ICASSP 2023

arXiv:2206.13909 [pdf, other]

QTI Submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design

Authors: Byeonggeun Kim, Seunghan Yang, Jangho Kim, Simyung Chang

Abstract: This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance nor… ▽ More This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. We extend this work to [1]. △ Less

Submitted 25 October, 2022; v1 submitted 28 June, 2022; originally announced June 2022.

Comments: tech report; won 1st place in DCASE2021 challenge. arXiv admin note: substantial text overlap with arXiv:2111.06531

arXiv:2206.13708 [pdf, other]

Personalized Keyword Spotting through Multi-task Learning

Authors: Seunghan Yang, Byeonggeun Kim, Inseop Chung, Simyung Chang

Abstract: Keyword spotting (KWS) plays an essential role in enabling speech-based user interaction on smart devices, and conventional KWS (C-KWS) approaches have concentrated on detecting user-agnostic pre-defined keywords. However, in practice, most user interactions come from target users enrolled in the device which motivates to construct personalized keyword spotting. We design two personalized KWS task… ▽ More Keyword spotting (KWS) plays an essential role in enabling speech-based user interaction on smart devices, and conventional KWS (C-KWS) approaches have concentrated on detecting user-agnostic pre-defined keywords. However, in practice, most user interactions come from target users enrolled in the device which motivates to construct personalized keyword spotting. We design two personalized KWS tasks; (1) Target user Biased KWS (TB-KWS) and (2) Target user Only KWS (TO-KWS). To solve the tasks, we propose personalized keyword spotting through multi-task learning (PK-MTL) that consists of multi-task learning and task-adaptation. First, we introduce applying multi-task learning on keyword spotting and speaker verification to leverage user information to the keyword spotting system. Next, we design task-specific scoring functions to adapt to the personalized KWS tasks thoroughly. We evaluate our framework on conventional and personalized scenarios, and the results show that PK-MTL can dramatically reduce the false alarm rate, especially in various practical scenarios. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Proceedings of INTERSPEECH 2022

arXiv:2206.13691 [pdf, other]

Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting

Authors: Byeonggeun Kim, Seunghan Yang, Inseop Chung, Simyung Chang

Abstract: Keyword spotting is the task of detecting a keyword in streaming audio. Conventional keyword spotting targets predefined keywords classification, but there is growing attention in few-shot (query-by-example) keyword spotting, e.g., N-way classification given M-shot support samples. Moreover, in real-world scenarios, there can be utterances from unexpected categories (open-set) which need to be rej… ▽ More Keyword spotting is the task of detecting a keyword in streaming audio. Conventional keyword spotting targets predefined keywords classification, but there is growing attention in few-shot (query-by-example) keyword spotting, e.g., N-way classification given M-shot support samples. Moreover, in real-world scenarios, there can be utterances from unexpected categories (open-set) which need to be rejected rather than classified as one of the N classes. Combining the two needs, we tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC. We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets). Our D-ProtoNets shows clear margins compared to recent few-shot open-set recognition (FSOSR) approaches in the suggested splitGSC. We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Proceedings of INTERSPEECH 2022

arXiv:2206.13295 [pdf, other]

Diffusion Deformable Model for 4D Temporal Medical Image Generation

Authors: Boah Kim, Jong Chul Ye

Abstract: Temporal volume images with 3D+t (4D) information are often used in medical imaging to statistically analyze temporal dynamics or capture disease progression. Although deep-learning-based generative models for natural images have been extensively studied, approaches for temporal medical image generation such as 4D cardiac volume data are limited. In this work, we present a novel deep learning mode… ▽ More Temporal volume images with 3D+t (4D) information are often used in medical imaging to statistically analyze temporal dynamics or capture disease progression. Although deep-learning-based generative models for natural images have been extensively studied, approaches for temporal medical image generation such as 4D cardiac volume data are limited. In this work, we present a novel deep learning model that generates intermediate temporal volumes between source and target volumes. Specifically, we propose a diffusion deformable model (DDM) by adapting the denoising diffusion probabilistic model that has recently been widely investigated for realistic image generation. Our proposed DDM is composed of the diffusion and the deformation modules so that DDM can learn spatial deformation information between the source and target volumes and provide a latent code for generating intermediate frames along a geodesic path. Once our model is trained, the latent code estimated from the diffusion module is simply interpolated and fed into the deformation module, which enables DDM to generate temporal frames along the continuous trajectory while preserving the topology of the source image. We demonstrate the proposed method with the 4D cardiac MR image generation between the diastolic and systolic phases for each subject. Compared to the existing deformation methods, our DDM achieves high performance on temporal volume generation. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Accepted for MICCAI 2022

arXiv:2206.12513 [pdf, other]

Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene Classification

Authors: Byeonggeun Kim, Seunghan Yang, Jangho Kim, Hyunsin Park, Juntae Lee, Simyung Chang

Abstract: While using two-dimensional convolutional neural networks (2D-CNNs) in image processing, it is possible to manipulate domain information using channel statistics, and instance normalization has been a promising way to get domain-invariant features. However, unlike image processing, we analyze that domain-relevant information in an audio feature is dominant in frequency statistics rather than chann… ▽ More While using two-dimensional convolutional neural networks (2D-CNNs) in image processing, it is possible to manipulate domain information using channel statistics, and instance normalization has been a promising way to get domain-invariant features. However, unlike image processing, we analyze that domain-relevant information in an audio feature is dominant in frequency statistics rather than channel statistics. Motivated by our analysis, we introduce Relaxed Instance Frequency-wise Normalization (RFN): a plug-and-play, explicit normalization module along the frequency axis which can eliminate instance-specific domain discrepancy in an audio feature while relaxing undesirable loss of useful discriminative information. Empirically, simply adding RFN to networks shows clear margins compared to previous domain generalization approaches on acoustic scene classification and yields improved robustness for multiple audio devices. Especially, the proposed RFN won the DCASE2021 challenge TASK1A, low-complexity acoustic scene classification with multiple devices, with a clear margin, and RFN is an extended work of our technical report. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: Proceedings of INTERSPEECH 2022

arXiv:2205.10405 [pdf, other]

Demo: A Transparent Antenna System for In-Building Networks

Authors: Sang-Hyun Park, Soo-Min Kim, Seonghoon Kim, HongIl Yoo, Byoungnam Kim, Chan-Byoung Chae

Abstract: For in-building networks, the potential of transparent antennas, which are used as windows of a building, is presented in this paper. In this scenario, a transparent window antenna communicates with outdoor devices or base stations, and the indoor repeaters act as relay stations of the transparent window antenna for indoor devices. At indoor, back lobe waves of the transparent window antenna are d… ▽ More For in-building networks, the potential of transparent antennas, which are used as windows of a building, is presented in this paper. In this scenario, a transparent window antenna communicates with outdoor devices or base stations, and the indoor repeaters act as relay stations of the transparent window antenna for indoor devices. At indoor, back lobe waves of the transparent window antenna are defined as interference to in-building networks. Hence, we analyze different SIR and SINR results according to the location of an indoor repeater through 3D ray tracing system-level simulation. Furthermore, a link-level simulation through a full-duplex software-defined radio platform with the fabricated transparent antenna is presented to examine the feasibility of the transparent antenna. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 2 pages, 3 figures

arXiv:2205.05275 [pdf, other]

Strong Sign Controllability of Diffusively-Coupled Networks

Authors: Nam-Jin Park, Seong-Ho Kwon, Yoo-Bin Bae, Byeong-Yeon Kim, Kevin L. Moore, Hyo-Sung Ahn

Abstract: This paper presents several conditions to determine strong sign controllability for diffusively-coupled undirected networks. The strong sign controllability is determined by the sign patterns (positive, negative, zero) of the edges. We first provide the necessary and sufficient conditions for strong sign controllability of basic components, such as path, cycle, and tree. Next, we propose a merging… ▽ More This paper presents several conditions to determine strong sign controllability for diffusively-coupled undirected networks. The strong sign controllability is determined by the sign patterns (positive, negative, zero) of the edges. We first provide the necessary and sufficient conditions for strong sign controllability of basic components, such as path, cycle, and tree. Next, we propose a merging process to extend the basic componenets to a larger graph based on the conditions of the strong sign controllability. Furthermore, we develop an algorithm of polynomial complexity to find the minimum number of external input nodes while maintaining the strong sign controllability of a network. △ Less

Submitted 11 May, 2022; originally announced May 2022.

arXiv:2201.09458 [pdf]

Hybrid Adaptive Control for Series Elastic Actuator of Humanoid Robot

Authors: Anh Khoa Lanh Luu, Van Tu Duong, Huy Hung Nguyen, Sang Bong Kim, Tan Tien Nguyen

Abstract: Generally, humanoid robots usually suffer significant impact force when walking or running in a non-predefined environment that could easily damage the actuators due to high stiffness. In recent years, the usages of passive, compliant series elastic actuators (SEA) for driving humanoid's joints have proved the capability in many aspects so far. However, despite being widely applied in the biped ro… ▽ More Generally, humanoid robots usually suffer significant impact force when walking or running in a non-predefined environment that could easily damage the actuators due to high stiffness. In recent years, the usages of passive, compliant series elastic actuators (SEA) for driving humanoid's joints have proved the capability in many aspects so far. However, despite being widely applied in the biped robot research field, the stable control problem for a humanoid powered by the SEAs, especially in the walking process, is still a challenge. This paper proposes a model reference adaptive control (MRAC) combined with the backstepping algorithm to deal with the parameter uncertainties in a humanoid's lower limb driven by the SEA system. This is also an extension of our previous research (Lanh et al.,2021). Firstly, a dynamic model of SEA is obtained. Secondly, since there are unknown and uncertain parameters in the SEA model, a model reference adaptive controller (MRAC) is employed to guarantee the robust performance of the humanoid's lower limb. Finally, an experiment is carried out to evaluate the effectiveness of the proposed controller and the SEA mechanism. △ Less

Submitted 23 January, 2022; originally announced January 2022.

arXiv:2112.11414 [pdf, ps, other]

Covert Communications via Adversarial Machine Learning and Reconfigurable Intelligent Surfaces

Authors: Brian Kim, Tugba Erpek, Yalin E. Sagduyu, Sennur Ulukus

Abstract: By moving from massive antennas to antenna surfaces for software-defined wireless systems, the reconfigurable intelligent surfaces (RISs) rely on arrays of unit cells to control the scattering and reflection profiles of signals, mitigating the propagation loss and multipath attenuation, and thereby improving the coverage and spectral efficiency. In this paper, covert communication is considered in… ▽ More By moving from massive antennas to antenna surfaces for software-defined wireless systems, the reconfigurable intelligent surfaces (RISs) rely on arrays of unit cells to control the scattering and reflection profiles of signals, mitigating the propagation loss and multipath attenuation, and thereby improving the coverage and spectral efficiency. In this paper, covert communication is considered in the presence of the RIS. While there is an ongoing transmission boosted by the RIS, both the intended receiver and an eavesdropper individually try to detect this transmission using their own deep neural network (DNN) classifiers. The RIS interaction vector is designed by balancing two (potentially conflicting) objectives of focusing the transmitted signal to the receiver and keeping the transmitted signal away from the eavesdropper. To boost covert communications, adversarial perturbations are added to signals at the transmitter to fool the eavesdropper's classifier while keeping the effect on the receiver low. Results from different network topologies show that adversarial perturbation and RIS interaction vector can be jointly designed to effectively increase the signal detection accuracy at the receiver while reducing the detection accuracy at the eavesdropper to enable covert communications. △ Less

Submitted 21 December, 2021; originally announced December 2021.

arXiv:2112.05149 [pdf, other]

DiffuseMorph: Unsupervised Deformable Image Registration Using Diffusion Model

Authors: Boah Kim, Inhwa Han, Jong Chul Ye

Abstract: Deformable image registration is one of the fundamental tasks in medical imaging. Classical registration algorithms usually require a high computational cost for iterative optimizations. Although deep-learning-based methods have been developed for fast image registration, it is still challenging to obtain realistic continuous deformations from a moving image to a fixed image with less topological… ▽ More Deformable image registration is one of the fundamental tasks in medical imaging. Classical registration algorithms usually require a high computational cost for iterative optimizations. Although deep-learning-based methods have been developed for fast image registration, it is still challenging to obtain realistic continuous deformations from a moving image to a fixed image with less topological folding problem. To address this, here we present a novel diffusion-model-based image registration method, called DiffuseMorph. DiffuseMorph not only generates synthetic deformed images through reverse diffusion but also allows image registration by deformation fields. Specifically, the deformation fields are generated by the conditional score function of the deformation between the moving and fixed images, so that the registration can be performed from continuous deformation by simply scaling the latent feature of the score. Experimental results on 2D facial and 3D medical image registration tasks demonstrate that our method provides flexible deformations with topology preservation capability. △ Less

Submitted 29 September, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

arXiv:2112.04188 [pdf, other]

Beam Squint in Ultra-wideband mmWave Systems: RF Lens Array vs. Phase-Shifter-Based Array

Authors: Sang-Hyun Park, Byoungnam Kim, Dong Ku Kim, Linglong Dai, Kai-Kit Wong, Chan-Byoung Chae

Abstract: In this article, we discuss the potential of radio frequency (RF) lens for ultra-wideband millimeter-wave (mmWave) systems. In terms of the beam squint, we compare the proposed RF lens antenna with the phase shifter-based array for hybrid beamforming. To reduce the complexities for fully digital beamforming, researchers have come up with RF lens-based hybrid beamforming. The use of mmWave systems,… ▽ More In this article, we discuss the potential of radio frequency (RF) lens for ultra-wideband millimeter-wave (mmWave) systems. In terms of the beam squint, we compare the proposed RF lens antenna with the phase shifter-based array for hybrid beamforming. To reduce the complexities for fully digital beamforming, researchers have come up with RF lens-based hybrid beamforming. The use of mmWave systems, however, causes an increase in bandwidth, which gives rise to the beam squint phenomenon. We first find the causative factors for beam squint in the dielectric RF lens antenna. Based on the beamforming gain at each frequency, we verify that, in a specific situation, RF lens can be free of the beam squint effect. We use 3D electromagnetic analysis software to numerically interpret the beam squint of each antenna type. Based on the results, we present the degraded spectral efficiency by system-level simulations with 3D indoor ray tracing. Finally, to verify our analysis, we fabricate an actual RF lens antenna and demonstrate the real performance using a mmWave, NI PXIe, software-defined radio system. △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: 8 pages, 4 figures, 2 tables

arXiv:2111.14362 [pdf, other]

Unsupervised Image Denoising with Frequency Domain Knowledge

Authors: Nahyun Kim, Donggon Jang, Sunhyeok Lee, Bomi Kim, Dae-Shik Kim

Abstract: Supervised learning-based methods yield robust denoising results, yet they are inherently limited by the need for large-scale clean/noisy paired datasets. The use of unsupervised denoisers, on the other hand, necessitates a more detailed understanding of the underlying image statistics. In particular, it is well known that apparent differences between clean and noisy images are most prominent on h… ▽ More Supervised learning-based methods yield robust denoising results, yet they are inherently limited by the need for large-scale clean/noisy paired datasets. The use of unsupervised denoisers, on the other hand, necessitates a more detailed understanding of the underlying image statistics. In particular, it is well known that apparent differences between clean and noisy images are most prominent on high-frequency bands, justifying the use of low-pass filters as part of conventional image preprocessing steps. However, most learning-based denoising methods utilize only one-sided information from the spatial domain without considering frequency domain information. To address this limitation, in this study we propose a frequency-sensitive unsupervised denoising method. To this end, a generative adversarial network (GAN) is used as a base structure. Subsequently, we include spectral discriminator and frequency reconstruction loss to transfer frequency knowledge into the generator. Results using natural and synthetic datasets indicate that our unsupervised learning method augmented with frequency information achieves state-of-the-art denoising performance, suggesting that frequency domain information could be a viable factor in improving the overall performance of unsupervised learning-based methods. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: Accepted to BMVC 2021

arXiv:2111.10067 [pdf]

Noise-resistant reconstruction algorithm based on the sinogram pattern

Authors: Byung Chun Kim, Hyunju Lee, Kyungtaek Jun

Abstract: We introduce a new CT image reconstruction algorithm that is less affected by various artifacts. The new reconstruction algorithm is a method of minimizing the difference between synchrotron X-ray tomography data and sinograms generated using Radon transform of CT images. The CT image is iteratively updated to reduce the difference from the sinogram of the data. This method can obtain clean CT ima… ▽ More We introduce a new CT image reconstruction algorithm that is less affected by various artifacts. The new reconstruction algorithm is a method of minimizing the difference between synchrotron X-ray tomography data and sinograms generated using Radon transform of CT images. The CT image is iteratively updated to reduce the difference from the sinogram of the data. This method can obtain clean CT images from the projection data, which can create ring artifacts or metal artifacts. Also, even if the sample size is larger than the CCD and/or the projection image does not satisfy the Beer-Lambert law, a clean CT image can be reconstructed. Our new reconstruction algorithm can also be applied to fan beam CT or cone beam CT △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: 8 pages, 3 figures

arXiv:2111.06531 [pdf, other]

Domain Generalization on Efficient Acoustic Scene Classification using Residual Normalization

Authors: Byeonggeun Kim, Seunghan Yang, Jangho Kim, Simyung Chang

Abstract: It is a practical research topic how to deal with multi-device audio inputs by a single acoustic scene classification system with efficient design. In this work, we propose Residual Normalization, a novel feature normalization method that uses frequency-wise normalization % instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful informat… ▽ More It is a practical research topic how to deal with multi-device audio inputs by a single acoustic scene classification system with efficient design. In this work, we propose Residual Normalization, a novel feature normalization method that uses frequency-wise normalization % instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Moreover, we introduce an efficient architecture, BC-ResNet-ASC, a modified version of the baseline architecture with a limited receptive field. BC-ResNet-ASC outperforms the baseline architecture even though it contains the small number of parameters. Through three model compression schemes: pruning, quantization, and knowledge distillation, we can reduce model complexity further while mitigating the performance degradation. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. The proposed method won the 1st place in DCASE 2021 challenge, TASK1A. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)

Showing 1–50 of 99 results for author: Kim, B