Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–22 of 22 results for author: Mun, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.18505  [pdf, other

    eess.AS

    VoxSim: A perceptual voice similarity dataset

    Authors: Junseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung

    Abstract: This paper introduces VoxSim, a dataset of perceptual voice similarity ratings. Recent efforts to automate the assessment of speech synthesis technologies have primarily focused on predicting mean opinion score of naturalness, leaving speaker voice similarity relatively unexplored due to a lack of extensive training data. To address this, we generate about 41k utterance pairs from the VoxCeleb dat… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: INTERSPEECH 2024. The dataset is available from https://mm.kaist.ac.kr/projects/voxsim/

  2. arXiv:2312.06065  [pdf, other

    eess.AS cs.SD

    EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings

    Authors: Sung Hwan Mun, Min Hyun Han, Canyeong Moon, Nam Soo Kim

    Abstract: In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: Submitted to IEEE Signal Processing Letters

  3. arXiv:2310.03538  [pdf, other

    eess.AS

    Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis

    Authors: Jae-Sung Bae, Joun Yeop Lee, Ji-Hyun Lee, Seongkyu Mun, Taehwa Kang, Hoon-Young Cho, Chanwoo Kim

    Abstract: Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts s… ▽ More

    Submitted 22 January, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2305.19051  [pdf, other

    eess.AS cs.AI cs.SD

    Towards single integrated spoofing-aware speaker verification embeddings

    Authors: Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

    Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outpe… ▽ More

    Submitted 1 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

  5. arXiv:2211.03078  [pdf, other

    eess.AS cs.SD

    An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space

    Authors: Jihwan Lee, Jae-Sung Bae, Seongkyu Mun, Heejin Choi, Joun Yeop Lee, Hoon-Young Cho, Chanwoo Kim

    Abstract: With the recent developments in cross-lingual Text-to-Speech (TTS) systems, L2 (second-language, or foreign) accent problems arise. Moreover, running a subjective evaluation for such cross-lingual TTS systems is troublesome. The vowel space analysis, which is often utilized to explore various aspects of language including L2 accents, is a great alternative analysis tool. In this study, we apply th… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  6. arXiv:2210.05979  [pdf, other

    eess.AS cs.SD

    Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

    Authors: Byoung Jin Choi, Myeonghun Jeong, Minchan Kim, Sung Hwan Mun, Nam Soo Kim

    Abstract: Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main c… ▽ More

    Submitted 22 November, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: APSIPA 2022

  7. arXiv:2210.02732  [pdf, other

    eess.AS

    Fully Unsupervised Training of Few-shot Keyword Spotting

    Authors: Dongjune Lee, Minchan Kim, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

    Abstract: For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric le… ▽ More

    Submitted 6 October, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: Accepted by IEEE SLT 2022

  8. arXiv:2208.08012  [pdf, other

    eess.AS cs.SD

    Disentangled Speaker Representation Learning via Mutual Information Minimization

    Authors: Sung Hwan Mun, Min Hyun Han, Minchan Kim, Dongjune Lee, Nam Soo Kim

    Abstract: Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive lo… ▽ More

    Submitted 12 October, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: Accepted by APSIPA ASC 2022. Camera-ready. 8 pages, 4 figures, and 1 table

  9. arXiv:2204.01271  [pdf, other

    eess.AS cs.LG cs.SD

    Into-TTS : Intonation Template Based Prosody Control System

    Authors: Jihwan Lee, Joun Yeop Lee, Heejin Choi, Seongkyu Mun, Sangjun Park, Jae-Sung Bae, Chanwoo Kim

    Abstract: Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervi… ▽ More

    Submitted 6 November, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2204.01005  [pdf, other

    eess.AS cs.AI

    Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

    Authors: Sung Hwan Mun, Jee-weon Jung, Min Hyun Han, Nam Soo Kim

    Abstract: The majority of recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention mechanisms. Convolutional layers of these models typically have a fixed kernel size, e.g., 3 or 5. In this study, we further contribute to this line of research utilising a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional lay… ▽ More

    Submitted 12 October, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: Accepted by IEEE SLT 2022. 7 pages, 4 figures, 1 table. Code is available at https://github.com/msh9184/ska-tdnn.git

  11. arXiv:2112.08929  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-supervised Speaker Verification

    Authors: Sung Hwan Mun, Min Hyun Han, Dongjune Lee, Jihwan Kim, Nam Soo Kim

    Abstract: In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertainty-aware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In t… ▽ More

    Submitted 24 December, 2021; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: Accepted by IEEE Access

  12. arXiv:2105.01254  [pdf, other

    cs.SD cs.LG eess.AS

    Streaming end-to-end speech recognition with jointly trained neural feature enhancement

    Authors: Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, Seongkyu Mun, Changwoo Han

    Abstract: In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples,… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

    Comments: Accepted to ICASSP 2021

  13. arXiv:2010.11433  [pdf, other

    eess.AS cs.SD

    Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning

    Authors: Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

    Abstract: In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the pr… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure, 4 tables

  14. arXiv:2010.11408  [pdf, ps, other

    eess.AS cs.SD

    Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020

    Authors: Sung Hwan Mun, Woo Hyun Kang, Min Hyun Han, Nam Soo Kim

    Abstract: This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: Accepted in INTERSPEECH 2020

  15. Disentangled speaker and nuisance attribute embedding for robust speaker verification

    Authors: Woo Hyun Kang, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

    Abstract: Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states)… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Accepted in IEEE Access

  16. arXiv:2007.05191  [pdf, other

    cs.SD eess.AS

    Overcoming label noise in audio event detection using sequential labeling

    Authors: Jae-Bin Kim, Seongkyu Mun, Myungwoo Oh, Soyeon Choe, Yong-Hyeok Lee, Hyung-Min Park

    Abstract: This paper addresses the noisy label issue in audio event detection (AED) by refining strong labels as sequential labels with inaccurate timestamps removed. In AED, strong labels contain the occurrence of a specific event and its timestamps corresponding to the start and end of the event in an audio clip. The timestamps depend on subjectivity of each annotator, and their label noise is inevitable.… ▽ More

    Submitted 10 July, 2020; originally announced July 2020.

  17. arXiv:2005.08776  [pdf, other

    eess.AS cs.SD

    Metric Learning for Keyword Spotting

    Authors: Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, Joon Son Chung

    Abstract: The goal of this work is to train effective representations for keyword spotting via metric learning. Most existing works address keyword spotting as a closed-set classification problem, where both target and non-target keywords are predefined. Therefore, prevailing classifier-based keyword spotting systems perform poorly on non-target sounds which are unseen during the training stage, causing hig… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  18. In defence of metric learning for speaker recognition

    Authors: Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, Icksang Han

    Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper… ▽ More

    Submitted 24 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: The code can be found at https://github.com/clovaai/voxceleb_trainer

  19. arXiv:1911.02411  [pdf, other

    cs.SD eess.AS

    The sound of my voice: speaker representation loss for target voice separation

    Authors: Seongkyu Mun, Soyeon Choe, Jaesung Huh, Joon Son Chung

    Abstract: Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice from the noisy input and also remove it from the residual components. Compared to the conventional s… ▽ More

    Submitted 27 February, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

    Comments: To appear in ICASSP 2020. The first two authors contributed equally to this work

  20. arXiv:1910.11238  [pdf, other

    cs.SD cs.LG eess.AS

    Delving into VoxCeleb: environment invariant speaker recognition

    Authors: Joon Son Chung, Jaesung Huh, Seongkyu Mun

    Abstract: Research in speaker recognition has recently seen significant progress due to the application of neural network models and the availability of new large-scale datasets. There has been a plethora of work in search for more powerful architectures or loss functions suitable for the task, but these works do not consider what information is learnt by the models, apart from being able to predict the giv… ▽ More

    Submitted 3 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

  21. arXiv:1812.01731  [pdf, other

    cs.SD eess.AS

    Domain Mismatch Robust Acoustic Scene Classification using Channel Information Conversion

    Authors: Seongkyu Mun, Suwon Shon

    Abstract: In a recent acoustic scene classification (ASC) research field, training and test device channel mismatch have become an issue for the real world implementation. To address the issue, this paper proposes a channel domain conversion using factorized hierarchical variational autoencoder. Proposed method adapts both the source and target domain to a pre-defined specific domain. Unlike the conventiona… ▽ More

    Submitted 4 December, 2018; originally announced December 2018.

  22. arXiv:1807.04970  [pdf

    cs.SD eess.AS

    Analysis Acoustic Features for Acoustic Scene Classification and Score fusion of multi-classification systems applied to DCASE 2016 challenge

    Authors: Sangwook Park, Seongkyu Mun, Younglo Lee, David K. Han, Hanseok Ko

    Abstract: This paper describes an acoustic scene classification method which achieved the 4th ranking result in the IEEE AASP challenge of Detection and Classification of Acoustic Scenes and Events 2016. In order to accomplish the ensuing task, several methods are explored in three aspects: feature extraction, feature transformation, and score fusion for final decision. In the part of feature extraction, se… ▽ More

    Submitted 13 July, 2018; originally announced July 2018.

    Comments: This article is related to a technical report for a challenge named Detection and Classification of Acoustic Scenes and Events 2016

    Journal ref: Park, S., Mun, S., Lee, Y., and Ko, H. (2016). Score fusion of classification systems for acoustic scene classification. IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)