Search | arXiv e-print repository

ManWav: The First Manchu ASR Model

Authors: Jean Seo, Minha Kang, Sungjoo Byun, Sangah Lee

Abstract: This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR… ▽ More This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: ACL2024/Field Matters

arXiv:2403.16447 [pdf, ps, other]

A Study on How Attention Scores in the BERT Model are Aware of Lexical Categories in Syntactic and Semantic Tasks on the GLUE Benchmark

Authors: Dongjun Jang, Sungjoo Byun, Hyopil Shin

Abstract: This study examines whether the attention scores between tokens in the BERT model significantly vary based on lexical categories during the fine-tuning process for downstream tasks. Drawing inspiration from the notion that in human language processing, syntactic and semantic information is parsed differently, we categorize tokens in sentences according to their lexical categories and focus on chan… ▽ More This study examines whether the attention scores between tokens in the BERT model significantly vary based on lexical categories during the fine-tuning process for downstream tasks. Drawing inspiration from the notion that in human language processing, syntactic and semantic information is parsed differently, we categorize tokens in sentences according to their lexical categories and focus on changes in attention scores among these categories. Our hypothesis posits that in downstream tasks that prioritize semantic information, attention scores centered on content words are enhanced, while in cases emphasizing syntactic information, attention scores centered on function words are intensified. Through experimentation conducted on six tasks from the GLUE benchmark dataset, we substantiate our hypothesis regarding the fine-tuning process. Furthermore, our additional investigations reveal the presence of BERT layers that consistently assign more bias to specific lexical categories, irrespective of the task, highlighting the existence of task-agnostic lexical category preferences. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.16444 [pdf, other]

KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Authors: Dongjun Jang, Sungjoo Byun, Hyemi Jo, Hyopil Shin

Abstract: Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets… ▽ More Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce \textit{KIT-19} as an instruction dataset for the development of LLM in Korean. \textit{KIT-19} is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using \textit{KIT-19} to demonstrate its effectiveness. The experimental results show that the model trained on \textit{KIT-19} significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that \textit{KIT-19} has the potential to make a substantial contribution to the future improvement of Korean LLMs' performance. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.16158 [pdf, other]

Korean Bio-Medical Corpus (KBMC) for Medical Named Entity Recognition

Authors: Sungjoo Byun, Jiseung Hong, Sumin Park, Dongjun Jang, Jean Seo, Minseok Kim, Chaeyoung Oh, Hyopil Shin

Abstract: Named Entity Recognition (NER) plays a pivotal role in medical Natural Language Processing (NLP). Yet, there has not been an open-source medical NER dataset specifically for the Korean language. To address this, we utilized ChatGPT to assist in constructing the KBMC (Korean Bio-Medical Corpus), which we are now presenting to the public. With the KBMC dataset, we noticed an impressive 20% increase… ▽ More Named Entity Recognition (NER) plays a pivotal role in medical Natural Language Processing (NLP). Yet, there has not been an open-source medical NER dataset specifically for the Korean language. To address this, we utilized ChatGPT to assist in constructing the KBMC (Korean Bio-Medical Corpus), which we are now presenting to the public. With the KBMC dataset, we noticed an impressive 20% increase in medical NER performance compared to models trained on general Korean NER datasets. This research underscores the significant benefits and importance of using specialized tools and datasets, like ChatGPT, to enhance language processing in specialized fields such as healthcare. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Journal ref: LREC-COLING 2024

arXiv:2402.15046 [pdf, other]

CARBD-Ko: A Contextually Annotated Review Benchmark Dataset for Aspect-Level Sentiment Classification in Korean

Authors: Dongjun Jang, Jean Seo, Sungjoo Byun, Taekyoung Kim, Minseok Kim, Hyopil Shin

Abstract: This paper explores the challenges posed by aspect-based sentiment classification (ABSC) within pretrained language models (PLMs), with a particular focus on contextualization and hallucination issues. In order to tackle these challenges, we introduce CARBD-Ko (a Contextually Annotated Review Benchmark Dataset for Aspect-Based Sentiment Classification in Korean), a benchmark dataset that incorpora… ▽ More This paper explores the challenges posed by aspect-based sentiment classification (ABSC) within pretrained language models (PLMs), with a particular focus on contextualization and hallucination issues. In order to tackle these challenges, we introduce CARBD-Ko (a Contextually Annotated Review Benchmark Dataset for Aspect-Based Sentiment Classification in Korean), a benchmark dataset that incorporates aspects and dual-tagged polarities to distinguish between aspect-specific and aspect-agnostic sentiment classification. The dataset consists of sentences annotated with specific aspects, aspect polarity, aspect-agnostic polarity, and the intensity of aspects. To address the issue of dual-tagged aspect polarities, we propose a novel approach employing a Siamese Network. Our experimental findings highlight the inherent difficulties in accurately predicting dual-polarities and underscore the significance of contextualized sentiment analysis models. The CARBD-Ko dataset serves as a valuable resource for future research endeavors in aspect-level sentiment classification. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.05448 [pdf, other]

Minecraft-ify: Minecraft Style Image Generation with Text-guided Image Editing for In-Game Application

Authors: Bumsoo Kim, Sanghyun Byun, Yonghoon Jung, Wonseop Shin, Sareer UI Amin, Sanghyun Seo

Abstract: In this paper, we first present the character texture generation system \textit{Minecraft-ify}, specified to Minecraft video game toward in-game application. Ours can generate face-focused image for texture mapping tailored to 3D virtual character having cube manifold. While existing projects or works only generate texture, proposed system can inverse the user-provided real image, or generate aver… ▽ More In this paper, we first present the character texture generation system \textit{Minecraft-ify}, specified to Minecraft video game toward in-game application. Ours can generate face-focused image for texture mapping tailored to 3D virtual character having cube manifold. While existing projects or works only generate texture, proposed system can inverse the user-provided real image, or generate average/random appearance from learned distribution. Moreover, it can be manipulated with text-guidance using StyleGAN and StyleCLIP. These features provide a more extended user experience with enlarged freedom as a user-friendly AI-tool. Project page can be found at https://gh-bumsookim.github.io/Minecraft-ify/ △ Less

Submitted 3 March, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: 2 pages, 2 figures. Accepted as Spotlight to NeurIPS 2023 Workshop on Machine Learning for Creativity and Design

arXiv:2311.18215 [pdf, other]

Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models

Authors: Sungjoo Byun, Dongjun Jang, Hyemi Jo, Hyopil Shin

Abstract: Caution: this paper may include material that could be offensive or distressing. The advent of Large Language Models (LLMs) necessitates the development of training approaches that mitigate the generation of unethical language and aptly manage toxic user queries. Given the challenges related to human labor and the scarcity of data, we present KoTox, comprising 39K unethical instruction-output pa… ▽ More Caution: this paper may include material that could be offensive or distressing. The advent of Large Language Models (LLMs) necessitates the development of training approaches that mitigate the generation of unethical language and aptly manage toxic user queries. Given the challenges related to human labor and the scarcity of data, we present KoTox, comprising 39K unethical instruction-output pairs. This collection of automatically generated toxic instructions refines the training of LLMs and establishes a foundational framework for improving LLMs' ethical awareness and response to various toxic inputs, promoting more secure and responsible interactions in Natural Language Processing (NLP) applications. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

arXiv:2311.17492 [pdf, other]

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

Authors: Jean Seo, Sungjoo Byun, Minha Kang, Sangah Lee

Abstract: The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(… ▽ More The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score. △ Less

Submitted 12 January, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: emnlp2023/mrl2023

arXiv:2311.13784 [pdf, other]

DaG LLM ver 1.0: Pioneering Instruction-Tuned Language Modeling for Korean NLP

Authors: Dongjun Jang, Sangah Lee, Sungjoo Byun, Jinwoong Kim, Jean Seo, Minseok Kim, Soyeon Kim, Chaeyoung Oh, Jaeyoon Kim, Hyemi Jo, Hyopil Shin

Abstract: This paper presents the DaG LLM (David and Goliath Large Language Model), a language model specialized for Korean and fine-tuned through Instruction Tuning across 41 tasks within 13 distinct categories. This paper presents the DaG LLM (David and Goliath Large Language Model), a language model specialized for Korean and fine-tuned through Instruction Tuning across 41 tasks within 13 distinct categories. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2310.02588 [pdf, other]

ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer

Authors: Seok-Yong Byun, Wonju Lee

Abstract: This paper presents a novel approach to address the challenges of understanding the prediction process and debugging prediction errors in Vision Transformers (ViT), which have demonstrated superior performance in various computer vision tasks such as image classification and object detection. While several visual explainability techniques, such as CAM, Grad-CAM, Score-CAM, and Recipro-CAM, have be… ▽ More This paper presents a novel approach to address the challenges of understanding the prediction process and debugging prediction errors in Vision Transformers (ViT), which have demonstrated superior performance in various computer vision tasks such as image classification and object detection. While several visual explainability techniques, such as CAM, Grad-CAM, Score-CAM, and Recipro-CAM, have been extensively researched for Convolutional Neural Networks (CNNs), limited research has been conducted on ViT. Current state-of-the-art solutions for ViT rely on class agnostic Attention-Rollout and Relevance techniques. In this work, we propose a new gradient-free visual explanation method for ViT, called ViT-ReciproCAM, which does not require attention matrix and gradient information. ViT-ReciproCAM utilizes token masking and generated new layer outputs from the target layer's input to exploit the correlation between activated tokens and network predictions for target classes. Our proposed method outperforms the state-of-the-art Relevance method in the Average Drop-Coherence-Complexity (ADCC) metric by $4.58\%$ to $5.80\%$ and generates more localized saliency maps. Our experiments demonstrate the effectiveness of ViT-ReciproCAM and showcase its potential for understanding and debugging ViT models. Our proposed method provides an efficient and easy-to-implement alternative for generating visual explanations, without requiring attention and gradient information, which can be beneficial for various applications in the field of computer vision. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2209.14074 [pdf, other]

Recipro-CAM: Fast gradient-free visual explanations for convolutional neural networks

Authors: Seok-Yong Byun, Wonju Lee

Abstract: The Convolutional Neural Network (CNN) is a widely used deep learning architecture for computer vision. However, its black box nature makes it difficult to interpret the behavior of the model. To mitigate this issue, AI practitioners have explored explainable AI methods like Class Activation Map (CAM) and Grad-CAM. Although these methods have shown promise, they are limited by architectural constr… ▽ More The Convolutional Neural Network (CNN) is a widely used deep learning architecture for computer vision. However, its black box nature makes it difficult to interpret the behavior of the model. To mitigate this issue, AI practitioners have explored explainable AI methods like Class Activation Map (CAM) and Grad-CAM. Although these methods have shown promise, they are limited by architectural constraints or the burden of gradient computing. To overcome this issue, Score-CAM and Ablation-CAM have been proposed as gradient-free methods, but they have longer execution times compared to CAM or Grad-CAM based methods, making them unsuitable for real-world solution though they resolved gradient related issues and enabled inference mode XAI. To address this challenge, we propose a fast gradient-free Reciprocal CAM (Recipro-CAM) method. Our approach involves spatially masking the extracted feature maps to exploit the correlation between activation maps and network predictions for target classes. Our proposed method has yielded promising results, outperforming current state-of-the-art method in the Average Drop-Coherence-Complexity (ADCC) metric by $1.78 \%$ to $3.72 \%$, excluding VGG-16 backbone. Moreover, Recipro-CAM generates saliency maps at a similar rate to Grad-CAM and is approximately $148$ times faster than Score-CAM. The source code for Recipro-CAM is available in our data analysis framework. △ Less

Submitted 12 March, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

arXiv:2107.00191 [pdf, other]

Unsupervised Model Drift Estimation with Batch Normalization Statistics for Dataset Shift Detection and Model Selection

Authors: Wonju Lee, Seok-Yong Byun, Jooeun Kim, Minje Park, Kirill Chechil

Abstract: While many real-world data streams imply that they change frequently in a nonstationary way, most of deep learning methods optimize neural networks on training data, and this leads to severe performance degradation when dataset shift happens. However, it is less possible to annotate or inspect newly streamed data by humans, and thus it is desired to measure model drift at inference time in an unsu… ▽ More While many real-world data streams imply that they change frequently in a nonstationary way, most of deep learning methods optimize neural networks on training data, and this leads to severe performance degradation when dataset shift happens. However, it is less possible to annotate or inspect newly streamed data by humans, and thus it is desired to measure model drift at inference time in an unsupervised manner. In this paper, we propose a novel method of model drift estimation by exploiting statistics of batch normalization layer on unlabeled test data. To remedy possible sampling error of streamed input data, we adopt low-rank approximation to each representational layer. We show the effectiveness of our method not only on dataset shift detection but also on model selection when there are multiple candidate models among model zoo or training trajectories in an unsupervised way. We further demonstrate the consistency of our method by comparing model drift scores between different network architectures. △ Less

Submitted 30 June, 2021; originally announced July 2021.

Comments: 11 pages, 5 figures, 2 tables

arXiv:1904.10788 [pdf, other]

doi 10.1109/ICASSP.2019.8683483

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Authors: Seunghyun Yoon, Seokhyun Byun, Subhadeep Dey, Kyomin Jung

Abstract: In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a fra… ▽ More In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data. The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance. Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities. The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal. The relevant textual data is then applied to attend parts of the audio signal. To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset. Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy. △ Less

Submitted 9 May, 2019; v1 submitted 23 April, 2019; originally announced April 2019.

Comments: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral presentation)

arXiv:1810.04635 [pdf, other]

Multimodal Speech Emotion Recognition Using Audio and Text

Authors: Seunghyun Yoon, Seokhyun Byun, Kyomin Jung

Abstract: Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content,… ▽ More Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%. △ Less

Submitted 10 October, 2018; originally announced October 2018.

Comments: 7 pages, Accepted as a conference paper at IEEE SLT 2018

arXiv:cs/0609098 [pdf]

Reducing the Makespan in Hierarchical Reliable Multicast Tree

Authors: Sang-Seon Byun, Chuck Yoo

Abstract: In hierarchical reliable multicast environment, makespan is the time that is required to fully and successfully transmit a packet from the sender to all receivers. Low makespan is vital for achieving high throughput with a TCP-like window based sending scheme. In hierarchical reliable multicast methods, the number of repair servers and their locations influence the makespan. In this paper we pro… ▽ More In hierarchical reliable multicast environment, makespan is the time that is required to fully and successfully transmit a packet from the sender to all receivers. Low makespan is vital for achieving high throughput with a TCP-like window based sending scheme. In hierarchical reliable multicast methods, the number of repair servers and their locations influence the makespan. In this paper we propose a new method to decide the locations of repair servers that can reduce the makespan in hierarchical reliable multicast networks. Our method has a formulation based on mixed integer programming to analyze the makespan minimization problem. A notable aspect of the formulation is that heterogeneous links and packet losses are taken into account in the formulation. Three different heuristics are presented to find the locations of repair servers in reasonable time in the formulation. Through simulations, three heuristics are carefully analyzed and compared on networks with different sizes. We also evaluate our proposals on PGM (Pragmatic General Multicast) reliable multicast protocol using ns-2 simulation. The results show that the our best heuristic is close to the lower bound by a factor of 2.3 in terms of makespan and by a factor of 5.5 in terms of the number of repair servers. △ Less

Submitted 17 September, 2006; originally announced September 2006.

Comments: 26 pages, 9 figures, submitted to IEEE Transactions on Information Theory

Showing 1–15 of 15 results for author: Byun, S