-
Scalable, reproducible, and cost-effective processing of large-scale medical imaging datasets
Authors:
Michael E. Kim,
Karthik Ramadass,
Chenyu Gao,
Praitayini Kanakaraj,
Nancy R. Newlin,
Gaurav Rudravaram,
Kurt G. Schilling,
Blake E. Dewey,
Derek Archer,
Timothy J. Hohman,
Zhiyuan Li,
Shunxing Bao,
Bennett A. Landman,
Nazirah Mohd Khairi
Abstract:
Curating, processing, and combining large-scale medical imaging datasets from national studies is a non-trivial task due to the intense computation and data throughput required, variability of acquired data, and associated financial overhead. Existing platforms or tools for large-scale data curation, processing, and storage have difficulty achieving a viable cost-to-scale ratio of computation spee…
▽ More
Curating, processing, and combining large-scale medical imaging datasets from national studies is a non-trivial task due to the intense computation and data throughput required, variability of acquired data, and associated financial overhead. Existing platforms or tools for large-scale data curation, processing, and storage have difficulty achieving a viable cost-to-scale ratio of computation speed for research purposes, either being too slow or too expensive. Additionally, management and consistency of processing large data in a team-driven manner is a non-trivial task. We design a BIDS-compliant method for an efficient and robust data processing pipeline of large-scale diffusion-weighted and T1-weighted MRI data compatible with low-cost, high-efficiency computing systems. Our method accomplishes automated querying of data available for processing and process running in a consistent and reproducible manner that has long-term stability, while using heterogenous low-cost computational resources and storage systems for efficient processing and data transfer. We demonstrate how our organizational structure permits efficiency in a semi-automated data processing pipeline and show how our method is comparable in processing time to cloud-based computation while being almost 20 times more cost-effective. Our design allows for fast data throughput speeds and low latency to reduce the time for data transfer between storage servers and computation servers, achieving an average of 0.60 Gb/s compared to 0.33 Gb/s for using cloud-based processing methods. The design of our workflow engine permits quick process running while maintaining flexibility to adapt to newly acquired data.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Unlocking Intrinsic Fairness in Stable Diffusion
Authors:
Eunji Kim,
Siwon Kim,
Rahim Entezari,
Sungroh Yoon
Abstract:
Recent text-to-image models like Stable Diffusion produce photo-realistic images but often show demographic biases. Previous debiasing methods focused on training-based approaches, failing to explore the root causes of bias and overlooking Stable Diffusion's potential for unbiased image generation. In this paper, we demonstrate that Stable Diffusion inherently possesses fairness, which can be unlo…
▽ More
Recent text-to-image models like Stable Diffusion produce photo-realistic images but often show demographic biases. Previous debiasing methods focused on training-based approaches, failing to explore the root causes of bias and overlooking Stable Diffusion's potential for unbiased image generation. In this paper, we demonstrate that Stable Diffusion inherently possesses fairness, which can be unlocked to achieve debiased outputs. Through carefully designed experiments, we identify the excessive bonding between text prompts and the diffusion process as a key source of bias. To address this, we propose a novel approach that perturbs text conditions to unleash Stable Diffusion's intrinsic fairness. Our method effectively mitigates bias without additional tuning, while preserving image-text alignment and image quality.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation
Authors:
Jisoo Kim,
Jungbin Cho,
Joonho Park,
Soonmin Hwang,
Da Eun Kim,
Geon Kim,
Youngjae Yu
Abstract:
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hinde…
▽ More
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, A talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Source code will be made publicly available soon.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
EXAONE 3.0 7.8B Instruction Tuned Language Model
Authors:
LG AI Research,
:,
Soyoung An,
Kyunghoon Bae,
Eunbi Choi,
Stanley Jungkyu Choi,
Yemuk Choi,
Seokhee Hong,
Yeonjung Hong,
Junwon Hwang,
Hyojin Jeon,
Gerrard Jeongwon Jo,
Hyunjik Jo,
Jiyeon Jung,
Yountae Jung,
Euisoon Kim,
Hyosang Kim,
Joonkee Kim,
Seonghwan Kim,
Soyeon Kim,
Sunkyoung Kim,
Yireun Kim,
Youchul Kim,
Edward Hwayoung Lee,
Haeju Lee
, et al. (14 additional authors not shown)
Abstract:
We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly compet…
▽ More
We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art open models of similar size. Our comparative analysis shows that EXAONE 3.0 excels particularly in Korean, while achieving compelling performance across general tasks and complex reasoning. With its strong real-world effectiveness and bilingual proficiency, we hope that EXAONE keeps contributing to advancements in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct
△ Less
Submitted 13 August, 2024; v1 submitted 7 August, 2024;
originally announced August 2024.
-
HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation
Authors:
Wencan Cheng,
Eunji Kim,
Jong Hwan Ko
Abstract:
The extraction of keypoint positions from input hand frames, known as 3D hand pose estimation, is crucial for various human-computer interaction applications. However, current approaches often struggle with the dynamic nature of self-occlusion of hands and intra-occlusion with interacting objects. To address this challenge, this paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, fo…
▽ More
The extraction of keypoint positions from input hand frames, known as 3D hand pose estimation, is crucial for various human-computer interaction applications. However, current approaches often struggle with the dynamic nature of self-occlusion of hands and intra-occlusion with interacting objects. To address this challenge, this paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, for hand pose estimation. The proposed HandDAGT leverages a transformer structure to thoroughly explore effective geometric features from input patches. Additionally, it incorporates a novel attention mechanism to adaptively weigh the contribution of kinematic correspondence and local geometric features for the estimation of specific keypoints. This attribute enables the model to adaptively employ kinematic and local information based on the occlusion situation, enhancing its robustness and accuracy. Furthermore, we introduce a novel denoising training strategy aimed at improving the model's robust performance in the face of occlusion challenges. Experimental results show that the proposed model significantly outperforms the existing methods on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDAGT.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping
Authors:
Minseong Park,
Suhan Woo,
Euntai Kim
Abstract:
Learning efficient representations of local features is a key challenge in feature volume-based 3D neural mapping, especially in large-scale environments. In this paper, we introduce Decomposition-based Neural Mapping (DNMap), a storage-efficient large-scale 3D mapping method that employs a discrete representation based on a decomposition strategy. This decomposition strategy aims to efficiently c…
▽ More
Learning efficient representations of local features is a key challenge in feature volume-based 3D neural mapping, especially in large-scale environments. In this paper, we introduce Decomposition-based Neural Mapping (DNMap), a storage-efficient large-scale 3D mapping method that employs a discrete representation based on a decomposition strategy. This decomposition strategy aims to efficiently capture repetitive and representative patterns of shapes by decomposing each discrete embedding into component vectors that are shared across the embedding space. Our DNMap optimizes a set of component vectors, rather than entire discrete embeddings, and learns composition rather than indexing the discrete embeddings. Furthermore, to complement the mapping quality, we additionally learn low-resolution continuous embeddings that require tiny storage space. By combining these representations with a shallow neural network and an efficient octree-based feature volume, our DNMap successfully approximates signed distance functions and compresses the feature volume while preserving mapping quality. Our source code is available at https://github.com/minseong-p/dnmap.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
AI-Assisted SQL Authoring at Industry Scale
Authors:
Chandra Maddila,
Negar Ghorbani,
Kosay Jabre,
Vijayaraghavan Murali,
Edwin Kim,
Parth Thakkar,
Nikolay Pavlovich Laptev,
Olivia Harman,
Diana Hsu,
Rui Abreu,
Peter C. Rigby
Abstract:
SqlCompose brings generative AI into the data analytics domain. SQL is declarative, has formal table schemas, and is often written in a non-linear manner. We address each of these challenges and develop a set of models that shows the importance of each problem. We first develop an internal SQL benchmark to perform offline tests at Meta. We evaluate how well the Public Llama model performs. We atta…
▽ More
SqlCompose brings generative AI into the data analytics domain. SQL is declarative, has formal table schemas, and is often written in a non-linear manner. We address each of these challenges and develop a set of models that shows the importance of each problem. We first develop an internal SQL benchmark to perform offline tests at Meta. We evaluate how well the Public Llama model performs. We attain a BLEU score of 53% and 24% for single- and multi-line predictions, respectively. This performance is consistent with prior works on imperative languages. We then fine-tune Llama on our internal data and database schemas. SqlComposeSA substantially outperforms Llama by 16 percentage points on BLEU score. SQL is often written with multiple sub queries and in a non-sequential manner. We develop SqlComposeFIM which is aware of the context before and after the line(s) that need to be completed. This fill-in-the-middle model outperform SqlComposeFIM by 35 percentage points. We also measure how often the models get the correct table names, and SqlComposeFIM is able to do this 75% of the time. Aside from our scientific research, we also roll out SqlComposeFIM at Meta. SqlCompose is used on a weekly basis by over 10k users including data scientists and software engineers, less than 1% of users have disabled SqlCompose. We use the feedback from users to improve SqlCompose. Interesting positive themes include completing tedious or repetitive SQL clauses, suggesting boilerplate coding, and help in eliminate the need to remember difficult SQL syntax. The most significant negative themes was table and column name hallucinations, which has been reduced with the release of SqlComposeFIM. The SqlCompose models consistently outperform public and internal LLMs, despite being smaller (7 bn and 13 bn), which provides early indications that smaller specialist models can outperform larger general purpose models.
△ Less
Submitted 19 July, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Retrieval-Enhanced Machine Learning: Synthesis and Opportunities
Authors:
To Eun Kim,
Alireza Salemi,
Andrew Drozdov,
Fernando Diaz,
Hamed Zamani
Abstract:
In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine…
▽ More
In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Unsupervised Anomaly Detection Using Diffusion Trend Analysis
Authors:
Eunwoo Kim,
Un Yang,
Cheol Lae Roh,
Stefano Ermon
Abstract:
Conventional anomaly detection techniques based on reconstruction via denoising diffusion model are widely used due to their ability to identify anomaly locations and shapes with high performance. However, there is a limitation in determining appropriate noise parameters that can degrade anomalies while preserving normal characteristics. Also, due to the volatility of the diffusion model, normal r…
▽ More
Conventional anomaly detection techniques based on reconstruction via denoising diffusion model are widely used due to their ability to identify anomaly locations and shapes with high performance. However, there is a limitation in determining appropriate noise parameters that can degrade anomalies while preserving normal characteristics. Also, due to the volatility of the diffusion model, normal regions can fluctuate considerably during reconstruction, resulting in false detection. In this paper, we propose a method to detect anomalies by analysis of reconstruction trend depending on the degree of degradation, effectively solving the both problems of existing methods. The proposed method is validated on an open dataset for industrial anomaly detection, improving the performance of existing methods on a number of evaluation criteria. With the ease of combination with existing anomaly detection methods, it provides a tradeoff between computational cost and performance, allowing it high application potential in manufacturing industry.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data
Authors:
Ximing Wen,
Rosina O. Weber,
Anik Sen,
Darryl Hannan,
Steven C. Nesbit,
Vincent Chan,
Alberto Goffi,
Michael Morris,
John C. Hunninghake,
Nicholas E. Villalobos,
Edward Kim,
Christopher J. MacLellan
Abstract:
Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and interpreting ultrasound scans right at the patient's bedside. However, the expertise needed to interpret these images is considerable and may not always be present in emergency situations. This reality makes algorithms such as machine learning classifiers extremely valuable to augment human decisions. POCUS devices are b…
▽ More
Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and interpreting ultrasound scans right at the patient's bedside. However, the expertise needed to interpret these images is considerable and may not always be present in emergency situations. This reality makes algorithms such as machine learning classifiers extremely valuable to augment human decisions. POCUS devices are becoming available at a reasonable cost in the size of a mobile phone. The challenge of turning POCUS devices into life-saving tools is that interpretation of ultrasound images requires specialist training and experience. Unfortunately, the difficulty to obtain positive training images represents an important obstacle to building efficient and accurate classifiers. Hence, the problem we try to investigate is how to explore strategies to increase accuracy of classifiers trained with scarce data. We hypothesize that training with a few data instances may not suffice for classifiers to generalize causing them to overfit. Our approach uses an Explainable AI-Augmented approach to help the algorithm learn more from less and potentially help the classifier better generalize.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
Authors:
Sion Yoon,
Tae Eun Kim,
Yoo Jung Oh
Abstract:
The dynamics of human-AI communication have been reshaped by language models such as ChatGPT. However, extant research has primarily focused on dyadic communication, leaving much to be explored regarding the dynamics of human-AI communication in group settings. The availability of multiple language model chatbots presents a unique opportunity for scholars to better understand the interaction betwe…
▽ More
The dynamics of human-AI communication have been reshaped by language models such as ChatGPT. However, extant research has primarily focused on dyadic communication, leaving much to be explored regarding the dynamics of human-AI communication in group settings. The availability of multiple language model chatbots presents a unique opportunity for scholars to better understand the interaction between humans and multiple chatbots. This study examines the impact of multi-chatbot communication in a specific persuasion setting: promoting charitable donations. We developed an online environment that enables multi-chatbot communication and conducted a pilot experiment utilizing two GPT-based chatbots, Save the Children and UNICEF chatbots, to promote charitable donations. In this study, we present our development process of the multi-chatbot interface and present preliminary findings from a pilot experiment. Analysis of qualitative and quantitative feedback are presented, and limitations are addressed.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
SWANN: Shuffling Weights in Crossbar Arrays for Enhanced DNN Accuracy in Deeply Scaled Technologies
Authors:
Jeffry Victor,
Dong Eun Kim,
Chunguang Wang,
Kaushik Roy,
Sumeet Gupta
Abstract:
Deep neural network (DNN) accelerators employing crossbar arrays capable of in-memory computing (IMC) are highly promising for neural computing platforms. However, in deeply scaled technologies, interconnect resistance severely impairs IMC robustness, leading to a drop in the system accuracy. To address this problem, we propose SWANN - a technique based on shuffling weights in crossbar arrays whic…
▽ More
Deep neural network (DNN) accelerators employing crossbar arrays capable of in-memory computing (IMC) are highly promising for neural computing platforms. However, in deeply scaled technologies, interconnect resistance severely impairs IMC robustness, leading to a drop in the system accuracy. To address this problem, we propose SWANN - a technique based on shuffling weights in crossbar arrays which alleviates the detrimental effect of wire resistance on IMC. For 8T-SRAM-based 128x128 crossbar arrays in 7nm technology, SWANN enhances the accuracy from 47.78% to 83.5% for ResNet-20/CIFAR-10. We also show that SWANN can be used synergistically with Partial-Word-LineActivation, further boosting the accuracy. Moreover, we evaluate the implications of SWANN for compact ferroelectric-transistorbased crossbar arrays. SWANN incurs minimal hardware overhead, with less than a 1% increase in energy consumption. Additionally, the latency and area overheads of SWANN are ~1% and ~16%, respectively when 1 ADC is utilized per crossbar array.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Enhancing Single-Slice Segmentation with 3D-to-2D Unpaired Scan Distillation
Authors:
Xin Yu,
Qi Yang,
Han Liu,
Ho Hin Lee,
Yucheng Tang,
Lucas W. Remedios,
Michael E. Kim,
Rendong Zhang,
Shunxing Bao,
Yuankai Huo,
Ann Zenobia Moore,
Luigi Ferrucci,
Bennett A. Landman
Abstract:
2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmenta…
▽ More
2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmentation results. In this work, we propose a novel 3D-to-2D distillation framework, leveraging pre-trained 3D models to enhance 2D single-slice segmentation. Specifically, we extract the prediction distribution centroid from the 3D representations, to guide the 2D student by learning intra- and inter-class correlation. Unlike traditional knowledge distillation methods that require the same data input, our approach employs unpaired 3D CT scans with any contrast to guide the 2D student model. Experiments conducted on 707 subjects from the single-slice Baltimore Longitudinal Study of Aging (BLSA) dataset demonstrate that state-of-the-art 2D multi-organ segmentation methods can benefit from the 3D teacher model, achieving enhanced performance in single-slice multi-organ segmentation. Notably, our approach demonstrates considerable efficacy in low-data regimes, outperforming the model trained with all available training subjects even when utilizing only 200 training subjects. Thus, this work underscores the potential to alleviate manual annotation burdens.
△ Less
Submitted 12 July, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Fast Global Localization on Neural Radiance Field
Authors:
Mangyu Kong,
Seongwon Lee,
Jaewon Lee,
Euntai Kim
Abstract:
Neural Radiance Fields (NeRF) presented a novel way to represent scenes, allowing for high-quality 3D reconstruction from 2D images. Following its remarkable achievements, global localization within NeRF maps is an essential task for enabling a wide range of applications. Recently, Loc-NeRF demonstrated a localization approach that combines traditional Monte Carlo Localization with NeRF, showing p…
▽ More
Neural Radiance Fields (NeRF) presented a novel way to represent scenes, allowing for high-quality 3D reconstruction from 2D images. Following its remarkable achievements, global localization within NeRF maps is an essential task for enabling a wide range of applications. Recently, Loc-NeRF demonstrated a localization approach that combines traditional Monte Carlo Localization with NeRF, showing promising results for using NeRF as an environment map. However, despite its advancements, Loc-NeRF encounters the challenge of a time-intensive ray rendering process, which can be a significant limitation in practical applications. To address this issue, we introduce Fast Loc-NeRF, which leverages a coarse-to-fine approach to enable more efficient and accurate NeRF map-based global localization. Specifically, Fast Loc-NeRF matches rendered pixels and observed images on a multi-resolution from low to high resolution. As a result, it speeds up the costly particle update process while maintaining precise localization results. Additionally, to reject the abnormal particles, we propose particle rejection weighting, which estimates the uncertainty of particles by exploiting NeRF's characteristics and considers them in the particle weighting process. Our Fast Loc-NeRF sets new state-of-the-art localization performances on several benchmarks, convincing its accuracy and efficiency.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Zero-Shot Scene Change Detection
Authors:
Kyusik Cho,
Dong Yeop Kim,
Euntai Kim
Abstract:
We present a novel, training-free approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of c…
▽ More
We present a novel, training-free approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames. Furthermore, we focus on the content gap and style gap between two input images in change detection, and address both issues by proposing adaptive content threshold and style bridging layers, respectively. Finally, we extend our approach to video to exploit rich temporal information, enhancing scene change detection performance. We compare our approach and baseline through various experiments. While existing train-based baseline tend to specialize only in the trained domain, our method shows consistent performance across various domains, proving the competitiveness of our approach.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
Authors:
Junho Myung,
Nayeon Lee,
Yi Zhou,
Jiho Jin,
Rifki Afina Putri,
Dimosthenis Antypas,
Hsuvas Borkakoty,
Eunsu Kim,
Carla Perez-Almendros,
Abinew Ali Ayele,
VĂctor GutiĂ©rrez-Basulto,
YazmĂn Ibåñez-GarcĂa,
Hwaran Lee,
Shamsuddeen Hassan Muhammad,
Kiwoong Park,
Anar Sabuhi Rzayev,
Nina White,
Seid Muhie Yimam,
Mohammad Taher Pilehvar,
Nedjma Ousidhoum,
Jose Camacho-Collados,
Alice Oh
Abstract:
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food…
▽ More
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation
Authors:
Eungbeom Kim,
Hantae Kim,
Kyogu Lee
Abstract:
Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduce…
▽ More
Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
Authors:
Jinwoo Ahn,
Junhyeok Park,
Min-Jun Kim,
Kang-Hyeon Kim,
So-Yeong Sohn,
Yun-Ji Lee,
Du-Seong Chang,
Yu-Jung Heo,
Eun-Sol Kim
Abstract:
In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two m…
▽ More
In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the captioning process. We employed the SAM algorithm, which can detect various-size objects, to capture the visual features of these geometric patterns and used this information as input for the LLM. Under the puzzle split configuration, we achieved an option selection accuracy Oacc of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Information Geometry of Evolution of Neural Network Parameters While Training
Authors:
Abhiram Anand Thiruthummal,
Eun-jin Kim,
Sergiy Shelyag
Abstract:
Artificial neural networks (ANNs) are powerful tools capable of approximating any arbitrary mathematical function, but their interpretability remains limited, rendering them as black box models. To address this issue, numerous methods have been proposed to enhance the explainability and interpretability of ANNs. In this study, we introduce the application of information geometric framework to inve…
▽ More
Artificial neural networks (ANNs) are powerful tools capable of approximating any arbitrary mathematical function, but their interpretability remains limited, rendering them as black box models. To address this issue, numerous methods have been proposed to enhance the explainability and interpretability of ANNs. In this study, we introduce the application of information geometric framework to investigate phase transition-like behavior during the training of ANNs and relate these transitions to overfitting in certain models.
The evolution of ANNs during training is studied by looking at the probability distribution of its parameters. Information geometry utilizing the principles of differential geometry, offers a unique perspective on probability and statistics by considering probability density functions as points on a Riemannian manifold. We create this manifold using a metric based on Fisher information to define a distance and a velocity. By parameterizing this distance and velocity with training steps, we study how the ANN evolves as training progresses. Utilizing standard datasets like MNIST, FMNIST and CIFAR-10, we observe a transition in the motion on the manifold while training the ANN and this transition is identified with over-fitting in the ANN models considered. The information geometric transitions observed is shown to be mathematically similar to the phase transitions in physics. Preliminary results showing finite-size scaling behavior is also provided. This work contributes to the development of robust tools for improving the explainability and interpretability of ANNs, aiding in our understanding of the variability of the parameters these complex models exhibit during training.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Extremization to Fine Tune Physics Informed Neural Networks for Solving Boundary Value Problems
Authors:
Abhiram Anand Thiruthummal,
Sergiy Shelyag,
Eun-jin Kim
Abstract:
We propose a novel method for fast and accurate training of physics-informed neural networks (PINNs) to find solutions to boundary value problems (BVPs) and initial boundary value problems (IBVPs). By combining the methods of training deep neural networks (DNNs) and Extreme Learning Machines (ELMs), we develop a model which has the expressivity of DNNs with the fine-tuning ability of ELMs. We show…
▽ More
We propose a novel method for fast and accurate training of physics-informed neural networks (PINNs) to find solutions to boundary value problems (BVPs) and initial boundary value problems (IBVPs). By combining the methods of training deep neural networks (DNNs) and Extreme Learning Machines (ELMs), we develop a model which has the expressivity of DNNs with the fine-tuning ability of ELMs. We showcase the superiority of our proposed method by solving several BVPs and IBVPs which include linear and non-linear ordinary differential equations (ODEs), partial differential equations (PDEs) and coupled PDEs. The examples we consider include a stiff coupled ODE system where traditional numerical methods fail, a 3+1D non-linear PDE, Kovasznay flow and Taylor-Green vortex solutions to incompressible Navier-Stokes equations and pure advection solution of 1+1 D compressible Euler equation.
The Theory of Functional Connections (TFC) is used to exactly impose initial and boundary conditions (IBCs) of (I)BVPs on PINNs. We propose a modification to the TFC framework named Reduced TFC and show a significant improvement in the training and inference time of PINNs compared to IBCs imposed using TFC. Furthermore, Reduced TFC is shown to be able to generalize to more complex boundary geometries which is not possible with TFC. We also introduce a method of applying boundary conditions at infinity for BVPs and numerically solve the pure advection in 1+1 D Euler equations using these boundary conditions.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Fields, Bridges, and Foundations: How Researchers Browse Citation Network Visualizations
Authors:
Kiroong Choe,
Eunhye Kim,
Sangwon Park,
Jinwook Seo
Abstract:
Visualizing citation relations with network structures is widely used, but the visual complexity can make it challenging for individual researchers to navigate through them. We collected data from 18 researchers using an interface that we designed using network simplification methods and analyzed how users browsed and identified important papers. Our analysis reveals six major patterns used for id…
▽ More
Visualizing citation relations with network structures is widely used, but the visual complexity can make it challenging for individual researchers to navigate through them. We collected data from 18 researchers using an interface that we designed using network simplification methods and analyzed how users browsed and identified important papers. Our analysis reveals six major patterns used for identifying papers of interest, which can be categorized into three key components: Fields, Bridges, and Foundations, each viewed from two distinct perspectives: layout-oriented and connection-oriented. The connection-oriented approach was found to be more effective for selecting relevant papers, but the layout-oriented method was adopted more often, even though it led to unexpected results and user frustration. Our findings emphasize the importance of integrating these components and the necessity to balance visual layouts with meaningful connections to enhance the effectiveness of citation networks in academic browsing systems.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
DCNN: Dual Cross-current Neural Networks Realized Using An Interactive Deep Learning Discriminator for Fine-grained Objects
Authors:
Da Fu,
Mingfei Rong,
Eun-Hu Kim,
Hao Huang,
Witold Pedrycz
Abstract:
Accurate classification of fine-grained images remains a challenge in backbones based on convolutional operations or self-attention mechanisms. This study proposes novel dual-current neural networks (DCNN), which combine the advantages of convolutional operations and self-attention mechanisms to improve the accuracy of fine-grained image classification. The main novel design features for construct…
▽ More
Accurate classification of fine-grained images remains a challenge in backbones based on convolutional operations or self-attention mechanisms. This study proposes novel dual-current neural networks (DCNN), which combine the advantages of convolutional operations and self-attention mechanisms to improve the accuracy of fine-grained image classification. The main novel design features for constructing a weakly supervised learning backbone model DCNN include (a) extracting heterogeneous data, (b) keeping the feature map resolution unchanged, (c) expanding the receptive field, and (d) fusing global representations and local features. Experimental results demonstrated that using DCNN as the backbone network for classifying certain fine-grained benchmark datasets achieved performance advantage improvements of 13.5--19.5% and 2.2--12.9%, respectively, compared to other advanced convolution or attention-based fine-grained backbones.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Deep Learning-based Prediction of Breast Cancer Tumor and Immune Phenotypes from Histopathology
Authors:
Tiago Gonçalves,
Dagoberto Pulido-Arias,
Julian Willett,
Katharina V. Hoebel,
Mason Cleveland,
Syed Rakin Ahmed,
Elizabeth Gerstner,
Jayashree Kalpathy-Cramer,
Jaime S. Cardoso,
Christopher P. Bridge,
Albert E. Kim
Abstract:
The interactions between tumor cells and the tumor microenvironment (TME) dictate therapeutic efficacy of radiation and many systemic therapies in breast cancer. However, to date, there is not a widely available method to reproducibly measure tumor and immune phenotypes for each patient's tumor. Given this unmet clinical need, we applied multiple instance learning (MIL) algorithms to assess activi…
▽ More
The interactions between tumor cells and the tumor microenvironment (TME) dictate therapeutic efficacy of radiation and many systemic therapies in breast cancer. However, to date, there is not a widely available method to reproducibly measure tumor and immune phenotypes for each patient's tumor. Given this unmet clinical need, we applied multiple instance learning (MIL) algorithms to assess activity of ten biologically relevant pathways from the hematoxylin and eosin (H&E) slide of primary breast tumors. We employed different feature extraction approaches and state-of-the-art model architectures. Using binary classification, our models attained area under the receiver operating characteristic (AUROC) scores above 0.70 for nearly all gene expression pathways and on some cases, exceeded 0.80. Attention maps suggest that our trained models recognize biologically relevant spatial patterns of cell sub-populations from H&E. These efforts represent a first step towards developing computational H&E biomarkers that reflect facets of the TME and hold promise for augmenting precision oncology.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Pegasus-v1 Technical Report
Authors:
Raehyuk Jung,
Hyojun Go,
Jaehyuk Yi,
Jiho Jang,
Daniel Kim,
Jay Suh,
Aiden Lee,
Cooper Han,
Jae Lee,
Jeff Kim,
Jin-Young Kim,
Junwan Kim,
Kyle Park,
Lucas Lee,
Mars Ha,
Minjoon Seo,
Abraham Jo,
Ed Park,
Hassan Kianinejad,
SJ Kim,
Tony Moon,
Wade Jeong,
Andrei Popescu,
Esther Kim,
EK Yoon
, et al. (19 additional authors not shown)
Abstract:
This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's archi…
▽ More
This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
General Item Representation Learning for Cold-start Content Recommendations
Authors:
Jooeun Kim,
Jinri Kim,
Kwangeun Yeo,
Eungi Kim,
Kyoung-Woon On,
Jonghwan Mun,
Joonseok Lee
Abstract:
Cold-start item recommendation is a long-standing challenge in recommendation systems. A common remedy is to use a content-based approach, but rich information from raw contents in various forms has not been fully utilized. In this paper, we propose a domain/data-agnostic item representation learning framework for cold-start recommendations, naturally equipped with multimodal alignment among vario…
▽ More
Cold-start item recommendation is a long-standing challenge in recommendation systems. A common remedy is to use a content-based approach, but rich information from raw contents in various forms has not been fully utilized. In this paper, we propose a domain/data-agnostic item representation learning framework for cold-start recommendations, naturally equipped with multimodal alignment among various features by adopting a Transformer-based architecture. Our proposed model is end-to-end trainable completely free from classification labels, not just costly to collect but suboptimal for recommendation-purpose representation learning. From extensive experiments on real-world movie and news recommendation benchmarks, we verify that our approach better preserves fine-grained user taste than state-of-the-art baselines, universally applicable to multiple domains at large scale.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Towards quantum computing for clinical trial design and optimization: A perspective on new opportunities and challenges
Authors:
Hakan Doga,
M. Emre Sahin,
Joao Bettencourt-Silva,
Anh Pham,
Eunyoung Kim,
Alan Andress,
Sudhir Saxena,
Aritra Bose,
Laxmi Parida,
Jan Lukas Robertus,
Hideaki Kawaguchi,
Radwa Soliman,
Daniel Blankenberg
Abstract:
Clinical trials are pivotal in the drug discovery process to determine the safety and efficacy of a drug candidate. The high failure rates of these trials are attributed to deficiencies in clinical model development and protocol design. Improvements in the clinical drug design process could therefore yield significant benefits for all stakeholders involved. This paper examines the current challeng…
▽ More
Clinical trials are pivotal in the drug discovery process to determine the safety and efficacy of a drug candidate. The high failure rates of these trials are attributed to deficiencies in clinical model development and protocol design. Improvements in the clinical drug design process could therefore yield significant benefits for all stakeholders involved. This paper examines the current challenges faced in clinical trial design and optimization, reviews established classical computational approaches, and introduces quantum algorithms aimed at enhancing these processes. Specifically, the focus is on three critical aspects: clinical trial simulations, site selection, and cohort identification. This study aims to provide a comprehensive framework that leverages quantum computing to innovate and refine the efficiency and effectiveness of clinical trials.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
E3: Ensemble of Expert Embedders for Adapting Synthetic Image Detectors to New Generators Using Limited Data
Authors:
Aref Azizpour,
Tai D. Nguyen,
Manil Shrestha,
Kaidi Xu,
Edward Kim,
Matthew C. Stamm
Abstract:
As generative AI progresses rapidly, new synthetic image generators continue to emerge at a swift pace. Traditional detection methods face two main challenges in adapting to these generators: the forensic traces of synthetic images from new techniques can vastly differ from those learned during training, and access to data for these new generators is often limited. To address these issues, we intr…
▽ More
As generative AI progresses rapidly, new synthetic image generators continue to emerge at a swift pace. Traditional detection methods face two main challenges in adapting to these generators: the forensic traces of synthetic images from new techniques can vastly differ from those learned during training, and access to data for these new generators is often limited. To address these issues, we introduce the Ensemble of Expert Embedders (E3), a novel continual learning framework for updating synthetic image detectors. E3 enables the accurate detection of images from newly emerged generators using minimal training data. Our approach does this by first employing transfer learning to develop a suite of expert embedders, each specializing in the forensic traces of a specific generator. Then, all embeddings are jointly analyzed by an Expert Knowledge Fusion Network to produce accurate and reliable detection decisions. Our experiments demonstrate that E3 outperforms existing continual learning methods, including those developed specifically for synthetic image detection.
△ Less
Submitted 16 April, 2024; v1 submitted 12 April, 2024;
originally announced April 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seongjin Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Software-Defined Cryptography: A Design Feature of Cryptographic Agility
Authors:
Jihoon Cho,
Changhoon Lee,
Eunkyung Kim,
Jieun Lee,
Beumjin Cho
Abstract:
Given the widespread use of cryptography in Enterprise IT, migration to post-quantum cryptography (PQC) is not drop-in replacement at all. Cryptographic agility, or crypto-agility, is a design feature that enables seamless updates to new cryptographic algorithms and standards without the need to modify or replace the surrounding infrastructure. This paper introduces a notion of software-defined cr…
▽ More
Given the widespread use of cryptography in Enterprise IT, migration to post-quantum cryptography (PQC) is not drop-in replacement at all. Cryptographic agility, or crypto-agility, is a design feature that enables seamless updates to new cryptographic algorithms and standards without the need to modify or replace the surrounding infrastructure. This paper introduces a notion of software-defined cryptography as the desired design feature for crypto-agility, emphasizing the role of software in providing centralized governance for cryptography and automated enforcement of cryptographic policies, such as migration to PQC.
△ Less
Submitted 1 September, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Statistical Analysis by Semiparametric Additive Regression and LSTM-FCN Based Hierarchical Classification for Computer Vision Quantification of Parkinsonian Bradykinesia
Authors:
Youngseo Cho,
In Hee Kwak,
Dohyeon Kim,
Jinhee Na,
Hanjoo Sung,
Jeongjae Lee,
Young Eun Kim,
Hyeo-il Ma
Abstract:
Bradykinesia, characterized by involuntary slowing or decrement of movement, is a fundamental symptom of Parkinson's Disease (PD) and is vital for its clinical diagnosis. Despite various methodologies explored to quantify bradykinesia, computer vision-based approaches have shown promising results. However, these methods often fall short in adequately addressing key bradykinesia characteristics in…
▽ More
Bradykinesia, characterized by involuntary slowing or decrement of movement, is a fundamental symptom of Parkinson's Disease (PD) and is vital for its clinical diagnosis. Despite various methodologies explored to quantify bradykinesia, computer vision-based approaches have shown promising results. However, these methods often fall short in adequately addressing key bradykinesia characteristics in repetitive limb movements: "occasional arrest" and "decrement in amplitude."
This research advances vision-based quantification of bradykinesia by introducing nuanced numerical analysis to capture decrement in amplitudes and employing a simple deep learning technique, LSTM-FCN, for precise classification of occasional arrests. Our approach structures the classification process hierarchically, tailoring it to the unique dynamics of bradykinesia in PD.
Statistical analysis of the extracted features, including those representing arrest and fatigue, has demonstrated their statistical significance in most cases. This finding underscores the importance of considering "occasional arrest" and "decrement in amplitude" in bradykinesia quantification of limb movement. Our enhanced diagnostic tool has been rigorously tested on an extensive dataset comprising 1396 motion videos from 310 PD patients, achieving an accuracy of 80.3%. The results confirm the robustness and reliability of our method.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Aligning Large Language Models for Enhancing Psychiatric Interviews through Symptom Delineation and Summarization
Authors:
Jae-hee So,
Joonhwan Chang,
Eunji Kim,
Junho Na,
JiYeon Choi,
Jy-yong Sohn,
Byung-Hoon Kim,
Sang Hui Chu
Abstract:
Recent advancements in Large Language Models (LLMs) have accelerated their usage in various domains. Given the fact that psychiatric interviews are goal-oriented and structured dialogues between the professional interviewer and the interviewee, it is one of the most underexplored areas where LLMs can contribute substantial value. Here, we explore the use of LLMs for enhancing psychiatric interview…
▽ More
Recent advancements in Large Language Models (LLMs) have accelerated their usage in various domains. Given the fact that psychiatric interviews are goal-oriented and structured dialogues between the professional interviewer and the interviewee, it is one of the most underexplored areas where LLMs can contribute substantial value. Here, we explore the use of LLMs for enhancing psychiatric interviews, by analyzing counseling data from North Korean defectors with traumatic events and mental health issues. Specifically, we investigate whether LLMs can (1) delineate the part of the conversation that suggests psychiatric symptoms and name the symptoms, and (2) summarize stressors and symptoms, based on the interview dialogue transcript. Here, the transcript data was labeled by mental health experts for training and evaluation of LLMs. Our experimental results show that appropriately prompted LLMs can achieve high performance on both the symptom delineation task and the summarization task. This research contributes to the nascent field of applying LLMs to psychiatric interview and demonstrates their potential effectiveness in aiding mental health practitioners.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
Authors:
Eunsu Kim,
Juyoung Suk,
Philhoon Oh,
Haneul Yoo,
James Thorne,
Alice Oh
Abstract:
Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets…
▽ More
Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as bias and hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in CLIcK, we provide fine-grained annotation of which cultural and linguistic knowledge is required to answer the question correctly. Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs' proficiency in Korean culture and language.
△ Less
Submitted 4 July, 2024; v1 submitted 10 March, 2024;
originally announced March 2024.
-
Quantum Many-Body Physics Calculations with Large Language Models
Authors:
Haining Pan,
Nayantara Mudur,
Will Taranto,
Maria Tikhanovskaya,
Subhashini Venugopalan,
Yasaman Bahri,
Michael P. Brenner,
Eun-Ah Kim
Abstract:
Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock metho…
▽ More
Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for developing algorithms that automatically explore theoretical hypotheses at an unprecedented scale.
△ Less
Submitted 22 August, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Interpretable Models for Detecting and Monitoring Elevated Intracranial Pressure
Authors:
Darryl Hannan,
Steven C. Nesbit,
Ximing Wen,
Glen Smith,
Qiao Zhang,
Alberto Goffi,
Vincent Chan,
Michael J. Morris,
John C. Hunninghake,
Nicholas E. Villalobos,
Edward Kim,
Rosina O. Weber,
Christopher J. MacLellan
Abstract:
Detecting elevated intracranial pressure (ICP) is crucial in diagnosing and managing various neurological conditions. These fluctuations in pressure are transmitted to the optic nerve sheath (ONS), resulting in changes to its diameter, which can then be detected using ultrasound imaging devices. However, interpreting sonographic images of the ONS can be challenging. In this work, we propose two sy…
▽ More
Detecting elevated intracranial pressure (ICP) is crucial in diagnosing and managing various neurological conditions. These fluctuations in pressure are transmitted to the optic nerve sheath (ONS), resulting in changes to its diameter, which can then be detected using ultrasound imaging devices. However, interpreting sonographic images of the ONS can be challenging. In this work, we propose two systems that actively monitor the ONS diameter throughout an ultrasound video and make a final prediction as to whether ICP is elevated. To construct our systems, we leverage subject matter expert (SME) guidance, structuring our processing pipeline according to their collection procedure, while also prioritizing interpretability and computational efficiency. We conduct a number of experiments, demonstrating that our proposed systems are able to outperform various baselines. One of our SMEs then manually validates our top system's performance, lending further credibility to our approach while demonstrating its potential utility in a clinical setting.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Multi-FAct: Assessing Multilingual LLMs' Multi-Regional Knowledge using FActScore
Authors:
Sheikh Shafayat,
Eunsu Kim,
Juhyun Oh,
Alice Oh
Abstract:
Large Language Models (LLMs) are prone to factuality hallucination, generating text that contradicts established knowledge. While extensive research has addressed this in English, little is known about multilingual LLMs. This paper systematically evaluates multilingual LLMs' factual accuracy across languages and geographic regions. We introduce a novel pipeline for multilingual factuality evaluati…
▽ More
Large Language Models (LLMs) are prone to factuality hallucination, generating text that contradicts established knowledge. While extensive research has addressed this in English, little is known about multilingual LLMs. This paper systematically evaluates multilingual LLMs' factual accuracy across languages and geographic regions. We introduce a novel pipeline for multilingual factuality evaluation, adapting FActScore(Min et al., 2023) for diverse languages. Our analysis across nine languages reveals that English consistently outperforms others in factual accuracy and quantity of generated facts. Furthermore, multilingual models demonstrate a bias towards factual information from Western continents. These findings highlight the need for improved multilingual factuality assessment and underscore geographical biases in LLMs' fact generation.
△ Less
Submitted 1 March, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports
Authors:
Felix J. Dorfner,
Liv JĂŒrgensen,
Leonhard Donle,
Fares Al Mohamad,
Tobias R. Bodenmann,
Mason C. Cleveland,
Felix Busch,
Lisa C. Adams,
James Sato,
Thomas Schultz,
Albert E. Kim,
Jameson Merkow,
Keno K. Bressem,
Christopher P. Bridge
Abstract:
Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models.
Materials and Methods: Two different…
▽ More
Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models.
Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques.
Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively.
Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Event-to-Video Conversion for Overhead Object Detection
Authors:
Darryl Hannan,
Ragib Arnab,
Gavin Parpart,
Garrett T. Kenyon,
Edward Kim,
Yijing Watkins
Abstract:
Collecting overhead imagery using an event camera is desirable due to the energy efficiency of the image sensor compared to standard cameras. However, event cameras complicate downstream image processing, especially for complex tasks such as object detection. In this paper, we investigate the viability of event streams for overhead object detection. We demonstrate that across a number of standard…
▽ More
Collecting overhead imagery using an event camera is desirable due to the energy efficiency of the image sensor compared to standard cameras. However, event cameras complicate downstream image processing, especially for complex tasks such as object detection. In this paper, we investigate the viability of event streams for overhead object detection. We demonstrate that across a number of standard modeling approaches, there is a significant gap in performance between dense event representations and corresponding RGB frames. We establish that this gap is, in part, due to a lack of overlap between the event representations and the pre-training data used to initialize the weights of the object detectors. Then, we apply event-to-video conversion models that convert event streams into gray-scale video to close this gap. We demonstrate that this approach results in a large performance increase, outperforming even event-specific object detection techniques on our overhead target task. These results suggest that better alignment between event representations and existing large pre-trained models may result in greater short-term performance gains compared to end-to-end event-specific architectural improvements.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate
Authors:
Juhyun Oh,
Eunsu Kim,
Inha Cha,
Alice Oh
Abstract:
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared…
▽ More
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Nevermind: Instruction Override and Moderation in Large Language Models
Authors:
Edward Kim
Abstract:
Given the impressive capabilities of recent Large Language Models (LLMs), we investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations, e.g. overrides. These include the ability of the model to override the knowledge within the weights of the model, the ability to override (or moderate) extr…
▽ More
Given the impressive capabilities of recent Large Language Models (LLMs), we investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations, e.g. overrides. These include the ability of the model to override the knowledge within the weights of the model, the ability to override (or moderate) extracted knowledge in the prompt, and lastly the ability to perform a full jailbreak. Experimentation performed suggest several key findings to improve instruction following - larger models perform the best in following instructions that override internal and contextual instructions, and are obedient, even to a fault. When scaling to longer contexts via rope scaling, a significant buffer needs to be maintained from the edge of the perplexity cliff in order to maintain instruction following capabilities. Finally, we observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines. Thus, we postulate the most effective approach for safe, trustworthy AI should be dealt external to the LLM itself.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Super-resolution multi-contrast unbiased eye atlases with deep probabilistic refinement
Authors:
Ho Hin Lee,
Adam M. Saunders,
Michael E. Kim,
Samuel W. Remedios,
Lucas W. Remedios,
Yucheng Tang,
Qi Yang,
Xin Yu,
Shunxing Bao,
Chloe Cho,
Louise A. Mawn,
Tonia S. Rex,
Kevin L. Schey,
Blake E. Dewey,
Jeffrey M. Spraggins,
Jerry L. Prince,
Yuankai Huo,
Bennett A. Landman
Abstract:
Purpose: Eye morphology varies significantly across the population, especially for the orbit and optic nerve. These variations limit the feasibility and robustness of generalizing population-wise features of eye organs to an unbiased spatial reference.
Approach: To tackle these limitations, we propose a process for creating high-resolution unbiased eye atlases. First, to restore spatial details…
▽ More
Purpose: Eye morphology varies significantly across the population, especially for the orbit and optic nerve. These variations limit the feasibility and robustness of generalizing population-wise features of eye organs to an unbiased spatial reference.
Approach: To tackle these limitations, we propose a process for creating high-resolution unbiased eye atlases. First, to restore spatial details from scans with a low through-plane resolution compared to a high in-plane resolution, we apply a deep learning-based super-resolution algorithm. Then, we generate an initial unbiased reference with an iterative metric-based registration using a small portion of subject scans. We register the remaining scans to this template and refine the template using an unsupervised deep probabilistic approach that generates a more expansive deformation field to enhance the organ boundary alignment. We demonstrate this framework using magnetic resonance images across four different tissue contrasts, generating four atlases in separate spatial alignments.
Results: For each tissue contrast, we find a significant improvement using the Wilcoxon signed-rank test in the average Dice score across four labeled regions compared to a standard registration framework consisting of rigid, affine, and deformable transformations. These results highlight the effective alignment of eye organs and boundaries using our proposed process.
Conclusions: By combining super-resolution preprocessing and deep probabilistic models, we address the challenge of generating an eye atlas to serve as a standardized reference across a largely variable population.
△ Less
Submitted 14 June, 2024; v1 submitted 5 January, 2024;
originally announced January 2024.
-
Synergistic Formulaic Alpha Generation for Quantitative Trading based on Reinforcement Learning
Authors:
Hong-Gi Shin,
Sukhyun Jeong,
Eui-Yeon Kim,
Sungho Hong,
Young-Jin Cho,
Yong-Hoon Choi
Abstract:
Mining of formulaic alpha factors refers to the process of discovering and developing specific factors or indicators (referred to as alpha factors) for quantitative trading in stock market. To efficiently discover alpha factors in vast search space, reinforcement learning (RL) is commonly employed. This paper proposes a method to enhance existing alpha factor mining approaches by expanding a searc…
▽ More
Mining of formulaic alpha factors refers to the process of discovering and developing specific factors or indicators (referred to as alpha factors) for quantitative trading in stock market. To efficiently discover alpha factors in vast search space, reinforcement learning (RL) is commonly employed. This paper proposes a method to enhance existing alpha factor mining approaches by expanding a search space and utilizing pretrained formulaic alpha set as initial seed values to generate synergistic formulaic alpha. We employ information coefficient (IC) and rank information coefficient (Rank IC) as performance evaluation metrics for the model. Using CSI300 market data, we conducted real investment simulations and observed significant performance improvement compared to existing techniques.
△ Less
Submitted 7 July, 2024; v1 submitted 5 January, 2024;
originally announced January 2024.
-
VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting
Authors:
Seunggu Kang,
WonJun Moon,
Euiyeon Kim,
Jae-Pil Heo
Abstract:
Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline,…
▽ More
Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.
△ Less
Submitted 30 December, 2023; v1 submitted 27 December, 2023;
originally announced December 2023.
-
GRIL-Calib: Targetless Ground Robot IMU-LiDAR Extrinsic Calibration Method using Ground Plane Motion Constraints
Authors:
TaeYoung Kim,
Gyuhyeon Pak,
Euntai Kim
Abstract:
Targetless IMU-LiDAR extrinsic calibration methods are gaining significant attention as the importance of the IMU-LiDAR fusion system increases. Notably, existing calibration methods derive calibration parameters under the assumption that the methods require full motion in all axes. When IMU and LiDAR are mounted on a ground robot the motion of which is restricted to planar motion, existing calibr…
▽ More
Targetless IMU-LiDAR extrinsic calibration methods are gaining significant attention as the importance of the IMU-LiDAR fusion system increases. Notably, existing calibration methods derive calibration parameters under the assumption that the methods require full motion in all axes. When IMU and LiDAR are mounted on a ground robot the motion of which is restricted to planar motion, existing calibration methods are likely to exhibit degraded performance. To address this issue, we present GRIL-Calib: a novel targetless Ground Robot IMU-LiDAR Calibration method. Our proposed method leverages ground information to compensate for the lack of unrestricted full motion. First, we propose LiDAR Odometry (LO) using ground plane residuals to enhance calibration accuracy. Second, we propose the Ground Plane Motion (GPM) constraint and incorporate it into the optimization for calibration, enabling the determination of full 6-DoF extrinsic parameters, including theoretically unobservable direction. Finally, unlike baseline methods, we formulate the calibration not as sequential two optimizations but as a single optimization (SO) problem, solving all calibration parameters simultaneously and improving accuracy. We validate our GRIL-Calib by applying it to various real-world datasets and comparing its performance with that of existing state-of-the-art methods in terms of accuracy and robustness. Our code is available at https://github.com/Taeyoung96/GRIL-Calib.
△ Less
Submitted 24 May, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Materials Expert-Artificial Intelligence for Materials Discovery
Authors:
Yanjun Liu,
Milena Jovanovic,
Krishnanand Mallayya,
Wesley J. Maddox,
Andrew Gordon Wilson,
Sebastian Klemenz,
Leslie M. Schoop,
Eun-Ah Kim
Abstract:
The advent of material databases provides an unprecedented opportunity to uncover predictive descriptors for emergent material properties from vast data space. However, common reliance on high-throughput ab initio data necessarily inherits limitations of such data: mismatch with experiments. On the other hand, experimental decisions are often guided by an expert's intuition honed from experiences…
▽ More
The advent of material databases provides an unprecedented opportunity to uncover predictive descriptors for emergent material properties from vast data space. However, common reliance on high-throughput ab initio data necessarily inherits limitations of such data: mismatch with experiments. On the other hand, experimental decisions are often guided by an expert's intuition honed from experiences that are rarely articulated. We propose using machine learning to "bottle" such operational intuition into quantifiable descriptors using expertly curated measurement-based data. We introduce "Materials Expert-Artificial Intelligence" (ME-AI) to encapsulate and articulate this human intuition. As a first step towards such a program, we focus on the topological semimetal (TSM) among square-net materials as the property inspired by the expert-identified descriptor based on structural information: the tolerance factor. We start by curating a dataset encompassing 12 primary features of 879 square-net materials, using experimental data whenever possible. We then use Dirichlet-based Gaussian process regression using a specialized kernel to reveal composite descriptors for square-net topological semimetals. The ME-AI learned descriptors independently reproduce expert intuition and expand upon it. Specifically, new descriptors point to hypervalency as a critical chemical feature predicting TSM within square-net compounds. Our success with a carefully defined problem points to the "machine bottling human insight" approach as promising for machine learning-aided material discovery.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation
Authors:
Meriem Boubdir,
Edward Kim,
Beyza Ermis,
Sara Hooker,
Marzieh Fadaee
Abstract:
In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fund…
▽ More
In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct extensive evaluation of Elo behaviour, illustrating that individual Elo computations exhibit volatility and delving into the impact of varying the Elo rating system's hyperparameters. We show that these axioms are not always satisfied raising questions about the reliability of current comparative evaluations of LLMs. If the current use of Elo scores is intended to substitute the costly head-to-head comparison of LLMs, it is crucial to ensure the ranking is as robust as possible. Guided by the axioms, our findings offer concrete guidelines for enhancing the reliability of LLM evaluation methods, suggesting a need for reassessment of existing comparative approaches.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Pseudo-label Correction for Instance-dependent Noise Using Teacher-student Framework
Authors:
Eugene Kim
Abstract:
The high capacity of deep learning models to learn complex patterns poses a significant challenge when confronted with label noise. The inability to differentiate clean and noisy labels ultimately results in poor generalization. We approach this problem by reassigning the label for each image using a new teacher-student based framework termed P-LC (pseudo-label correction). Traditional teacher-stu…
▽ More
The high capacity of deep learning models to learn complex patterns poses a significant challenge when confronted with label noise. The inability to differentiate clean and noisy labels ultimately results in poor generalization. We approach this problem by reassigning the label for each image using a new teacher-student based framework termed P-LC (pseudo-label correction). Traditional teacher-student networks are composed of teacher and student classifiers for knowledge distillation. In our novel approach, we reconfigure the teacher network into a triple encoder, leveraging the triplet loss to establish a pseudo-label correction system. As the student generates pseudo labels for a set of given images, the teacher learns to choose between the initially assigned labels and the pseudo labels. Experiments on MNIST, Fashion-MNIST, and SVHN demonstrate P-LC's superior performance over existing state-of-the-art methods across all noise levels, most notably in high noise. In addition, we introduce a noise level estimation to help assess model performance and inform the need for additional data cleaning procedures.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Predicting Age from White Matter Diffusivity with Residual Learning
Authors:
Chenyu Gao,
Michael E. Kim,
Ho Hin Lee,
Qi Yang,
Nazirah Mohd Khairi,
Praitayini Kanakaraj,
Nancy R. Newlin,
Derek B. Archer,
Angela L. Jefferson,
Warren D. Taylor,
Brian D. Boyd,
Lori L. Beason-Held,
Susan M. Resnick,
The BIOCARD Study Team,
Yuankai Huo,
Katherine D. Van Schaik,
Kurt G. Schilling,
Daniel Moyer,
Ivana IĆĄgum,
Bennett A. Landman
Abstract:
Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for developing biomarkers that are sensitive to such deviations. Complementary to structural analysis,…
▽ More
Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for developing biomarkers that are sensitive to such deviations. Complementary to structural analysis, diffusion tensor imaging (DTI) has proven effective in identifying age-related microstructural changes within the brain white matter, thereby presenting itself as a promising additional modality for brain age prediction. Although early studies have sought to harness DTI's advantages for age estimation, there is no evidence that the success of this prediction is owed to the unique microstructural and diffusivity features that DTI provides, rather than the macrostructural features that are also available in DTI data. Therefore, we seek to develop white-matter-specific age estimation to capture deviations from normal white matter aging. Specifically, we deliberately disregard the macrostructural information when predicting age from DTI scalar images, using two distinct methods. The first method relies on extracting only microstructural features from regions of interest. The second applies 3D residual neural networks (ResNets) to learn features directly from the images, which are non-linearly registered and warped to a template to minimize macrostructural variations. When tested on unseen data, the first method yields mean absolute error (MAE) of 6.11 years for cognitively normal participants and MAE of 6.62 years for cognitively impaired participants, while the second method achieves MAE of 4.69 years for cognitively normal participants and MAE of 4.96 years for cognitively impaired participants. We find that the ResNet model captures subtler, non-macrostructural features for brain age prediction.
△ Less
Submitted 21 January, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Authors:
Meriem Boubdir,
Edward Kim,
Beyza Ermis,
Marzieh Fadaee,
Sara Hooker
Abstract:
Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizi…
▽ More
Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizing data instances which most effectively distinguish between models?" We evaluate several metric-based methods and find that these metrics enhance the efficiency of human evaluations by minimizing the number of required annotations, thus saving time and cost, while ensuring a robust performance evaluation. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54% compared to a random sample when focusing on the top-20 percentile of prioritized instances. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
Reliable Test-Time Adaptation via Agreement-on-the-Line
Authors:
Eungyeup Kim,
Mingjie Sun,
Aditi Raghunathan,
Zico Kolter
Abstract:
Test-time adaptation (TTA) methods aim to improve robustness to distribution shifts by adapting models using unlabeled data from the shifted test distribution. However, there remain unresolved challenges that undermine the reliability of TTA, which include difficulties in evaluating TTA performance, miscalibration after TTA, and unreliable hyperparameter tuning for adaptation. In this work, we mak…
▽ More
Test-time adaptation (TTA) methods aim to improve robustness to distribution shifts by adapting models using unlabeled data from the shifted test distribution. However, there remain unresolved challenges that undermine the reliability of TTA, which include difficulties in evaluating TTA performance, miscalibration after TTA, and unreliable hyperparameter tuning for adaptation. In this work, we make a notable and surprising observation that TTAed models strongly show the agreement-on-the-line phenomenon (Baek et al., 2022) across a wide range of distribution shifts. We find such linear trends occur consistently in a wide range of models adapted with various hyperparameters, and persist in distributions where the phenomenon fails to hold in vanilla models (i.e., before adaptation). We leverage these observations to make TTA methods more reliable in three perspectives: (i) estimating OOD accuracy (without labeled data) to determine when TTA helps and when it hurts, (ii) calibrating TTAed models without label information, and (iii) reliably determining hyperparameters for TTA without any labeled validation data. Through extensive experiments, we demonstrate that various TTA methods can be precisely evaluated, both in terms of their improvements and degradations. Moreover, our proposed methods on unsupervised calibration and hyperparameters tuning for TTA achieve results close to the ones assuming access to ground-truth labels, in terms of both OOD accuracy and calibration error.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
Clustering-based Image-Text Graph Matching for Domain Generalization
Authors:
Nokyung Park,
Daewon Chae,
Jeongyong Shim,
Sangpil Kim,
Eun-Sol Kim,
Jinkyu Kim
Abstract:
Learning domain-invariant visual representations is important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problem. However, they use pivot embedding in global manner (i.e.…
▽ More
Learning domain-invariant visual representations is important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problem. However, they use pivot embedding in global manner (i.e., aligning an image embedding with sentence-level text embedding), not fully utilizing the semantic cues of given text description. In this work, we advocate for the use of local alignment between image regions and corresponding textual descriptions. To this end, we first represent image and text inputs with graphs. We subsequently cluster nodes in those graphs and match the graph-based image node features into textual graphs. This matching process is conducted globally and locally, tightly aligning visual and textual semantic sub-structures. We experiment with large-scale public datasets, such as CUB-DG and DomainBed, and our model achieves matched or better state-of-the-art performance on these datasets. Our code will be publicly available upon publication.
△ Less
Submitted 15 April, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.