Search | arXiv e-print repository

LawLuo: A Chinese Law Firm Co-run by LLM Agents

Authors: Jingyun Sun, Chengxiao Dai, Zhongze Luo, Yangbo Chang, Yang Li

Abstract: Large Language Models (LLMs) demonstrate substantial potential in delivering legal consultation services to users without a legal background, attributed to their superior text comprehension and generation capabilities. Nonetheless, existing Chinese legal LLMs limit interaction to a single model-user dialogue, unlike the collaborative consultations typical of law firms, where multiple staff members… ▽ More Large Language Models (LLMs) demonstrate substantial potential in delivering legal consultation services to users without a legal background, attributed to their superior text comprehension and generation capabilities. Nonetheless, existing Chinese legal LLMs limit interaction to a single model-user dialogue, unlike the collaborative consultations typical of law firms, where multiple staff members contribute to a single consultation. This limitation prevents an authentic consultation experience. Additionally, extant Chinese legal LLMs suffer from critical limitations: (1) insufficient control over the quality of instruction fine-tuning data; (2) increased model hallucination resulting from users' ambiguous queries; and (3) a reduction in the model's ability to follow instructions over multiple dialogue turns. In response to these challenges, we propose a novel legal dialogue framework that leverages the collaborative capabilities of multiple LLM agents, termed LawLuo. This framework encompasses four agents: a receptionist, a lawyer, a secretary, and a boss, each responsible for different functionalities, collaboratively providing a comprehensive legal consultation to users. Additionally, we constructed two high-quality legal dialogue datasets, KINLED and MURLED, and fine-tuned ChatGLM-3-6b using these datasets. We propose a legal query clarification algorithm called ToLC. Experimental results demonstrate that LawLuo outperforms baseline LLMs, including GPT-4, across three dimensions: lawyer-like language style, the usefulness of legal advice, and the accuracy of legal knowledge. Our code and datasets are available at https://github.com/NEFUJing/LawLuo. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: 11 pages, 13 figures, 2 tables

ACM Class: I.2.1

arXiv:2407.16095 [pdf, other]

Robotically adjustable kinematics in a wrist-driven orthosis eases grasping across tasks

Authors: Erin Y. Chang, Andrew I. W. McPherson, Hannah S. Stuart

Abstract: Without finger function, people with C5-7 spinal cord injury (SCI) regularly utilize wrist extension to passively close the fingers and thumb together for grasping. Wearable assistive grasping devices often focus on this familiar wrist-driven technique to provide additional support and amplify grasp force. Despite recent research advances in modernizing these tools, people with SCI often abandon s… ▽ More Without finger function, people with C5-7 spinal cord injury (SCI) regularly utilize wrist extension to passively close the fingers and thumb together for grasping. Wearable assistive grasping devices often focus on this familiar wrist-driven technique to provide additional support and amplify grasp force. Despite recent research advances in modernizing these tools, people with SCI often abandon such wearable assistive devices in the long term. We suspect that the wrist constraints imposed by such devices generate undesirable reach and grasp kinematics. Here we show that using continuous robotic motor assistance to give users more adaptability in their wrist posture prior to wrist-driven grasping reduces task difficulty and perceived exertion. Our results demonstrate that more free wrist mobility allows users to select comfortable and natural postures depending on task needs, which improves the versatility of the assistive grasping device for easier use across different hand poses in the arm's workspace. This behavior holds the potential to improve ease of use and desirability of future device designs through new modes of combining both body-power and robotic automation. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: 6 pages, 8 figures. Presented at the 2024 International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)

arXiv:2407.14136 [pdf, other]

6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry

Authors: Sungho Chun, Ju Yong Chang

Abstract: This study addresses the nuanced challenge of estimating head translations within the context of six-degrees-of-freedom (6DoF) head pose estimation, placing emphasis on this aspect over the more commonly studied head rotations. Identifying a gap in existing methodologies, we recognized the underutilized potential synergy between facial geometry and head translation. To bridge this gap, we propose… ▽ More This study addresses the nuanced challenge of estimating head translations within the context of six-degrees-of-freedom (6DoF) head pose estimation, placing emphasis on this aspect over the more commonly studied head rotations. Identifying a gap in existing methodologies, we recognized the underutilized potential synergy between facial geometry and head translation. To bridge this gap, we propose a novel approach called the head Translation, Rotation, and face Geometry network (TRG), which stands out for its explicit bidirectional interaction structure. This structure has been carefully designed to leverage the complementary relationship between face geometry and head translation, marking a significant advancement in the field of head pose estimation. Our contributions also include the development of a strategy for estimating bounding box correction parameters and a technique for aligning landmarks to image. Both of these innovations demonstrate superior performance in 6DoF head pose estimation tasks. Extensive experiments conducted on ARKitFace and BIWI datasets confirm that the proposed method outperforms current state-of-the-art techniques. Codes are released at https://github.com/asw91666/TRG-Release. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.12068 [pdf, other]

Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness

Authors: Kai Guo, Zewen Liu, Zhikai Chen, Hongzhi Wen, Wei Jin, Jiliang Tang, Yi Chang

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks. Recently, several LLMs-based pipelines have been developed to enhance learning on graphs with text attributes, showcasing promising performance. However, graphs are well-known to be susceptible to adversarial attacks and it remains unclear whether LLMs exhibit robustness in learn… ▽ More Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks. Recently, several LLMs-based pipelines have been developed to enhance learning on graphs with text attributes, showcasing promising performance. However, graphs are well-known to be susceptible to adversarial attacks and it remains unclear whether LLMs exhibit robustness in learning on graphs. To address this gap, our work aims to explore the potential of LLMs in the context of adversarial attacks on graphs. Specifically, we investigate the robustness against graph structural and textual perturbations in terms of two dimensions: LLMs-as-Enhancers and LLMs-as-Predictors. Through extensive experiments, we find that, compared to shallow models, both LLMs-as-Enhancers and LLMs-as-Predictors offer superior robustness against structural and textual attacks.Based on these findings, we carried out additional analyses to investigate the underlying causes. Furthermore, we have made our benchmark library openly available to facilitate quick and fair evaluations, and to encourage ongoing innovative research in this field. △ Less

Submitted 28 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.10051 [pdf, ps, other]

The weight distributions of some linear codes derived from Kloosterman sums

Authors: Mengzhen Zhao, Yanxun Chang

Abstract: Linear codes with few weights have applications in data storage systems, secret sharing schemes, and authentication codes. In this paper, some kinds of p-ary linear codes with few weights are constructed by use of the given de ning set, where p is a prime. Their weight distributions are determined based on Kloosterman sums over nite elds. In addition, some linear codes we given is minimal. Linear codes with few weights have applications in data storage systems, secret sharing schemes, and authentication codes. In this paper, some kinds of p-ary linear codes with few weights are constructed by use of the given de ning set, where p is a prime. Their weight distributions are determined based on Kloosterman sums over nite elds. In addition, some linear codes we given is minimal. △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: 11 pages

MSC Class: 94A24; 94B05

arXiv:2407.08377 [pdf, other]

Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework

Authors: Shengqi Xu, Run Sun, Yi Chang, Shuning Cao, Xueyao Xiao, Luxin Yan

Abstract: Long-range imaging inevitably suffers from atmospheric turbulence with severe geometric distortions due to random refraction of light. The further the distance, the more severe the disturbance. Despite existing research has achieved great progress in tackling short-range turbulence, there is less attention paid to long-range turbulence with significant distortions. To address this dilemma and adva… ▽ More Long-range imaging inevitably suffers from atmospheric turbulence with severe geometric distortions due to random refraction of light. The further the distance, the more severe the disturbance. Despite existing research has achieved great progress in tackling short-range turbulence, there is less attention paid to long-range turbulence with significant distortions. To address this dilemma and advance the field, we construct a large-scale real long-range atmospheric turbulence dataset (RLR-AT), including 1500 turbulence sequences spanning distances from 1 Km to 13 Km. The advantages of RLR-AT compared to existing ones: turbulence with longer-distances and higher-diversity, scenes with greater-variety and larger-scale. Moreover, most existing work adopts either registration-based or decomposition-based methods to address distortions through one-step mitigation. However, they fail to effectively handle long-range turbulence due to its significant pixel displacements. In this work, we propose a coarse-to-fine framework to handle severe distortions, which cooperates dynamic turbulence and static background priors (CDSP). On the one hand, we discover the pixel motion statistical prior of turbulence, and propose a frequency-aware reference frame for better large-scale distortion registration, greatly reducing the burden of refinement. On the other hand, we take advantage of the static prior of background, and propose a subspace-based low-rank tensor refinement model to eliminate the misalignments inevitably left by registration while well preserving details. The dynamic and static priors complement to each other, facilitating us to progressively mitigate long-range turbulence with severe distortions. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on different datasets. △ Less

Submitted 17 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

Comments: This paper is accepted by ECCV 2024

arXiv:2407.05364 [pdf, other]

PTaRL: Prototype-based Tabular Representation Learning via Space Calibration

Authors: Hangting Ye, Wei Fan, Xiaozhuang Song, Shun Zheng, He Zhao, Dandan Guo, Yi Chang

Abstract: Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. With the recent success of deep learning, many tabular machine learning (ML) methods based on deep networks (e.g., Transformer, ResNet) have achieved competitive performance on tabular benchmarks. However, existing deep tabular ML methods suffer from the representatio… ▽ More Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. With the recent success of deep learning, many tabular machine learning (ML) methods based on deep networks (e.g., Transformer, ResNet) have achieved competitive performance on tabular benchmarks. However, existing deep tabular ML methods suffer from the representation entanglement and localization, which largely hinders their prediction performance and leads to performance inconsistency on tabular tasks. To overcome these problems, we explore a novel direction of applying prototype learning for tabular ML and propose a prototype-based tabular representation learning framework, PTaRL, for tabular prediction tasks. The core idea of PTaRL is to construct prototype-based projection space (P-Space) and learn the disentangled representation around global data prototypes. Specifically, PTaRL mainly involves two stages: (i) Prototype Generation, that constructs global prototypes as the basis vectors of P-Space for representation, and (ii) Prototype Projection, that projects the data samples into P-Space and keeps the core global data information via Optimal Transport. Then, to further acquire the disentangled representations, we constrain PTaRL with two strategies: (i) to diversify the coordinates towards global prototypes of different representations within P-Space, we bring up a diversification constraint for representation calibration; (ii) to avoid prototype entanglement in P-Space, we introduce a matrix orthogonalization constraint to ensure the independence of global prototypes. Finally, we conduct extensive experiments in PTaRL coupled with state-of-the-art deep tabular ML models on various tabular benchmarks and the results have shown our consistent superiority. △ Less

Submitted 15 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: Accepted by ICLR 2024

arXiv:2407.03925 [pdf, other]

Reduced-Order Neural Operators: Learning Lagrangian Dynamics on Highly Sparse Graphs

Authors: Hrishikesh Viswanath, Yue Chang, Julius Berner, Peter Yichen Chen, Aniket Bera

Abstract: We present a neural operator architecture to simulate Lagrangian dynamics, such as fluid flow, granular flows, and elastoplasticity. Traditional numerical methods, such as the finite element method (FEM), suffer from long run times and large memory consumption. On the other hand, approaches based on graph neural networks are faster but still suffer from long computation times on dense graphs, whic… ▽ More We present a neural operator architecture to simulate Lagrangian dynamics, such as fluid flow, granular flows, and elastoplasticity. Traditional numerical methods, such as the finite element method (FEM), suffer from long run times and large memory consumption. On the other hand, approaches based on graph neural networks are faster but still suffer from long computation times on dense graphs, which are often required for high-fidelity simulations. Our model, GIOROM or Graph Interaction Operator for Reduced-Order Modeling, learns temporal dynamics within a reduced-order setting, capturing spatial features from a highly sparse graph representation of the input and generalizing to arbitrary spatial locations during inference. The model is geometry-aware and discretization-agnostic and can generalize to different initial conditions, velocities, and geometries after training. We show that point clouds of the order of 100,000 points can be inferred from sparse graphs with $\sim$1000 points, with negligible change in computation time. We empirically evaluate our model on elastic solids, Newtonian fluids, Non-Newtonian fluids, Drucker-Prager granular flows, and von Mises elastoplasticity. On these benchmarks, our approach results in a 25$\times$ speedup compared to other neural network-based physics simulators while delivering high-fidelity predictions of complex physical systems and showing better performance on most benchmarks. The code and the demos are provided at https://github.com/HrishikeshVish/GIOROM. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.00383 [pdf, other]

FANFOLD: Graph Normalizing Flows-driven Asymmetric Network for Unsupervised Graph-Level Anomaly Detection

Authors: Rui Cao, Shijie Xue, Jindong Li, Qi Wang, Yi Chang

Abstract: Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. In recent studies, knowledge distillation-based methods have been widely used in unsupervised anomaly detection to improve model efficiency and generalization. However, the inherent symmetry between the source (teacher) and target (student) networks typically results in consistent… ▽ More Unsupervised graph-level anomaly detection (UGAD) has attracted increasing interest due to its widespread application. In recent studies, knowledge distillation-based methods have been widely used in unsupervised anomaly detection to improve model efficiency and generalization. However, the inherent symmetry between the source (teacher) and target (student) networks typically results in consistent outputs across both architectures, making it difficult to distinguish abnormal graphs from normal graphs. Also, existing methods mainly rely on graph features to distinguish anomalies, which may be unstable with complex and diverse data and fail to capture the essence that differentiates normal graphs from abnormal ones. In this work, we propose a Graph Normalizing Flows-driven Asymmetric Network For Unsupervised Graph-Level Anomaly Detection (FANFOLD in short). We introduce normalizing flows to unsupervised graph-level anomaly detection due to their successful application and superior quality in learning the underlying distribution of samples. Specifically, we adopt the knowledge distillation technique and apply normalizing flows on the source network, achieving the asymmetric network. In the training stage, FANFOLD transforms the original distribution of normal graphs to a standard normal distribution. During inference, FANFOLD computes the anomaly score using the source-target loss to discriminate between normal and anomalous graphs. We conduct extensive experiments on 15 datasets of different fields with 9 baseline methods to validate the superiority of FANFOLD. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2407.00118 [pdf, other]

From Efficient Multimodal Models to World Models: A Survey

Authors: Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang

Abstract: Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of… ▽ More Multimodal Large Models (MLMs) are becoming a significant research focus, combining powerful large language models with multimodal learning to perform complex tasks across different data modalities. This review explores the latest developments and challenges in MLMs, emphasizing their potential in achieving artificial general intelligence and as a pathway to world models. We provide an overview of key techniques such as Multimodal Chain of Thought (M-COT), Multimodal Instruction Tuning (M-IT), and Multimodal In-Context Learning (M-ICL). Additionally, we discuss both the fundamental and specific technologies of multimodal models, highlighting their applications, input/output modalities, and design characteristics. Despite significant advancements, the development of a unified multimodal model remains elusive. We discuss the integration of 3D generation and embodied intelligence to enhance world simulation capabilities and propose incorporating external rule systems for improved reasoning and decision-making. Finally, we outline future research directions to address these challenges and advance the field. △ Less

Submitted 27 June, 2024; originally announced July 2024.

arXiv:2406.19228 [pdf, other]

Tools Fail: Detecting Silent Errors in Faulty Tools

Authors: Jimin Sun, So Yeon Min, Yingshan Chang, Yonatan Bisk

Abstract: Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, a… ▽ More Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 18 pages, 12 figures

arXiv:2406.17386 [pdf, other]

Double Momentum Method for Lower-Level Constrained Bilevel Optimization

Authors: Wanli Shi, Yi Chang, Bin Gu

Abstract: Bilevel optimization (BO) has recently gained prominence in many machine learning applications due to its ability to capture the nested structure inherent in these problems. Recently, many hypergradient methods have been proposed as effective solutions for solving large-scale problems. However, current hypergradient methods for the lower-level constrained bilevel optimization (LCBO) problems need… ▽ More Bilevel optimization (BO) has recently gained prominence in many machine learning applications due to its ability to capture the nested structure inherent in these problems. Recently, many hypergradient methods have been proposed as effective solutions for solving large-scale problems. However, current hypergradient methods for the lower-level constrained bilevel optimization (LCBO) problems need very restrictive assumptions, namely, where optimality conditions satisfy the differentiability and invertibility conditions and lack a solid analysis of the convergence rate. What's worse, existing methods require either double-loop updates, which are sometimes less efficient. To solve this problem, in this paper, we propose a new hypergradient of LCBO leveraging the theory of nonsmooth implicit function theorem instead of using the restrive assumptions. In addition, we propose a \textit{single-loop single-timescale} algorithm based on the double-momentum method and adaptive step size method and prove it can return a $(δ, ε)$-stationary point with $\tilde{\mathcal{O}}(d_2^2ε^{-4})$ iterations. Experiments on two applications demonstrate the effectiveness of our proposed method. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 27pages, 9 figures

arXiv:2406.15848 [pdf, other]

Quality-guided Skin Tone Enhancement for Portrait Photography

Authors: Shiqi Gao, Huiyu Duan, Xinyue Li, Kang Fu, Yicong Peng, Qihang Xu, Yuanyuan Chang, Jia Wang, Xiongkuo Min, Guangtao Zhai

Abstract: In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image contin… ▽ More In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image continuously, since in many cases we may want to get a slighter or stronger enhancement effect rather than one fixed adjusted result. In this paper, we propose a quality-guided image enhancement paradigm that enables image enhancement models to learn the distribution of images with various quality ratings. By learning this distribution, image enhancement models can associate image features with their corresponding perceptual qualities, which can be used to adjust images continuously according to different quality scores. To validate the effectiveness of our proposed method, a subjective quality assessment experiment is first conducted, focusing on skin tone adjustment in portrait photography. Guided by the subjective quality ratings obtained from this experiment, our method can adjust the skin tone corresponding to different quality requirements. Furthermore, an experiment conducted on 10 natural raw images corroborates the effectiveness of our model in situations with fewer subjects and fewer shots, and also demonstrates its general applicability to natural images. Our project page is https://github.com/IntMeGroup/quality-guided-enhancement . △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.15119 [pdf, other]

Speech Emotion Recognition under Resource Constraints with Data Distillation

Authors: Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller

Abstract: Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment… ▽ More Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14517 [pdf, other]

PostMark: A Robust Blackbox Watermark for Large Language Models

Authors: Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Wieting, Mohit Iyyer

Abstract: The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. I… ▽ More The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: preprint; 18 pages, 5 figures

arXiv:2406.13153 [pdf, other]

SwinStyleformer is a favorable choice for image inversion

Authors: Jiawei Mao, Guangyi Zhao, Xuesong Yin, Yuanqi Chang

Abstract: This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise fro… ▽ More This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12587 [pdf, other]

Restorer: Solving Multiple Image Restoration Tasks with One Set of Parameters

Authors: Jiawei Mao, Xuesong Yin, Yuanqi Chang

Abstract: Although there are many excellent solutions in image restoration, the fact that they are specifically designed for a single image restoration task may prevent them from being state-of-the-art (SOTA) in other types of image restoration tasks. While some approaches require considering multiple image restoration tasks, they are still not sufficient for the requirements of the real world and may suffe… ▽ More Although there are many excellent solutions in image restoration, the fact that they are specifically designed for a single image restoration task may prevent them from being state-of-the-art (SOTA) in other types of image restoration tasks. While some approaches require considering multiple image restoration tasks, they are still not sufficient for the requirements of the real world and may suffer from the task confusion issue. In this work, we focus on designing a unified and effective solution for multiple image restoration tasks including deraining, desnowing, defogging, deblurring, denoising, and low-light enhancement. Based on the above purpose, we propose a Transformer network Restorer with U-Net architecture. In order to effectively deal with degraded information in multiple image restoration tasks, we need a more comprehensive attention mechanism. Thus, we design all-axis attention (AAA) through stereo embedding and 3D convolution, which can simultaneously model the long-range dependencies in both spatial and channel dimensions, capturing potential correlations among all axis. Moreover, we propose a Restorer based on textual prompts. Compared to previous methods that employ learnable queries, textual prompts bring explicit task priors to solve the task confusion issue arising from learnable queries and introduce interactivity. Based on these designs, Restorer demonstrates SOTA or comparable performance in multiple image restoration tasks compared to universal image restoration frameworks and methods specifically designed for these individual tasks. Meanwhile, Restorer is faster during inference. The above results along with the real-world test results show that Restorer has the potential to serve as a backbone for multiple real-world image restoration tasks. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12585 [pdf, other]

Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

Authors: Yao-Ching Yu, Chun-Chih Kuo, Ziqi Ye, Yu-Cheng Chang, Yueh-Se Li

Abstract: Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembli… ▽ More Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembling the full-text outputs of LLMs, such as selecting the best output using a ranker, which leads to underutilization of token-level probability information. In this paper, we treat the Generation of each token by LLMs as a Classification (GaC) for ensembling. This approach fully exploits the probability information at each generation step and better prevents LLMs from producing early incorrect tokens that lead to snowballing errors. In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling. Furthermore, we observed that most of the tokens in the answer are simple and do not affect the correctness of the final answer. Therefore, we also experimented with ensembling only key tokens, and the results showed better performance with lower latency across benchmarks. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.10272 [pdf, other]

Connected Speech-Based Cognitive Assessment in Chinese and English

Authors: Saturnino Luz, Sofia De La Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi, Ya-Ning Chang, Chia-Ju Chou, Yi-Chien Liu

Abstract: We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age… ▽ More We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age and sex by propensity score analysis to ensure balance and representativity in model training. The prediction tasks encompass mild cognitive impairment diagnosis and cognitive test score prediction. This framework was designed to encourage the development of approaches to speech-based cognitive assessment which generalise across languages. We illustrate it by presenting baseline prediction models that employ language-agnostic and comparable features for diagnosis and cognitive test score prediction. The models achieved unweighted average recall was 59.2% in diagnosis, and root mean squared error of 2.89 in score prediction. △ Less

Submitted 18 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: To appear in Proceedings of Interspeech 2024

ACM Class: J.3; I.5.4

arXiv:2406.08855 [pdf, other]

Trajectory Planning for Autonomous Driving in Unstructured Scenarios Based on Graph Neural Network and Numerical Optimization

Authors: Sumin Zhang, Kuo Li, Rui He, Zhiwei Meng, Yupeng Chang, Xiaosong Jin, Ri Bai

Abstract: In unstructured environments, obstacles are diverse and lack lane markings, making trajectory planning for intelligent vehicles a challenging task. Traditional trajectory planning methods typically involve multiple stages, including path planning, speed planning, and trajectory optimization. These methods require the manual design of numerous parameters for each stage, resulting in significant wor… ▽ More In unstructured environments, obstacles are diverse and lack lane markings, making trajectory planning for intelligent vehicles a challenging task. Traditional trajectory planning methods typically involve multiple stages, including path planning, speed planning, and trajectory optimization. These methods require the manual design of numerous parameters for each stage, resulting in significant workload and computational burden. While end-to-end trajectory planning methods are simple and efficient, they often fail to ensure that the trajectory meets vehicle dynamics and obstacle avoidance constraints in unstructured scenarios. Therefore, this paper proposes a novel trajectory planning method based on Graph Neural Networks (GNN) and numerical optimization. The proposed method consists of two stages: (1) initial trajectory prediction using the GNN, (2) trajectory optimization using numerical optimization. First, the graph neural network processes the environment information and predicts a rough trajectory, replacing traditional path and speed planning. This predicted trajectory serves as the initial solution for the numerical optimization stage, which optimizes the trajectory to ensure compliance with vehicle dynamics and obstacle avoidance constraints. We conducted simulation experiments to validate the feasibility of the proposed algorithm and compared it with other mainstream planning algorithms. The results demonstrate that the proposed method simplifies the trajectory planning process and significantly improves planning efficiency. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.05191 [pdf, other]

DiffusionPID: Interpreting Diffusion via Partial Information Decomposition

Authors: Shaurya Dewan, Rushikesh Zawar, Prakanshul Saxena, Yingshan Chang, Andrew Luo, Yonatan Bisk

Abstract: Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding w… ▽ More Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding what they learn, how they represent visual-semantic relationships, and why they sometimes fail to generalize. Our work presents Diffusion Partial Information Decomposition (DiffusionPID), a novel technique that applies information-theoretic principles to decompose the input text prompt into its elementary components, enabling a detailed examination of how individual tokens and their interactions shape the generated image. We introduce a formal approach to analyze the uniqueness, redundancy, and synergy terms by applying PID to the denoising model at both the image and pixel level. This approach enables us to characterize how individual tokens and their interactions affect the model output. We first present a fine-grained analysis of characteristics utilized by the model to uniquely localize specific concepts, we then apply our approach in bias analysis and show it can recover gender and ethnicity biases. Finally, we use our method to visually characterize word ambiguity and similarity from the model's perspective and illustrate the efficacy of our method for prompt intervention. Our results show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models. △ Less

Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04368 [pdf, other]

SocialNLP Fake-EmoReact 2021 Challenge Overview: Predicting Fake Tweets from Their Replies and GIFs

Authors: Chien-Kun Huang, Yi-Ting Chang, Lun-Wei Ku, Cheng-Te Li, Hong-Han Shuai

Abstract: This paper provides an overview of the Fake-EmoReact 2021 Challenge, held at the 9th SocialNLP Workshop, in conjunction with NAACL 2021. The challenge requires predicting the authenticity of tweets using reply context and augmented GIF categories from EmotionGIF dataset. We offer the Fake-EmoReact dataset with more than 453k as the experimental materials, where every tweet is labeled with authenti… ▽ More This paper provides an overview of the Fake-EmoReact 2021 Challenge, held at the 9th SocialNLP Workshop, in conjunction with NAACL 2021. The challenge requires predicting the authenticity of tweets using reply context and augmented GIF categories from EmotionGIF dataset. We offer the Fake-EmoReact dataset with more than 453k as the experimental materials, where every tweet is labeled with authenticity. Twenty-four teams registered to participate in this challenge, and 5 submitted their results successfully in the evaluation phase. The best team achieves 93.9 on Fake-EmoReact 2021 dataset using F1 score. In addition, we show the definition of share task, data collection, and the teams' performance that joined this challenge and their approaches. △ Less

Submitted 31 May, 2024; originally announced June 2024.

arXiv:2406.03102 [pdf, other]

DEER: A Delay-Resilient Framework for Reinforcement Learning with Variable Delays

Authors: Bo Xia, Yilun Kong, Yongzhe Chang, Bo Yuan, Zhiheng Li, Xueqian Wang, Bin Liang

Abstract: Classic reinforcement learning (RL) frequently confronts challenges in tasks involving delays, which cause a mismatch between received observations and subsequent actions, thereby deviating from the Markov assumption. Existing methods usually tackle this issue with end-to-end solutions using state augmentation. However, these black-box approaches often involve incomprehensible processes and redund… ▽ More Classic reinforcement learning (RL) frequently confronts challenges in tasks involving delays, which cause a mismatch between received observations and subsequent actions, thereby deviating from the Markov assumption. Existing methods usually tackle this issue with end-to-end solutions using state augmentation. However, these black-box approaches often involve incomprehensible processes and redundant information in the information states, causing instability and potentially undermining the overall performance. To alleviate the delay challenges in RL, we propose $\textbf{DEER (Delay-resilient Encoder-Enhanced RL)}$, a framework designed to effectively enhance the interpretability and address the random delay issues. DEER employs a pretrained encoder to map delayed states, along with their variable-length past action sequences resulting from different delays, into hidden states, which is trained on delay-free environment datasets. In a variety of delayed scenarios, the trained encoder can seamlessly integrate with standard RL algorithms without requiring additional modifications and enhance the delay-solving capability by simply adapting the input dimension of the original algorithms. We evaluate DEER through extensive experiments on Gym and Mujoco environments. The results confirm that DEER is superior to state-of-the-art RL algorithms in both constant and random delay settings. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.01033 [pdf]

Generalized Jersey Number Recognition Using Multi-task Learning With Orientation-guided Weight Refinement

Authors: Yung-Hui Lin, Yu-Wen Chang, Huang-Chia Shih, Takahiro Ogawa

Abstract: Jersey number recognition (JNR) has always been an important task in sports analytics. Improving recognition accuracy remains an ongoing challenge because images are subject to blurring, occlusion, deformity, and low resolution. Recent research has addressed these problems using number localization and optical character recognition. Some approaches apply player identification schemes to image sequ… ▽ More Jersey number recognition (JNR) has always been an important task in sports analytics. Improving recognition accuracy remains an ongoing challenge because images are subject to blurring, occlusion, deformity, and low resolution. Recent research has addressed these problems using number localization and optical character recognition. Some approaches apply player identification schemes to image sequences, ignoring the impact of human body rotation angles on jersey digit identification. Accurately predicting the number of jersey digits by using a multi-task scheme to recognize each individual digit enables more robust results. Based on the above considerations, this paper proposes a multi-task learning method called the angle-digit refine scheme (ADRS), which combines human body orientation angles and digit number clues to recognize athletic jersey numbers. Based on our experimental results, our approach increases inference information, significantly improving prediction accuracy. Compared to state-of-the-art methods, which can only handle a single type of sport, the proposed method produces a more diverse and practical JNR application. The incorporation of diverse types of team sports such as soccer, football, basketball, volleyball, and baseball into our dataset contributes greatly to generalized JNR in sports analytics. Our accuracy achieves 64.07% on Top-1 and 89.97% on Top-2, with corresponding F1 scores of 67.46% and 90.64%, respectively. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 10 pages, 6 figures, 5 tables

arXiv:2405.20404 [pdf, other]

XPrompt:Explaining Large Language Model's Generation via Joint Prompt Attribution

Authors: Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, Lu Lin

Abstract: Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of elucidating and explaining the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to b… ▽ More Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of elucidating and explaining the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on joint prompt attribution, XPrompt, which aims to explain how a few prompt texts collaboratively influences the LLM's complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both faithfulness and efficiency of our framework. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.20131 [pdf, other]

Language Models Need Inductive Biases to Count Inductively

Authors: Yingshan Chang, Yonatan Bisk

Abstract: Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of countin… ▽ More Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19718 [pdf, other]

LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising

Authors: Yuxing Duan, Shihan Peng, Lin Zhu, Wei Zhang, Yi Chang, Sheng Zhong, Luxin Yan

Abstract: Event camera has significant advantages in capturing dynamic scene information while being prone to noise interference, particularly in challenging conditions like low threshold and low illumination. However, most existing research focuses on gentle situations, hindering event camera applications in realistic complex scenarios. To tackle this limitation and advance the field, we construct a new pa… ▽ More Event camera has significant advantages in capturing dynamic scene information while being prone to noise interference, particularly in challenging conditions like low threshold and low illumination. However, most existing research focuses on gentle situations, hindering event camera applications in realistic complex scenarios. To tackle this limitation and advance the field, we construct a new paired real-world event denoising dataset (LED), including 3K sequences with 18K seconds of high-resolution (1200*680) event streams and showing three notable distinctions compared to others: diverse noise levels and scenes, larger-scale with high-resolution, and high-quality GT. Specifically, it contains stepped parameters and varying illumination with diverse scenarios. Moreover, based on the property of noise events inconsistency and signal events consistency, we propose a novel effective denoising framework(DED) using homogeneous dual events to generate the GT with better separating noise from the raw. Furthermore, we design a bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with dynamic thresholds to realize accurate denoising. The experimental results demonstrate that the remarkable performance of the proposed approach on different datasets.The dataset and code are at https://github.com/Yee-Sing/led. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Accepted by CVPR 2024

arXiv:2405.18816 [pdf, other]

Flow Priors for Linear Inverse Problems via Iterative Corrupted Trajectory Matching

Authors: Yasi Zhang, Peiyu Yu, Yaxuan Zhu, Yingshan Chang, Feng Gao, Ying Nian Wu, Oscar Leong

Abstract: Generative models based on flow matching have attracted significant attention for their simplicity and superior performance in high-resolution image synthesis. By leveraging the instantaneous change-of-variables formula, one can directly compute image likelihoods from a learned flow, making them enticing candidates as priors for downstream tasks such as inverse problems. In particular, a natural a… ▽ More Generative models based on flow matching have attracted significant attention for their simplicity and superior performance in high-resolution image synthesis. By leveraging the instantaneous change-of-variables formula, one can directly compute image likelihoods from a learned flow, making them enticing candidates as priors for downstream tasks such as inverse problems. In particular, a natural approach would be to incorporate such image probabilities in a maximum-a-posteriori (MAP) estimation problem. A major obstacle, however, lies in the slow computation of the log-likelihood, as it requires backpropagating through an ODE solver, which can be prohibitively slow for high-dimensional problems. In this work, we propose an iterative algorithm to approximate the MAP estimator efficiently to solve a variety of linear inverse problems. Our algorithm is mathematically justified by the observation that the MAP objective can be approximated by a sum of $N$ ``local MAP'' objectives, where $N$ is the number of function evaluations. By leveraging Tweedie's formula, we show that we can perform gradient steps to sequentially optimize these objectives. We validate our approach for various linear inverse problems, such as super-resolution, deblurring, inpainting, and compressed sensing, and demonstrate that we can outperform other methods based on flow matching. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.16755 [pdf, other]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Authors: Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, Amin Saberi

Abstract: Utilizing large language models (LLMs) for transforming natural language questions into SQL queries (text-to-SQL) is a promising yet challenging approach, particularly when applied to real-world databases with complex and extensive schemas. In particular, effectively incorporating data catalogs and database values for SQL generation remains an obstacle, leading to suboptimal solutions. We address… ▽ More Utilizing large language models (LLMs) for transforming natural language questions into SQL queries (text-to-SQL) is a promising yet challenging approach, particularly when applied to real-world databases with complex and extensive schemas. In particular, effectively incorporating data catalogs and database values for SQL generation remains an obstacle, leading to suboptimal solutions. We address this problem by proposing a new pipeline that effectively retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient SQL queries. To increase retrieval precision, our pipeline introduces a hierarchical retrieval method leveraging model-generated keywords, locality-sensitive hashing indexing, and vector databases. Additionally, we have developed an adaptive schema pruning technique that adjusts based on the complexity of the problem and the model's context size. Our approach generalizes to both frontier proprietary models like GPT-4 and open-source models such as Llama-3-70B. Through a series of ablation studies, we demonstrate the effectiveness of each component of our pipeline and its impact on the end-to-end performance. Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset. △ Less

Submitted 27 June, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.15808 [pdf, other]

Ensuring Ground Truth Accuracy in Healthcare with the EVINCE framework

Authors: Edward Y. Chang

Abstract: Misdiagnosis is a significant issue in healthcare, leading to harmful consequences for patients. The propagation of mislabeled data through machine learning models into clinical practice is unacceptable. This paper proposes EVINCE, a system designed to 1) improve diagnosis accuracy and 2) rectify misdiagnoses and minimize training data errors. EVINCE stands for Entropy Variation through Informatio… ▽ More Misdiagnosis is a significant issue in healthcare, leading to harmful consequences for patients. The propagation of mislabeled data through machine learning models into clinical practice is unacceptable. This paper proposes EVINCE, a system designed to 1) improve diagnosis accuracy and 2) rectify misdiagnoses and minimize training data errors. EVINCE stands for Entropy Variation through Information Duality with Equal Competence, leveraging this novel theory to optimize the diagnostic process using multiple Large Language Models (LLMs) in a structured debate framework. Our empirical study verifies EVINCE to be effective in achieving its design goals. △ Less

Submitted 28 May, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

Comments: 23 pages, 4 tables, 4 figures

ACM Class: I.2.7

arXiv:2405.14573 [pdf, other]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Authors: Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva

Abstract: Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactiv… ▽ More Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate AndroidWorld's benefits and mode of operation, we introduce a new computer control agent, M3A. M3A can complete 30.6% of the AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-domain agents. Finally, we conduct a robustness analysis by testing M3A against a range of task variations on a representative subset of tasks, demonstrating that variations in task parameters can significantly alter a task's complexity and, consequently, an agent's performance, highlighting the importance of testing agents under diverse conditions. AndroidWorld and the experiments in this paper are available at https://github.com/google-research/android_world. △ Less

Submitted 10 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.13368 [pdf, other]

Static Deep Q-learning for Green Downlink C-RAN

Authors: Yuchao Chang, Hongli Wang, Wen Chen, Yonghui Li, Naofal Al-Dhahir

Abstract: Power saving is a main pillar in the operation of wireless communication systems. In this paper, we investigate cloud radio access network (C-RAN) capability to reduce power consumption based on the user equipment (UE) requirement. Aiming to save the long-term C-RAN energy consumption, an optimization problem is formulated to manage the downlink power without degrading the UE requirement by design… ▽ More Power saving is a main pillar in the operation of wireless communication systems. In this paper, we investigate cloud radio access network (C-RAN) capability to reduce power consumption based on the user equipment (UE) requirement. Aiming to save the long-term C-RAN energy consumption, an optimization problem is formulated to manage the downlink power without degrading the UE requirement by designing the power offset parameter. Considering stochastic traffic arrivals at UEs, we first formulate the problem as a Markov decision process (MDP) and then set up a dual objective optimization problem in terms of the downlink throughput and power. To solve this optimization problem, we develop a novel static deep Q-learning (SDQL) algorithm to maximize the downlink throughput and minimize the downlink power. In our proposed algorithm, we design multi-Q-tables to simultaneously optimize power reductions of activated RRHs by assigning one Q-table for each UE. To maximize the accumulative reward in terms of the downlink throughput loss and power reduction, our proposed algorithm performs power reductions of activated RRHs through continuous environmental interactions. Simulation results1 show that our proposed algorithm enjoys a superior average power reduction compared to the activation and sleep schemes, and enjoys a low computational complexity. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.12954 [pdf, other]

A Method on Searching Better Activation Functions

Authors: Haoyuan Sun, Zihao Wu, Bo Xia, Pu Chang, Zibin Dong, Yifu Yuan, Yongzhe Chang, Xueqian Wang

Abstract: The success of artificial neural networks (ANNs) hinges greatly on the judicious selection of an activation function, introducing non-linearity into network and enabling them to model sophisticated relationships in data. However, the search of activation functions has largely relied on empirical knowledge in the past, lacking theoretical guidance, which has hindered the identification of more effe… ▽ More The success of artificial neural networks (ANNs) hinges greatly on the judicious selection of an activation function, introducing non-linearity into network and enabling them to model sophisticated relationships in data. However, the search of activation functions has largely relied on empirical knowledge in the past, lacking theoretical guidance, which has hindered the identification of more effective activation functions. In this work, we offer a proper solution to such issue. Firstly, we theoretically demonstrate the existence of the worst activation function with boundary conditions (WAFBC) from the perspective of information entropy. Furthermore, inspired by the Taylor expansion form of information entropy functional, we propose the Entropy-based Activation Function Optimization (EAFO) methodology. EAFO methodology presents a novel perspective for designing static activation functions in deep neural networks and the potential of dynamically optimizing activation during iterative training. Utilizing EAFO methodology, we derive a novel activation function from ReLU, known as Correction Regularized ReLU (CRReLU). Experiments conducted with vision transformer and its variants on CIFAR-10, CIFAR-100 and ImageNet-1K datasets demonstrate the superiority of CRReLU over existing corrections of ReLU. Extensive empirical studies on task of large language model (LLM) fine-tuning, CRReLU exhibits superior performance compared to GELU, suggesting its broader potential for practical applications. △ Less

Submitted 22 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

Comments: 16 pages,3 figures

arXiv:2405.10516 [pdf, other]

Language Models can Evaluate Themselves via Probability Discrepancy

Authors: Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou

Abstract: In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. T… ▽ More In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes. △ Less

Submitted 8 July, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: ACL 2024 Findings

arXiv:2405.07076 [pdf, other]

Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Authors: Edward Y. Chang

Abstract: This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling… ▽ More This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions. △ Less

Submitted 13 May, 2024; v1 submitted 11 May, 2024; originally announced May 2024.

Comments: 29 pages, 10 tables, 6 figures

ACM Class: I.2.7

arXiv:2405.05132 [pdf, other]

Low-Distortion Clustering in Bounded Growth Graphs

Authors: Yi-Jun Chang, Varsha Dani, Thomas P. Hayes

Abstract: The well-known clustering algorithm of Miller, Peng, and Xu (SPAA 2013) is useful for many applications, including low-diameter decomposition and low-energy distributed algorithms. One nice property of their clustering, shown in previous work by Chang, Dani, Hayes, and Pettie (PODC 2020), is that distances in the cluster graph are rescaled versions of distances in the original graph, up to an… ▽ More The well-known clustering algorithm of Miller, Peng, and Xu (SPAA 2013) is useful for many applications, including low-diameter decomposition and low-energy distributed algorithms. One nice property of their clustering, shown in previous work by Chang, Dani, Hayes, and Pettie (PODC 2020), is that distances in the cluster graph are rescaled versions of distances in the original graph, up to an $O(\log n)$ distortion factor and rounding issues. Minimizing this distortion factor is important for efficiency in computing the clustering, as well as in other applications. We prove that there exist graphs for which an $Ω((\log n)^{1/3})$ distortion factor is necessary for any clustering. We also consider a class of nice graphs which we call uniformly bounded independence graphs. These include, for example, paths, lattice graphs, and "dense" unit disk graphs. For these graphs, we prove that clusterings of distortion $O(1)$ always exist, and moreover, we give new efficient distributed algorithms to construct them. This clustering is based on Voronoi cells centered at the vertices of a maximal independent set in a suitable power graph. Applications include low-energy simulation of distributed algorithms in the LOCAL, CONGEST, and RADIO-CONGEST models and efficient approximate solutions to distributed combinatorial optimization problems. We also investigate related lower bounds. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.03908 [pdf, other]

Deterministic Expander Routing: Faster and More Versatile

Authors: Yi-Jun Chang, Shang-En Huang, Hsin-Hao Su

Abstract: We consider the expander routing problem formulated by Ghaffari, Kuhn, and Su (PODC 2017), where the goal is to route all the tokens to their destinations given that each vertex is the source and the destination of at most $°(v)$ tokens. They developed $\textit{randomized algorithms}$ that solve this problem in $\text{poly}(φ^{-1}) \cdot 2^{O(\sqrt{\log n \log \log n})}$ rounds in the… ▽ More We consider the expander routing problem formulated by Ghaffari, Kuhn, and Su (PODC 2017), where the goal is to route all the tokens to their destinations given that each vertex is the source and the destination of at most $°(v)$ tokens. They developed $\textit{randomized algorithms}$ that solve this problem in $\text{poly}(φ^{-1}) \cdot 2^{O(\sqrt{\log n \log \log n})}$ rounds in the $\textsf{CONGEST}$ model, where $φ$ is the conductance of the graph. Later, Ghaffari and Li (DISC 2018) gave an improved algorithm. However, both algorithms are randomized, which means that all the resulting applications are also randomized. Recently, Chang and Saranurak (FOCS 2020) gave a deterministic algorithm that solves an expander routing instance in $2^{O(\log^{2/3} n \cdot \log^{1/3} \log n)}$ rounds. The deterministic algorithm is less efficient and does not allow preprocessing/query tradeoffs, which precludes the de-randomization of algorithms that require this feature, such as the $k$-clique enumeration algorithm in general graphs. The main contribution of our work is a new deterministic expander routing algorithm that not only matches the randomized bound of [GKS 2017] but also allows preprocessing/query tradeoffs. Our algorithm solves a single instance of routing query in $2^{{O}(\sqrt{\log n \cdot \log \log n})}$ rounds. Our algorithm achieves the following preprocessing and query tradeoffs: For $0 < ε< 1$, we can answer every routing query in $\log^{O(1/ε)} n$ rounds at the cost of a $(n^{O(ε)} + \log^{O(1/ε)} n)$-round preprocessing procedure. Combining this with the approach of Censor-Hillel, Leitersdorf, and Vulakh (PODC 2022), we obtain a near-optimal $\tilde{O}(n^{1-2/k})$-round deterministic algorithm for $k$-clique enumeration in general graphs, improving the previous state-of-the-art $n^{1-2/k+o(1)}$. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Accepted to PODC 2024

arXiv:2405.03371 [pdf, other]

Explainable Fake News Detection With Large Language Model via Defense Among Competing Wisdom

Authors: Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, Yi Chang

Abstract: Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalen… ▽ More Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications. △ Less

Submitted 20 June, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 12 pages, WWW'2024

arXiv:2405.02814 [pdf, other]

NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli

Authors: Xu Wang, Cheng Li, Yi Chang, Jindong Wang, Yuan Wu

Abstract: Large Language Models (LLMs) have become integral to a wide spectrum of applications, ranging from traditional computing tasks to advanced artificial intelligence (AI) applications. This widespread adoption has spurred extensive research into LLMs across various disciplines, including the social sciences. Notably, studies have revealed that LLMs possess emotional intelligence, which can be further… ▽ More Large Language Models (LLMs) have become integral to a wide spectrum of applications, ranging from traditional computing tasks to advanced artificial intelligence (AI) applications. This widespread adoption has spurred extensive research into LLMs across various disciplines, including the social sciences. Notably, studies have revealed that LLMs possess emotional intelligence, which can be further developed through positive emotional stimuli. This discovery raises an intriguing question: can negative emotions similarly influence LLMs, potentially enhancing their performance? In response to this question, we introduce NegativePrompt, a novel approach underpinned by psychological principles, involving ten specifically designed negative emotional stimuli. We embark on rigorous experimental evaluations of five LLMs including Flan-T5-Large, Vicuna, Llama 2, ChatGPT, and GPT-4, across a set of 45 tasks. The results are revealing: NegativePrompt markedly enhances the performance of LLMs, evidenced by relative improvements of 12.89% in Instruction Induction tasks and 46.25% in BIG-Bench tasks. Moreover, we conduct attention visualization experiments to decipher the underlying mechanisms of NegativePrompt's influence. Our research contributes significantly to the understanding of LLMs and emotion interaction, demonstrating the practical efficacy of NegativePrompt as an emotion-driven method and offering novel insights for the enhancement of LLMs in real-world applications. The code is available at https://github.com/wangxu0820/NegativePrompt. △ Less

Submitted 12 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

Comments: This paper has been accepted by IJCAI 2024

arXiv:2405.02695 [pdf, ps, other]

Improved All-Pairs Approximate Shortest Paths in Congested Clique

Authors: Hong Duc Bui, Shashwat Chandra, Yi-Jun Chang, Michal Dory, Dean Leitersdorf

Abstract: In this paper, we present new algorithms for approximating All-Pairs Shortest Paths (APSP) in the Congested Clique model. We present randomized algorithms for weighted undirected graphs. Our first contribution is an $O(1)$-approximate APSP algorithm taking just $O(\log \log \log n)$ rounds. Prior to our work, the fastest algorithms that give an $O(1)$-approximation for APSP take… ▽ More In this paper, we present new algorithms for approximating All-Pairs Shortest Paths (APSP) in the Congested Clique model. We present randomized algorithms for weighted undirected graphs. Our first contribution is an $O(1)$-approximate APSP algorithm taking just $O(\log \log \log n)$ rounds. Prior to our work, the fastest algorithms that give an $O(1)$-approximation for APSP take $\operatorname{poly}(\log{n})$ rounds in weighted undirected graphs, and $\operatorname{poly}(\log \log n)$ rounds in unweighted undirected graphs. If we terminate the execution of the algorithm early, we obtain an $O(t)$-round algorithm that yields an $O \big( (\log n)^{1/2^t} \big) $ distance approximation for a parameter $t$. The trade-off between $t$ and the approximation quality provides flexibility for different scenarios, allowing the algorithm to adapt to specific requirements. In particular, we can get an $O \big( (\log n)^{1/2^t} \big) $-approximation for any constant $t$ in $O(1)$-rounds. Such result was previously known only for the special case that $t=0$. A key ingredient in our algorithm is a lemma that allows to improve an $O(a)$-approximation for APSP to an $O(\sqrt{a})$-approximation for APSP in $O(1)$ rounds. To prove the lemma, we develop several new tools, including $O(1)$-round algorithms for computing the $k$ closest nodes, a certain type of hopset, and skeleton graphs. △ Less

Submitted 4 May, 2024; originally announced May 2024.

arXiv:2405.02359 [pdf, other]

doi 10.1007/978-3-031-43412-9_11

CVTGAD: Simplified Transformer with Cross-View Attention for Unsupervised Graph-level Anomaly Detection

Authors: Jindong Li, Qianli Xing, Qi Wang, Yi Chang

Abstract: Unsupervised graph-level anomaly detection (UGAD) has received remarkable performance in various critical disciplines, such as chemistry analysis and bioinformatics. Existing UGAD paradigms often adopt data augmentation techniques to construct multiple views, and then employ different strategies to obtain representations from different views for jointly conducting UGAD. However, most previous work… ▽ More Unsupervised graph-level anomaly detection (UGAD) has received remarkable performance in various critical disciplines, such as chemistry analysis and bioinformatics. Existing UGAD paradigms often adopt data augmentation techniques to construct multiple views, and then employ different strategies to obtain representations from different views for jointly conducting UGAD. However, most previous works only considered the relationship between nodes/graphs from a limited receptive field, resulting in some key structure patterns and feature information being neglected. In addition, most existing methods consider different views separately in a parallel manner, which is not able to explore the inter-relationship across different views directly. Thus, a method with a larger receptive field that can explore the inter-relationship across different views directly is in need. In this paper, we propose a novel Simplified Transformer with Cross-View Attention for Unsupervised Graph-level Anomaly Detection, namely, CVTGAD. To increase the receptive field, we construct a simplified transformer-based module, exploiting the relationship between nodes/graphs from both intra-graph and inter-graph perspectives. Furthermore, we design a cross-view attention mechanism to directly exploit the view co-occurrence between different views, bridging the inter-view gap at node level and graph level. To the best of our knowledge, this is the first work to apply transformer and cross attention to UGAD, which realizes graph neural network and transformer working collaboratively. Extensive experiments on 15 real-world datasets of 3 fields demonstrate the superiority of CVTGAD on the UGAD task. The code is available at \url{https://github.com/jindongli-Ai/CVTGAD}. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2405.02288 [pdf, other]

Prospective Role of Foundation Models in Advancing Autonomous Vehicles

Authors: Jianhua Wu, Bingzhao Gao, Jincheng Gao, Jianhao Yu, Hongqing Chu, Qiankun Yu, Xun Gong, Yi Chang, H. Eric Tseng, Hong Chen, Jie Chen

Abstract: With the development of artificial intelligence and breakthroughs in deep learning, large-scale Foundation Models (FMs), such as GPT, Sora, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhancing scene understanding and reas… ▽ More With the development of artificial intelligence and breakthroughs in deep learning, large-scale Foundation Models (FMs), such as GPT, Sora, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhancing scene understanding and reasoning. By pre-training on rich linguistic and visual data, FMs can understand and interpret various elements in a driving scene, and provide cognitive reasoning to give linguistic and action instructions for driving decisions and planning. Furthermore, FMs can augment data based on the understanding of driving scenarios to provide feasible scenes of those rare occurrences in the long tail distribution that are unlikely to be encountered during routine driving and data collection. The enhancement can subsequently lead to improvement in the accuracy and reliability of autonomous driving systems. Another testament to the potential of FMs' applications lies in World Models, exemplified by the DREAMER series, which showcases the ability to comprehend physical laws and dynamics. Learning from massive data under the paradigm of self-supervised learning, World Model can generate unseen yet plausible driving environments, facilitating the enhancement in the prediction of road users' behaviors and the off-line training of driving strategies. In this paper, we synthesize the applications and future trends of FMs in autonomous driving. By utilizing the powerful capabilities of FMs, we strive to tackle the potential issues stemming from the long-tail distribution in autonomous driving, consequently advancing overall safety in this domain. △ Less

Submitted 17 May, 2024; v1 submitted 8 December, 2023; originally announced May 2024.

Comments: 45 pages,8 figures

arXiv:2404.13696 [pdf, other]

Clio: Real-time Task-Driven Open-Set 3D Scene Graphs

Authors: Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, Lukas Schmid, Luca Carlone

Abstract: Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us… ▽ More Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions and executes incrementally. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online using only onboard compute, as the robot explores it. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts. △ Less

Submitted 29 April, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.13071 [pdf, other]

Modeling Emotions and Ethics with Large Language Models

Authors: Edward Y. Chang

Abstract: This paper explores the integration of human-like emotions and ethical considerations into Large Language Models (LLMs). We first model eight fundamental human emotions, presented as opposing pairs, and employ collaborative LLMs to reinterpret and express these emotions across a spectrum of intensity. Our focus extends to embedding a latent ethical dimension within LLMs, guided by a novel self-sup… ▽ More This paper explores the integration of human-like emotions and ethical considerations into Large Language Models (LLMs). We first model eight fundamental human emotions, presented as opposing pairs, and employ collaborative LLMs to reinterpret and express these emotions across a spectrum of intensity. Our focus extends to embedding a latent ethical dimension within LLMs, guided by a novel self-supervised learning algorithm with human feedback (SSHF). This approach enables LLMs to perform self-evaluations and adjustments concerning ethical guidelines, enhancing their capability to generate content that is not only emotionally resonant but also ethically aligned. The methodologies and case studies presented herein illustrate the potential of LLMs to transcend mere text and image generation, venturing into the realms of empathetic interaction and principled decision-making, thereby setting a new precedent in the development of emotionally aware and ethically conscious AI systems. △ Less

Submitted 25 June, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Comments: 8 pages, 4 figures, 3 tables

ACM Class: I.2.0

Journal ref: IEEE MIPR 2024

arXiv:2404.12930 [pdf, other]

Fast Broadcast in Highly Connected Networks

Authors: Shashwat Chandra, Yi-Jun Chang, Michal Dory, Mohsen Ghaffari, Dean Leitersdorf

Abstract: We revisit the classic broadcast problem, wherein we have $k$ messages, each composed of $O(\log{n})$ bits, distributed arbitrarily across a network. The objective is to broadcast these messages to all nodes in the network. In the distributed CONGEST model, a textbook algorithm solves this problem in $O(D+k)$ rounds, where $D$ is the diameter of the graph. While the $O(D)$ term in the round comple… ▽ More We revisit the classic broadcast problem, wherein we have $k$ messages, each composed of $O(\log{n})$ bits, distributed arbitrarily across a network. The objective is to broadcast these messages to all nodes in the network. In the distributed CONGEST model, a textbook algorithm solves this problem in $O(D+k)$ rounds, where $D$ is the diameter of the graph. While the $O(D)$ term in the round complexity is unavoidable$\unicode{x2014}$given that $Ω(D)$ rounds are necessary to solve broadcast in any graph$\unicode{x2014}$it remains unclear whether the $O(k)$ term is needed in all graphs. In cases where the minimum cut size is one, simply transmitting messages from one side of the cut to the other would require $Ω(k)$ rounds. However, if the size of the minimum cut is larger, it may be possible to develop faster algorithms. This motivates the exploration of the broadcast problem in networks with high edge connectivity. In this work, we present a simple randomized distributed algorithm for performing $k$-message broadcast in $O(((n+k)/λ)\log n)$ rounds in any $n$-node simple graph with edge connectivity $λ$. When $k = Ω(n)$, our algorithm is universally optimal, up to an $O(\log n)$ factor, as its complexity nearly matches an information-theoretic $Ω(k/λ)$ lower bound that applies to all graphs, even when the network topology is known to the algorithm. The setting $k = Ω(n)$ is particularly interesting because several fundamental problems can be reduced to broadcasting $Ω(n)$ messages. Our broadcast algorithm finds several applications in distributed computing, enabling $O(1)$-approximation for all distances and $(1+ε)$-approximation for all cut sizes in $\tilde{O}(n/λ)$ rounds. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.12445 [pdf]

Adaptive Catalyst Discovery Using Multicriteria Bayesian Optimization with Representation Learning

Authors: Jie Chen, Pengfei Ou, Yuxin Chang, Hengrui Zhang, Xiao-Yan Li, Edward H. Sargent, Wei Chen

Abstract: High-performance catalysts are crucial for sustainable energy conversion and human health. However, the discovery of catalysts faces challenges due to the absence of efficient approaches to navigating vast and high-dimensional structure and composition spaces. In this study, we propose a high-throughput computational catalyst screening approach integrating density functional theory (DFT) and Bayes… ▽ More High-performance catalysts are crucial for sustainable energy conversion and human health. However, the discovery of catalysts faces challenges due to the absence of efficient approaches to navigating vast and high-dimensional structure and composition spaces. In this study, we propose a high-throughput computational catalyst screening approach integrating density functional theory (DFT) and Bayesian Optimization (BO). Within the BO framework, we propose an uncertainty-aware atomistic machine learning model, UPNet, which enables automated representation learning directly from high-dimensional catalyst structures and achieves principled uncertainty quantification. Utilizing a constrained expected improvement acquisition function, our BO framework simultaneously considers multiple evaluation criteria. Using the proposed methods, we explore catalyst discovery for the CO2 reduction reaction. The results demonstrate that our approach achieves high prediction accuracy, facilitates interpretable feature extraction, and enables multicriteria design optimization, leading to significant reduction of computing power and time (10x reduction of required DFT calculations) in high-performance catalyst discovery. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.11884 [pdf, other]

Seeing Motion at Nighttime with an Event Camera

Authors: Haoyue Liu, Shihan Peng, Lin Zhu, Yi Chang, Hanyu Zhou, Luxin Yan

Abstract: We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (… ▽ More We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB), offering an alternative solution. In this work, we present a novel nighttime dynamic imaging method with an event camera. Specifically, we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently, we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover, we construct a paired real low-light event dataset (RLED) through a co-axial imaging system, including 64,200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: https://github.com/Liu-haoyue/NER-Net. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.01598 [pdf, other]

Extremum-Seeking Action Selection for Accelerating Policy Optimization

Authors: Ya-Chien Chang, Sicun Gao

Abstract: Reinforcement learning for control over continuous spaces typically uses high-entropy stochastic policies, such as Gaussian distributions, for local exploration and estimating policy gradient to optimize performance. Many robotic control problems deal with complex unstable dynamics, where applying actions that are off the feasible control manifolds can quickly lead to undesirable divergence. In su… ▽ More Reinforcement learning for control over continuous spaces typically uses high-entropy stochastic policies, such as Gaussian distributions, for local exploration and estimating policy gradient to optimize performance. Many robotic control problems deal with complex unstable dynamics, where applying actions that are off the feasible control manifolds can quickly lead to undesirable divergence. In such cases, most samples taken from the ambient action space generate low-value trajectories that hardly contribute to policy improvement, resulting in slow or failed learning. We propose to improve action selection in this model-free RL setting by introducing additional adaptive control steps based on Extremum-Seeking Control (ESC). On each action sampled from stochastic policies, we apply sinusoidal perturbations and query for estimated Q-values as the response signal. Based on ESC, we then dynamically improve the sampled actions to be closer to nearby optima before applying them to the environment. Our methods can be easily added in standard policy optimization to improve learning efficiency, which we demonstrate in various control learning environments. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.01261 [pdf, other]

FABLES: Evaluating faithfulness and content selection in book-length summarization

Authors: Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study miti… ▽ More While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: Claude-3-Opus significantly outperforms all closed-source LLMs, while the open-source Mixtral is on par with GPT-3.5-Turbo. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: preprint - 39 pages

arXiv:2403.16394 [pdf, other]

Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

Authors: Yingshan Chang, Yasi Zhang, Zhiyuan Fang, Yingnian Wu, Yonatan Bisk, Feng Gao

Abstract: The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that g… ▽ More The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that generalization emerges out of large-scale pretraining. We hypothesize that the underlying phenomenological coverage has not been proportionally scaled up, leading to a skew of the presented phenomenon which harms generalization. We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning, and show that generalization failures of text-to-image generation are a direct result of incomplete or unbalanced phenomenological coverage. We first perform experiments in a synthetic domain and demonstrate that systematically controlled metrics are strongly predictive of generalization performance. Then we move to natural images and show that simple distribution perturbations in light of our theories boost generalization without enlarging the absolute data size. This work informs an important direction towards quality-enhancing the data diversity or balance orthogonal to scaling up the absolute size. Our discussions point out important open questions on 1) Evaluation of generated entity-relation compositions, and 2) Better models for reasoning with abstract relations. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Showing 1–50 of 480 results for author: Chang, Y