Search | arXiv e-print repository

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Authors: Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb

Abstract: Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoi… ▽ More Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers. △ Less

Submitted 6 September, 2024; originally announced September 2024.

arXiv:2409.03643 [pdf, other]

CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Authors: Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, Conghui He

Abstract: Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly… ▽ More Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations. △ Less

Submitted 5 September, 2024; originally announced September 2024.

Comments: Project Website: https://github.com/opendatalab/UniMERNet/tree/main/cdm

arXiv:2409.01787 [pdf, other]

LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection

Authors: Yifeng Wang, Zhouhong Gu, Siwei Zhang, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao

Abstract: Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leavin… ▽ More Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leaving them unable to understand the complex news-faking process. Secondly, utilizing LLMs for this task would generate both correct and incorrect explanations, which necessitates abundant labor in the loop. In this paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms to enable an LLM to become Generator and Detector and for realistic fake news generation and detection. Our results demonstrate LLM-GAN's effectiveness in both prediction performance and explanation quality. We further showcase the integration of LLM-GAN to a cloud-native AI platform to provide better fake news detection service in the cloud. △ Less

Submitted 3 September, 2024; originally announced September 2024.

arXiv:2408.12209 [pdf, ps, other]

Zeroth-Order Stochastic Mirror Descent Algorithms for Minimax Excess Risk Optimization

Authors: Zhihao Gu, Zi Xu

Abstract: The minimax excess risk optimization (MERO) problem is a new variation of the traditional distributionally robust optimization (DRO) problem, which achieves uniformly low regret across all test distributions under suitable conditions. In this paper, we propose a zeroth-order stochastic mirror descent (ZO-SMD) algorithm available for both smooth and non-smooth MERO to estimate the minimal risk of e… ▽ More The minimax excess risk optimization (MERO) problem is a new variation of the traditional distributionally robust optimization (DRO) problem, which achieves uniformly low regret across all test distributions under suitable conditions. In this paper, we propose a zeroth-order stochastic mirror descent (ZO-SMD) algorithm available for both smooth and non-smooth MERO to estimate the minimal risk of each distrbution, and finally solve MERO as (non-)smooth stochastic convex-concave (linear) minimax optimization problems. The proposed algorithm is proved to converge at optimal convergence rates of $\mathcal{O}\left(1/\sqrt{t}\right)$ on the estimate of $R_i^*$ and $\mathcal{O}\left(1/\sqrt{t}\right)$ on the optimization error of both smooth and non-smooth MERO. Numerical results show the efficiency of the proposed algorithm. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.01349 [pdf, other]

doi 10.1145/3664647.3680860

PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

Authors: Yue Duan, Zhangxuan Gu, Zhenzhe Ying, Lei Qi, Changhua Meng, Yinghuan Shi

Abstract: In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Capt… ▽ More In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC$^2$) framework to address this challenge. PC$^2$ offers a threefold strategy: firstly, it establishes an auxiliary "pseudo-classification" task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC$^2$'s pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC$^2$ showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at https://github.com/alipay/PC2-NoiseofWeb. △ Less

Submitted 2 August, 2024; originally announced August 2024.

Comments: Accepted by ACM MM 2024

arXiv:2408.00598 [pdf, other]

HOT: An Efficient Halpern Accelerating Algorithm for Optimal Transport Problems

Authors: Guojun Zhang, Zhexuan Gu, Yancheng Yuan, Defeng Sun

Abstract: This paper proposes an efficient HOT algorithm for solving the optimal transport (OT) problems with finite supports. We particularly focus on an efficient implementation of the HOT algorithm for the case where the supports are in $\mathbb{R}^2$ with ground distances calculated by $L_2^2$-norm. Specifically, we design a Halpern accelerating algorithm to solve the equivalent reduced model of the dis… ▽ More This paper proposes an efficient HOT algorithm for solving the optimal transport (OT) problems with finite supports. We particularly focus on an efficient implementation of the HOT algorithm for the case where the supports are in $\mathbb{R}^2$ with ground distances calculated by $L_2^2$-norm. Specifically, we design a Halpern accelerating algorithm to solve the equivalent reduced model of the discrete OT problem. Moreover, we derive a novel procedure to solve the involved linear systems in the HOT algorithm in linear time complexity. Consequently, we can obtain an $\varepsilon$-approximate solution to the optimal transport problem with $M$ supports in $O(M^{1.5}/\varepsilon)$ flops, which significantly improves the best-known computational complexity. We further propose an efficient procedure to recover an optimal transport plan for the original OT problem based on a solution to the reduced model, thereby overcoming the limitations of the reduced OT model in applications that require the transport map. We implement the HOT algorithm in PyTorch and extensive numerical results show the superior performance of the HOT algorithm compared to existing state-of-the-art algorithms for solving the OT problems. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.20804 [pdf, other]

The incompressible Navier-Stokes limit from the lattice BGK Boltzmann equation

Authors: Zhongyang Gu, Xin Hu, Pritpal Matharu, Bartosz Protas, Makiko Sasada, Tsuyoshi Yoneda

Abstract: In this paper, we prove that a local weak solution to the $d$-dimensional incompressible Navier-Stokes equations ($d \geq 2$) can be constructed by taking the hydrodynamic limit of a velocity-discretized Boltzmann equation with a simplified BGK collision operator. Moreover, in the case when the dimension is $d=2,3$, we characterize the combinations of finitely many particle velocities and probabil… ▽ More In this paper, we prove that a local weak solution to the $d$-dimensional incompressible Navier-Stokes equations ($d \geq 2$) can be constructed by taking the hydrodynamic limit of a velocity-discretized Boltzmann equation with a simplified BGK collision operator. Moreover, in the case when the dimension is $d=2,3$, we characterize the combinations of finitely many particle velocities and probabilities that lead to the incompressible Navier-Stokes equations in the hydrodynamic limit. Numerical computations conducted in 2D provide information about the rate with which this hydrodynamic limit is achieved when the Knudsen number tends to zero. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 49 pages, 3 figures

MSC Class: 35Q30; 76D05; 76P05; 76M28

arXiv:2407.20679 [pdf, other]

Online Prediction-Assisted Safe Reinforcement Learning for Electric Vehicle Charging Station Recommendation in Dynamically Coupled Transportation-Power Systems

Authors: Qionghua Liao, Guilong Li, Jiajie Yu, Ziyuan Gu, Wei Ma

Abstract: With the proliferation of electric vehicles (EVs), the transportation network and power grid become increasingly interdependent and coupled via charging stations. The concomitant growth in charging demand has posed challenges for both networks, highlighting the importance of charging coordination. Existing literature largely overlooks the interactions between power grid security and traffic effici… ▽ More With the proliferation of electric vehicles (EVs), the transportation network and power grid become increasingly interdependent and coupled via charging stations. The concomitant growth in charging demand has posed challenges for both networks, highlighting the importance of charging coordination. Existing literature largely overlooks the interactions between power grid security and traffic efficiency. In view of this, we study the en-route charging station (CS) recommendation problem for EVs in dynamically coupled transportation-power systems. The system-level objective is to maximize the overall traffic efficiency while ensuring the safety of the power grid. This problem is for the first time formulated as a constrained Markov decision process (CMDP), and an online prediction-assisted safe reinforcement learning (OP-SRL) method is proposed to learn the optimal and secure policy by extending the PPO method. To be specific, we mainly address two challenges. First, the constrained optimization problem is converted into an equivalent unconstrained optimization problem by applying the Lagrangian method. Second, to account for the uncertain long-time delay between performing CS recommendation and commencing charging, we put forward an online sequence-to-sequence (Seq2Seq) predictor for state augmentation to guide the agent in making forward-thinking decisions. Finally, we conduct comprehensive experimental studies based on the Nguyen-Dupuis network and a large-scale real-world road network, coupled with IEEE 33-bus and IEEE 69-bus distribution systems, respectively. Results demonstrate that the proposed method outperforms baselines in terms of road network efficiency, power grid safety, and EV user satisfaction. The case study on the real-world network also illustrates the applicability in the practical context. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 33 pages, 31 figures

arXiv:2407.17768 [pdf, ps, other]

$G$-BSDEs with mean constraints in time-dependent intervals

Authors: Zihao Gu, Hui Zhao

Abstract: In this paper, we study a collection of mean-reflected backward stochastic differential equations driven by $G$-Brownian motions ($G$-BSDEs), where $G$-expectations are constrained in some time-dependent intervals. To establish well-posedness results, we firstly construct a backward Skorokhod problem with sublinear expectation, and then apply that in the study of doubly mean-reflected $G$-BSDEs in… ▽ More In this paper, we study a collection of mean-reflected backward stochastic differential equations driven by $G$-Brownian motions ($G$-BSDEs), where $G$-expectations are constrained in some time-dependent intervals. To establish well-posedness results, we firstly construct a backward Skorokhod problem with sublinear expectation, and then apply that in the study of doubly mean-reflected $G$-BSDEs involving Lipschitz and quadratic generators under bounded and unbounded terminal conditions. Also we utilize fixed-point argumentations and $θ$-methods while solving these equations. Finally, we extend the results to multi-dimensional doubly mean-reflected $G$-BSDEs with diagonal generators. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.15835 [pdf, other]

dMel: Speech Tokenization made Simple

Authors: He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

Abstract: Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic tokens, pot… ▽ More Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic tokens, potentially losing acoustic information, or model acoustic tokens, risking the loss of semantic information. Having multiple token types also complicates the architecture and requires additional pretraining. Here we show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel), that performs better than other existing speech tokenization methods. Using a transformer decoder-only architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR), speech synthesis (TTS). Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: under review

arXiv:2407.12005 [pdf, other]

VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It

Authors: Xiaoxuan Zhu, Zhouhong Gu, Sihang Jiang, Zhixu Li, Hongwei Feng, Yanghua Xiao

Abstract: Online courses have significantly lowered the barrier to accessing education, yet the varying content quality of these videos poses challenges. In this work, we focus on the task of automatically evaluating the quality of video course content. We have constructed a dataset with a substantial collection of video courses and teaching materials. We propose three evaluation principles and design a new… ▽ More Online courses have significantly lowered the barrier to accessing education, yet the varying content quality of these videos poses challenges. In this work, we focus on the task of automatically evaluating the quality of video course content. We have constructed a dataset with a substantial collection of video courses and teaching materials. We propose three evaluation principles and design a new evaluation framework, \textit{VCEval}, based on these principles. The task is modeled as a multiple-choice question-answering task, with a language model serving as the evaluator. Our method effectively distinguishes video courses of different content quality and produces a range of interpretable results. △ Less

Submitted 15 June, 2024; originally announced July 2024.

arXiv:2407.05909 [pdf, other]

Multi-clue Consistency Learning to Bridge Gaps Between General and Oriented Object in Semi-supervised Detection

Authors: Chenxu Wang, Chunyan Xu, Ziqi Gu, Zhen Cui

Abstract: While existing semi-supervised object detection (SSOD) methods perform well in general scenes, they encounter challenges in handling oriented objects in aerial images. We experimentally find three gaps between general and oriented object detection in semi-supervised learning: 1) Sampling inconsistency: the common center sampling is not suitable for oriented objects with larger aspect ratios when s… ▽ More While existing semi-supervised object detection (SSOD) methods perform well in general scenes, they encounter challenges in handling oriented objects in aerial images. We experimentally find three gaps between general and oriented object detection in semi-supervised learning: 1) Sampling inconsistency: the common center sampling is not suitable for oriented objects with larger aspect ratios when selecting positive labels from labeled data. 2) Assignment inconsistency: balancing the precision and localization quality of oriented pseudo-boxes poses greater challenges which introduces more noise when selecting positive labels from unlabeled data. 3) Confidence inconsistency: there exists more mismatch between the predicted classification and localization qualities when considering oriented objects, affecting the selection of pseudo-labels. Therefore, we propose a Multi-clue Consistency Learning (MCL) framework to bridge gaps between general and oriented objects in semi-supervised detection. Specifically, considering various shapes of rotated objects, the Gaussian Center Assignment is specially designed to select the pixel-level positive labels from labeled data. We then introduce the Scale-aware Label Assignment to select pixel-level pseudo-labels instead of unreliable pseudo-boxes, which is a divide-and-rule strategy suited for objects with various scales. The Consistent Confidence Soft Label is adopted to further boost the detector by maintaining the alignment of the predicted results. Comprehensive experiments on DOTA-v1.5 and DOTA-v1.0 benchmarks demonstrate that our proposed MCL can achieve state-of-the-art performance in the semi-supervised oriented object detection task. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.04895 [pdf, ps, other]

Retractive spaces and Bousfield-Kan completions

Authors: Zeshen Gu, John E. Harper

Abstract: In this short paper we apply some recent techniques developed by Schonsheck, and subsequently Carr-Harper, in the context of operadic algebras in spectra -- on convergence of Bousfield-Kan completions and comparisons with convergence of the Taylor tower of the identity functor in Goodwillie's functor calculus -- to the setting of retractive spaces: this arises when working with spaces centered awa… ▽ More In this short paper we apply some recent techniques developed by Schonsheck, and subsequently Carr-Harper, in the context of operadic algebras in spectra -- on convergence of Bousfield-Kan completions and comparisons with convergence of the Taylor tower of the identity functor in Goodwillie's functor calculus -- to the setting of retractive spaces: this arises when working with spaces centered away from the one-point space. Interestingly, in the retractive spaces context, the comparison results are stronger in terms of convergence outside of functor calculus' notion of "radius of (strong) convergence" for analytic functors. In particular, we give a new proof (and generalization to retractive spaces) of the Arone-Kankaanrinta result for convergence of the Taylor tower of the identity functor to various Bousfield-Kan completions; it's notable that no use is made of Snaith splittings -- rather, we make extensive use of the kinds of homotopical estimates that appear in earlier work of Dundas and Dundas-Goodwillie-McCarthy. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.04845 [pdf, other]

Poster: Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks

Authors: Ruikun Wang, Jiawei Zhang, Qiaolun Zhang, Bojun Zhang, Zhiqun Gu, Aryanaz Attarpour, Yuefeng Ji, Massimo Tornatore

Abstract: Many emerging Artificial Intelligence (AI) applications require on-demand provisioning of large-scale computing, which can only be enabled by leveraging distributed computing services interconnected through networking. To address such increasing demand for networking to serve AI tasks, we investigate new scheduling strategies to improve communication efficiency and test them on a programmable test… ▽ More Many emerging Artificial Intelligence (AI) applications require on-demand provisioning of large-scale computing, which can only be enabled by leveraging distributed computing services interconnected through networking. To address such increasing demand for networking to serve AI tasks, we investigate new scheduling strategies to improve communication efficiency and test them on a programmable testbed. We also show relevant challenges and research directions. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.03942 [pdf, other]

Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data

Authors: Zihui Gu, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Cheng-Zhong Xu, Ju Fan

Abstract: Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suff… ▽ More Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Journal ref: AAAI 2024

arXiv:2407.02730 [pdf, other]

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Authors: Zishan Gu, Changchang Yin, Fenglin Liu, Ping Zhang

Abstract: Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmar… ▽ More Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2406.14250 [pdf, other]

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

Authors: Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu

Abstract: Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according… ▽ More Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities. △ Less

Submitted 1 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

Comments: 9 pages, 5 figures, Under review

arXiv:2406.13726 [pdf, other]

Global Solutions to Master Equations for Continuous Time Heterogeneous Agent Macroeconomic Models

Authors: Zhouzhou Gu, Mathieu Laurière, Sebastian Merkel, Jonathan Payne

Abstract: We propose and compare new global solution algorithms for continuous time heterogeneous agent economies with aggregate shocks. First, we approximate the agent distribution so that equilibrium in the economy can be characterized by a high, but finite, dimensional non-linear partial differential equation. We consider different approximations: discretizing the number of agents, discretizing the agent… ▽ More We propose and compare new global solution algorithms for continuous time heterogeneous agent economies with aggregate shocks. First, we approximate the agent distribution so that equilibrium in the economy can be characterized by a high, but finite, dimensional non-linear partial differential equation. We consider different approximations: discretizing the number of agents, discretizing the agent state variables, and projecting the distribution onto a finite set of basis functions. Second, we represent the value function using a neural network and train it to solve the differential equation using deep learning tools. We refer to the solution as an Economic Model Informed Neural Network (EMINN). The main advantage of this technique is that it allows us to find global solutions to high dimensional, non-linear problems. We demonstrate our algorithm by solving important models in the macroeconomics and spatial literatures (e.g. Krusell and Smith (1998), Khan and Thomas (2007), Bilal (2023)). △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12641 [pdf, other]

DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?

Authors: Zhouhong Gu, Lin Zhang, Xiaoxuan Zhu, Jiangjie Chen, Wenhao Huang, Yikai Zhang, Shusen Wang, Zheyu Ye, Yan Gao, Hongwei Feng, Yanghua Xiao

Abstract: Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice… ▽ More Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice questions, with an average of 994 tokens per question. Each question contains an average of 4.55 pieces of implicit evidence, and solving the problem typically requires 7.62 logical jumps to find the correct answer. To enhance the performance of LLMs in evidence detection, this paper proposes Detective Reasoning Prompt and Finetune. Experiments demonstrate that the existing LLMs' abilities to detect evidence in long contexts are far inferior to humans. However, the Detective Reasoning Prompt effectively enhances the capability of powerful LLMs in evidence detection, while the Finetuning method shows significant effects in enhancing the performance of weaker LLMs. Moreover, when the abilities of LLMs in evidence detection are improved, their final reasoning performance is also enhanced accordingly. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11931 [pdf, other]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Authors: DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen , et al. (15 additional authors not shown)

Abstract: We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathe… ▽ More We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11374 [pdf, other]

Pseudogap with Fermi arcs and Fermi pockets in half-filled twisted transition metal dichalcogenides

Authors: Yong-Yue Zong, Zhao-Long Gu, Jian-Xin Li

Abstract: Twisted transition metal dichalcogenides are a new platform for realizing strongly correlated physics with high tunability. Recent transport experiments have reported the realization of a Mott insulator, its bandwidth-driven evolution to a metal, and the strange metal behavior in proximity to the transition via the tuning of a displacement field in twisted $\mathrm{WSe_2}$($\mathrm{tWSe_2}$) fixed… ▽ More Twisted transition metal dichalcogenides are a new platform for realizing strongly correlated physics with high tunability. Recent transport experiments have reported the realization of a Mott insulator, its bandwidth-driven evolution to a metal, and the strange metal behavior in proximity to the transition via the tuning of a displacement field in twisted $\mathrm{WSe_2}$($\mathrm{tWSe_2}$) fixed at half filling. However, the nature of the correlated states and the related Mott physics involved in the whole process remain to be determined. Here, we unveil theoretically the evolution of the ground state of the half-filled $\boldsymbol{\mathrm{moir\acute{e}}}$ Hubbard model as applied to $\mathrm{tWSe_2}$, transiting from a pseudogap state with Fermi arcs to a 120$^\circ$ Ne$\boldsymbol{\acute{\mathrm{e}}}$l ordered Mott insulator, then to another pseudogap state with Fermi pockets, and eventually to a Fermi liquid via a Lifshitz transition. The pseudogap phases are definitively identified by the vanishing of quasiparticle weights over an extended region of the free Fermi surface, with the remaining parts forming disconnected Fermi arcs or pockets that contain well-defined quasiparticles. The loss of quasiparticle weights in the pseudogap phases indicates the breakdown of Landau's Fermi liquid theory. We demonstrate that the emergence of the Fermi arcs/pockets results from the electronic structure reconstruction driven by strong electron correlations, marked by the existence of Luttinger surface enclosed by the zeros of single-particle Green's function. This work reveals the fundamental aspects of the Mottness in $\boldsymbol{\mathrm{moir\acute{e}}}$ systems and will stimulate the advance of the direct probes of the Fermi surface and quasiparticles beyond transports, such as angle-resolved photoemission spectroscopy (ARPES) and scanning tunnelling microscopy (STM) experiments. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 8+11 pages, 4+8 figures

arXiv:2406.10621 [pdf, other]

StrucText-Eval: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

Authors: Zhouhong Gu, Haoning Ye, Zeyang Zhou, Hongwei Feng, Yanghua Xiao

Abstract: Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates… ▽ More Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StrucText-Eval, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StrucText-Eval-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0\% on StrucText-Eval-Hard, while human accuracy reaches up to 95.7\%. Moreover, while fine-tuning using StrucText-Eval can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types. The benchmark and generation codes are open sourced in https://github.com/MikeGu721/StrucText-Eval △ Less

Submitted 30 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

arXiv:2405.19707 [pdf, other]

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Authors: Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, Huaxiong Li

Abstract: Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale… ▽ More Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{https://github.com/chenhaoxing/DeMamba}. △ Less

Submitted 22 August, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.15216 [pdf, other]

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Authors: Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

Abstract: Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a… ▽ More Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: under review

arXiv:2405.12247 [pdf, other]

Focus on Low-Resolution Information: Multi-Granular Information-Lossless Model for Low-Resolution Human Pose Estimation

Authors: Zejun Gu, Zhong-Qiu Zhao, Hao Shen, Zhao Zhang

Abstract: In real-world applications of human pose estimation, low-resolution input images are frequently encountered when the performance of the image acquisition equipment is limited or the shooting distance is too far. However, existing state-of-the-art models for human pose estimation perform poorly on low-resolution images. One key reason is the presence of downsampling layers in these models, e.g., st… ▽ More In real-world applications of human pose estimation, low-resolution input images are frequently encountered when the performance of the image acquisition equipment is limited or the shooting distance is too far. However, existing state-of-the-art models for human pose estimation perform poorly on low-resolution images. One key reason is the presence of downsampling layers in these models, e.g., strided convolutions and pooling layers. It further reduces the already insufficient image information. Another key reason is that the body skeleton and human kinematic information are not fully utilized. In this work, we propose a Multi-Granular Information-Lossless (MGIL) model to replace the downsampling layers to address the above issues. Specifically, MGIL employs a Fine-grained Lossless Information Extraction (FLIE) module, which can prevent the loss of local information. Furthermore, we design a Coarse-grained Information Interaction (CII) module to adequately leverage human body structural information. To efficiently fuse cross-granular information and thoroughly exploit the relationships among keypoints, we further introduce a Multi-Granular Adaptive Fusion (MGAF) mechanism. The mechanism assigns weights to features of different granularities based on the content of the image. The model is effective, flexible, and universal. We show its potential in various vision tasks with comprehensive experiments. It outperforms the SOTA methods by 7.7 mAP on COCO and performs well with different input resolutions, different backbones, and different vision tasks. The code is provided in supplementary material. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 8 pages, 5 figures, conference

arXiv:2405.11640 [pdf, other]

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning

Authors: Zishan Gu, Fenglin Liu, Changchang Yin, Ping Zhang

Abstract: The adoption of large language models (LLMs) in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this… ▽ More The adoption of large language models (LLMs) in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this end, we propose a multimodal medical collaborative reasoning framework \textbf{MultiMedRes}, which incorporates a learner agent to proactively gain essential information from domain-specific expert models, to solve medical multimodal reasoning problems. Our method includes three steps: i) \textbf{Inquire}: The learner agent first decomposes given complex medical reasoning problems into multiple domain-specific sub-problems; ii) \textbf{Interact}: The agent then interacts with domain-specific expert models by repeating the ``ask-answer'' process to progressively obtain different domain-specific knowledge; iii) \textbf{Integrate}: The agent finally integrates all the acquired domain-specific knowledge to accurately address the medical reasoning problem. We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments demonstrate that our zero-shot prediction achieves state-of-the-art performance, and even outperforms the fully supervised methods. Besides, our approach can be incorporated into various LLMs and multimodal LLMs to significantly boost their performance. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.11448 [pdf, other]

Cross-Domain Knowledge Distillation for Low-Resolution Human Pose Estimation

Authors: Zejun Gu, Zhong-Qiu Zhao, Henghui Ding, Hao Shen, Zhao Zhang, De-Shuang Huang

Abstract: In practical applications of human pose estimation, low-resolution inputs frequently occur, and existing state-of-the-art models perform poorly with low-resolution images. This work focuses on boosting the performance of low-resolution models by distilling knowledge from a high-resolution model. However, we face the challenge of feature size mismatch and class number mismatch when applying knowled… ▽ More In practical applications of human pose estimation, low-resolution inputs frequently occur, and existing state-of-the-art models perform poorly with low-resolution images. This work focuses on boosting the performance of low-resolution models by distilling knowledge from a high-resolution model. However, we face the challenge of feature size mismatch and class number mismatch when applying knowledge distillation to networks with different input resolutions. To address this issue, we propose a novel cross-domain knowledge distillation (CDKD) framework. In this framework, we construct a scale-adaptive projector ensemble (SAPE) module to spatially align feature maps between models of varying input resolutions. It adopts a projector ensemble to map low-resolution features into multiple common spaces and adaptively merges them based on multi-scale information to match high-resolution features. Additionally, we construct a cross-class alignment (CCA) module to solve the problem of the mismatch of class numbers. By combining an easy-to-hard training (ETHT) strategy, the CCA module further enhances the distillation performance. The effectiveness and efficiency of our approach are demonstrated by extensive experiments on two common benchmark datasets: MPII and COCO. The code is made available in supplementary material. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 11 pages, 5 figures

arXiv:2405.10691 [pdf, other]

LoCI-DiffCom: Longitudinal Consistency-Informed Diffusion Model for 3D Infant Brain Image Completion

Authors: Zihao Zhu, Tianli Tao, Yitian Tao, Haowen Deng, Xinyi Cai, Gaofeng Wu, Kaidong Wang, Haifeng Tang, Lixuan Zhu, Zhuoyang Gu, Jiawei Huang, Dinggang Shen, Han Zhang

Abstract: The infant brain undergoes rapid development in the first few years after birth.Compared to cross-sectional studies, longitudinal studies can depict the trajectories of infants brain development with higher accuracy, statistical power and flexibility.However, the collection of infant longitudinal magnetic resonance (MR) data suffers a notorious dropout problem, resulting in incomplete datasets wit… ▽ More The infant brain undergoes rapid development in the first few years after birth.Compared to cross-sectional studies, longitudinal studies can depict the trajectories of infants brain development with higher accuracy, statistical power and flexibility.However, the collection of infant longitudinal magnetic resonance (MR) data suffers a notorious dropout problem, resulting in incomplete datasets with missing time points. This limitation significantly impedes subsequent neuroscience and clinical modeling. Yet, existing deep generative models are facing difficulties in missing brain image completion, due to sparse data and the nonlinear, dramatic contrast/geometric variations in the developing brain. We propose LoCI-DiffCom, a novel Longitudinal Consistency-Informed Diffusion model for infant brain image Completion,which integrates the images from preceding and subsequent time points to guide a diffusion model for generating high-fidelity missing data. Our designed LoCI module can work on highly sparse sequences, relying solely on data from two temporal points. Despite wide separation and diversity between age time points, our approach can extract individualized developmental features while ensuring context-aware consistency. Our experiments on a large infant brain MR dataset demonstrate its effectiveness with consistent performance on missing infant brain MR completion even in big gap scenarios, aiding in better delineation of early developmental trajectories. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.10316 [pdf, other]

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

Authors: Zheng Gu, Shiyuan Yang, Jing Liao, Jing Huo, Yang Gao

Abstract: Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual IC… ▽ More Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: Project page: https://analogist2d.github.io

arXiv:2405.08447 [pdf, other]

AI-Resilient Interfaces

Authors: Elena L. Glassman, Ziwei Gu, Jonathan K. Kummerfeld

Abstract: AI is powerful, but it can make choices that result in objective errors, contextually inappropriate outputs, and disliked options. We need AI-resilient interfaces that help people be resilient to the AI choices that are not right, or not right for them. To support this goal, interfaces need to help users notice and have the context to appropriately judge those AI choices. Existing human-AI interac… ▽ More AI is powerful, but it can make choices that result in objective errors, contextually inappropriate outputs, and disliked options. We need AI-resilient interfaces that help people be resilient to the AI choices that are not right, or not right for them. To support this goal, interfaces need to help users notice and have the context to appropriately judge those AI choices. Existing human-AI interaction guidelines recommend efficient user dismissal, modification, or otherwise efficient recovery from AI choices that a user does not like. However, in order to recover from AI choices, the user must notice them first. This can be difficult. For example, when generating summaries of long documents, a system's exclusion of a detail that is critically important to the user is hard for the user to notice. That detail can be hiding in a wall of text in the original document, and the existence of a summary may tempt the user not to read the original document as carefully. Once noticed, judging AI choices well can also be challenging. The interface may provide very little information that contextualizes the choices, and the user may fall back on assumptions when deciding whether to dismiss, modify, or otherwise recover from an AI choice. Building on prior work, this paper defines key aspects of AI-resilient interfaces, illustrated with examples. Designing interfaces for increased AI-resilience of users will improve AI safety, usability, and utility. This is especially critical where AI-powered systems are used for context- and preference-dominated open-ended AI-assisted tasks, like ideating, summarizing, searching, sensemaking, and the reading and writing of text or code. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.04434 [pdf, other]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. △ Less

Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.01882 [pdf, other]

Millimeter Wave Radar-based Human Activity Recognition for Healthcare Monitoring Robot

Authors: Zhanzhong Gu, Xiangjian He, Gengfa Fang, Chengpei Xu, Feng Xia, Wenjing Jia

Abstract: Healthcare monitoring is crucial, especially for the daily care of elderly individuals living alone. It can detect dangerous occurrences, such as falls, and provide timely alerts to save lives. Non-invasive millimeter wave (mmWave) radar-based healthcare monitoring systems using advanced human activity recognition (HAR) models have recently gained significant attention. However, they encounter cha… ▽ More Healthcare monitoring is crucial, especially for the daily care of elderly individuals living alone. It can detect dangerous occurrences, such as falls, and provide timely alerts to save lives. Non-invasive millimeter wave (mmWave) radar-based healthcare monitoring systems using advanced human activity recognition (HAR) models have recently gained significant attention. However, they encounter challenges in handling sparse point clouds, achieving real-time continuous classification, and coping with limited monitoring ranges when statically mounted. To overcome these limitations, we propose RobHAR, a movable robot-mounted mmWave radar system with lightweight deep neural networks for real-time monitoring of human activities. Specifically, we first propose a sparse point cloud-based global embedding to learn the features of point clouds using the light-PointNet (LPN) backbone. Then, we learn the temporal pattern with a bidirectional lightweight LSTM model (BiLiLSTM). In addition, we implement a transition optimization strategy, integrating the Hidden Markov Model (HMM) with Connectionist Temporal Classification (CTC) to improve the accuracy and robustness of the continuous HAR. Our experiments on three datasets indicate that our method significantly outperforms the previous studies in both discrete and continuous HAR tasks. Finally, we deploy our system on a movable robot-mounted edge computing platform, achieving flexible healthcare monitoring in real-world scenarios. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.00797 [pdf, other]

ADM: Accelerated Diffusion Model via Estimated Priors for Robust Motion Prediction under Uncertainties

Authors: Jiahui Li, Tianle Shen, Zekai Gu, Jiawei Sun, Chengran Yuan, Yuhang Han, Shuo Sun, Marcelo H. Ang Jr

Abstract: Motion prediction is a challenging problem in autonomous driving as it demands the system to comprehend stochastic dynamics and the multi-modal nature of real-world agent interactions. Diffusion models have recently risen to prominence, and have proven particularly effective in pedestrian motion prediction tasks. However, the significant time consumption and sensitivity to noise have limited the r… ▽ More Motion prediction is a challenging problem in autonomous driving as it demands the system to comprehend stochastic dynamics and the multi-modal nature of real-world agent interactions. Diffusion models have recently risen to prominence, and have proven particularly effective in pedestrian motion prediction tasks. However, the significant time consumption and sensitivity to noise have limited the real-time predictive capability of diffusion models. In response to these impediments, we propose a novel diffusion-based, acceleratable framework that adeptly predicts future trajectories of agents with enhanced resistance to noise. The core idea of our model is to learn a coarse-grained prior distribution of trajectory, which can skip a large number of denoise steps. This advancement not only boosts sampling efficiency but also maintains the fidelity of prediction accuracy. Our method meets the rigorous real-time operational standards essential for autonomous vehicles, enabling prompt trajectory generation that is vital for secure and efficient navigation. Through extensive experiments, our method speeds up the inference time to 136ms compared to standard diffusion model, and achieves significant improvement in multi-agent motion prediction on the Argoverse 1 motion forecasting dataset. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: 7 pages, 4 figures

arXiv:2404.16770 [pdf, other]

Pseudogap phase as fluctuating pair density wave

Authors: Zheng-Yuan Yue, Zheng-Tao Xu, Shuo Yang, Zheng-Cheng Gu

Abstract: The physical nature of pseudogap phase is one of the most important and intriguing problems towards understanding the key mechanism of high temperature superconductivity in cuprates. Theoretically, the square-lattice $t$-$J$ model is widely believed to be the simplest toy model that captures the essential physics of cuprate superconductors. We employ the Grassmann tensor product state approach to… ▽ More The physical nature of pseudogap phase is one of the most important and intriguing problems towards understanding the key mechanism of high temperature superconductivity in cuprates. Theoretically, the square-lattice $t$-$J$ model is widely believed to be the simplest toy model that captures the essential physics of cuprate superconductors. We employ the Grassmann tensor product state approach to investigate uniform states in the underdoped ($δ\lesssim 0.1$) region. In addition to the previously known uniform $d$-wave state, we discover a strongly fluctuating pair density wave (PDW) state with wave vector $Q = (π, π)$. This fluctuating PDW state weakly breaks the $C_4$ rotational symmetry of the square lattice and has a lower or comparable energy to the $d$-wave state (depending on doping and the $t/J$ ratio), making it a promising candidate state for describing the pseudogap phase. △ Less

Submitted 15 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: 10 pages, 13 figures, references added

arXiv:2404.15254 [pdf, other]

UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition

Authors: Bin Wang, Zhuangcheng Gu, Guang Liang, Chao Xu, Bo Zhang, Botian Shi, Conghui He

Abstract: The paper introduces the UniMER dataset, marking the first study on Mathematical Expression Recognition (MER) targeting complex real-world scenarios. The UniMER dataset includes a large-scale training set, UniMER-1M, which offers unprecedented scale and diversity with one million training instances to train high-quality, robust models. Additionally, UniMER features a meticulously designed, diverse… ▽ More The paper introduces the UniMER dataset, marking the first study on Mathematical Expression Recognition (MER) targeting complex real-world scenarios. The UniMER dataset includes a large-scale training set, UniMER-1M, which offers unprecedented scale and diversity with one million training instances to train high-quality, robust models. Additionally, UniMER features a meticulously designed, diverse test set, UniMER-Test, which covers a variety of formula distributions found in real-world scenarios, providing a more comprehensive and fair evaluation. To better utilize the UniMER dataset, the paper proposes a Universal Mathematical Expression Recognition Network (UniMERNet), tailored to the characteristics of formula recognition. UniMERNet consists of a carefully designed encoder that incorporates detail-aware and local context features, and an optimized decoder for accelerated performance. Extensive experiments conducted using the UniMER-1M dataset and UniMERNet demonstrate that training on the large-scale UniMER-1M dataset can produce a more generalizable formula recognition model, significantly outperforming all previous datasets. Furthermore, the introduction of UniMERNet enhances the model's performance in formula recognition, achieving higher accuracy and speeds. All data, models, and code are available at https://github.com/opendatalab/UniMERNet. △ Less

Submitted 5 September, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

Comments: Project Website: https://github.com/opendatalab/UniMERNet

arXiv:2404.13671 [pdf, other]

FiLo: Zero-Shot Anomaly Detection by Fine-Grained Description and High-Quality Localization

Authors: Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Hao Li, Ming Tang, Jinqiao Wang

Abstract: Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image… ▽ More Zero-shot anomaly detection (ZSAD) methods entail detecting anomalies directly without access to any known normal or abnormal samples within the target item categories. Existing approaches typically rely on the robust generalization capabilities of multimodal pretrained models, computing similarities between manually crafted textual features representing "normal" or "abnormal" semantics and image features to detect anomalies and localize anomalous patches. However, the generic descriptions of "abnormal" often fail to precisely match diverse types of anomalies across different object categories. Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). FG-Des introduces fine-grained anomaly descriptions for each category using Large Language Models (LLMs) and employs adaptively learned textual templates to enhance the accuracy and interpretability of anomaly detection. HQ-Loc, utilizing Grounding DINO for preliminary localization, position-enhanced text prompts, and Multi-scale Multi-shape Cross-modal Interaction (MMCI) module, facilitates more accurate localization of anomalies of different sizes and shapes. Experimental results on datasets like MVTec and VisA demonstrate that FiLo significantly improves the performance of ZSAD in both detection and localization, achieving state-of-the-art performance with an image-level AUC of 83.9% and a pixel-level AUC of 95.9% on the VisA dataset. Code is available at https://github.com/CASIA-IVA-Lab/FiLo. △ Less

Submitted 25 July, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

Comments: Accepted by ACM MM 2024

arXiv:2404.09872 [pdf, other]

Conditional Prototype Rectification Prompt Learning

Authors: Haoxing Chen, Yaohui Li, Zizheng Huang, Yan Hong, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Huijia Zhu, Weiqiang Wang

Abstract: Pre-trained large-scale vision-language models (VLMs) have acquired profound understanding of general visual concepts. Recent advancements in efficient transfer learning (ETL) have shown remarkable success in fine-tuning VLMs within the scenario of limited data, introducing only a few parameters to harness task-specific insights from VLMs. Despite significant progress, current leading ETL methods… ▽ More Pre-trained large-scale vision-language models (VLMs) have acquired profound understanding of general visual concepts. Recent advancements in efficient transfer learning (ETL) have shown remarkable success in fine-tuning VLMs within the scenario of limited data, introducing only a few parameters to harness task-specific insights from VLMs. Despite significant progress, current leading ETL methods tend to overfit the narrow distributions of base classes seen during training and encounter two primary challenges: (i) only utilizing uni-modal information to modeling task-specific knowledge; and (ii) using costly and time-consuming methods to supplement knowledge. To address these issues, we propose a Conditional Prototype Rectification Prompt Learning (CPR) method to correct the bias of base examples and augment limited data in an effective way. Specifically, we alleviate overfitting on base classes from two aspects. First, each input image acquires knowledge from both textual and visual prototypes, and then generates sample-conditional text tokens. Second, we extract utilizable knowledge from unlabeled data to further refine the prototypes. These two strategies mitigate biases stemming from base classes, yielding a more effective classifier. Extensive experiments on 11 benchmark datasets show that our CPR achieves state-of-the-art performance on both few-shot classification and base-to-new generalization tasks. Our code is avaliable at \url{https://github.com/chenhaoxing/CPR}. △ Less

Submitted 20 August, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.07598 [pdf, other]

Electro-optically Modulated Nonlinear Metasurfaces

Authors: Zhengqing He, Lun Qu, Wei Wu, Jikun Liu, Jingfei You, Weiye Liu, Lu Bai, Chunyan Jin, Chenxiong Wang, Zhidong Gu, Wei Cai, Mengxin Ren, Jingjun Xu

Abstract: Tunable nonlinearity facilitates the creation of reconfigurable nonlinear metasurfaces, enabling innovative applications in signal processing, light switching, and sensing. This paper presents a novel approach to electrically modulate SHG from a lithium niobate (LN) metasurface, exploiting the electro-optical (EO) effect. By fabricating a nanohole array metasurface on a thin LN film and applying a… ▽ More Tunable nonlinearity facilitates the creation of reconfigurable nonlinear metasurfaces, enabling innovative applications in signal processing, light switching, and sensing. This paper presents a novel approach to electrically modulate SHG from a lithium niobate (LN) metasurface, exploiting the electro-optical (EO) effect. By fabricating a nanohole array metasurface on a thin LN film and applying an electric field, we demonstrate the alteration of the material's refractive index, resulting in resonance shifts and modulation of SHG intensity at specific wavelengths. Our findings provide valuable insights for the development of electrically tunable nonlinear light sources, quantum optics, dynamic nonlinear holography, and nonlinear information processing. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 4 pages, 4 figures

arXiv:2404.05685 [pdf, other]

Global phase diagram of doped quantum spin liquid on the Kagome lattice

Authors: Zheng-Tao Xu, Zheng-Cheng Gu, Shuo Yang

Abstract: It has long been believed that doped quantum spin liquids (QSLs) can give rise to fascinating quantum phases, including the possibility of high-temperature superconductivity (SC) as proposed by P. W. Anderson's resonating valence bond (RVB) scenario. The Kagome lattice $t$-$J$ model is known to exhibit spin liquid behavior at half-filling, making it an ideal system for studying the properties of d… ▽ More It has long been believed that doped quantum spin liquids (QSLs) can give rise to fascinating quantum phases, including the possibility of high-temperature superconductivity (SC) as proposed by P. W. Anderson's resonating valence bond (RVB) scenario. The Kagome lattice $t$-$J$ model is known to exhibit spin liquid behavior at half-filling, making it an ideal system for studying the properties of doped QSL. In this study, we employ the fermionic projected entangled simplex state (PESS) method to investigate the ground state properties of the Kagome lattice $t$-$J$ model with $t/J = 3.0$. Our results reveal a phase transition from charge density wave (CDW) states to uniform states around a critical doping level $δ_c \approx 0.27$. Within the CDW phase, we observe different types of Wigner crystal (WC) formulated by doped holes that are energetically favored. As we enter the uniform phase, a non-Fermi liquid (NFL) state emerges within the doping range $0.27 < δ< 0.32$, characterized by an exponential decay of all correlation functions. With further hole doping, we discover the appearance of a pair density wave (PDW) state within a narrow doping region $0.32 < δ< 1/3$. We also discuss the potential experimental implications of our findings. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 11 pages, 17 figures

arXiv:2403.16062 [pdf]

Holography inspired self-controlled reconfigurable intelligent surface

Authors: Jieao Zhu, Ze Gu, Qian Ma, Linglong Dai, Tie Jun Cui

Abstract: Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy effi… ▽ More Among various promising candidate technologies for the sixth-generation (6G) wireless communications, recent advances in microwave metasurfaces have sparked a new research area of reconfigurable intelligent surfaces (RISs). By controllably reprogramming the wireless propagation channel, RISs are envisioned to achieve low-cost wireless capacity boosting, coverage extension, and enhanced energy efficiency. To reprogram the channel, each meta-atom on RIS needs an external control signal, which is usually generated by base station (BS). However, BS-controlled RISs require complicated control cables, which hamper their massive deployments. Here, we eliminate the need for BS control by proposing a self-controlled RIS (SC-RIS), which is inspired by the optical holography principle. Different from the existing BS-controlled RISs, each meta-atom of SC-RIS is integrated with an additional power detector for holographic recording. By applying the classical Fourier-transform processing to the measured hologram, SC-RIS is capable of retrieving the user's channel state information required for beamforming, thus enabling autonomous RIS beamforming without control cables. Owing to this WiFi-like plug-and-play capability without the BS control, SC-RISs are expected to enable easy and massive deployments in the future 6G systems. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: Traditional BS-controlled RISs suffer from complicated control cables. To "cut" the control cables, we propose a self-controlled RIS by leveraging the holographic interference principle, thus realizing autonomous RIS beamforming

arXiv:2403.15993 [pdf, other]

Robust-Locomotion-by-Logic: Perturbation-Resilient Bipedal Locomotion via Signal Temporal Logic Guided Model Predictive Control

Authors: Zhaoyuan Gu, Yuntian Zhao, Yipu Chen, Rongming Guo, Jennifer K. Leestma, Gregory S. Sawicki, Ye Zhao

Abstract: This study introduces a robust planning framework that utilizes a model predictive control (MPC) approach, enhanced by incorporating signal temporal logic (STL) specifications. This marks the first-ever study to apply STL-guided trajectory optimization for bipedal locomotion, specifically designed to handle both translational and orientational perturbations. Existing recovery strategies often stru… ▽ More This study introduces a robust planning framework that utilizes a model predictive control (MPC) approach, enhanced by incorporating signal temporal logic (STL) specifications. This marks the first-ever study to apply STL-guided trajectory optimization for bipedal locomotion, specifically designed to handle both translational and orientational perturbations. Existing recovery strategies often struggle with reasoning complex task logic and evaluating locomotion robustness systematically, making them susceptible to failures caused by inappropriate recovery strategies or lack of robustness. To address these issues, we design an analytical robustness metric for bipedal locomotion and quantify this metric using STL specifications, which guide the generation of recovery trajectories to achieve maximum locomotion robustness. To enable safe and computational-efficient crossed-leg maneuver, we design data-driven self-leg-collision constraints that are $1000$ times faster than the traditional inverse-kinematics-based approach. Our framework outperforms a state-of-the-art locomotion controller, a standard MPC without STL, and a linear-temporal-logic-based planner in a high-fidelity dynamic simulation, especially in scenarios involving crossed-leg maneuvers. Additionally, the Cassie bipedal robot achieves robust performance under horizontal and orientational perturbations such as those observed in ship motions. These environments are validated in simulations and deployed on hardware. Furthermore, our proposed method demonstrates versatility on stepping stones and terrain-agnostic features on inclined terrains. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2403.15718 [pdf, other]

On a dryout point for a stationary incompressible thermal fluid with phase transition in a pipe

Authors: Yoshikazu Giga, Zhongyang Gu

Abstract: A dryout point is recognized as the position where the phase transition from liquid to vapor occurs. In the one-dimensional case, by solving the stationary incompressible Navier-Stokes-Fourier equations with phase transition, we derive a necessary and sufficient condition for a dryout point to exist when the temperature at the liquid-vapor interface is given. In addition, we show by considering th… ▽ More A dryout point is recognized as the position where the phase transition from liquid to vapor occurs. In the one-dimensional case, by solving the stationary incompressible Navier-Stokes-Fourier equations with phase transition, we derive a necessary and sufficient condition for a dryout point to exist when the temperature at the liquid-vapor interface is given. In addition, we show by considering thermodynamics that the temperature at the dryout point and the density of the vapor phase can be determined by given density and sufficiently small injected mass flux of the liquid phase. △ Less

Submitted 23 March, 2024; originally announced March 2024.

Comments: 20 pages, 12 figures

MSC Class: 35Q79; 76D05; 80A22

arXiv:2403.13433 [pdf, other]

AgentGroupChat: An Interactive Group Chat Simulacra For Better Eliciting Emergent Behavior

Authors: Zhouhong Gu, Xiaoxuan Zhu, Haoran Guo, Lin Zhang, Yin Cai, Hao Shen, Jiangjie Chen, Zheyu Ye, Yifei Dai, Yan Gao, Yao Hu, Hongwei Feng, Yanghua Xiao

Abstract: Language significantly influences the formation and evolution of Human emergent behavior, which is crucial in understanding collective intelligence within human societies. Considering that the study of how language affects human behavior needs to put it into the dynamic scenarios in which it is used, we introduce AgentGroupChat in this paper, a simulation that delves into the complex role of langu… ▽ More Language significantly influences the formation and evolution of Human emergent behavior, which is crucial in understanding collective intelligence within human societies. Considering that the study of how language affects human behavior needs to put it into the dynamic scenarios in which it is used, we introduce AgentGroupChat in this paper, a simulation that delves into the complex role of language in shaping collective behavior through interactive debate scenarios. Central to this simulation are characters engaging in dynamic conversation interactions. To enable simulation, we introduce the Verbal Strategist Agent, utilizing large language models to enhance interaction strategies by incorporating elements of persona and action. We set four narrative scenarios based on AgentGroupChat to demonstrate the simulation's capacity to mimic complex language use in group dynamics. Evaluations focus on aligning agent behaviors with human expectations and the emergence of collective behaviors within the simulation. Results reveal that emergent behaviors materialize from a confluence of factors: a conducive environment for extensive information exchange, characters with diverse traits, high linguistic comprehension, and strategic adaptability. During discussions on ``the impact of AI on humanity'' in AgentGroupChat simulation, philosophers commonly agreed that ``AI could enhance societal welfare with judicious limitations'' and even come to a conclusion that ``the essence of true intelligence encompasses understanding the necessity to constrain self abilities''. Additionally, in the competitive domain of casting for primary roles in films in AgentGroupChat, certain actors were ready to reduce their remuneration or accept lesser roles, motivated by their deep-seated desire to contribute to the project. △ Less

Submitted 4 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.12580 [pdf, other]

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

Authors: Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jianning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, Lizhuang Ma

Abstract: Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However, the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand, most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec, and the differences of methods cannot be well… ▽ More Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However, the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand, most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec, and the differences of methods cannot be well distinguished, leading to a significant gap between public datasets and actual application scenarios. On the other hand, the research on various new practical anomaly detection settings is limited by the scale of the dataset, posing a risk of overfitting in evaluation results. Therefore, we propose a large-scale, Real-world, and multi-view Industrial Anomaly Detection dataset, named Real-IAD, which contains 150K high-resolution images of 30 different objects, an order of magnitude larger than existing datasets. It has a larger range of defect area and ratio proportions, making it more challenging than previous datasets. To make the dataset closer to real application scenarios, we adopted a multi-view shooting method and proposed sample-level evaluation metrics. In addition, beyond the general unsupervised anomaly detection setting, we propose a new setting for Fully Unsupervised Industrial Anomaly Detection (FUIAD) based on the observation that the yield rate in industrial production is usually greater than 60%, which has more practical application value. Finally, we report the results of popular IAD methods on the Real-IAD dataset, providing a highly challenging benchmark to promote the development of the IAD field. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: It is accepted by CVPR2024

arXiv:2403.07825 [pdf, other]

The Missing Piece in Model Editing: A Deep Dive into the Hidden Damage Brought By Model Editing

Authors: Jianchen Wang, Zhouhong Gu, Xiaoxuan Zhu, Lin Zhang, Haoning Ye, Zhuozhi Xiong, Hongwei Feng, Yanghua Xiao

Abstract: Large Language Models have revolutionized numerous tasks with their remarkable efficacy. However, editing these models, crucial for rectifying outdated or erroneous information, often leads to a complex issue known as the ripple effect in the hidden space. While difficult to detect, this effect can significantly impede the efficacy of model editing tasks and deteriorate model performance. This pap… ▽ More Large Language Models have revolutionized numerous tasks with their remarkable efficacy. However, editing these models, crucial for rectifying outdated or erroneous information, often leads to a complex issue known as the ripple effect in the hidden space. While difficult to detect, this effect can significantly impede the efficacy of model editing tasks and deteriorate model performance. This paper addresses this scientific challenge by proposing a novel evaluation methodology, Graphical Impact Evaluation(GIE), which quantitatively evaluates the adaptations of the model and the subsequent impact of editing. Furthermore, we introduce the Selective Impact Revision(SIR), a model editing method designed to mitigate this ripple effect. Our comprehensive evaluations reveal that the ripple effect in the hidden space is a significant issue in all current model editing methods. However, our proposed methods, GIE and SIR, effectively identify and alleviate this issue, contributing to the advancement of LLM editing techniques. △ Less

Submitted 2 July, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.05644 [pdf, other]

TSSS: A Novel Triangulated Spherical Spline Smoothing for Surface-based Data

Authors: Zhiling Gu, Shan Yu, Guannan Wang, Ming-Jun Lai, Li Wang

Abstract: Surface-based data is commonly observed in diverse practical applications spanning various fields. In this paper, we introduce a novel nonparametric method to discover the underlying signals from data distributed on complex surface-based domains. Our approach involves a penalized spline estimator defined on a triangulation of surface patches, which enables effective signal extraction and recovery.… ▽ More Surface-based data is commonly observed in diverse practical applications spanning various fields. In this paper, we introduce a novel nonparametric method to discover the underlying signals from data distributed on complex surface-based domains. Our approach involves a penalized spline estimator defined on a triangulation of surface patches, which enables effective signal extraction and recovery. The proposed method offers several advantages over existing methods, including superior handling of "leakage" or "boundary effects" over complex domains, enhanced computational efficiency, and potential applications in analyzing sparse and irregularly distributed data on complex objects. We provide rigorous theoretical guarantees for the proposed method, including convergence rates of the estimator in both the $L_2$ and supremum norms, as well as the asymptotic normality of the estimator. We also demonstrate that the convergence rates achieved by our estimation method are optimal within the framework of nonparametric estimation. Furthermore, we introduce a bootstrap method to quantify the uncertainty associated with the proposed estimators accurately. The superior performance of the proposed method is demonstrated through simulation experiments and data applications on cortical surface functional magnetic resonance imaging data and oceanic near-surface atmospheric data. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 56 pages, 16 figures

MSC Class: 62G05; 62G08

arXiv:2403.04652 [pdf, other]

Yi: Open Foundation Models by 01.AI

Authors: 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie , et al. (7 additional authors not shown)

Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU,… ▽ More We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual representations to the semantic space of the language model. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.01704 [pdf]

Giant second harmonic generation in supertwisted WS2 spirals grown in step edge particle induced non-Euclidean surfaces

Authors: Tong Tong, Ruijie Chen, Yuxuan Ke, Qian Wang, Xinchao Wang, Qinjun Sun, Jie Chen, Zhiyuan Gu, Ying Yu, Hongyan Wei, Yuying Hao, Xiaopeng Fan, Qing Zhang

Abstract: In moiré crystals resulting from the stacking of twisted two-dimensional (2D) layered materials, a subtle adjustment in the twist angle surprisingly gives rise to a wide range of correlated optical and electrical properties. Herein, we report the synthesis of supertwisted WS2 spirals and the observation of giant second harmonic generation (SHG) in these spirals. Supertwisted WS2 spirals featuring… ▽ More In moiré crystals resulting from the stacking of twisted two-dimensional (2D) layered materials, a subtle adjustment in the twist angle surprisingly gives rise to a wide range of correlated optical and electrical properties. Herein, we report the synthesis of supertwisted WS2 spirals and the observation of giant second harmonic generation (SHG) in these spirals. Supertwisted WS2 spirals featuring different twist angles are synthesized on a Euclidean or step-edge particle-induced non-Euclidean surface using a carefully designed water-assisted chemical vapor deposition. We observed an oscillatory dependence of SHG intensity on layer number, attributed to atomically phase-matched nonlinear dipoles within layers of supertwisted spiral crystals where inversion symmetry is restored. Through an investigation into the twist angle evolution of SHG intensity, we discovered that the stacking model between layers plays a crucial role in determining the nonlinearity, and the SHG signals in supertwisted spirals exhibit enhancements by a factor of 2 to 136 when compared with the SHG of the single-layer structure. These findings provide an efficient method for the rational growth of 2D twisted structures and the implementation of twist angle adjustable endowing them great potential for exploring strong coupling correlation physics and applications in the field of twistronics. △ Less

Submitted 19 July, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 26 pages, 4 figures

arXiv:2402.19270 [pdf, other]

Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching

Authors: Rui Gong, Weide Liu, Zaiwang Gu, Xulei Yang, Jun Cheng

Abstract: Geometric knowledge has been shown to be beneficial for the stereo matching task. However, prior attempts to integrate geometric insights into stereo matching algorithms have largely focused on geometric knowledge from single images while crucial cross-view factors such as occlusion and matching uniqueness have been overlooked. To address this gap, we propose a novel Intra-view and Cross-view Geom… ▽ More Geometric knowledge has been shown to be beneficial for the stereo matching task. However, prior attempts to integrate geometric insights into stereo matching algorithms have largely focused on geometric knowledge from single images while crucial cross-view factors such as occlusion and matching uniqueness have been overlooked. To address this gap, we propose a novel Intra-view and Cross-view Geometric knowledge learning Network (ICGNet), specifically crafted to assimilate both intra-view and cross-view geometric knowledge. ICGNet harnesses the power of interest points to serve as a channel for intra-view geometric understanding. Simultaneously, it employs the correspondences among these points to capture cross-view geometric relationships. This dual incorporation empowers the proposed ICGNet to leverage both intra-view and cross-view geometric knowledge in its learning process, substantially improving its ability to estimate disparities. Our extensive experiments demonstrate the superiority of the ICGNet over contemporary leading models. △ Less

Submitted 6 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR2024

arXiv:2402.18986 [pdf, other]

Always be Pre-Training: Representation Learning for Network Intrusion Detection with GNNs

Authors: Zhengyao Gu, Diego Troy Lopez, Lilas Alrahis, Ozgur Sinanoglu

Abstract: Graph neural network-based network intrusion detection systems have recently demonstrated state-of-the-art performance on benchmark datasets. Nevertheless, these methods suffer from a reliance on target encoding for data pre-processing, limiting widespread adoption due to the associated need for annotated labels--a cost-prohibitive requirement. In this work, we propose a solution involving in-cont… ▽ More Graph neural network-based network intrusion detection systems have recently demonstrated state-of-the-art performance on benchmark datasets. Nevertheless, these methods suffer from a reliance on target encoding for data pre-processing, limiting widespread adoption due to the associated need for annotated labels--a cost-prohibitive requirement. In this work, we propose a solution involving in-context pre-training and the utilization of dense representations for categorical features to jointly overcome the label-dependency limitation. Our approach exhibits remarkable data efficiency, achieving over 98% of the performance of the supervised state-of-the-art with less than 4% labeled data on the NF-UQ-NIDS-V2 dataset. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Will appear in the 2024 International Symposium on Quality Electronic Design (ISQED'24)

Showing 1–50 of 407 results for author: Gu, Z