Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 4,225 results for author: Li, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03512  [pdf, other

    cs.CY cs.CL

    From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

    Authors: Jifan Yu, Zheyuan Zhang, Daniel Zhang-li, Shangqing Tu, Zhanxin Hao, Rui Miao Li, Haoxuan Li, Yuanchun Wang, Hanming Li, Linlu Gong, Jie Cao, Jiayin Lin, Jinchang Zhou, Fei Qin, Haohua Wang, Jianxiao Jiang, Lijun Deng, Yisi Zhan, Chaojun Xiao, Xusheng Dai, Xuan Yan, Nianyi Lin, Nan Zhang, Ruixin Ni, Yang Dang , et al. (8 additional authors not shown)

    Abstract: Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integ… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  2. arXiv:2409.03501  [pdf, other

    cs.CV

    Towards Data-Centric Face Anti-Spoofing: Improving Cross-domain Generalization via Physics-based Data Synthesis

    Authors: Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex Kot

    Abstract: Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is large… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted by International Journal of Computer Vision (IJCV) in Sept 2024

  3. arXiv:2409.03358  [pdf, other

    cs.CV cs.LG cs.RO

    MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice

    Authors: Friedhelm Hamann, Hanxiong Li, Paul Mieske, Lars Lewejohann, Guillermo Gallego

    Abstract: Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotate… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: 18 pages, 5 figures, ECCV Workshops

  4. arXiv:2409.03354  [pdf, other

    cs.CV

    Few-Shot Continual Learning for Activity Recognition in Classroom Surveillance Images

    Authors: Yilei Qian, Kanglei Geng, Kailong Chen, Shaoxu Cheng, Linfeng Xu, Hongliang Li, Fanman Meng, Qingbo Wu

    Abstract: The application of activity recognition in the "AI + Education" field is gaining increasing attention. However, current work mainly focuses on the recognition of activities in manually captured videos and a limited number of activity types, with little attention given to recognizing activities in surveillance images from real classrooms. In real classroom settings, normal teaching activities such… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  5. arXiv:2409.03251  [pdf, other

    cs.HC cs.LG eess.SY

    Dual-TSST: A Dual-Branch Temporal-Spectral-Spatial Transformer Model for EEG Decoding

    Authors: Hongqi Li, Haodong Zhang, Yitong Chen

    Abstract: The decoding of electroencephalography (EEG) signals allows access to user intentions conveniently, which plays an important role in the fields of human-machine interaction. To effectively extract sufficient characteristics of the multichannel EEG, a novel decoding architecture network with a dual-branch temporal-spectral-spatial transformer (Dual-TSST) is proposed in this study. Specifically, by… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  6. arXiv:2409.02489  [pdf, other

    cs.SD cs.AI eess.AS

    NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

    Authors: Dashanka De Silva, Siqi Cai, Saurav Pahuja, Tanja Schultz, Haizhou Li

    Abstract: In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a n… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  7. arXiv:2409.02189  [pdf, other

    cs.LG

    Collaboratively Learning Federated Models from Noisy Decentralized Data

    Authors: Haoyuan Li, Mathias Funk, Nezihe Merve Gürel, Aaqib Saeed

    Abstract: Federated learning (FL) has emerged as a prominent method for collaboratively training machine learning models using local data from edge devices, all while keeping data decentralized. However, accounting for the quality of data contributed by local clients remains a critical challenge in FL, as local data are often susceptible to corruption by various forms of noise and perturbations, which compr… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  8. arXiv:2409.01867  [pdf, other

    cs.HC

    ASD-Chat: An Innovative Dialogue Intervention System for Children with Autism based on LLM and VB-MAPP

    Authors: Chengyun Deng, Shuzhong Lai, Chi Zhou, Mengyi Bao, Jingwen Yan, Haifeng Li, Lin Yao, Yueming Wang

    Abstract: Early diagnosis and professional intervention can help children with autism spectrum disorder (ASD) return to normal life. However, the scarcity and imbalance of professional medical resources currently prevent many autistic children from receiving the necessary diagnosis and intervention. Therefore, numerous paradigms have been proposed that use computer technology to assist or independently cond… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  9. arXiv:2409.01806  [pdf, other

    cs.AI cs.CL cs.LG

    LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning

    Authors: Haoming Li, Zhaoliang Chen, Jonathan Zhang, Fei Liu

    Abstract: Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions ne… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  10. arXiv:2409.01658  [pdf, other

    cs.CL

    From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning

    Authors: Wei Chen, Zhen Huang, Liang Xie, Binbin Lin, Houqiang Li, Le Lu, Xinmei Tian, Deng Cai, Yonggang Zhang, Wenxiao Wan, Xu Shen, Jieping Ye

    Abstract: Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted by ICML 2024

  11. arXiv:2409.01347  [pdf, other

    cs.CV

    Target-Driven Distillation: Consistency Distillation with Target Timestep Selection and Decoupled Guidance

    Authors: Cunzheng Wang, Ziyuan Guo, Yuxuan Duan, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu

    Abstract: Consistency distillation methods have demonstrated significant success in accelerating generative tasks of diffusion models. However, since previous consistency distillation methods use simple and straightforward strategies in selecting target timesteps, they usually struggle with blurs and detail losses in generated images. To address these limitations, we introduce Target-Driven Distillation (TD… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

  12. arXiv:2409.01327  [pdf, other

    cs.CV

    SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation

    Authors: Yang Zhang, Rui Zhang, Xuecheng Nie, Haochen Li, Jikun Chen, Yifan Hao, Xin Zhang, Luoqi Liu, Ling Li

    Abstract: Recent text-to-image models have achieved remarkable success in generating high-quality images. However, when tasked with multi-concept generation which creates images containing multiple characters or objects, existing methods often suffer from attribute confusion, resulting in severe text-image inconsistency. We found that attribute confusion occurs when a certain region of the latent features a… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

  13. arXiv:2409.01212  [pdf, other

    cs.CV

    MobileIQA: Exploiting Mobile-level Diverse Opinion Network For No-Reference Image Quality Assessment Using Knowledge Distillation

    Authors: Zewen Chen, Sunhan Xu, Yun Zeng, Haochen Guo, Jian Guo, Shuai Liu, Juan Wang, Bing Li, Weiming Hu, Dehua Liu, Hesong Li

    Abstract: With the rising demand for high-resolution (HR) images, No-Reference Image Quality Assessment (NR-IQA) gains more attention, as it can ecaluate image quality in real-time on mobile devices and enhance user experience. However, existing NR-IQA methods often resize or crop the HR images into small resolution, which leads to a loss of important details. And most of them are of high computational comp… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted by ECCV Workshop 2024

  14. arXiv:2409.01113  [pdf, other

    cs.CV

    KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

    Authors: Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, Shuangping Huang

    Abstract: We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Despite recent advancements in data-driven techniques, accurately mapping between audio signals and 3D facial meshes remains challenging. Direct regression of the entire sequence often leads to over-smoothed results due to the ill-posed nature of the problem. To this end, we propose a p… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted by ECCV 2024

  15. arXiv:2409.01037  [pdf, other

    cs.CL

    NYK-MS: A Well-annotated Multi-modal Metaphor and Sarcasm Understanding Benchmark on Cartoon-Caption Dataset

    Authors: Ke Chang, Hao Li, Junzhao Zhang, Yunfang Wu

    Abstract: Metaphor and sarcasm are common figurative expressions in people's communication, especially on the Internet or the memes popular among teenagers. We create a new benchmark named NYK-MS (NewYorKer for Metaphor and Sarcasm), which contains 1,583 samples for metaphor understanding tasks and 1,578 samples for sarcasm understanding tasks. These tasks include whether it contains metaphor/sarcasm, which… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: 13 pages, 6 figures

  16. arXiv:2409.00968  [pdf, other

    math.OC cs.AI cs.LG

    Solving Integrated Process Planning and Scheduling Problem via Graph Neural Network Based Deep Reinforcement Learning

    Authors: Hongpei Li, Han Zhang, Ziyan He, Yunkai Jia, Bo Jiang, Xiang Huang, Dongdong Ge

    Abstract: The Integrated Process Planning and Scheduling (IPPS) problem combines process route planning and shop scheduling to achieve high efficiency in manufacturing and maximize resource utilization, which is crucial for modern manufacturing systems. Traditional methods using Mixed Integer Linear Programming (MILP) and heuristic algorithms can not well balance solution quality and speed when solving IPPS… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: 24 pages, 13 figures

  17. arXiv:2409.00346  [pdf, other

    cs.CV

    SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

    Authors: Fuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, Shounjun Zhou

    Abstract: In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture th… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

    Comments: Accepted by BIBM 2024

  18. arXiv:2409.00317  [pdf, other

    cs.CV

    FBD-SV-2024: Flying Bird Object Detection Dataset in Surveillance Video

    Authors: Zi-Wei Sun, Ze-Xi Hua, Heng-Chao Li, Zhi-Peng Qi, Xiang Li, Yan Li, Jin-Chi Zhang

    Abstract: A Flying Bird Dataset for Surveillance Videos (FBD-SV-2024) is introduced and tailored for the development and performance evaluation of flying bird detection algorithms in surveillance videos. This dataset comprises 483 video clips, amounting to 28,694 frames in total. Among them, 23,833 frames contain 28,366 instances of flying birds. The proposed dataset of flying birds in surveillance videos i… ▽ More

    Submitted 30 August, 2024; originally announced September 2024.

  19. arXiv:2409.00005  [pdf, other

    cs.IT cs.AI

    Csi-LLM: A Novel Downlink Channel Prediction Method Aligned with LLM Pre-Training

    Authors: Shilong Fan, Zhenyu Liu, Xinyu Gu, Haozhen Li

    Abstract: Downlink channel temporal prediction is a critical technology in massive multiple-input multiple-output (MIMO) systems. However, existing methods that rely on fixed-step historical sequences significantly limit the accuracy, practicality, and scalability of channel prediction. Recent advances have shown that large language models (LLMs) exhibit strong pattern recognition and reasoning abilities ov… ▽ More

    Submitted 15 August, 2024; originally announced September 2024.

  20. arXiv:2408.16986  [pdf, other

    cs.CV

    AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

    Authors: Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

    Abstract: Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visu… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  21. arXiv:2408.16564  [pdf, other

    cs.MM cs.SD eess.AS

    Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

    Authors: Qianhui Liu, Jiadong Wang, Yang Wang, Xin Yang, Gang Pan, Haizhou Li

    Abstract: Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multi… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  22. arXiv:2408.16540  [pdf, other

    cs.CV

    GRPose: Learning Graph Relations for Human Image Generation with Pose Priors

    Authors: Xiangchen Yin, Donglin Di, Lei Fan, Hao Li, Chen Wei, Xiaofei Gou, Yang Song, Xiao Sun, Xun Yang

    Abstract: Recent methods using diffusion models have made significant progress in human image generation with various additional controls such as pose priors. However, existing approaches still struggle to generate high-quality images with consistent pose alignment, resulting in unsatisfactory outputs. In this paper, we propose a framework delving into the graph relations of pose priors to provide control i… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: The code will be released at https://github.com/XiangchenYin/GRPose

  23. arXiv:2408.16340  [pdf, other

    eess.IV cs.CV

    Learned Image Transmission with Hierarchical Variational Autoencoder

    Authors: Guangyi Zhang, Hanlei Li, Yunlong Cai, Qiyu Hu, Guanding Yu, Runmin Zhang

    Abstract: In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly m… ▽ More

    Submitted 3 September, 2024; v1 submitted 29 August, 2024; originally announced August 2024.

  24. arXiv:2408.16254  [pdf, other

    cs.CV

    EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

    Authors: Kanghao Chen, Guoqiang Liang, Hangyu Li, Yunfan Lu, Lin Wang

    Abstract: Event cameras offer significant advantages for low-light video enhancement, primarily due to their high dynamic range. Current research, however, is severely limited by the absence of large-scale, real-world, and spatio-temporally aligned event-video datasets. To address this, we introduce a large-scale dataset with over 30,000 pairs of frames and events captured under varying illumination. This d… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: Journal extension based on EvLight (arXiv:2404.00834)

  25. arXiv:2408.15980  [pdf, other

    cs.RO cs.AI

    In-Context Imitation Learning via Next-Token Prediction

    Authors: Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, Ken Goldberg

    Abstract: We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor traj… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  26. arXiv:2408.15881  [pdf, other

    cs.CV

    LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

    Authors: Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang

    Abstract: We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, s… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  27. arXiv:2408.15876  [pdf, other

    cs.CV

    Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

    Authors: Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu

    Abstract: In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spa… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  28. arXiv:2408.15045  [pdf, other

    cs.CV

    DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

    Authors: Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, Lianwen Jin

    Abstract: Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content. With the rapid evolution of large language models (LLMs), they have been widely leveraged for TDU due to their remarkable versatility and generalization. In this paper, we introduce DocLayLLM, an efficient and effective multi-modal extension of LLMs specifically designed f… ▽ More

    Submitted 28 August, 2024; v1 submitted 27 August, 2024; originally announced August 2024.

  29. arXiv:2408.14975  [pdf, other

    cs.CV

    MegActor-$Σ$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

    Authors: Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan, Jin Wang

    Abstract: Diffusion models have demonstrated superior performance in the field of portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  30. arXiv:2408.14950  [pdf, other

    cs.CV cs.AI

    NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework

    Authors: Shuangchen Zhao, Changde Du, Hui Li, Huiguang He

    Abstract: Deep Neural Networks (DNNs) have demonstrated exceptional recognition capabilities in traditional computer vision (CV) tasks. However, existing CV models often suffer a significant decrease in accuracy when confronted with out-of-distribution (OOD) data. In contrast to these DNN models, human can maintain a consistently low error rate when facing OOD scenes, partly attributed to the rich prior cog… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  31. arXiv:2408.14507  [pdf, other

    cs.DB cs.AI

    Cost-Aware Uncertainty Reduction in Schema Matching with GPT-4: The Prompt-Matcher Framework

    Authors: Longyu Feng, Huahang Li, Chen Jason Zhang

    Abstract: Schema matching is the process of identifying correspondences between the elements of two given schemata, essential for database management systems, data integration, and data warehousing. The inherent uncertainty of current schema matching algorithms leads to the generation of a set of candidate matches. Storing these results necessitates the use of databases and systems capable of handling proba… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  32. arXiv:2408.14244  [pdf, other

    cs.CV

    Cascaded Temporal Updating Network for Efficient Video Super-Resolution

    Authors: Hao Li, Jiangxin Dong, Jinshan Pan

    Abstract: Existing video super-resolution (VSR) methods generally adopt a recurrent propagation network to extract spatio-temporal information from the entire video sequences, exhibiting impressive performance. However, the key components in recurrent-based VSR networks significantly impact model efficiency, e.g., the alignment module occupies a substantial portion of model parameters, while the bidirection… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Project website: https://github.com/House-Leo/CTUN

  33. arXiv:2408.13852  [pdf, other

    cs.CV

    LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

    Authors: Keyi Zhou, Li Li, Wengang Zhou, Yonghui Wang, Hao Feng, Houqiang Li

    Abstract: In video lane detection, there are rich temporal contexts among successive frames, which is under-explored in existing lane detectors. In this work, we propose LaneTCA to bridge the individual video frames and explore how to effectively aggregate the temporal context. Technically, we develop an accumulative attention module and an adjacent attention module to abstract the long-term and short-term… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

  34. arXiv:2408.13836  [pdf, other

    cs.CV cs.AI

    PropSAM: A Propagation-Based Model for Segmenting Any 3D Objects in Multi-Modal Medical Images

    Authors: Zifan Chen, Xinyu Nan, Jiazheng Li, Jie Zhao, Haifeng Li, Zilin Lin, Haoshen Li, Heyun Chen, Yiting Liu, Bin Dong, Li Zhang, Lei Tang

    Abstract: Volumetric segmentation is crucial for medical imaging but is often constrained by labor-intensive manual annotations and the need for scenario-specific model training. Furthermore, existing general segmentation models are inefficient due to their design and inferential approaches. Addressing this clinical demand, we introduce PropSAM, a propagation-based segmentation model that optimizes the use… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: 26 figures, 6 figures

  35. arXiv:2408.13832  [pdf, other

    eess.IV cs.CV

    A Low-dose CT Reconstruction Network Based on TV-regularized OSEM Algorithm

    Authors: Ran An, Yinghui Zhang, Xi Chen, Lemeng Li, Ke Chen, Hongwei Li

    Abstract: Low-dose computed tomography (LDCT) offers significant advantages in reducing the potential harm to human bodies. However, reducing the X-ray dose in CT scanning often leads to severe noise and artifacts in the reconstructed images, which might adversely affect diagnosis. By utilizing the expectation maximization (EM) algorithm, statistical priors could be combined with artificial priors to improv… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: 11 pages, 8 figures

    ACM Class: I.4.5

  36. arXiv:2408.13745  [pdf, other

    cs.CL cs.AI cs.PL

    DOCE: Finding the Sweet Spot for Execution-Based Code Generation

    Authors: Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, André F. T. Martins

    Abstract: Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decodi… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: 10 pages (32 including appendix), 5 figures, 25 tables. arXiv admin note: text overlap with arXiv:2304.05128 by other authors

  37. arXiv:2408.13674  [pdf, other

    cs.CV

    GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

    Authors: Keqiang Sun, Amin Jourabloo, Riddhish Bhalodia, Moustafa Meshry, Yu Rong, Zhengyu Yang, Thu Nguyen-Phuoc, Christian Haene, Jiu Xu, Sam Johnson, Hongsheng Li, Sofien Bouaziz

    Abstract: Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production. Traditional methods for avatar creation often involve time-consuming scanning and reconstruction processes for each avatar, which limits their scalability. Furthermore, these methods do not offer the flexibility to sample new identit… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  38. arXiv:2408.13463  [pdf, other

    cs.CV

    HabitAction: A Video Dataset for Human Habitual Behavior Recognition

    Authors: Hongwu Li, Zhenliang Zhang, Wei Wang

    Abstract: Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person's perso… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  39. arXiv:2408.13195  [pdf, other

    cs.AR cs.LG

    NAS-Cap: Deep-Learning Driven 3-D Capacitance Extraction with Neural Architecture Search and Data Augmentation

    Authors: Haoyuan Li, Dingcheng Yang, Chunyan Pei, Wenjian Yu

    Abstract: More accurate capacitance extraction is demanded for designing integrated circuits under advanced process technology. The pattern matching approach and the field solver for capacitance extraction have the drawbacks of inaccuracy and large computational cost, respectively. Recent work \cite{yang2023cnn} proposes a grid-based data representation and a convolutional neural network (CNN) based capacit… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

  40. arXiv:2408.12830  [pdf, other

    cs.LG stat.ML

    SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning

    Authors: Wang Luo, Haoran Li, Zicheng Zhang, Congying Han, Jiayu Lv, Tiande Guo

    Abstract: Model-based Offline Reinforcement Learning trains policies based on offline datasets and model dynamics, without direct real-world environment interactions. However, this method is inherently challenged by distribution shift. Previous approaches have primarily focused on tackling this issue directly leveraging off-policy mechanisms and heuristic uncertainty in model dynamics, but they resulted in… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

  41. arXiv:2408.12791  [pdf, other

    cs.CV

    Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

    Authors: Chenqi Kong, Anwei Luo, Peijun Bao, Haoliang Li, Renjie Wan, Zengwei Zheng, Anderson Rocha, Alex C. Kot

    Abstract: Open-set face forgery detection poses significant security threats and presents substantial challenges for existing detection models. These detectors primarily have two limitations: they cannot generalize across unknown forgery domains and inefficiently adapt to new data. To address these issues, we introduce an approach that is both general and parameter-efficient for face forgery detection. It b… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  42. arXiv:2408.12733  [pdf, other

    cs.AI cs.CL cs.DB cs.LG

    SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

    Authors: Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik

    Abstract: Text-to-SQL systems, which convert natural language queries into SQL commands, have seen significant progress primarily for the SQLite dialect. However, adapting these systems to other SQL dialects like BigQuery and PostgreSQL remains a challenge due to the diversity in SQL syntax and functions. We introduce SQL-GEN, a framework for generating high-quality dialect-specific synthetic data guided by… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  43. arXiv:2408.12609  [pdf, ps, other

    cs.RO cs.AI

    Enhanced Prediction of Multi-Agent Trajectories via Control Inference and State-Space Dynamics

    Authors: Yu Zhang, Yongxiang Zou, Haoyu Zhang, Zeyu Liu, Houcheng Li, Long Cheng

    Abstract: In the field of autonomous systems, accurately predicting the trajectories of nearby vehicles and pedestrians is crucial for ensuring both safety and operational efficiency. This paper introduces a novel methodology for trajectory forecasting based on state-space dynamic system modeling, which endows agents with models that have tangible physical implications. To enhance the precision of state est… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  44. arXiv:2408.12352  [pdf, other

    cs.CV

    GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

    Authors: Shiyue Zhang, Zheng Chong, Xujie Zhang, Hanhui Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

    Abstract: General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffus… ▽ More

    Submitted 23 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  45. arXiv:2408.12249  [pdf, other

    cs.CL cs.AI cs.LG

    LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

    Authors: Aishik Nagar, Viktor Schlegel, Thanh-Tung Nguyen, Hao Li, Yuping Wu, Kuluhan Binici, Stefan Winkler

    Abstract: Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extration. To breach this gap, in th… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: 11 pages

  46. arXiv:2408.12245  [pdf, other

    cs.CV

    Scalable Autoregressive Image Generation with Mamba

    Authors: Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, Guoqi Li

    Abstract: We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Un… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: 9 pages, 8 figures

  47. arXiv:2408.12152  [pdf, other

    cs.IR

    Behavior Pattern Mining-based Multi-Behavior Recommendation

    Authors: Haojie Li, Zhiyong Cheng, Xu Yu, Jinhuan Liu, Guanfeng Liu, Junwei Du

    Abstract: Multi-behavior recommendation systems enhance effectiveness by leveraging auxiliary behaviors (such as page views and favorites) to address the limitations of traditional models that depend solely on sparse target behaviors like purchases. Existing approaches to multi-behavior recommendations typically follow one of two strategies: some derive initial node representations from individual behavior… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  48. arXiv:2408.11878  [pdf, other

    cs.CL cs.CE q-fin.CP

    Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

    Authors: Qianqian Xie, Dong Li, Mengxi Xiao, Zihao Jiang, Ruoyu Xiang, Xiao Zhang, Zhengyu Chen, Yueru He, Weiguang Han, Yuzhe Yang, Shunian Chen, Yifei Zhang, Lihang Shen, Daniel Kim, Zhiwei Liu, Zheheng Luo, Yangyang Yu, Yupeng Cao, Zhiyang Deng, Zhiyuan Yao, Haohang Li, Duanyu Feng, Yongfu Dai, VijayaSai Somasundaram, Peng Lu , et al. (14 additional authors not shown)

    Abstract: Large language models (LLMs) have advanced financial applications, yet they often lack sufficient financial knowledge and struggle with tasks involving multi-modal inputs like tables and time series data. To address these limitations, we introduce \textit{Open-FinLLMs}, a series of Financial LLMs. We begin with FinLLaMA, pre-trained on a 52 billion token financial corpus, incorporating text, table… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 33 pages, 13 figures

  49. arXiv:2408.11795  [pdf, other

    cs.CV

    EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

    Authors: Feipeng Ma, Yizhou Zhou, Hebei Li, Zilong He, Siying Wu, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun

    Abstract: In the realm of multimodal research, numerous studies leverage substantial image-text pairs to conduct modal alignment learning, transforming Large Language Models (LLMs) into Multimodal LLMs and excelling in a variety of visual-language tasks. The prevailing methodologies primarily fall into two categories: self-attention-based and cross-attention-based methods. While self-attention-based methods… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  50. arXiv:2408.11744  [pdf

    cs.AI cs.CV

    JieHua Paintings Style Feature Extracting Model using Stable Diffusion with ControlNet

    Authors: Yujia Gu, Haofeng Li, Xinyu Fang, Zihan Peng, Yinan Peng

    Abstract: This study proposes a novel approach to extract stylistic features of Jiehua: the utilization of the Fine-tuned Stable Diffusion Model with ControlNet (FSDMC) to refine depiction techniques from artists' Jiehua. The training data for FSDMC is based on the opensource Jiehua artist's work collected from the Internet, which were subsequently manually constructed in the format of (Original Image, Cann… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: accepted by ICCSMT 2024