Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 1,211 results for author: Chen, Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.18014  [pdf, other

    cs.AI

    Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles

    Authors: Lewei He, Tianyu Shi, Pengran Huang, Bingzhi Chen, Qianglong Chen, Jiahui Pan

    Abstract: Large language models (LLMs) with long-context processing are still challenging because of their implementation complexity, training efficiency and data sparsity. To address this issue, a new paradigm named Online Long-context Processing (OLP) is proposed when we process a document of unlimited length, which typically occurs in the information reception and organization of diverse streaming media… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  2. arXiv:2409.16537  [pdf

    cs.LG

    A QoE-Aware Split Inference Accelerating Algorithm for NOMA-based Edge Intelligence

    Authors: Xin Yuan, Ning Li, Quan Chen, Wenchao Xu, Zhaoxin Zhang, Song Guo

    Abstract: Even the AI has been widely used and significantly changed our life, deploying the large AI models on resource limited edge devices directly is not appropriate. Thus, the model split inference is proposed to improve the performance of edge intelligence, in which the AI model is divided into different sub models and the resource-intensive sub model is offloaded to edge server wirelessly for reducin… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: 16pages, 19figures. arXiv admin note: substantial text overlap with arXiv:2312.15850

  3. arXiv:2409.15710  [pdf, other

    cs.RO cs.AI eess.SY

    Autotuning Bipedal Locomotion MPC with GRFM-Net for Efficient Sim-to-Real Transfer

    Authors: Qianzhong Chen, Junheng Li, Sheng Cheng, Naira Hovakimyan, Quan Nguyen

    Abstract: Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuni… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  4. arXiv:2409.15644  [pdf, other

    cs.HC

    PolicyCraft: Supporting Collaborative and Participatory Policy Design through Case-Grounded Deliberation

    Authors: Tzu-Sheng Kuo, Quan Ze Chen, Amy X. Zhang, Jane Hsieh, Haiyi Zhu, Kenneth Holstein

    Abstract: Community and organizational policies are typically designed in a top-down, centralized fashion, with limited input from impacted stakeholders. This can result in policies that are misaligned with community needs or perceived as illegitimate. How can we support more collaborative, participatory approaches to policy design? In this paper, we present PolicyCraft, a system that structures collaborati… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  5. arXiv:2409.15584  [pdf, other

    cs.CV cs.AI

    FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality

    Authors: Junyuan Ding, Ziteng Wang, Chang Gao, Min Liu, Qinyu Chen

    Abstract: Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: 8 pages, 5 figures

  6. arXiv:2409.15087  [pdf

    eess.IV cs.CV cs.LG

    Towards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design, External Validation, and Continual Learning

    Authors: Qingyu Chen, Tiarnan D L Keenan, Elvira Agron, Alexis Allot, Emily Guan, Bryant Duong, Amr Elsawy, Benjamin Hou, Cancan Xue, Sanjeeb Bhandari, Geoffrey Broadhead, Chantal Cousineau-Krieger, Ellen Davis, William G Gensheimer, David Grasic, Seema Gupta, Luis Haddock, Eleni Konstantinou, Tania Lamba, Michele Maiberger, Dimosthenis Mantopoulos, Mitul C Mehta, Ayman G Nahri, Mutaz AL-Nawaflh, Arnold Oshinsky , et al. (13 additional authors not shown)

    Abstract: Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diag… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  7. arXiv:2409.14874  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images

    Authors: Ahjol Senbi, Tianyu Huang, Fei Lyu, Qing Li, Yuhui Tao, Wei Shao, Qiang Chen, Chengyan Wang, Shuo Wang, Tao Zhou, Yizhe Zhang

    Abstract: We explore the feasibility and potential of building a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging. This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions. Based on p… ▽ More

    Submitted 24 September, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: 17 pages, 15 figures

  8. arXiv:2409.14655  [pdf, other

    cs.DC cs.CR cs.LG

    Federated Graph Learning with Adaptive Importance-based Sampling

    Authors: Anran Li, Yuanyuan Chen, Chao Ren, Wenhan Wang, Ming Hu, Tianlin Li, Han Yu, Qingyu Chen

    Abstract: For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ig… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

  9. arXiv:2409.14113  [pdf, other

    eess.IV cs.CV

    Accelerated Multi-Contrast MRI Reconstruction via Frequency and Spatial Mutual Learning

    Authors: Qi Chen, Xiaohan Xing, Zhen Chen, Zhiwei Xiong

    Abstract: To accelerate Magnetic Resonance (MR) imaging procedures, Multi-Contrast MR Reconstruction (MCMR) has become a prevalent trend that utilizes an easily obtainable modality as an auxiliary to support high-quality reconstruction of the target modality with under-sampled k-space measurements. The exploration of global dependency and complementary information across different modalities is essential fo… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

    Comments: Accepted as a poster by Medical Image Computing and Computer Assisted Intervention (MICCAI) 2024

  10. arXiv:2409.13902  [pdf

    cs.CL cs.AI

    Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

    Authors: Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali Prasad, Fares Siddig, Maxwell Singer, Wendy Wong, Qiao Jin, Tiarnan D. L. Keenan, Xia Hu, Emily Y. Chew, Zhiyong Lu, Hua Xu, Ron A. Adelman, Yih-Chung Tham, Qingyu Chen

    Abstract: Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that ret… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

  11. arXiv:2409.13540  [pdf, other

    cs.CV

    FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

    Authors: Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang

    Abstract: Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the sim… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: 7 pages, 5 figures, 2 tables

  12. arXiv:2409.11653  [pdf, other

    cs.LG cs.CV

    Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

    Authors: Qian Shao, Jiangrui Kang, Qiyuan Chen, Zepeng Li, Hongxia Xu, Yiwen Cao, Jiajuan Liang, Jian Wu

    Abstract: Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

    Comments: Under Review

  13. arXiv:2409.11240  [pdf, other

    cs.LG cs.DC

    Federated Learning with Integrated Sensing, Communication, and Computation: Frameworks and Performance Analysis

    Authors: Yipeng Liang, Qimei Chen, Hao Jiang

    Abstract: With the emergence of integrated sensing, communication, and computation (ISCC) in the upcoming 6G era, federated learning with ISCC (FL-ISCC), integrating sample collection, local training, and parameter exchange and aggregation, has garnered increasing interest for enhancing training efficiency. Currently, FL-ISCC primarily includes two algorithms: FedAVG-ISCC and FedSGD-ISCC. However, the theor… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

    Comments: due to the limitation The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

  14. arXiv:2409.11147  [pdf, other

    cs.CL

    Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning

    Authors: Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen

    Abstract: Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM's performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate e… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  15. arXiv:2409.10516  [pdf, other

    cs.LG cs.CL

    RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

    Authors: Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu

    Abstract: Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation… ▽ More

    Submitted 18 September, 2024; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: 16 pages

  16. arXiv:2409.10372  [pdf, other

    cs.AI cs.CL cs.CY cs.GT

    Instigating Cooperation among LLM Agents Using Adaptive Information Modulation

    Authors: Qiliang Chen, Sepehr Ilami, Nunzio Lore, Babak Heydari

    Abstract: This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (… ▽ More

    Submitted 19 September, 2024; v1 submitted 16 September, 2024; originally announced September 2024.

  17. Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective

    Authors: Ningfei Wang, Shaoyuan Xie, Takami Sato, Yunpeng Luo, Kaidi Xu, Qi Alfred Chen

    Abstract: Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

    Comments: Accepted by NDSS 2025

  18. arXiv:2409.09745  [pdf, other

    cs.LG math.OC

    The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

    Authors: Haihan Zhang, Yuanshi Liu, Qianwen Chen, Cong Fang

    Abstract: Stochastic gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings. However, a fundamental question -- for what kinds of high-dimensional learning problems SGD and its accelerated varian… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

    Comments: 46 pages

  19. arXiv:2409.09564  [pdf, other

    cs.CV cs.AI

    TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

    Authors: Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen

    Abstract: Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA)… ▽ More

    Submitted 20 September, 2024; v1 submitted 14 September, 2024; originally announced September 2024.

  20. arXiv:2409.08861  [pdf, other

    cs.LG math.OC stat.ML

    Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

    Authors: Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, Ricky T. Q. Chen

    Abstract: Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there has not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless no… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  21. arXiv:2409.08687  [pdf, other

    cs.RO cs.LG

    xTED: Cross-Domain Policy Adaptation via Diffusion-Based Trajectory Editing

    Authors: Haoyi Niu, Qimao Chen, Tenglong Liu, Jianxiong Li, Guyue Zhou, Yi Zhang, Jianming Hu, Xianyuan Zhan

    Abstract: Reusing pre-collected data from different domains is an attractive solution in decision-making tasks where the accessible data is insufficient in the target domain but relatively abundant in other related domains. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, which requires learning domain/task-specific mod… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: xTED offers a novel, generic, flexible, simple and effective paradigm that casts cross-domain policy adaptation as a data pre-processing problem

  22. arXiv:2409.08622  [pdf, other

    cs.HC

    Policy Prototyping for LLMs: Pluralistic Alignment via Interactive and Collaborative Policymaking

    Authors: K. J. Kevin Feng, Inyoung Cheong, Quan Ze Chen, Amy X. Zhang

    Abstract: Emerging efforts in AI alignment seek to broaden participation in shaping model behavior by eliciting and integrating collective input into a policy for model finetuning. While pluralistic, these processes are often linear and do not allow participating stakeholders to confirm whether potential outcomes of their contributions are indeed consistent with their intentions. Design prototyping has long… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  23. arXiv:2409.08552  [pdf, other

    eess.AS cs.SD

    Unified Audio Event Detection

    Authors: Yidi Jiang, Ruijie Tao, Wen Huang, Qian Chen, Wen Wang

    Abstract: Sound Event Detection (SED) detects regions of sound events, while Speaker Diarization (SD) segments speech conversations attributed to individual speakers. In SED, all speaker segments are classified as a single speech event, while in SD, non-speech sounds are treated merely as background noise. Thus, both tasks provide only partial analysis in complex audio scenarios involving both speech conver… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: submitted to ICASSP 2025

  24. arXiv:2409.08083  [pdf, other

    cs.CV

    SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality

    Authors: Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang

    Abstract: Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vi… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: Github link: https://github.com/mt-cly/SimMAT

  25. arXiv:2409.08042  [pdf, other

    cs.CV cs.GR

    Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis

    Authors: Qian Chen, Shihao Shu, Xiangzhi Bai

    Abstract: Novel-view synthesis based on visible light has been extensively studied. In comparison to visible light imaging, thermal infrared imaging offers the advantage of all-weather imaging and strong penetration, providing increased possibilities for reconstruction in nighttime and adverse weather scenarios. However, thermal infrared imaging is influenced by physical characteristics such as atmospheric… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: 17 pages, 4 figures, 3 tables

    ACM Class: I.3.3; I.4.5

    Journal ref: ECCV2024

  26. arXiv:2409.06129  [pdf, other

    cs.CV cs.GR cs.LG

    DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement

    Authors: Qimin Chen, Zhiqin Chen, Vladimir G. Kim, Noam Aigerman, Hao Zhang, Siddhartha Chaudhuri

    Abstract: We present a 3D modeling method which enables end-users to refine or detailize 3D shapes using machine learning, expanding the capabilities of AI-assisted 3D content creation. Given a coarse voxel shape (e.g., one produced with a simple box extrusion tool or via generative modeling), a user can directly "paint" desired target styles representing compelling geometric details, from input exemplar sh… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: ECCV 2024 (poster). Code: https://qiminchen.github.io/decollage/

  27. arXiv:2409.06035  [pdf, other

    eess.IV cs.CV

    Analyzing Tumors by Synthesis

    Authors: Qi Chen, Yuxiang Lai, Xiaoxi Chen, Qixin Hu, Alan Yuille, Zongwei Zhou

    Abstract: Computer-aided tumor detection has shown great potential in enhancing the interpretation of over 80 million CT scans performed annually in the United States. However, challenges arise due to the rarity of CT scans with tumors, especially early-stage tumors. Developing AI with real tumor data faces issues of scarcity, annotation difficulty, and low prevalence. Tumor synthesis addresses these challe… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: Accepted as a chapter in the Springer Book: "Generative Machine Learning Models in Medical Image Computing."

  28. arXiv:2409.05249  [pdf, other

    cs.CR cs.DB cs.NI

    NetDPSyn: Synthesizing Network Traces under Differential Privacy

    Authors: Danyu Sun, Joann Qiongna Chen, Chen Gong, Tianhao Wang, Zhou Li

    Abstract: As the utilization of network traces for the network measurement research becomes increasingly prevalent, concerns regarding privacy leakage from network traces have garnered the public's attention. To safeguard network traces, researchers have proposed the trace synthesis that retains the essential properties of the raw data. However, previous works also show that synthesis traces with generative… ▽ More

    Submitted 8 September, 2024; originally announced September 2024.

    Comments: IMC 2024

  29. arXiv:2409.03247  [pdf, other

    cs.HC

    End User Authoring of Personalized Content Classifiers: Comparing Example Labeling, Rule Writing, and LLM Prompting

    Authors: Leijie Wang, Kathryn Yurechko, Pranati Dani, Quan Ze Chen, Amy X. Zhang

    Abstract: Existing tools for laypeople to create personal classifiers often assume a motivated user working uninterrupted in a single, lengthy session. However, users tend to engage with social media casually, with many short sessions on an ongoing, daily basis. To make creating personal classifiers for content curation easier for such users, tools should support rapid initialization and iterative refinemen… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  30. arXiv:2409.03020  [pdf, ps, other

    cs.DS

    Online Scheduling via Gradient Descent for Weighted Flow Time Minimization

    Authors: Qingyun Chen, Sungjin Im, Aditya Petety

    Abstract: In this paper, we explore how a natural generalization of Shortest Remaining Processing Time (SRPT) can be a powerful \emph{meta-algorithm} for online scheduling. The meta-algorithm processes jobs to maximally reduce the objective of the corresponding offline scheduling problem of the remaining jobs: minimizing the total weighted completion time of them (the residual optimum). We show that it achi… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  31. arXiv:2409.02919  [pdf, other

    cs.CV

    HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

    Authors: Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo

    Abstract: The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propos… ▽ More

    Submitted 9 September, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: https://liuxinyv.github.io/HiPrompt/

  32. arXiv:2409.02834  [pdf, other

    cs.CL

    CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

    Authors: Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He

    Abstract: Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate t… ▽ More

    Submitted 6 September, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

  33. arXiv:2409.01893  [pdf, other

    cs.CL cs.AI

    What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

    Authors: Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Haijun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai Chen, Dahua Lin

    Abstract: Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Work in progress

  34. arXiv:2409.01055  [pdf, other

    cs.CV

    Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

    Authors: Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, Wei Liu

    Abstract: This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called \textit{Follow-Your-Canvas}. It builds upon two core designs. F… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Github: https://github.com/mayuelala/FollowYourCanvas Page: https://follow-your-canvas.github.io/

  35. arXiv:2408.17347  [pdf, other

    cs.CV

    LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation

    Authors: Shuyi Ouyang, Jinyang Zhang, Xiangye Lin, Xilai Wang, Qingqing Chen, Yen-Wei Chen, Lanfen Lin

    Abstract: Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions… ▽ More

    Submitted 2 September, 2024; v1 submitted 30 August, 2024; originally announced August 2024.

    Comments: 14 pages, 5 figures

    ACM Class: I.4.6

  36. arXiv:2408.17051  [pdf, other

    cs.SI

    Service-Oriented AoI Modeling and Analysis for Non-Terrestrial Networks

    Authors: Zheng Guo, Qian Chen, Weixiao Meng

    Abstract: To achieve truly seamless global intelligent connectivity, non-terrestrial networks (NTN) mainly composed of low earth orbit (LEO) satellites and drones are recognized as important components of the future 6G network architecture. Meanwhile, the rapid advancement of the Internet of Things (IoT) has led to the proliferation of numerous applications with stringent requirements for timely information… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: 6 pages, 5 figures

  37. arXiv:2408.16990  [pdf, other

    cs.MM

    Video to Music Moment Retrieval

    Authors: Zijie Xin, Minquan Wang, Ye Ma, Bo Wang, Quan Chen, Peng Jiang, Xirong Li

    Abstract: Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a cle… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  38. arXiv:2408.16532  [pdf, other

    eess.AS cs.LG cs.MM cs.SD eess.SP

    WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

    Authors: Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao

    Abstract: Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domai… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: Working in progress. arXiv admin note: text overlap with arXiv:2402.12208

  39. ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

    Authors: Minghang Zheng, Jiahua Zhang, Qingchao Chen, Yuxin Peng, Yang Liu

    Abstract: Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distract… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: Accepted by ACM MM 2024

    ACM Class: I.2

  40. arXiv:2408.16219  [pdf, other

    cs.CV

    Training-free Video Temporal Grounding using Large-scale Pre-trained Models

    Authors: Minghang Zheng, Xinhao Cai, Qingchao Chen, Yuxin Peng, Yang Liu

    Abstract: Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  41. arXiv:2408.15428  [pdf, other

    cs.CV

    HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles

    Authors: Deyuan Qu, Qi Chen, Yongqi Zhu, Yihao Zhu, Sergei S. Avedisov, Song Fu, Qing Yang

    Abstract: In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024 Workshop

  42. arXiv:2408.15270  [pdf, other

    cs.CV cs.GR cs.LG cs.RO

    SkillMimic: Learning Reusable Basketball Skills from Demonstrations

    Authors: Yinhuai Wang, Qihan Zhao, Runyi Yu, Ailing Zeng, Jing Lin, Zhengyi Luo, Hok Wai Tsui, Jiwen Yu, Xiu Li, Qifeng Chen, Jian Zhang, Lei Zhang, Ping Tan

    Abstract: Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-drive… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  43. arXiv:2408.14700  [pdf, other

    cs.AI

    Artificial Intelligence in Landscape Architecture: A Survey

    Authors: Yue Xing, Wensheng Gan, Qidi Chen

    Abstract: The development history of landscape architecture (LA) reflects the human pursuit of environmental beautification and ecological balance. With the advancement of artificial intelligence (AI) technologies that simulate and extend human intelligence, immense opportunities have been provided for LA, offering scientific and technological support throughout the entire workflow. In this article, we comp… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Preprint. 3 figures, 2 tables

  44. arXiv:2408.14492  [pdf, other

    cs.LG

    Evolvable Psychology Informed Neural Network for Memory Behavior Modeling

    Authors: Xiaoxuan Shen, Zhihai Hu, Qirong Chen, Shengyingjie Liu, Ruxia Liang, Jianwen Sun

    Abstract: Memory behavior modeling is a core issue in cognitive psychology and education. Classical psychological theories typically use memory equations to describe memory behavior, which exhibits insufficient accuracy and controversy, while data-driven memory modeling methods often require large amounts of training data and lack interpretability. Knowledge-informed neural network models have shown excelle… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  45. arXiv:2408.14469  [pdf, other

    cs.CV

    Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

    Authors: Qirui Chen, Shangzhe Di, Weidi Xie

    Abstract: This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a l… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  46. arXiv:2408.13226  [pdf, other

    cs.CV

    D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

    Authors: Jingyu Liu, Minquan Wang, Ye Ma, Bo Wang, Aozhu Chen, Quan Chen, Peng Jiang, Xirong Li

    Abstract: Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies ab… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

    Comments: 9 pages, 4 figures

  47. arXiv:2408.12128  [pdf, other

    cs.AI cs.CV

    Diffusion-Based Visual Art Creation: A Survey and New Perspectives

    Authors: Bingyuan Wang, Qifeng Chen, Zeyu Wang

    Abstract: The integration of generative AI in visual art has revolutionized not only how visual content is created but also how AI interacts with and reflects the underlying domain knowledge. This survey explores the emerging realm of diffusion-based visual art creation, examining its development from both artistic and technical perspectives. We structure the survey into three phases, data feature and frame… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: 35 pages, 9 figures

  48. arXiv:2408.12102  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

    Authors: Luyao Cheng, Hui Wang, Siqi Zheng, Yafeng Chen, Rongjie Huang, Qinglin Zhang, Qian Chen, Xihao Li

    Abstract: Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  49. arXiv:2408.10919  [pdf, other

    cs.CV cs.AI cs.LG eess.SP

    CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

    Authors: Zijian Zhao, Tingwei Chen, Zhijie Cai, Xiaoyang Li, Hang Li, Qimei Chen, Guangxu Zhu

    Abstract: In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fa… ▽ More

    Submitted 20 August, 2024; v1 submitted 20 August, 2024; originally announced August 2024.

  50. arXiv:2408.10679  [pdf, other

    cs.CV

    DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba

    Authors: Shuning Xu, Xina Liu, Binbin Song, Xiangyu Chen, Qiubo Chen, Jiantao Zhou

    Abstract: Moire patterns arise when two similar repetitive patterns interfere, a phenomenon frequently observed during the capture of images or videos on screens. The color, shape, and location of moire patterns may differ across video frames, posing a challenge in learning information from adjacent frames and preserving temporal consistency. Previous video demoireing methods heavily rely on well-designed a… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.