Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 796 results for author: Zhu, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03605  [pdf, other

    cs.CV cs.MM

    SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

    Authors: Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu

    Abstract: Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate r… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: 10 pages, 7 figures, 3 tables

  2. arXiv:2409.00614  [pdf, other

    cs.CL cs.AI

    DAMe: Personalized Federated Social Event Detection with Dual Aggregation Mechanism

    Authors: Xiaoyan Yu, Yifan Wei, Pu Li, Shuaishuai Zhou, Hao Peng, Li Sun, Liehuang Zhu, Philip S. Yu

    Abstract: Training social event detection models through federated learning (FedSED) aims to improve participants' performance on the task. However, existing federated learning paradigms are inadequate for achieving FedSED's objective and exhibit limitations in handling the inherent heterogeneity in social data. This paper proposes a personalized federated learning framework with a dual aggregation mechanis… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

    Comments: CIKM 2024

  3. arXiv:2408.15516  [pdf, other

    cs.NI

    Predicting Parameter Change's Effect on Cellular Network Time Series

    Authors: Mingjie Li, Yongqian Sun, Xiaolei Hua, Renkai Yu, Xinwen Fan, Lin Zhu, Junlan Feng, Dan Pei

    Abstract: The cellular network provides convenient network access for ever-growing mobile phones. During the continuous optimization, operators can adjust cell parameters to enhance the Quality of Service (QoS) flexibly. A precise prediction of the parameter change's effect can help operators make proper parameter adjustments. This work focuses on predicting cell status (like the workload and QoS) after adj… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  4. arXiv:2408.12825  [pdf, other

    cs.CV

    MergeUp-augmented Semi-Weakly Supervised Learning for WSI Classification

    Authors: Mingxi Ouyang, Yuqiu Fu, Renao Yan, ShanShan Shi, Xitong Ling, Lianghui Zhu, Yonghong He, Tian Guan

    Abstract: Recent advancements in computational pathology and artificial intelligence have significantly improved whole slide image (WSI) classification. However, the gigapixel resolution of WSIs and the scarcity of manual annotations present substantial challenges. Multiple instance learning (MIL) is a promising weakly supervised learning approach for WSI classification. Recently research revealed employing… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

  5. arXiv:2408.12353  [pdf, other

    stat.ML cs.LG math.ST

    Distributed quasi-Newton robust estimation under differential privacy

    Authors: Chuhan Wang, Lixing Zhu, Xuehu Zhu

    Abstract: For distributed computing with Byzantine machines under Privacy Protection (PP) constraints, this paper develops a robust PP distributed quasi-Newton estimation, which only requires the node machines to transmit five vectors to the central processor with high asymptotic relative efficiency. Compared with the gradient descent strategy which requires more rounds of transmission and the Newton iterat… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: 38 pages, 6 figures

  6. arXiv:2408.12316  [pdf, other

    cs.CV

    Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement

    Authors: Lingyu Zhu, Wenhan Yang, Baoliang Chen, Hanwei Zhu, Zhangkai Ni, Qi Mao, Shiqi Wang

    Abstract: Obtaining pairs of low/normal-light videos, with motions, is more challenging than still images, which raises technical issues and poses the technical route of unpaired learning as a critical role. This paper makes endeavors in the direction of learning for low-light video enhancement without using paired ground truth. Compared to low-light image enhancement, enhancing low-light videos is more dif… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  7. arXiv:2408.12247  [pdf, other

    cs.AI

    Enhanced Fine-Tuning of Lightweight Domain-Specific Q&A Model Based on Large Language Models

    Authors: Shenglin Zhang, Pengtian Zhu, Minghua Ma, Jiagang Wang, Yongqian Sun, Dongwen Li, Jingyu Wang, Qianying Guo, Xiaolei Hua, Lin Zhu, Dan Pei

    Abstract: Large language models (LLMs) excel at general question-answering (Q&A) but often fall short in specialized domains due to a lack of domain-specific knowledge. Commercial companies face the dual challenges of privacy protection and resource constraints when involving LLMs for fine-tuning. This paper propose a novel framework, Self-Evolution, designed to address these issues by leveraging lightweigh… ▽ More

    Submitted 22 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

  8. arXiv:2408.11820  [pdf, other

    cs.CY cs.AI

    Responsible AI Question Bank: A Comprehensive Tool for AI Risk Assessment

    Authors: Sung Une Lee, Harsha Perera, Yue Liu, Boming Xia, Qinghua Lu, Liming Zhu

    Abstract: The rapid growth of Artificial Intelligence (AI) has underscored the urgent need for responsible AI practices. Despite increasing interest, a comprehensive AI risk assessment toolkit remains lacking. This study introduces our Responsible AI (RAI) Question Bank, a comprehensive framework and tool designed to support diverse AI initiatives. By integrating AI ethics principles such as fairness, trans… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: 30 pages, 6 tables, 14 figures

  9. Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

    Authors: Haipeng Zhou, Honqiu Wang, Tian Ye, Zhaohu Xing, Jun Ma, Ping Li, Qiong Wang, Lei Zhu

    Abstract: Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidanc… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: ACM MM2024

  10. arXiv:2408.11296  [pdf, other

    cs.SE cs.CL

    RePair: Automated Program Repair with Process-based Feedback

    Authors: Yuze Zhao, Zhenya Huang, Yixiao Ma, Rui Li, Kai Zhang, Hao Jiang, Qi Liu, Linbo Zhu, Yu Su

    Abstract: The gap between the trepidation of program reliability and the expense of repairs underscores the indispensability of Automated Program Repair (APR). APR is instrumental in transforming vulnerable programs into more robust ones, bolstering program reliability while simultaneously diminishing the financial burden of manual repairs. Commercial-scale language models (LM) have taken APR to unprecedent… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 15 pages, 13 figures

    Journal ref: ACL 2024 Findings

  11. arXiv:2408.10488  [pdf, other

    cs.CV cs.AI cs.CL cs.NE

    Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

    Authors: Xiao Wang, Yao Rong, Fuling Wang, Jianing Li, Lin Zhu, Bo Jiang, Yaowei Wang

    Abstract: Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams h… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: First Large-scale and High-Definition Benchmark Dataset for Event-based Sign Language Translation

  12. arXiv:2408.10487  [pdf, other

    cs.CV cs.AI

    MambaEVT: Event Stream based Visual Object Tracking using State Space Model

    Authors: Xiao Wang, Chao wang, Shiao Wang, Xixi Wang, Zhicheng Zhao, Lin Zhu, Bo Jiang

    Abstract: Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object locali… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: In Peer Review

  13. arXiv:2408.10188  [pdf, other

    cs.CV cs.CL

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos

    Authors: Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han

    Abstract: Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long su… ▽ More

    Submitted 21 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: Code and models are available at https://github.com/NVlabs/VILA/blob/main/LongVILA.md

  14. arXiv:2408.10154  [pdf, other

    cs.CV cs.RO

    LoopSplat: Loop Closure by Registering 3D Gaussian Splats

    Authors: Liyuan Zhu, Yue Li, Erik Sandström, Shengyu Huang, Konrad Schindler, Iro Armeni

    Abstract: Simultaneous Localization and Mapping (SLAM) based on 3D Gaussian Splats (3DGS) has recently shown promise towards more accurate, dense 3D scene maps. However, existing 3DGS-based methods fail to address the global consistency of the scene via loop closure and/or global bundle adjustment. To this end, we propose LoopSplat, which takes RGB-D images as input and performs dense mapping with 3DGS subm… ▽ More

    Submitted 19 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: Project page: https://loopsplat.github.io/

  15. arXiv:2408.10039  [pdf, other

    cs.AI

    MSDiagnosis: An EMR-based Dataset for Clinical Multi-Step Diagnosis

    Authors: Ruihui Hou, Shencheng Chen, Yongqi Fan, Lifeng Zhu, Jing Sun, Jingping Liu, Tong Ruan

    Abstract: Clinical diagnosis is critical in medical practice, typically requiring a continuous and evolving process that includes primary diagnosis, differential diagnosis, and final diagnosis. However, most existing clinical diagnostic tasks are single-step processes, which does not align with the complex multi-step diagnostic procedures found in real-world clinical settings. In this paper, we propose a mu… ▽ More

    Submitted 29 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

  16. arXiv:2408.09764  [pdf, other

    cs.CV cs.AI cs.NE

    Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

    Authors: Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian

    Abstract: Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event camera… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: In Peer Review

  17. arXiv:2408.09667  [pdf, other

    cs.CL

    BLADE: Benchmarking Language Model Agents for Data-Driven Science

    Authors: Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff

    Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-dri… ▽ More

    Submitted 20 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

  18. arXiv:2408.09410  [pdf, other

    cs.AI

    $\mathbb{BEHR}$NOULLI: A Binary EHR Data-Oriented Medication Recommendation System

    Authors: Xihao Piao, Pei Gao, Zheng Chen, Lingwei Zhu, Yasuko Matsubara, Yasushi Sakurai

    Abstract: The medical community believes binary medical event outcomes in EHR data contain sufficient information for making a sensible recommendation. However, there are two challenges to effectively utilizing such data: (1) modeling the relationship between massive 0,1 event outcomes is difficult, even with expert knowledge; (2) in practice, learning can be stalled by the binary values since the equally i… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

    MSC Class: 68T01

  19. Language-Driven Interactive Shadow Detection

    Authors: Hongqiu Wang, Wei Wang, Haipeng Zhou, Huihui Xu, Shaozhi Wu, Lei Zhu

    Abstract: Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrar… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: ACM MM 2024

  20. arXiv:2408.07245  [pdf, other

    cs.LG

    q-exponential family for policy optimization

    Authors: Lingwei Zhu, Haseeb Shah, Han Wang, Martha White

    Abstract: Policy optimization methods benefit from a simple and tractable policy functional, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the $q$-exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies ($q>1$) and light-tailed policies ($q<1$). This paper examines the in… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: 27 pages, 12 pages main text, 15 pages appendix

  21. arXiv:2408.04579  [pdf, other

    cs.CV

    SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More

    Authors: Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chunan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, Ying Zang

    Abstract: The advent of large models, also known as foundation models, has significantly transformed the AI research landscape, with models like Segment Anything (SAM) achieving notable success in diverse image segmentation scenarios. Despite its advancements, SAM encountered limitations in handling some complex low-level segmentation tasks like camouflaged object and medical imaging. In response, in 2023,… ▽ More

    Submitted 10 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: text overlap with arXiv:2304.09148

  22. arXiv:2408.02920  [pdf, other

    cs.SE cs.AI

    A Taxonomy of Architecture Options for Foundation Model-based Agents: Analysis and Decision Model

    Authors: Jingwen Zhou, Qinghua Lu, Jieshan Chen, Liming Zhu, Xiwei Xu, Zhenchang Xing, Stefan Harrer

    Abstract: The rapid advancement of AI technology has led to widespread applications of agent systems across various domains. However, the need for detailed architecture design poses significant challenges in designing and operating these systems. This paper introduces a taxonomy focused on the architectures of foundation-model-based agents, addressing critical aspects such as functional capabilities and non… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Under review

  23. arXiv:2408.02205  [pdf, other

    cs.SE cs.AI

    Towards AI-Safety-by-Design: A Taxonomy of Runtime Guardrails in Foundation Model based Systems

    Authors: Md Shamsujjoha, Qinghua Lu, Dehai Zhao, Liming Zhu

    Abstract: The rapid advancement and widespread deployment of foundation model (FM) based systems have revolutionized numerous applications across various domains. However, the fast-growing capabilities and autonomy have also raised significant concerns about responsible AI and AI safety. Recently, there have been increasing attention toward implementing guardrails to ensure the runtime behavior of FM-based… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

    Comments: 15 Pages

  24. arXiv:2408.01840  [pdf, other

    cs.CV

    E$^3$NeRF: Efficient Event-Enhanced Neural Radiance Fields from Blurry Images

    Authors: Yunshan Qi, Jia Li, Yifan Zhao, Yu Zhang, Lin Zhu

    Abstract: Neural Radiance Fields (NeRF) achieve impressive rendering performance by learning volumetric 3D representation from several images of different views. However, it is difficult to reconstruct a sharp NeRF from blurry input as it often occurs in the wild. To solve this problem, we propose a novel Efficient Event-Enhanced NeRF (E$^3$NeRF) by utilizing the combination of RGB images and event streams.… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

  25. arXiv:2408.01732  [pdf, other

    cs.CV cs.AI

    Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

    Authors: Jintao Tan, Xize Cheng, Lingyu Xiong, Lei Zhu, Xiandong Li, Xianjia Wu, Kai Gong, Minglei Li, Yi Cai

    Abstract: Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip s… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

  26. arXiv:2408.00965  [pdf, other

    cs.AI

    Integrating ESG and AI: A Comprehensive Responsible AI Assessment Framework

    Authors: Sung Une Lee, Harsha Perera, Yue Liu, Boming Xia, Qinghua Lu, Liming Zhu, Jessica Cairns, Moana Nottage

    Abstract: Artificial Intelligence (AI) is a widely developed and adopted technology across entire industry sectors. Integrating environmental, social, and governance (ESG) considerations with AI investments is crucial for ensuring ethical and sustainable technological advancement. Particularly from an investor perspective, this integration not only mitigates risks but also enhances long-term value creation… ▽ More

    Submitted 5 August, 2024; v1 submitted 1 August, 2024; originally announced August 2024.

    Comments: 23 pages, 8 tables, 10 figures

  27. DeliLaw: A Chinese Legal Counselling System Based on a Large Language Model

    Authors: Nan Xie, Yuelin Bai, Hengyuan Gao, Feiteng Fang, Qixuan Zhao, Zhijian Li, Ziqiang Xue, Liang Zhu, Shiwen Ni, Min Yang

    Abstract: Traditional legal retrieval systems designed to retrieve legal documents, statutes, precedents, and other legal information are unable to give satisfactory answers due to lack of semantic understanding of specific questions. Large Language Models (LLMs) have achieved excellent results in a variety of natural language processing tasks, which inspired us that we train a LLM in the legal domain to he… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: CIKM 2024, 5 pages with 3 figures

  28. RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining

    Authors: Hongtao Wu, Yijun Yang, Huihui Xu, Weiming Wang, Jinni Zhou, Lei Zhu

    Abstract: The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited recep… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: ACM Multimedia 2024

  29. arXiv:2407.21735  [pdf, other

    cs.CV

    Unifying Event-based Flow, Stereo and Depth Estimation via Feature Similarity Matching

    Authors: Pengjie Zhang, Lin Zhu, Lizhi Wang, Hua Huang

    Abstract: As an emerging vision sensor, the event camera has gained popularity in various vision tasks such as optical flow estimation, stereo matching, and depth estimation due to its high-speed, sparse, and asynchronous event streams. Unlike traditional approaches that use specialized architectures for each specific task, we propose a unified framework, EventMatch, that reformulates these tasks as an even… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  30. arXiv:2407.20199  [pdf, other

    stat.ML cs.LG

    Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

    Authors: Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

    Abstract: Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neura… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  31. arXiv:2407.19918  [pdf, other

    cs.CV

    FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

    Authors: Yu Lu, Yuanzhi Liang, Linchao Zhu, Yi Yang

    Abstract: Video diffusion models have made substantial progress in various video generation applications. However, training models for long video generation tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model (e.g. pre-tr… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Project page: https://yulu.net.cn/freelong

  32. arXiv:2407.19496  [pdf, ps, other

    math.OC cs.RO

    Small-Gain Theorem Based Distributed Prescribed-Time Convex Optimization For Networked Euler-Lagrange Systems

    Authors: Gewei Zuo, Mengmou Li, Lijun Zhu

    Abstract: In this paper, we address the distributed prescribed-time convex optimization (DPTCO) for a class of networked Euler-Lagrange systems under undirected connected graphs. By utilizing position-dependent measured gradient value of local objective function and local information interactions among neighboring agents, a set of auxiliary systems is constructed to cooperatively seek the optimal solution.… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

    Comments: 13 pages, 4 figures

  33. arXiv:2407.19139  [pdf, other

    cs.CV

    Multi-Expert Adaptive Selection: Task-Balancing for All-in-One Image Restoration

    Authors: Xiaoyan Yu, Shen Zhou, Huafeng Li, Liehuang Zhu

    Abstract: The use of a single image restoration framework to achieve multi-task image restoration has garnered significant attention from researchers. However, several practical challenges remain, including meeting the specific and simultaneous demands of different tasks, balancing relationships between tasks, and effectively utilizing task correlations in model design. To address these challenges, this pap… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

  34. arXiv:2407.18908  [pdf, other

    cs.LG cs.CL cs.CV

    Wolf: Captioning Everything with a World Summarization Framework

    Authors: Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone

    Abstract: We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhan… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

  35. arXiv:2407.18035  [pdf, other

    cs.CV cs.AI cs.CL

    RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

    Authors: Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu

    Abstract: Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited r… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  36. arXiv:2407.17734  [pdf, other

    cs.AI cs.CL cs.CV

    Cost-effective Instruction Learning for Pathology Vision and Language Analysis

    Authors: Kaitao Chen, Mianxin Liu, Fang Yan, Lei Ma, Xiaoming Shi, Lilong Wang, Xiaosong Wang, Lifeng Zhu, Zhe Wang, Mu Zhou, Shaoting Zhang

    Abstract: The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightwei… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  37. arXiv:2407.17691  [pdf, other

    cs.NI eess.SY

    System-Level Simulation Framework for NB-IoT: Key Features and Performance Evaluation

    Authors: Shutao Zhang, Wenkun Wen, Peiran Wu, Hongqing Huang, Liya Zhu, Yijia Guo, Tingting Yang, Minghua Xia

    Abstract: Narrowband Internet of Things (NB-IoT) is a technology specifically designated by the 3rd Generation Partnership Project (3GPP) to meet the explosive demand for massive machine-type communications (mMTC), and it is evolving to RedCap. Industrial companies have increasingly adopted NB-IoT as the solution for mMTC due to its lightweight design and comprehensive technical specifications released by 3… ▽ More

    Submitted 13 August, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

  38. arXiv:2407.17453  [pdf, other

    cs.CV

    $VILA^2$: VILA Augmented VILA

    Authors: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin

    Abstract: Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  39. arXiv:2407.15031  [pdf

    cs.NI

    Schedulability Analysis in Time-Sensitive Networking: A Systematic Literature Review

    Authors: Zitong Wang, Feng Luo, Yunpeng Li, Haotian Gan, Lei Zhu

    Abstract: Time-Sensitive Networking (TSN) is a set of standards that provide low-latency, high-reliability guarantees for the transmission of traffic in networks, and it is becoming an accepted solution for complex time-critical systems such as those in industrial automation and the automotive. In time-critical systems, it is essential to verify the timing predictability of the system, and the application o… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

  40. arXiv:2407.14900  [pdf, other

    cs.CV

    AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

    Authors: Yunlong Lin, Tian Ye, Sixiang Chen, Zhenqi Fu, Yingying Wang, Wenhao Chai, Zhaohu Xing, Lei Zhu, Xinghao Ding

    Abstract: Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents… ▽ More

    Submitted 23 July, 2024; v1 submitted 20 July, 2024; originally announced July 2024.

    Comments: 21 pages, 9 figures

  41. arXiv:2407.10990  [pdf

    cs.CL cs.AI

    MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

    Authors: Mianxin Liu, Jinru Ding, Jie Xu, Weiguo Hu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, Pengfei Liu, Xiaofan Zhang, Shanshan Wang, Kang Li, Haofen Wang, Tong Ruan, Xuanjing Huang, Xin Sun, Shaoting Zhang

    Abstract: Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce "MedBench", a comprehensive, standardized, and reliable benchmarking system for Chinese med… ▽ More

    Submitted 23 June, 2024; originally announced July 2024.

    Comments: 25 pages.4 figures

  42. arXiv:2407.10986  [pdf, other

    eess.SP cs.NI

    Integrating Base Station with Intelligent Surface for 6G Wireless Networks: Architectures, Design Issues, and Future Directions

    Authors: Yuwei Huang, Lipeng Zhu, Rui Zhang

    Abstract: Intelligent surface (IS) is envisioned as a promising technology for the sixth-generation (6G) wireless networks, which can effectively reconfigure the wireless propagation environment via dynamically controllable signal reflection/transmission. In particular, integrating passive intelligent surface (IS) into the base station (BS) is a novel solution to enhance the wireless network throughput and… ▽ More

    Submitted 21 June, 2024; originally announced July 2024.

    Comments: submitted to IEEE magazine. 5 figures, 1 table

  43. arXiv:2407.10636  [pdf, other

    cs.CV

    Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

    Authors: Lin Zhu, Yunlong Zheng, Yijun Zhang, Xiao Wang, Lizhi Wang, Hua Huang

    Abstract: Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  44. arXiv:2407.10439  [pdf, other

    cs.CV

    PolyRoom: Room-aware Transformer for Floorplan Reconstruction

    Authors: Yuzhou Liu, Lingjie Zhu, Xiaodong Ma, Hanqiao Ye, Xiang Gao, Xianwei Zheng, Shuhan Shen

    Abstract: Reconstructing geometry and topology structures from raw unstructured data has always been an important research topic in indoor mapping research. In this paper, we aim to reconstruct the floorplan with a vectorized representation from point clouds. Despite significant advancements achieved in recent years, current methods still encounter several challenges, such as missing corners or edges, inacc… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  45. arXiv:2407.05703  [pdf, other

    cs.CV

    LGRNet: Local-Global Reciprocal Network for Uterine Fibroid Segmentation in Ultrasound Videos

    Authors: Huihui Xu, Yijun Yang, Angelica I Aviles-Rivero, Guang Yang, Jing Qin, Lei Zhu

    Abstract: Regular screening and early discovery of uterine fibroid are crucial for preventing potential malignant transformations and ensuring timely, life-saving interventions. To this end, we collect and annotate the first ultrasound video dataset with 100 videos for uterine fibroid segmentation (UFUV). We also present Local-Global Reciprocal Network (LGRNet) to efficiently and effectively propagate the l… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: MICCAI2024 Early Accept

  46. arXiv:2407.05657  [pdf, other

    cs.CV

    DMSD-CDFSAR: Distillation from Mixed-Source Domain for Cross-Domain Few-shot Action Recognition

    Authors: Fei Guo, YiKang Wang, Han Qi, Li Zhu, Jing Sun

    Abstract: Few-shot action recognition is an emerging field in computer vision, primarily focused on meta-learning within the same domain. However, challenges arise in real-world scenario deployment, as gathering extensive labeled data within a specific domain is laborious and time-intensive. Thus, attention shifts towards cross-domain few-shot action recognition, requiring the model to generalize across dom… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  47. arXiv:2407.04404  [pdf

    cs.AR

    Fixed and Movable Antenna Technology for 6G Integrated Sensing and Communication

    Authors: Yong Zeng, Zhenjun Dong, Huizhi Wang, Lipeng Zhu, Ziyao Hong, Qingji Jiang, Dongming Wang, Shi Jin, Rui Zhang

    Abstract: By deploying antenna arrays at the transmitter/receiver to provide additional spatial-domain degrees of freedom (DoFs), multi-antenna technology greatly improves the reliability and efficiency of wireless communication. Meanwhile, the application of multi-antenna technology in the radar field has achieved spatial angle resolution and improved sensing DoF, thus significantly enhancing wireless sens… ▽ More

    Submitted 16 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: in Chinese language

  48. arXiv:2407.04284  [pdf, other

    cs.MM

    TSC-PCAC: Voxel Transformer and Sparse Convolution Based Point Cloud Attribute Compression for 3D Broadcasting

    Authors: Zixi Guo, Yun Zhang, Linwei Zhu, Hanli Wang, Gangyi Jiang

    Abstract: Point cloud has been the mainstream representation for advanced 3D applications, such as virtual reality and augmented reality. However, the massive data amounts of point clouds is one of the most challenging issues for transmission and storage. In this paper, we propose an end-to-end voxel Transformer and Sparse Convolution based Point Cloud Attribute Compression (TSC-PCAC) for 3D broadcasting. F… ▽ More

    Submitted 26 August, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  49. arXiv:2407.02158  [pdf, other

    cs.CV

    UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks

    Authors: Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, Lei Zhu

    Abstract: Ultra-high-resolution image generation poses great challenges, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. We present UltraPixel, a novel architecture utilizing cascade diffusion models to generate high-quality images at multiple resolutions (\textit{e.g.}, 1K to 6K) within a single model, while maintaining comp… ▽ More

    Submitted 4 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Project page https://jingjingrenabc.github.io/ultrapixel

  50. arXiv:2407.01530  [pdf, other

    eess.IV cs.CV

    xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

    Authors: Tianrun Chen, Chaotao Ding, Lanyun Zhu, Tao Xu, Deyi Ji, Yan Wang, Ying Zang, Zejian Li

    Abstract: Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (… ▽ More

    Submitted 2 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.