Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 1,259 results for author: Zhang, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03142  [pdf, other

    cs.LG stat.ML

    Causal Temporal Representation Learning with Nonstationary Sparse Transition

    Authors: Xiangchen Song, Zijian Li, Guangyi Chen, Yujia Zheng, Yewen Fan, Xinshuai Dong, Kun Zhang

    Abstract: Causal Temporal Representation Learning (Ctrl) methods aim to identify the temporal causal dynamics of complex nonstationary temporal sequences. Despite the success of existing Ctrl methods, they require either directly observing the domain variables or assuming a Markov prior on them. Such requirements limit the application of these methods in real-world scenarios when we do not have such prior k… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  2. arXiv:2409.02813  [pdf, other

    cs.CL cs.CV

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

    Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-o… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  3. arXiv:2409.02555  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation

    Authors: Kangkai Zhang, Shiming Ge, Ruixin Shi, Dan Zeng

    Abstract: Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situatio… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

  4. arXiv:2409.02069  [pdf, other

    cs.AI cs.HC

    A Deployed Online Reinforcement Learning Algorithm In An Oral Health Clinical Trial

    Authors: Anna L. Trella, Kelly W. Zhang, Hinal Jajal, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi-Velez, Susan A. Murphy

    Abstract: Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twice-daily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth i… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  5. arXiv:2409.01726  [pdf, other

    cs.CV

    Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

    Authors: Qi Zhang, Kaiyi Zhang, Antoni B. Chan, Hui Huang

    Abstract: Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map s… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: ECCV 2024

  6. arXiv:2409.01447  [pdf, other

    cs.LG cs.GT

    Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

    Authors: Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, Adam Wierman

    Abstract: In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorpor… ▽ More

    Submitted 4 September, 2024; v1 submitted 2 September, 2024; originally announced September 2024.

    Comments: A preliminary version [arXiv:2303.03100] of this paper, with a subset of the results that are presented here, was presented at NeurIPS 2023

  7. arXiv:2408.17209  [pdf, other

    cs.DB

    Updateable Data-Driven Cardinality Estimator with Bounded Q-error

    Authors: Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang

    Abstract: Modern Cardinality Estimators struggle with data updates. This research tackles this challenge within single-table. We introduce ICE, an Index-based Cardinality Estimator, the first data-driven estimator that enables instant, tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and applied them to solve cardinality estimation in dynamic scenarios: (1) Index poss… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  8. arXiv:2408.17186  [pdf, other

    cs.HC cs.AI eess.SY

    "Benefit Game: Alien Seaweed Swarms" -- Real-time Gamification of Digital Seaweed Ecology

    Authors: Dan-Lu Fei, Zi-Wei Wu, Kang Zhang

    Abstract: "Benefit Game: Alien Seaweed Swarms" combines artificial life art and interactive game with installation to explore the impact of human activity on fragile seaweed ecosystems. The project aims to promote ecological consciousness by creating a balance in digital seaweed ecologies. Inspired by the real species "Laminaria saccharina", the author employs Procedural Content Generation via Machine Learn… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Paper accepted at ISEA 24, The 29th International Symposium on Electronic Art, Brisbane, Australia, 21-29 June 2024

  9. arXiv:2408.15209  [pdf, other

    cs.MM

    Sec2Sec Co-attention for Video-Based Apparent Affective Prediction

    Authors: Mingwei Sun, Kunpeng Zhang

    Abstract: Video-based apparent affect detection plays a crucial role in video understanding, as it encompasses various elements such as vision, audio, audio-visual interactions, and spatiotemporal information, which are essential for accurate video predictions. However, existing approaches often focus on extracting only a subset of these elements, resulting in the limited predictive capacity of their models… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: 5 pages, 3 figures

  10. T3M: Text Guided 3D Human Motion Synthesis from Speech

    Authors: Wenshuo Peng, Kaipeng Zhang, Sai Qian Zhang

    Abstract: Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \texti… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

    Comments: 10 pages,4figures

  11. Towards Deconfounded Image-Text Matching with Causal Inference

    Authors: Wenhui Li, Xinqi Su, Dan Song, Lanjun Wang, Kun Zhang, An-An Liu

    Abstract: Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased external knowledge from large-scale datasets as… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: ACM MM

    Journal ref: 2023/10/26,Proceedings of the 31st ACM International Conference on Multimedia,6264-6273

  12. arXiv:2408.11296  [pdf, other

    cs.SE cs.CL

    RePair: Automated Program Repair with Process-based Feedback

    Authors: Yuze Zhao, Zhenya Huang, Yixiao Ma, Rui Li, Kai Zhang, Hao Jiang, Qi Liu, Linbo Zhu, Yu Su

    Abstract: The gap between the trepidation of program reliability and the expense of repairs underscores the indispensability of Automated Program Repair (APR). APR is instrumental in transforming vulnerable programs into more robust ones, bolstering program reliability while simultaneously diminishing the financial burden of manual repairs. Commercial-scale language models (LM) have taken APR to unprecedent… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 15 pages, 13 figures

    Journal ref: ACL 2024 Findings

  13. arXiv:2408.10899  [pdf, other

    cs.RO

    All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents

    Authors: Zhiqiang Wang, Hao Zheng, Yunshuang Nie, Wenjun Xu, Qingwei Wang, Hua Ye, Zhe Li, Kaidong Zhang, Xuewen Cheng, Wanxi Dong, Chang Cai, Liang Lin, Feng Zheng, Xiaodan Liang

    Abstract: Embodied AI is transforming how AI systems interact with the physical world, yet existing datasets are inadequate for developing versatile, general-purpose agents. These limitations include a lack of standardized formats, insufficient data diversity, and inadequate data volume. To address these issues, we introduce ARIO (All Robots In One), a new data standard that enhances existing datasets by of… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Project website: https://imaei.github.io/project_pages/ario/

  14. arXiv:2408.10666  [pdf, other

    cs.IR

    Accelerating the Surrogate Retraining for Poisoning Attacks against Recommender Systems

    Authors: Yunfan Wu, Qi Cao, Shuchang Tao, Kaike Zhang, Fei Sun, Huawei Shen

    Abstract: Recent studies have demonstrated the vulnerability of recommender systems to data poisoning attacks, where adversaries inject carefully crafted fake user interactions into the training data of recommenders to promote target items. Current attack methods involve iteratively retraining a surrogate recommender on the poisoned data with the latest fake users to optimize the attack. However, this repet… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Accepted by RecSys 2024

  15. arXiv:2408.10609  [pdf, other

    cs.LG q-bio.GN stat.ML

    PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

    Authors: Yan Wu, Esther Wershof, Sebastian M Schmon, Marcel Nassar, Błażej Osiński, Ridvan Eksi, Kun Zhang, Thore Graepel

    Abstract: We present a comprehensive framework for predicting the effects of perturbations in single cells, designed to standardize benchmarking in this rapidly evolving field. Our framework, PerturBench, includes a user-friendly platform, diverse datasets, metrics for fair model comparison, and detailed performance analysis. Extensive evaluations of published and baseline models reveal limitations like mod… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 9 pages plus 19 pages supplementary material. Code is available at https://github.com/altoslabs/perturbench

  16. arXiv:2408.10588  [pdf, other

    cs.CV cs.GR

    DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

    Authors: Zhijing Shao, Duotun Wang, Qing-Yao Tian, Yao-Dong Yang, Hengyu Meng, Zeyu Cai, Bo Dong, Yu Zhang, Kang Zhang, Zeyu Wang

    Abstract: Although neural rendering has made significant advancements in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method lea… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  17. arXiv:2408.10469  [pdf, other

    cs.CV cs.IR

    LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

    Authors: Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li

    Abstract: Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on vide… ▽ More

    Submitted 20 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: text overlap with arXiv:2406.03668

  18. arXiv:2408.10353  [pdf, other

    cs.LG stat.ML

    On the Identifiability of Sparse ICA without Assuming Non-Gaussianity

    Authors: Ignavier Ng, Yujia Zheng, Xinshuai Dong, Kun Zhang

    Abstract: Independent component analysis (ICA) is a fundamental statistical tool used to reveal hidden generative processes from observed data. However, traditional ICA approaches struggle with the rotational invariance inherent in Gaussian distributions, often necessitating the assumption of non-Gaussianity in the underlying sources. This may limit their applicability in broader contexts. To accommodate Ga… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: NeurIPS 2023

  19. arXiv:2408.09937  [pdf, other

    quant-ph cs.LG

    The curse of random quantum data

    Authors: Kaining Zhang, Junyu Liu, Liu Liu, Liang Jiang, Min-Hsiu Hsieh, Dacheng Tao

    Abstract: Quantum machine learning, which involves running machine learning algorithms on quantum devices, may be one of the most significant flagship applications for these devices. Unlike its classical counterparts, the role of data in quantum machine learning has not been fully understood. In this work, we quantify the performances of quantum machine learning in the landscape of quantum data. Provided th… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 40 pages, 8 figures

  20. arXiv:2408.09698  [pdf, other

    cs.IR cs.AI

    Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

    Authors: Yuyang Ye, Zhi Zheng, Yishan Shen, Tianshu Wang, Hengruo Zhang, Peijun Zhu, Runlong Yu, Kai Zhang, Hui Xiong

    Abstract: Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems tha… ▽ More

    Submitted 20 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

  21. arXiv:2408.09650  [pdf, other

    cs.CV cs.AI cs.MM eess.IV

    ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement

    Authors: Eashan Adhikarla, Kai Zhang, John Nicholson, Brian D. Davison

    Abstract: Low-light image enhancement remains a challenging task in computer vision, with existing state-of-the-art models often limited by hardware constraints and computational inefficiencies, particularly in handling high-resolution images. Recent foundation models, such as transformers and diffusion models, despite their efficacy in various domains, are limited in use on edge devices due to their comput… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

    Journal ref: Efficient Systems for Foundation Models II, International Conference on Machine Learning (ICML) 2024

  22. arXiv:2408.08933  [pdf, other

    cs.IR cs.AI cs.DB

    RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search

    Authors: Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, X. Sean Wang

    Abstract: Approximate Nearest Neighbor Search (ANNS) is a fundamental and critical component in many applications, including recommendation systems and large language model-based applications. With the advancement of multimodal neural models, which transform data from different modalities into a shared high-dimensional space as feature vectors, cross-modal ANNS aims to use the data vector from one modality… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: to be published in PVLDB

  23. arXiv:2408.08926  [pdf, other

    cs.CR cs.AI cs.CL cs.CY cs.LG

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

    Authors: Andy K. Zhang, Neil Perry, Riya Dulepet, Eliot Jones, Justin W. Lin, Joey Ji, Celeste Menders, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh , et al. (2 additional authors not shown)

    Abstract: Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have the potential to cause real-world impact. Policymakers, model providers, and other researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetrat… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: 86 pages, 7 figures

  24. arXiv:2408.08736  [pdf, other

    cs.CV

    Task-Aware Dynamic Transformer for Efficient Arbitrary-Scale Image Super-Resolution

    Authors: Tianyi Xu, Yiji Zhou, Xiaotao Hu, Kai Zhang, Anran Zhang, Xingye Qiu, Jun Xu

    Abstract: Arbitrary-scale super-resolution (ASSR) aims to learn a single model for image super-resolution at arbitrary magnifying scales. Existing ASSR networks typically comprise an off-the-shelf scale-agnostic feature extractor and an arbitrary scale upsampler. These feature extractors often use fixed network architectures to address different ASSR inference tasks, each of which is characterized by an inp… ▽ More

    Submitted 25 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

    Comments: ECAI 2024

  25. arXiv:2408.08342  [pdf, other

    cs.GR cs.CV

    CT4D: Consistent Text-to-4D Generation with Animatable Meshes

    Authors: Ce Chen, Shaoli Huang, Xuelin Chen, Guangyi Chen, Xiaoguang Han, Kun Zhang, Mingming Gong

    Abstract: Text-to-4D generation has recently been demonstrated viable by integrating a 2D image diffusion model with a video diffusion model. However, existing models tend to produce results with inconsistent motions and geometric structures over time. To this end, we present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  26. "I Try to Represent Myself as I Am": Self-Presentation Preferences of People with Invisible Disabilities through Embodied Social VR Avatars

    Authors: Ria J. Gualano, Lucy Jiang, Kexin Zhang, Tanisha Shende, Andrea Stevenson Won, Shiri Azenkot

    Abstract: With the increasing adoption of social virtual reality (VR), it is critical to design inclusive avatars. While researchers have investigated how and why blind and d/Deaf people wish to disclose their disabilities in VR, little is known about the preferences of many others with invisible disabilities (e.g., ADHD, dyslexia, chronic conditions). We filled this gap by interviewing 15 participants, eac… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: To appear at ASSETS 2024

  27. arXiv:2408.08146  [pdf, other

    cs.CL

    KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning

    Authors: Kaiqi Zhang, Jing Zhao, Rui Chen

    Abstract: Large Language Models (LLMs) exhibit high inference latency due to their autoregressive decoding nature. While the draft head in speculative decoding mitigates this issue, its full potential remains unexplored. In this paper, we introduce KOALA (K-layer Optimized Adversarial Learning Architecture), an orthogonal approach to the draft head. By transforming the conventional single-layer draft head i… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  28. arXiv:2408.07340  [pdf, other

    cs.LG cs.AI

    Towards Few-shot Self-explaining Graph Neural Networks

    Authors: Jingyu Peng, Qi Liu, Linan Yue, Zaixi Zhang, Kai Zhang, Yunhao Sha

    Abstract: Recent advancements in Graph Neural Networks (GNNs) have spurred an upsurge of research dedicated to enhancing the explainability of GNNs, particularly in critical domains such as medicine. A promising approach is the self-explaining method, which outputs explanations along with predictions. However, existing self-explaining models require a large amount of training data, rendering them unavailabl… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  29. arXiv:2408.07278  [pdf, other

    cs.IR cs.AI cs.CV

    Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction

    Authors: Wenhao Li, Jie Zhou, Chuan Luo, Chao Tang, Kun Zhang, Shixiong Zhao

    Abstract: In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices,… ▽ More

    Submitted 18 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 10 pages, 6 figures, accepted by Recsys 2024

    MSC Class: 68T09 ACM Class: I.2.0

  30. arXiv:2408.07176  [pdf, other

    cs.NE

    Surrogate-Assisted Search with Competitive Knowledge Transfer for Expensive Optimization

    Authors: Xiaoming Xue, Yao Hu, Liang Feng, Kai Zhang, Linqi Song, Kay Chen Tan

    Abstract: Expensive optimization problems (EOPs) have attracted increasing research attention over the decades due to their ubiquity in a variety of practical applications. Despite many sophisticated surrogate-assisted evolutionary algorithms (SAEAs) that have been developed for solving such problems, most of them lack the ability to transfer knowledge from previously-solved tasks and always start their sea… ▽ More

    Submitted 20 August, 2024; v1 submitted 13 August, 2024; originally announced August 2024.

    Comments: 22 pages, 14 figures

  31. arXiv:2408.07060  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

    Authors: Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

    Abstract: Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agent… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  32. arXiv:2408.07009  [pdf, other

    cs.CV

    Imagen 3

    Authors: Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Sergio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Peter Igwe, Christos Kaplanis, Siavash Khodadadeh , et al. (227 additional authors not shown)

    Abstract: We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.

    Submitted 13 August, 2024; originally announced August 2024.

  33. arXiv:2408.06878  [pdf, other

    cs.CV cs.GR

    PBIR-NIE: Glossy Object Capture under Non-Distant Lighting

    Authors: Guangyan Cai, Fujun Luan, Miloš Hašan, Kai Zhang, Sai Bi, Zexiang Xu, Iliyan Georgiev, Shuang Zhao

    Abstract: Glossy objects present a significant challenge for 3D reconstruction from multi-view input images under natural lighting. In this paper, we introduce PBIR-NIE, an inverse rendering framework designed to holistically capture the geometry, material attributes, and surrounding illumination of such objects. We propose a novel parallax-aware non-distant environment map as a lightweight and efficient li… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  34. arXiv:2408.06286  [pdf, other

    cs.CV

    Mipmap-GS: Let Gaussians Deform with Scale-specific Mipmap for Anti-aliasing Rendering

    Authors: Jiameng Li, Yue Shi, Jiezhang Cao, Bingbing Ni, Wenjun Zhang, Kai Zhang, Luc Van Gool

    Abstract: 3D Gaussian Splatting (3DGS) has attracted great attention in novel view synthesis because of its superior rendering efficiency and high fidelity. However, the trained Gaussians suffer from severe zooming degradation due to non-adjustable representation derived from single-scale training. Though some methods attempt to tackle this problem via post-processing techniques such as selective rendering… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: 9 pages

  35. arXiv:2408.06141  [pdf, ps, other

    cs.FL

    [Draft] High-order observers and high-order state-estimation-based properties of discrete-event systems

    Authors: Kuize Zhang, Xiaoguang Han, Alessandro Giua, Carla Seatzu

    Abstract: State-estimation-based properties are central properties in discrete-event systems modeled by labeled finite-state automata studied over the past 3 decades. Most existing results are based on a single agent who knows the structure of a system and can observe a subset of events and estimate the system's state based on the system's structure and the agent's observation to the system. The main tool u… ▽ More

    Submitted 13 August, 2024; v1 submitted 12 August, 2024; originally announced August 2024.

    Comments: 32 pages, 38 figures

  36. arXiv:2408.06087  [pdf, other

    cs.CL cs.AI cs.LG

    Building Decision Making Models Through Language Model Regime

    Authors: Yu Zhang, Haoxiang Liu, Feijun Jiang, Weihua Luo, Kaifu Zhang

    Abstract: We propose a novel approach for decision making problems leveraging the generalization capabilities of large language models (LLMs). Traditional methods such as expert systems, planning algorithms, and reinforcement learning often exhibit limited generalization, typically requiring the training of new models for each unique task. In contrast, LLMs demonstrate remarkable success in generalizing acr… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  37. arXiv:2408.06079  [pdf, other

    cs.CV

    Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment

    Authors: Kejia Zhang, Juanjuan Weng, Zhiming Luo, Shaozi Li

    Abstract: Despite the significant advances that deep neural networks (DNNs) have achieved in various visual tasks, they still exhibit vulnerability to adversarial examples, leading to serious security concerns. Recent adversarial training techniques have utilized inverse adversarial attacks to generate high-confidence examples, aiming to align the distributions of adversarial examples with the high-confiden… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  38. arXiv:2408.06047  [pdf, other

    cs.CV

    BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

    Authors: Xuanpu Zhang, Dan Song, Pengxin Zhan, Qingguo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Anan Liu

    Abstract: Image-based virtual try-on is an increasingly popular and important task to generate realistic try-on images of specific person. Existing methods always employ an accurate mask to remove the original garment in the source image, thus achieving realistic synthesized images in simple and conventional try-on scenarios based on powerful diffusion model. Therefore, acquiring suitable mask is vital to t… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  39. arXiv:2408.05926  [pdf, other

    cs.AI cs.LG cs.MM

    BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

    Authors: Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Kang Zhang, Yu-Jung Heo, Du-Seong Chang, Chang D. Yoo

    Abstract: Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both based on the dialogue context. Due to the lack of a large-scale dataset specifically for this task and the benefits of leveraging powerful pre-trained models, previous work relies on the text modality as an intermediary step for both the image… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: ECCV 2024

  40. arXiv:2408.05788  [pdf, other

    cs.LG cs.AI stat.ML

    Continual Learning of Nonlinear Independent Representations

    Authors: Boyang Sun, Ignavier Ng, Guangyi Chen, Yifan Shen, Qirong Ho, Kun Zhang

    Abstract: Identifying the causal relations between interested variables plays a pivotal role in representation learning as it provides deep insights into the dataset. Identifiability, as the central theme of this approach, normally hinges on leveraging data from multiple distributions (intervention, distribution shift, time series, etc.). Despite the exciting development in this field, a practical but often… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 9 pages, 5 Figures

  41. arXiv:2408.05694  [pdf, other

    cs.CR

    ICSFuzz: Collision Detector Bug Discovery in Autonomous Driving Simulators

    Authors: Weiwei Fu, Heqing Huang, Yifan Zhang, Ke Zhang, Jin Huang, Wei-Bin Lee, Jianping Wang

    Abstract: With the increasing adoption of autonomous vehicles, ensuring the reliability of autonomous driving systems (ADSs) deployed on autonomous vehicles has become a significant concern. Driving simulators have emerged as crucial platforms for testing autonomous driving systems, offering realistic, dynamic, and configurable environments. However, existing simulation-based ADS testers have largely overlo… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  42. arXiv:2408.05428  [pdf, other

    cs.LG stat.ME stat.ML

    Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression

    Authors: Anpeng Wu, Kun Kuang, Ruoxuan Xiong, Xiangwei Chen, Zexu Sun, Fei Wu, Kun Zhang

    Abstract: In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements ac… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

  43. arXiv:2408.05411  [pdf, other

    cs.CV

    How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

    Authors: Yuxin Zhu, Huiyu Duan, Kaiwei Zhang, Yucheng Zhu, Xilei Zhu, Long Teng, Xiongkuo Min, Guangtao Zhai

    Abstract: Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  44. arXiv:2408.05112  [pdf, other

    cs.LG cs.AI eess.IV

    Semantic Successive Refinement: A Generative AI-aided Semantic Communication Framework

    Authors: Kexin Zhang, Lixin Li, Wensheng Lin, Yuna Yan, Rui Li, Wenchi Cheng, Zhu Han

    Abstract: Semantic Communication (SC) is an emerging technology aiming to surpass the Shannon limit. Traditional SC strategies often minimize signal distortion between the original and reconstructed data, neglecting perceptual quality, especially in low Signal-to-Noise Ratio (SNR) environments. To address this issue, we introduce a novel Generative AI Semantic Communication (GSC) system for single-user scen… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

  45. arXiv:2408.04336  [pdf, other

    cs.AI

    KnowPC: Knowledge-Driven Programmatic Reinforcement Learning for Zero-shot Coordination

    Authors: Yin Gu, Qi Liu, Zhi Li, Kai Zhang

    Abstract: Zero-shot coordination (ZSC) remains a major challenge in the cooperative AI field, which aims to learn an agent to cooperate with an unseen partner in training environments or even novel environments. In recent years, a popular ZSC solution paradigm has been deep reinforcement learning (DRL) combined with advanced self-play or population-based methods to enhance the neural policy's ability to han… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  46. arXiv:2408.04235  [pdf, other

    cs.CV

    LLDif: Diffusion Models for Low-light Emotion Recognition

    Authors: Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana

    Abstract: This paper introduces LLDif, a novel diffusion-based facial expression recognition (FER) framework tailored for extremely low-light (LL) environments. Images captured under such conditions often suffer from low brightness and significantly reduced contrast, presenting challenges to conventional methods. These challenges include poor image quality that can significantly reduce the accuracy of emoti… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Accepted by ICPR2024

  47. arXiv:2408.03360  [pdf, other

    cs.LG cs.AI

    Prioritize Alignment in Dataset Distillation

    Authors: Zekai Li, Ziyao Guo, Wangbo Zhao, Tianle Zhang, Zhi-Qi Cheng, Samir Khaki, Kaipeng Zhang, Ahmad Sajedi, Konstantinos N Plataniotis, Kai Wang, Yang You

    Abstract: Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset. Consequently, the quality of extracted and embedded information determines the quality of the d… ▽ More

    Submitted 13 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

    Comments: 18 pages, 9 figures

  48. arXiv:2408.03326  [pdf, other

    cs.CV cs.AI cs.CL

    LLaVA-OneVision: Easy Visual Task Transfer

    Authors: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

    Abstract: We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-i… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Project Homepage: https://llava-vl.github.io/blog/2024-08-05-llava-onevision/

  49. arXiv:2408.03286  [pdf, other

    cs.CV

    Biomedical SAM 2: Segment Anything in Biomedical Images and Videos

    Authors: Zhiling Yan, Weixiang Sun, Rong Zhou, Zhengqing Yuan, Kai Zhang, Yiwei Li, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, Lichao Sun

    Abstract: Medical image segmentation and video object segmentation are essential for diagnosing and analyzing diseases by identifying and measuring biological structures. Recent advances in natural domain have been driven by foundation models like the Segment Anything Model 2 (SAM-2). To explore the performance of SAM-2 in biomedical applications, we designed three evaluation pipelines for single-frame 2D i… ▽ More

    Submitted 17 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

  50. arXiv:2408.03149  [pdf, other

    cs.CV cs.CL

    Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization

    Authors: Yanghai Zhang, Ye Liu, Shiwei Wu, Kai Zhang, Xukai Liu, Qi Liu, Enhong Chen

    Abstract: The rapid increase in multimedia data has spurred advancements in Multimodal Summarization with Multimodal Output (MSMO), which aims to produce a multimodal summary that integrates both text and relevant images. The inherent heterogeneity of content within multimodal inputs and outputs presents a significant challenge to the execution of MSMO. Traditional approaches typically adopt a holistic pers… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: In ACL-Findings 2024