Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 129 results for author: Jiao, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.05503  [pdf, other

    cs.CV cs.AI

    Disentangled Noisy Correspondence Learning

    Authors: Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang

    Abstract: Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predic… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

  2. arXiv:2407.12729  [pdf, other

    cs.DC

    FlexFL: Heterogeneous Federated Learning via APoZ-Guided Flexible Pruning in Uncertain Scenarios

    Authors: Zekai Chen, Chentao Jia, Ming Hu, Xiaofei Xie, Anran Li, Mingsong Chen

    Abstract: Along with the increasing popularity of Deep Learning (DL) techniques, more and more Artificial Intelligence of Things (AIoT) systems are adopting federated learning (FL) to enable privacy-aware collaborative learning among AIoT devices. However, due to the inherent data and device heterogeneity issues, existing FL-based AIoT systems suffer from the model selection problem. Although various hetero… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

  3. arXiv:2407.12317  [pdf, other

    cs.CV

    Out of Length Text Recognition with Sub-String Matching

    Authors: Yongkun Du, Zhineng Chen, Caiyan Jia, Xieping Gao, Yu-Gang Jiang

    Abstract: Scene Text Recognition (STR) methods have demonstrated robust performance in word-level text recognition. However, in real applications the text image is sometimes long due to detected with multiple horizontal words. It triggers the requirement to build long text recognition models from readily available short (i.e., word-level) text datasets, which has been less studied previously. In this paper,… ▽ More

    Submitted 13 August, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Preprint, 16 pages

  4. arXiv:2407.12259  [pdf, other

    cs.CL

    In-Context Probing Approximates Influence Function for Data Valuation

    Authors: Cathy Jiao, Gary Gao, Chenyan Xiong

    Abstract: Data valuation quantifies the value of training data, and is used for data attribution (i.e., determining the contribution of training data towards model predictions), and data selection; both of which are important for curating high-quality datasets to train large language models. In our paper, we show that data valuation through in-context probing (i.e., prompting a LLM) approximates influence f… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  5. arXiv:2407.03856  [pdf, other

    cs.LG

    Q-Adapter: Training Your LLM Adapter as a Residual Q-Function

    Authors: Yi-Chen Li, Fuxiang Zhang, Wenjie Qiu, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu

    Abstract: We consider the problem of adapting Large Language Models (LLMs) pre-trained with Reinforcement Learning from Human Feedback (RLHF) to downstream preference data. Naive approaches to achieve this could be supervised fine-tuning on preferred responses or reinforcement learning with a learned reward model. However, the LLM runs the risk of forgetting its initial knowledge as the fine-tuning progress… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  6. arXiv:2407.03162  [pdf, other

    cs.RO cs.CV cs.LG

    Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning

    Authors: Runyu Ding, Yuzhe Qin, Jiyue Zhu, Chengzhe Jia, Shiqi Yang, Ruihan Yang, Xiaojuan Qi, Xiaolong Wang

    Abstract: Teleoperation is a crucial tool for collecting human demonstrations, but controlling robots with bimanual dexterous hands remains a challenge. Existing teleoperation systems struggle to handle the complexity of coordinating two hands for intricate manipulations. We introduce Bunny-VisionPro, a real-time bimanual dexterous teleoperation system that leverages a VR headset. Unlike previous vision-bas… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: project page: https://dingry.github.io/projects/bunny_visionpro.html

  7. arXiv:2406.19571  [pdf, other

    cs.SI cs.CY

    Reranking Social Media Feeds: A Practical Guide for Field Experiments

    Authors: Tiziano Piccardi, Martin Saveski, Chenyan Jia, Jeffrey Hancock, Jeanne L. Tsai, Michael S. Bernstein

    Abstract: Social media plays a central role in shaping public opinion and behavior, yet performing experiments on these platforms and, in particular, on feed algorithms is becoming increasingly challenging. This article offers practical recommendations to researchers developing and deploying field experiments focused on real-time re-ranking of social media feeds. This article is organized around two contrib… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  8. arXiv:2406.17876  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

    Authors: Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa Vitiello

    Abstract: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorpo… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Journal ref: Conference on Computer Vision and Pattern Recognition (CVPR 2022) - Embodied AI Workshop

  9. arXiv:2406.10907  [pdf, other

    cs.CV

    SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

    Authors: Lin Liu, Ziying Song, Qiming Xia, Feiyang Jia, Caiyan Jia, Lei Yang, Hongyu Pan

    Abstract: LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficie… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2401.02702

  10. arXiv:2406.02959  [pdf, other

    cs.CL cs.LG

    Adversarial Moment-Matching Distillation of Large Language Models

    Authors: Chen Jia

    Abstract: Knowledge distillation (KD) has been shown to be highly effective in guiding a student model with a larger teacher model and achieving practical benefits in improving the computational and memory efficiency for large language models (LLMs). State-of-the-art KD methods for LLMs mostly rely on minimizing explicit distribution distance between teacher and student probability predictions. Instead of o… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  11. arXiv:2405.17039  [pdf, other

    cs.CL cs.LG

    BWArea Model: Learning World Model, Inverse Dynamics, and Policy for Controllable Language Generation

    Authors: Chengxing Jia, Pengyuan Wang, Ziniu Li, Yi-Chen Li, Zhilong Zhang, Nan Tang, Yang Yu

    Abstract: Large language models (LLMs) have catalyzed a paradigm shift in natural language processing, yet their limited controllability poses a significant challenge for downstream applications. We aim to address this by drawing inspiration from the neural mechanisms of the human brain, specifically Broca's and Wernicke's areas, which are crucial for language generation and comprehension, respectively. In… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  12. arXiv:2405.17031  [pdf, other

    cs.LG

    Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

    Authors: Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, Yang Yu

    Abstract: Model-based methods in reinforcement learning offer a promising approach to enhance data efficiency by facilitating policy exploration within a dynamics model. However, accurately predicting sequential steps in the dynamics model remains a challenge due to the bootstrapping prediction, which attributes the next state to the prediction of the current state. This leads to accumulated errors during m… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  13. arXiv:2405.16873  [pdf, other

    cs.CV

    ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

    Authors: Ziying Song, Feiyang Jia, Hongyu Pan, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lin Liu, Yang Ji, Lei Yang, Li Wang

    Abstract: In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is a widely adopted paradigm. However, existing methods are often compromised by imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies result in errors in depth estimation for the… ▽ More

    Submitted 5 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  14. arXiv:2404.06395  [pdf, other

    cs.CL cs.LG

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Authors: Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun

    Abstract: The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce… ▽ More

    Submitted 3 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: revise according to peer review

  15. arXiv:2403.17477  [pdf, other

    cs.CV cs.HC

    DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

    Authors: Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling

    Abstract: We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360° images based on a conditional score-based denoising diffusion model. Generating human gaze on 360° images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual hum… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  16. arXiv:2403.12170  [pdf, other

    cs.RO

    Sim2Real Manipulation on Unknown Objects with Tactile-based Reinforcement Learning

    Authors: Entong Su, Chengzhe Jia, Yuzhe Qin, Wenxuan Zhou, Annabella Macaluso, Binghao Huang, Xiaolong Wang

    Abstract: Using tactile sensors for manipulation remains one of the most challenging problems in robotics. At the heart of these challenges is generalization: How can we train a tactile-based policy that can manipulate unseen and diverse objects? In this paper, we propose to perform Reinforcement Learning with only visual tactile sensing inputs on diverse objects in a physical simulator. By training with di… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  17. arXiv:2403.11848  [pdf, other

    cs.CV

    GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

    Authors: Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, Li Wang

    Abstract: Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between… ▽ More

    Submitted 2 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

  18. arXiv:2403.07261  [pdf, other

    cs.LG cs.AI

    Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation

    Authors: Chengxing Jia, Fuxiang Zhang, Yi-Chen Li, Chen-Xiao Gao, Xu-Hui Liu, Lei Yuan, Zongzhang Zhang, Yang Yu

    Abstract: Offline meta-reinforcement learning (OMRL) proficiently allows an agent to tackle novel tasks while solely relying on a static dataset. For precise and efficient task identification, existing OMRL research suggests learning separate task representations that be incorporated with policy input, thus forming a context-based meta-policy. A major approach to train task representations is to adopt contr… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  19. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1110 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 8 August, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  20. arXiv:2403.00277  [pdf, other

    cs.CL

    Gender Bias in Large Language Models across Multiple Languages

    Authors: Jinman Zhao, Yitian Ding, Chen Jia, Yining Wang, Zifan Qian

    Abstract: With the growing deployment of large language models (LLMs) across various applications, assessing the influence of gender biases embedded in LLMs becomes crucial. The topic of gender bias within the realm of natural language processing (NLP) has gained considerable focus, particularly in the context of English. Nonetheless, the investigation of gender bias in languages other than English is still… ▽ More

    Submitted 29 February, 2024; originally announced March 2024.

    Comments: 20 pages, 27 tables, 7 figures, submitted to ACL2024

  21. arXiv:2402.14760  [pdf, other

    cs.LG cs.CL

    Generalizing Reward Modeling for Out-of-Distribution Preference Learning

    Authors: Chen Jia

    Abstract: Preference learning (PL) with large language models (LLMs) aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-o… ▽ More

    Submitted 8 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: 31 pages

  22. arXiv:2402.11317  [pdf, other

    cs.LG cs.AI

    Debiased Offline Representation Learning for Fast Online Adaptation in Non-stationary Dynamics

    Authors: Xinyu Zhang, Wenjie Qiu, Yi-Chen Li, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu

    Abstract: Developing policies that can adjust to non-stationary environments is essential for real-world reinforcement learning applications. However, learning such adaptable policies in offline settings, with only a limited set of pre-collected trajectories, presents significant challenges. A key difficulty arises because the limited offline data makes it hard for the context encoder to differentiate betwe… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

  23. arXiv:2402.08397  [pdf, other

    cs.CV

    A Neural-network Enhanced Video Coding Framework beyond ECM

    Authors: Yanchen Zhao, Wenxuan He, Chuanmin Jia, Qizhe Wang, Junru Li, Yue Li, Chaoyi Lin, Kai Zhang, Li Zhang, Siwei Ma

    Abstract: In this paper, a hybrid video compression framework is proposed that serves as a demonstrative showcase of deep learning-based approaches extending beyond the confines of traditional coding methodologies. The proposed hybrid framework is founded upon the Enhanced Compression Model (ECM), which is a further enhancement of the Versatile Video Coding (VVC) standard. We have augmented the latest ECM r… ▽ More

    Submitted 21 February, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

  24. arXiv:2402.03719  [pdf, other

    cs.CL cs.AI

    Empowering Language Models with Active Inquiry for Deeper Understanding

    Authors: Jing-Cheng Pang, Heng-Bo Fan, Pengyuan Wang, Jia-Hao Xiao, Nan Tang, Si-Hang Yang, Chengxing Jia, Sheng-Jun Huang, Yang Yu

    Abstract: The rise of large language models (LLMs) has revolutionized the way that we interact with artificial intelligence systems through natural language. However, LLMs often misinterpret user queries because of their uncertain intention, leading to less helpful responses. In natural human interactions, clarification is sought through targeted questioning to uncover obscure information. Thus, in this pap… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  25. arXiv:2401.17851  [pdf, other

    cs.CV

    Instruction-Guided Scene Text Recognition

    Authors: Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, Yu-Gang Jiang

    Abstract: Multi-modal models show appealing performance in visual recognition tasks recently, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models are either inefficient or cannot be trivially upgraded to scene text recognition (STR) due to the composition difference between natural and text images. We propose a novel instruction-guided scen… ▽ More

    Submitted 1 July, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

  26. arXiv:2401.12533  [pdf, other

    cs.LG cs.AI

    Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

    Authors: Longkun Guo, Chaoqi Jia, Kewen Liao, Zhigang Lu, Minhui Xue

    Abstract: Center-based clustering has attracted significant research interest from both theory and practice. In many practical applications, input data often contain background knowledge that can be used to improve clustering results. In this work, we build on widely adopted $k$-center clustering and model its input background knowledge as must-link (ML) and cannot-link (CL) constraint sets. However, most c… ▽ More

    Submitted 14 May, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

  27. arXiv:2401.06542  [pdf, other

    cs.CV

    Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

    Authors: Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Guoxin Zhang, Lei Yang, Li Wang, Caiyan Jia

    Abstract: In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. The key step to this system is related to 3D object detection that utilizes vehicle-mounted sensors such as LiDAR and cameras to identify the size, the category, and the location of nearby objects. De… ▽ More

    Submitted 15 August, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

  28. arXiv:2401.03907  [pdf, other

    cs.CV

    RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

    Authors: Ziying Song, Guoxing Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Caiyan Jia, Feiyang Jia, Li Wang

    Abstract: Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD).Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for im… ▽ More

    Submitted 23 April, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

  29. arXiv:2401.02982  [pdf, other

    cs.CL cs.AI

    FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

    Authors: Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan, Qingquan Wu, Chong Yang

    Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of financial data analysis, particularly focusing on data-driven thinking, remain uncertain. To bridge this gap, we introduce \texttt{FinDABench}, a comprehensive benchmark designed to evaluate the financial data analysis capabili… ▽ More

    Submitted 14 June, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

  30. VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection

    Authors: Ziying Song, Guoxin Zhang, Jun Xie, Lin Liu, Caiyan Jia, Shaoqing Xu, Zhepeng Wang

    Abstract: LiDAR-camera fusion can enhance the performance of 3D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with dense image features in a one-to-one manner, resulting in the loss of the advantages of images, including semantic and continuity i… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Journal ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 61, 2023, pp. 1-12

  31. arXiv:2401.01759  [pdf, other

    cs.SI cs.CL cs.CV cs.MM

    VGA: Vision and Graph Fused Attention Network for Rumor Detection

    Authors: Lin Bai, Caiyan Jia, Ziying Song, Chaoqun Cui

    Abstract: With the development of social media, rumors have been spread broadly on social media platforms, causing great harm to society. Beside textual information, many rumors also use manipulated images or conceal textual information within images to deceive people and avoid being detected, making multimodal rumor detection be a critical problem. The majority of multimodal rumor detection methods mainly… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  32. arXiv:2312.16478  [pdf, other

    cs.LG

    Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

    Authors: Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, Jingdong Wang

    Abstract: Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep ne… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

  33. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  34. arXiv:2312.02428  [pdf, other

    cs.CV cs.IR

    FreestyleRet: Retrieving Images from Style-Diversified Queries

    Authors: Hao Li, Curise Jia, Peng Jin, Zesen Cheng, Kehan Li, Jialu Sui, Chang Liu, Li Yuan

    Abstract: Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style… ▽ More

    Submitted 8 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: 16 pages, 7 figures

  35. arXiv:2312.02226  [pdf, other

    cs.CV

    Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

    Authors: Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, Jingdong Wang

    Abstract: Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal in… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

  36. AdaptiveFL: Adaptive Heterogeneous Federated Learning for Resource-Constrained AIoT Systems

    Authors: Chentao Jia, Ming Hu, Zekai Chen, Yanxin Yang, Xiaofei Xie, Yang Liu, Mingsong Chen

    Abstract: Although Federated Learning (FL) is promising to enable collaborative learning among Artificial Intelligence of Things (AIoT) devices, it suffers from the problem of low classification performance due to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named… ▽ More

    Submitted 9 April, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

    Comments: This paper has been accepted by DAC2024

  37. arXiv:2311.01686  [pdf, other

    cs.CV cs.LG

    Disentangled Representation Learning with Transmitted Information Bottleneck

    Authors: Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Jihong Wang, Xiaojun Chang, Jingdong Wang

    Abstract: Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop;… ▽ More

    Submitted 14 August, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

  38. arXiv:2310.08261  [pdf, other

    cs.CV

    GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection

    Authors: Ziying Song, Haiyue Wei, Lin Bai, Lei Yang, Caiyan Jia

    Abstract: LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordi… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

  39. arXiv:2310.04704  [pdf, other

    cs.LG cs.AI

    EdgeFD: An Edge-Friendly Drift-Aware Fault Diagnosis System for Industrial IoT

    Authors: Chen Jiao, Mao Fengjian, Lv Zuohong, Tang Jianhua

    Abstract: Recent transfer learning (TL) approaches in industrial intelligent fault diagnosis (FD) mostly follow the "pre-train and fine-tuning" paradigm to address data drift, which emerges from variable working conditions. However, we find that this approach is prone to the phenomenon known as catastrophic forgetting. Furthermore, performing frequent models fine-tuning on the resource-constrained edge node… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: 2023 IEEE The 23rd International Conference on Communication Technology

  40. arXiv:2309.11125  [pdf, other

    cs.CV

    PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

    Authors: Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Jingdong Wang

    Abstract: Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, pedestrian detection and Re-IDentification (ReID). Despite significant progress, current methods face two primary challenges: 1) the pedestrian candidates learned within detectors are suboptimal for the ReID task. 2) the potential for collaboration between tw… ▽ More

    Submitted 13 March, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

  41. arXiv:2309.07589  [pdf, other

    cs.MM eess.IV

    MPAI-EEV: Standardization Efforts of Artificial Intelligence based End-to-End Video Coding

    Authors: Chuanmin Jia, Feng Ye, Fanke Dong, Kai Lin, Leonardo Chiariglione, Siwei Ma, Huifang Sun, Wen Gao

    Abstract: The rapid advancement of artificial intelligence (AI) technology has led to the prioritization of standardizing the processing, coding, and transmission of video using neural networks. To address this priority area, the Moving Picture, Audio, and Data Coding by Artificial Intelligence (MPAI) group is developing a suite of standards called MPAI-EEV for "end-to-end optimized neural video coding." Th… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  42. arXiv:2308.12508  [pdf, other

    eess.IV cs.CV cs.GR

    FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

    Authors: Chenyue Jiao, Chongke Bi, Lu Yang

    Abstract: Large-scale numerical simulations are capable of generating data up to terabytes or even petabytes. As a promising method of data reduction, super-resolution (SR) has been widely studied in the scientific visualization community. However, most of them are based on deep convolutional neural networks (CNNs) or generative adversarial networks (GANs) and the scale factor needs to be determined before… ▽ More

    Submitted 26 August, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

    Comments: This paper has been accepted and published by ChinaVis 2023(2023.7.21-24)

  43. arXiv:2308.10156  [pdf, other

    cs.CV cs.AI

    SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

    Authors: Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang

    Abstract: Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for co… ▽ More

    Submitted 13 March, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

    Comments: Accepted to AAAI 2024

    Journal ref: 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 2024

  44. arXiv:2308.04961  [pdf

    cs.SI cs.LG

    CasCIFF: A Cross-Domain Information Fusion Framework Tailored for Cascade Prediction in Social Networks

    Authors: Hongjun Zhu, Shun Yuan, Xin Liu, Kuo Chen, Chaolong Jia, Ying Qian

    Abstract: Existing approaches for information cascade prediction fall into three main categories: feature-driven methods, point process-based methods, and deep learning-based methods. Among them, deep learning-based methods, characterized by its superior learning and representation capabilities, mitigates the shortcomings inherent of the other methods. However, current deep learning methods still face sever… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  45. arXiv:2307.16289  [pdf

    cs.CV cs.AI

    Implementing Edge Based Object Detection For Microplastic Debris

    Authors: Amardeep Singh, Prof. Charles Jia, Prof. Donald Kirk

    Abstract: Plastic has imbibed itself as an indispensable part of our day to day activities, becoming a source of problems due to its non-biodegradable nature and cheaper production prices. With these problems, comes the challenge of mitigating and responding to the aftereffects of disposal or the lack of proper disposal which leads to waste concentrating in locations and disturbing ecosystems for both plant… ▽ More

    Submitted 30 July, 2023; originally announced July 2023.

  46. arXiv:2307.13912  [pdf, other

    cs.HC cs.AI

    Embedding Democratic Values into Social Media AIs via Societal Objective Functions

    Authors: Chenyan Jia, Michelle S. Lam, Minh Chau Mai, Jeff Hancock, Michael S. Bernstein

    Abstract: Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the… ▽ More

    Submitted 14 February, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: This paper has been accepted to CSCW 2024 and will be published in Proc. ACM Hum.-Comput. Interact. 8, CSCW1, Article 163 (April 2024)

    Journal ref: Proceedings of the ACM: Human-Computer Interaction, 8, CSCW1, Article 163 (2024)

  47. arXiv:2307.12270  [pdf, other

    cs.CV

    Context Perception Parallel Decoder for Scene Text Recognition

    Authors: Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, Yu-Gang Jiang

    Abstract: Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based models implement the recognition in a character-by-character manner, showing superiority in accuracy but with slow inference speed. Alternatively, parallel decoding (PD)-based models infer all characters in a single decoding pass, offering faster inference speed but gener… ▽ More

    Submitted 9 October, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

  48. arXiv:2307.09704  [pdf

    cs.DL

    How are exclusively data journals indexed in major scholarly databases? An examination of the Web of Science, Scopus, Dimensions, and OpenAlex

    Authors: Chenyue Jiao, Kai Li, Zhichao Fang

    Abstract: As part of the data-driven paradigm and open science movement, the data paper is becoming a popular way for researchers to publish their research data, based on academic norms that cross knowledge domains. Data journals have also been created to host this new academic genre. The growing number of data papers and journals has made them an important large-scale data source for understanding how rese… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  49. arXiv:2306.14108  [pdf, other

    cs.CV eess.IV

    SpikeCodec: An End-to-end Learned Compression Framework for Spiking Camera

    Authors: Kexiang Feng, Chuanmin Jia, Siwei Ma, Wen Gao

    Abstract: Recently, the bio-inspired spike camera with continuous motion recording capability has attracted tremendous attention due to its ultra high temporal resolution imaging characteristic. Such imaging feature results in huge data storage and transmission burden compared to that of traditional camera, raising severe challenge and imminent necessity in compression for spike camera captured content. Exi… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

    Comments: 13 pages, 11 figures and 5 tables

  50. arXiv:2305.17770  [pdf, ps, other

    cs.CV

    Point Cloud Completion Guided by Prior Knowledge via Causal Inference

    Authors: Songxue Gao, Chuanqi Jiao, Ruidong Chen, Weijie Wang, Weizhi Nie

    Abstract: Point cloud completion aims to recover raw point clouds captured by scanners from partial observations caused by occlusion and limited view angles. This makes it hard to recover details because the global feature is unlikely to capture the full details of all missing parts. In this paper, we propose a novel approach to point cloud completion task called Point-PC, which uses a memory network to ret… ▽ More

    Submitted 15 December, 2023; v1 submitted 28 May, 2023; originally announced May 2023.