Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2024
AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data
SA '24: SIGGRAPH Asia 2024 Technical CommunicationsArticle No.: 23, Pages 1–5https://doi.org/10.1145/3681758.3698013This paper introduces an effective method for computation-efficient personalized style video generation without requiring access to any personalized video data. It reduces the necessary generation time of similarly sized video diffusion models from 25 ...
- ArticleNovember 2024
- ArticleNovember 2024
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
AbstractIn this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their ...
- ArticleNovember 2024
- ArticleNovember 2024
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
AbstractOptimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-...
-
- ArticleNovember 2024
BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation Using RGB Frames and Events
- Yijin Li,
- Yichen Shen,
- Zhaoyang Huang,
- Shuo Chen,
- Weikang Bian,
- Xiaoyu Shi,
- Fu-Yun Wang,
- Keqiang Sun,
- Hujun Bao,
- Zhaopeng Cui,
- Guofeng Zhang,
- Hongsheng Li
AbstractRecent advances in event-based vision suggest that they complement traditional cameras by providing continuous observation without frame rate limitations and high dynamic range which are well-suited for correspondence tasks such as optical flow ...
- ArticleNovember 2024
GiT: Towards Generalist Vision Transformer Through Universal Language Interface
- Haiyang Wang,
- Hao Tang,
- Li Jiang,
- Shaoshuai Shi,
- Muhammad Ferjad Naeem,
- Hongsheng Li,
- Bernt Schiele,
- Liwei Wang
AbstractThis paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g., GPT) widely used in ...
- ArticleOctober 2024
SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models
- Ziyi Lin,
- Dongyang Liu,
- Renrui Zhang,
- Peng Gao,
- Longtian Qiu,
- Han Xiao,
- Han Qiu,
- Wenqi Shao,
- Keqin Chen,
- Jiaming Han,
- Siyuan Huang,
- Yichi Zhang,
- Xuming He,
- Yu Qiao,
- Hongsheng Li
AbstractWe present [inline-graphic not available: see fulltext] , a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, visual embeddings and image scales. First, for stronger vision-language alignment, we unfreeze the ...
- ArticleOctober 2024
Delving Deep into Engagement Prediction of Short Videos
AbstractUnderstanding and modeling the popularity of User Generated Content (UGC) short videos on social media platforms presents a critical challenge with broad implications for content creators and recommendation systems. This study delves deep into the ...
- ArticleOctober 2024
DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition
- Qi Wang,
- Zhou Xu,
- Yuming Lin,
- Jingtao Ye,
- Hongsheng Li,
- Guangming Zhu,
- Syed Afaq Ali Shah,
- Mohammed Bennamoun,
- Liang Zhang
AbstractNeuromorphic sensors, specifically event cameras, revolutionize visual data acquisition by capturing pixel intensity changes with exceptional dynamic range, minimal latency, and energy efficiency, setting them apart from conventional frame-based ...
- ArticleOctober 2024
MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- Renrui Zhang,
- Dongzhi Jiang,
- Yichi Zhang,
- Haokun Lin,
- Ziyu Guo,
- Pengshuo Qiu,
- Aojun Zhou,
- Pan Lu,
- Kai-Wei Chang,
- Yu Qiao,
- Peng Gao,
- Hongsheng Li
AbstractThe remarkable progress of Multi-modal Large Language Models (MLLMs) has gained unparalleled attention. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to ...
- ArticleOctober 2024
Any2Point: Empowering Any-Modality Large Models for Efficient 3D Understanding
- Yiwen Tang,
- Ray Zhang,
- Jiaming Liu,
- Zoey Guo,
- Bin Zhao,
- Zhigang Wang,
- Peng Gao,
- Hongsheng Li,
- Dong Wang,
- Xuelong Li
AbstractLarge foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D ...
- ArticleOctober 2024
Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediction Tasks
AbstractIn this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as semantic segmentation and depth estimation. We focus on fine-tuning the Stable Diffusion model, which has ...
- ArticleSeptember 2024
Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
AbstractWe introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for ...
- ArticleSeptember 2024
Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation
AbstractVideo outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. ...
- research-articleSeptember 2024
<italic>FeatAug-DETR:</italic> Enriching One-to-Many Matching for DETRs With Feature Augmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence (ITPM), Volume 46, Issue 9Pages 6402–6415https://doi.org/10.1109/TPAMI.2024.3381961One-to-one matching is a crucial design in DETR-like object detection frameworks. It enables the DETR to perform end-to-end detection. However, it also faces challenges of lacking positive sample supervision and slow convergence speed. Several recent ...
- research-articleJuly 2024
SPP: sparsity-preserved parameter-efficient fine-tuning for large language models
ICML'24: Proceedings of the 41st International Conference on Machine LearningArticle No.: 1351, Pages 33254–33269Large Language Models (LLMs) have become pivotal in advancing the field of artificial intelligence, yet their immense sizes pose significant challenges for both fine-tuning and deployment. Current post-training pruning methods, while reducing the sizes ...
- research-articleJuly 2024
SPHINX-X: scaling data and parameters for a family of multi-modal large language models
- Dongyang Liu,
- Renrui Zhang,
- Longtian Qiu,
- Siyuan Huang,
- Weifeng Lin,
- Shitian Zhao,
- Shijie Geng,
- Ziyi Lin,
- Peng Jin,
- Kaipeng Zhang,
- Wenqi Shao,
- Chao Xu,
- Conghui He,
- Junjun He,
- Hao Shao,
- Pan Lu,
- Yu Qiao,
- Hongsheng Li,
- Peng Gao
ICML'24: Proceedings of the 41st International Conference on Machine LearningArticle No.: 1314, Pages 32400–32420We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded ...
- research-articleJuly 2024
Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling
- Xiaoyu Shi,
- Zhaoyang Huang,
- Fu-Yun Wang,
- Weikang Bian,
- Dasong Li,
- Yi Zhang,
- Manyuan Zhang,
- Ka Chun Cheung,
- Simon See,
- Hongwei Qin,
- Jifeng Dai,
- Hongsheng Li
SIGGRAPH '24: ACM SIGGRAPH 2024 Conference PapersArticle No.: 111, Pages 1–11https://doi.org/10.1145/3641519.3657497We introduce Motion-I2V, a novel framework for consistent and controllable text-guided image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages ...