Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleAugust 2024JUST ACCEPTED
Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Just Accepted https://doi.org/10.1145/3687475In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image ...
- research-articleMay 2024
Parameter and computation efficient transfer learning for vision-language pre-trained models
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing SystemsArticle No.: 1786, Pages 41034–41050With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only ...
- research-articleMay 2024
Cheap and quick: efficient vision-language instruction tuning for large language models
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing SystemsArticle No.: 1288, Pages 29615–29627Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing ...
- research-articleOctober 2023
Semi-Supervised Panoptic Narrative Grounding
MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 7164–7174https://doi.org/10.1145/3581783.3612259Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a ...
- research-articleOctober 2023
PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks
MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 4666–4677https://doi.org/10.1145/3581783.3612067Synthesizing vivid human portraits is a research hot spot in image generation with a wide scope of applications. In addition to fidelity, generation controllability is another key factor that has long plagued its development. To address this issue, ...
-
- research-articleOctober 2023
Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval
MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 4157–4168https://doi.org/10.1145/3581783.3611768Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Despite considerable efforts to bridge the gap between vision and language, the significant differences between these ...
- research-articleOctober 2023
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
- Haowei Wang,
- Jiji Tang,
- Jiayi Ji,
- Xiaoshuai Sun,
- Rongsheng Zhang,
- Yiwei Ma,
- Minda Zhao,
- Lincheng Li,
- Zeng Zhao,
- Tangjie Lv,
- Rongrong Ji
MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 3403–3414https://doi.org/10.1145/3581783.3611767In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D ...
- research-articleSeptember 2023
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension
IEEE Transactions on Multimedia (TOM), Volume 26Pages 3689–3700https://doi.org/10.1109/TMM.2023.3314153One-stage Referring Expression Comprehension (REC) is a task that requires accurate alignment between text descriptions and visual content. In recent years, numerous efforts have been devoted to cross-modal learning for REC, while the influence of other ...
- research-articleAugust 2023
Towards Language-Guided Visual Recognition via Dynamic Convolutions
International Journal of Computer Vision (IJCV), Volume 132, Issue 1Pages 1–19https://doi.org/10.1007/s11263-023-01871-1AbstractIn this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided ...
- research-articleJune 2023
Towards local visual modeling for image captioning
Highlights- Local visual modeling with grid features for image captioning.
- Locality-...
In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (...
- research-articleFebruary 2023
End-to-end zero-shot HOI detection via vision and language knowledge distillation
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 316, Pages 2839–2846https://doi.org/10.1609/aaai.v37i3.25385Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and ...
- research-articleFebruary 2023
Towards real-time panoptic narrative grounding by an end-to-end grounding network
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 281, Pages 2528–2536https://doi.org/10.1609/aaai.v37i2.25350Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is ...
- research-articleJanuary 2023
Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning
- Jiayi Ji,
- Xiaoyang Huang,
- Xiaoshuai Sun,
- Yiyi Zhou,
- Gen Luo,
- Liujuan Cao,
- Jianzhuang Liu,
- Ling Shao,
- Rongrong Ji
IEEE Transactions on Multimedia (TOM), Volume 25Pages 3962–3974https://doi.org/10.1109/TMM.2022.3169061Self-attention (SA) based networks have achieved great success in image captioning, constantly dominating the leaderboards of online benchmarks. However, existing SA networks still suffer from distance insensitivity and low-rank bottleneck. In this paper, ...
- research-articleJanuary 2023
Knowing What it is: Semantic-Enhanced Dual Attention Transformer
IEEE Transactions on Multimedia (TOM), Volume 25Pages 3723–3736https://doi.org/10.1109/TMM.2022.3164787Attention has become an indispensable component of the models of various multimedia tasks like <italic>Image Captioning</italic> (IC) and <italic>Visual Question Answering</italic> (VQA). However, most existing attention modules are designed for capturing ...
- research-articleJanuary 2023
Fast Monocular Depth Estimation via Side Prediction Aggregation with Continuous Spatial Refinement
IEEE Transactions on Multimedia (TOM), Volume 25Pages 1204–1216https://doi.org/10.1109/TMM.2021.3140001Recent works have validated the benefit of integrating spatial information into deep networks to improve pixel-level prediction tasks such as monocular depth estimation. However, how to efficiently and robustly integrate spatial cues retains as an open ...
- research-articleApril 2024
Make sharpness-aware minimization stronger: a sparsified perturbation approach
NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing SystemsArticle No.: 2244, Pages 30950–30962Deep neural networks often suffer from poor generalization caused by complex and non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Minimization (SAM), which smooths the loss landscape via minimizing the maximized change of ...
- ArticleOctober 2022
SeqTR: A Simple Yet Universal Network for Visual Grounding
- Chaoyang Zhu,
- Yiyi Zhou,
- Yunhang Shen,
- Gen Luo,
- Xingjia Pan,
- Mingbao Lin,
- Chao Chen,
- Liujuan Cao,
- Xiaoshuai Sun,
- Rongrong Ji
AbstractIn this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often ...
- ArticleOctober 2022
An Information Theoretic Approach for Attention-Driven Face Forgery Detection
AbstractRecently, Deepfake arises as a powerful tool to fool the existing real-world face detection systems, which has received wide attention in both academia and society. Most existing forgery face detection methods use heuristic clues to build a binary ...
- research-articleOctober 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
MM '22: Proceedings of the 30th ACM International Conference on MultimediaPages 638–647https://doi.org/10.1145/3503161.3547910Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained ...