Author: Sun, Xiaoshuai : Search

research-article

Free

JUST ACCEPTED

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Just Accepted https://doi.org/10.1145/3687475

In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image ...

research-article

Parameter and computation efficient transfer learning for vision-language pre-trained models

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing SystemsArticle No.: 1786, Pages 41034–41050

With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only ...

research-article

Cheap and quick: efficient vision-language instruction tuning for large language models

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing SystemsArticle No.: 1288, Pages 29615–29627

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing ...

research-article

Semi-Supervised Panoptic Narrative Grounding

MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 7164–7174https://doi.org/10.1145/3581783.3612259

Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a ...

research-article

PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks

MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 4666–4677https://doi.org/10.1145/3581783.3612067

Synthesizing vivid human portraits is a research hot spot in image generation with a wide scope of applications. In addition to fidelity, generation controllability is another key factor that has long plagued its development. To address this issue, ...

research-article

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 4157–4168https://doi.org/10.1145/3581783.3611768

Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Despite considerable efforts to bridge the gap between vision and language, the significant differences between these ...

research-article

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

MM '23: Proceedings of the 31st ACM International Conference on MultimediaPages 3403–3414https://doi.org/10.1145/3581783.3611767

In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D ...

research-article

A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension

IEEE Transactions on Multimedia (TOM), Volume 26Pages 3689–3700https://doi.org/10.1109/TMM.2023.3314153

One-stage Referring Expression Comprehension (REC) is a task that requires accurate alignment between text descriptions and visual content. In recent years, numerous efforts have been devoted to cross-modal learning for REC, while the influence of other ...

research-article

Towards Language-Guided Visual Recognition via Dynamic Convolutions

International Journal of Computer Vision (IJCV), Volume 132, Issue 1Pages 1–19https://doi.org/10.1007/s11263-023-01871-1

Abstract

In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided ...

research-article

Towards local visual modeling for image captioning

Pattern Recognition (PATT), Volume 138, Issue Chttps://doi.org/10.1016/j.patcog.2023.109420

Highlights

Local visual modeling with grid features for image captioning.
Locality-...

Abstract

In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (...

research-article

End-to-end zero-shot HOI detection via vision and language knowledge distillation

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 316, Pages 2839–2846https://doi.org/10.1609/aaai.v37i3.25385

Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and ...

research-article

Towards real-time panoptic narrative grounding by an end-to-end grounding network

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 281, Pages 2528–2536https://doi.org/10.1609/aaai.v37i2.25350

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is ...

research-article

Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning

IEEE Transactions on Multimedia (TOM), Volume 25Pages 3962–3974https://doi.org/10.1109/TMM.2022.3169061

Self-attention (SA) based networks have achieved great success in image captioning, constantly dominating the leaderboards of online benchmarks. However, existing SA networks still suffer from distance insensitivity and low-rank bottleneck. In this paper, ...

research-article

Knowing What it is: Semantic-Enhanced Dual Attention Transformer

IEEE Transactions on Multimedia (TOM), Volume 25Pages 3723–3736https://doi.org/10.1109/TMM.2022.3164787

Attention has become an indispensable component of the models of various multimedia tasks like <italic>Image Captioning</italic> (IC) and <italic>Visual Question Answering</italic> (VQA). However, most existing attention modules are designed for capturing ...

research-article

Fast Monocular Depth Estimation via Side Prediction Aggregation with Continuous Spatial Refinement

IEEE Transactions on Multimedia (TOM), Volume 25Pages 1204–1216https://doi.org/10.1109/TMM.2021.3140001

Recent works have validated the benefit of integrating spatial information into deep networks to improve pixel-level prediction tasks such as monocular depth estimation. However, how to efficiently and robustly integrate spatial cues retains as an open ...

research-article

Make sharpness-aware minimization stronger: a sparsified perturbation approach

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing SystemsArticle No.: 2244, Pages 30950–30962

Deep neural networks often suffer from poor generalization caused by complex and non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Minimization (SAM), which smooths the loss landscape via minimizing the maximized change of ...

Article

SeqTR: A Simple Yet Universal Network for Visual Grounding

Computer Vision – ECCV 2022Pages 598–615https://doi.org/10.1007/978-3-031-19833-5_35

Abstract

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often ...

Article

An Information Theoretic Approach for Attention-Driven Face Forgery Detection

Computer Vision – ECCV 2022Pages 111–127https://doi.org/10.1007/978-3-031-19781-9_7

Abstract

Recently, Deepfake arises as a powerful tool to fool the existing real-world face detection systems, which has received wide attention in both academia and society. Most existing forgery face detection methods use heuristic clues to build a binary ...

Article

PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation

Computer Vision – ECCV 2022Pages 643–660https://doi.org/10.1007/978-3-031-19781-9_37

Abstract

Pixel synthesis is a promising research paradigm for image generation, which can well exploit pixel-wise prior knowledge for generation. However, existing methods still suffer from excessive memory footprint and computation overhead. In this paper,...

research-article

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

MM '22: Proceedings of the 30th ACM International Conference on MultimediaPages 638–647https://doi.org/10.1145/3503161.3547910

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained ...

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Caption

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

Parameter and computation efficient transfer learning for vision-language pre-trained models

Cheap and quick: efficient vision-language instruction tuning for large language models

Semi-Supervised Panoptic Narrative Grounding

PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks

Upcoming Conferences

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension

Towards Language-Guided Visual Recognition via Dynamic Convolutions

Towards local visual modeling for image captioning

End-to-end zero-shot HOI detection via vision and language knowledge distillation

Towards real-time panoptic narrative grounding by an end-to-end grounding network

Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning

Knowing What it is: Semantic-Enhanced Dual Attention Transformer

Fast Monocular Depth Estimation via Side Prediction Aggregation with Continuous Spatial Refinement

Make sharpness-aware minimization stronger: a sparsified perturbation approach

SeqTR: A Simple Yet Universal Network for Visual Grounding

An Information Theoretic Approach for Attention-Driven Face Forgery Detection

PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder

Upcoming Conferences