X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Yuan, Zhihao; Yan, Xu; Liao, Yinghong; Guo, Yao; Li, Guanbin; Li, Zhen; Cui, Shuguang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.00843 (cs)

[Submitted on 2 Mar 2022 (v1), last revised 6 Apr 2022 (this version, v3)]

Title:X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Authors:Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Zhen Li, Shuguang Cui

View PDF

Abstract:3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. However, only exploiting single modal information, e.g., point cloud, previous approaches fail to produce faithful descriptions. Though aggregating 2D features into point clouds may be beneficial, it introduces an extra computational burden, especially in inference phases. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption through knowledge distillation using a teacher-student framework. In practice, during the training phase, the teacher network exploits auxiliary 2D modality and guides the student network that only takes point clouds as input through the feature consistency constraints. Owing to the well-designed cross-modal feature fusion module and the feature alignment in the training phase, X-Trans2Cap acquires rich appearance information embedded in 2D images with ease. Thus, a more faithful caption can be generated only using point clouds during the inference. Qualitative and quantitative results confirm that X-Trans2Cap outperforms previous state-of-the-art by a large margin, i.e., about +21 and about +16 absolute CIDEr score on ScanRefer and Nr3D datasets, respectively.

Comments:	To appear in CVPR2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2203.00843 [cs.CV]
	(or arXiv:2203.00843v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.00843

Submission history

From: Xu Yan [view email]
[v1] Wed, 2 Mar 2022 03:35:37 UTC (4,298 KB)
[v2] Fri, 18 Mar 2022 12:45:21 UTC (4,420 KB)
[v3] Wed, 6 Apr 2022 11:55:04 UTC (4,276 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators