Self-supervised Cross-view Representation Reconstruction for Change Captioning

Tu, Yunbin; Li, Liang; Su, Li; Zha, Zheng-Jun; Yan, Chenggang; Huang, Qingming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.16283 (cs)

[Submitted on 28 Sep 2023]

Title:Self-supervised Cross-view Representation Reconstruction for Change Captioning

Authors:Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming Huang

View PDF

Abstract:Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a ``hallucination'' representation with the caption and ``before'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at this https URL.

Comments:	Accepted by ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2309.16283 [cs.CV]
	(or arXiv:2309.16283v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.16283

Submission history

From: Yunbin Tu [view email]
[v1] Thu, 28 Sep 2023 09:28:50 UTC (3,067 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Self-supervised Cross-view Representation Reconstruction for Change Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Self-supervised Cross-view Representation Reconstruction for Change Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators