research-article

Relative Alignment Network for Source-Free Multimodal Video Domain Adaptation

Authors:

Yi Huang,

Xiaoshan Yang,

Ji Zhang,

Changsheng XuAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 1652 - 1660

https://doi.org/10.1145/3503161.3548009

Published: 10 October 2022 Publication History

Get Access

Abstract

Video domain adaptation aims to transfer knowledge from labeled source videos to unlabeled target videos. Existing video domain adaptation methods require full access to the source videos to reduce the domain gap between the source and target videos, which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem, in this paper, we propose to solve a source-free domain adaptation task for videos where only a pre-trained source model and unlabeled target videos are available for learning a multimodal video classification model. Existing source-free domain adaptation methods cannot be directly applied to this task, since videos always suffer from domain discrepancy along both the multimodal and temporal aspects, which brings difficulties in domain adaptation especially when the source data are unavailable. In this paper, we propose a Multimodal and Temporal Relative Alignment Network (MTRAN) to deal with the above challenges. To explicitly imitate the domain shifts contained in the multimodal information and the temporal dynamics of the source and target videos, we divide the target videos into two splits according to the self-entropy values of the classification results. The low-entropy videos are deemed to be source-like while the high-entropy videos are deemed to be target-like. Then, we adopt a self-entropy-guided MixUp strategy to generate synthetic samples and hypothetical samples as instance-level based on source-like and target-like videos, and push each synthetic sample to be similar with the corresponding hypothetical sample that is slightly closer to the source-like videos than the synthetic sample by multimodal and temporal relative alignment schemes. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods.

Supplementary Material

MP4 File (MM22-fp1086.mp4)

Video presentation

Download
46.84 MB

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Imbalanced Source-free Domain Adaptation

Source-Free Domain Adaptation for Real-World Image Dehazing

Continual Source-Free Unsupervised Domain Adaptation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations