Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548009acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Relative Alignment Network for Source-Free Multimodal Video Domain Adaptation

Published: 10 October 2022 Publication History

Abstract

Video domain adaptation aims to transfer knowledge from labeled source videos to unlabeled target videos. Existing video domain adaptation methods require full access to the source videos to reduce the domain gap between the source and target videos, which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem, in this paper, we propose to solve a source-free domain adaptation task for videos where only a pre-trained source model and unlabeled target videos are available for learning a multimodal video classification model. Existing source-free domain adaptation methods cannot be directly applied to this task, since videos always suffer from domain discrepancy along both the multimodal and temporal aspects, which brings difficulties in domain adaptation especially when the source data are unavailable. In this paper, we propose a Multimodal and Temporal Relative Alignment Network (MTRAN) to deal with the above challenges. To explicitly imitate the domain shifts contained in the multimodal information and the temporal dynamics of the source and target videos, we divide the target videos into two splits according to the self-entropy values of the classification results. The low-entropy videos are deemed to be source-like while the high-entropy videos are deemed to be target-like. Then, we adopt a self-entropy-guided MixUp strategy to generate synthetic samples and hypothetical samples as instance-level based on source-like and target-like videos, and push each synthetic sample to be similar with the corresponding hypothetical sample that is slightly closer to the source-like videos than the synthetic sample by multimodal and temporal relative alignment schemes. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods.

Supplementary Material

MP4 File (MM22-fp1086.mp4)
Video presentation

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[3]
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, Vol. 32.
[4]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision. 132--149.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. 2019. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6321--6330.
[7]
Min-Hung Chen, Baopu Li, Yingze Bao, Ghassan AlRegib, and Zsolt Kira. 2020. Action segmentation with joint self-supervised temporal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9454--9463.
[8]
Boris Chidlovskii, Stephane Clinchant, and Gabriela Csurka. 2016. Domain adaptation in the absence of source domain data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 451--460.
[9]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision. 720--736.
[10]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 248--255.
[11]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning. PMLR, 1180--1189.
[12]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2018. Watch, think and attend: End-to-end video classification via dynamic knowledge evolution modeling. In Proceedings of the 26th ACM International Conference on Multimedia. 690--699.
[13]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR, 315--323.
[14]
Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. 2017. Learning discrete representations via information maximizing self-augmented training. In International Conference on Machine Learning. PMLR, 1558--1567.
[15]
Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. 2021a. Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data. In Advances in Neural Information Processing Systems, Vol. 34.
[16]
Yi Huang, Xiaoshan Yang, Junyun Gao, and Changsheng Xu. 2021b. Holographic feature learning of egocentric-exocentric videos for multi-domain action recognition. IEEE Transactions on Multimedia, Vol. 24 (2021), 2273--2286.
[17]
Arshad Jamal, Vinay P Namboodiri, Dipti Deodhare, and KS Venkatesh. 2018. Deep domain adaptation in action space. In British Machine Vision Conference.
[18]
Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. 2020. Hard negative mixing for contrastive learning. In Advances in Neural Information Processing Systems, Vol. 33. 21798--21809.
[19]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1725--1732.
[20]
Donghyun Kim, Yi-Hsuan Tsai, Bingbing Zhuang, Xiang Yu, Stan Sclaroff, Kate Saenko, and Manmohan Chandraker. 2021b. Learning cross-modal contrastive features for video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13618--13627.
[21]
Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, and Sungeun Hong. 2021a. Domain adaptation without source data. IEEE Transactions on Artificial Intelligence, Vol. 2, 06 (2021), 508--518.
[22]
Wouter M Kouw and Marco Loog. 2019. A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 3 (2019), 766--785.
[23]
Hildegard Kuehne, Hueihan Jhuang, Est'ibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2556--2563.
[24]
Ilja Kuzborskij and Francesco Orabona. 2013. Stability and hypothesis transfer learning. In International Conference on Machine Learning. PMLR, 942--950.
[25]
Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. 2020. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9641--9650.
[26]
Jian Liang, Dapeng Hu, and Jiashi Feng. 2020. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028--6039.
[27]
Jianming Lv, Kaijie Liu, and Shengfeng He. 2021. Differentiated learning for multi-modal domain adaptation. In Proceedings of the 29th ACM International Conference on Multimedia. 1322--1330.
[28]
Jonathan Munro and Dima Damen. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 122--132.
[29]
Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision. Springer, 392--405.
[30]
Fan Qi, Xiaoshan Yang, and Changsheng Xu. 2018. A unified framework for multimodal domain adaptation. In Proceedings of the 26th ACM International Conference on Multimedia. 429--437.
[31]
Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, and Abir Das. 2021. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. In Advances in Neural Information Processing Systems, Vol. 34.
[32]
Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. 2018. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7396--7404.
[33]
Xiaolin Song, Sicheng Zhao, Jingyu Yang, Huanjing Yue, Pengfei Xu, Runbo Hu, and Hua Chai. 2021. Spatio-temporal contrastive domain adaptation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9787--9795.
[34]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[35]
Waqas Sultani and Imran Saleemi. 2014. Human action recognition across datasets by foreground-weighted histogram decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 764--771.
[36]
Jun Tang, Haiqun Jin, Shoubiao Tan, and Dong Liang. 2016. Cross-domain action recognition via collective matrix factorization with graph Laplacian regularization. Image and Vision Computing, Vol. 55 (2016), 119--126.
[37]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4489--4497.
[38]
Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
[39]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.
[41]
Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. 2020. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 6502--6509.
[42]
Xiaodong Yang, Pavlo Molchanov, and Jan Kautz. 2016. Multilayer and multimodal fusion of deep neural networks for video classification. In Proceedings of the 24th ACM International Conference on Multimedia. 978--987.
[43]
Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2014. Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia, Vol. 17, 1 (2014), 64--78.
[44]
Xiaoshan Yang, Tianzhu Zhang, Changsheng Xu, and Ming-Hsuan Yang. 2015. Boosted multifeature learning for cross-domain transfer. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 11, 3 (2015), 1--18.
[45]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4694--4702.
[46]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.

Cited By

View all
  • (2024)Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive SurveyACM Computing Surveys10.1145/367901056:12(1-36)Online publication date: 22-Jul-2024
  • (2024)A Survey of Trustworthy Representation Learning Across DomainsACM Transactions on Knowledge Discovery from Data10.1145/365730118:7(1-53)Online publication date: 19-Jun-2024
  • (2024)A Comprehensive Survey on Source-Free Domain AdaptationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337097846:8(5743-5762)Online publication date: Aug-2024
  • Show More Cited By

Index Terms

  1. Relative Alignment Network for Source-Free Multimodal Video Domain Adaptation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. relative alignment
    2. source-free
    3. video domain adaptation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)98
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive SurveyACM Computing Surveys10.1145/367901056:12(1-36)Online publication date: 22-Jul-2024
    • (2024)A Survey of Trustworthy Representation Learning Across DomainsACM Transactions on Knowledge Discovery from Data10.1145/365730118:7(1-53)Online publication date: 19-Jun-2024
    • (2024)A Comprehensive Survey on Source-Free Domain AdaptationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337097846:8(5743-5762)Online publication date: Aug-2024
    • (2024)Domain adaptation via Wasserstein distance and discrepancy metric for chest X-ray image classificationScientific Reports10.1038/s41598-024-53311-w14:1Online publication date: 1-Feb-2024
    • (2024)Source-free unsupervised domain adaptationNeural Networks10.1016/j.neunet.2024.106230174:COnline publication date: 1-Jun-2024
    • (2024)Source-Free Unsupervised Domain AdaptationNeurocomputing10.1016/j.neucom.2023.126921564:COnline publication date: 1-Feb-2024
    • (2024)A Comprehensive Survey on Test-Time Adaptation Under Distribution ShiftsInternational Journal of Computer Vision10.1007/s11263-024-02181-w133:1(31-64)Online publication date: 18-Jul-2024
    • (2023)Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612314(3807-3816)Online publication date: 26-Oct-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media