Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611854acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Open-Scenario Domain Adaptive Object Detection in Autonomous Driving

Published: 27 October 2023 Publication History

Abstract

Existing domain adaptive object detection algorithms (DAOD) have demonstrated their effectiveness in discriminating and localizing objects across scenarios. However, these algorithms typically assume a single source and target domain for adaptation, which is not representative of the more complex data distributions in practice. To address this issue, we propose a novel Open-Scenario Domain Adaptive Object Detection (OSDA), which leverages multiple source and target domains for more practical and effective domain adaptation. We are the first to increase the granularity of the background category by building the foundation model using contrastive vision-language pre-training in an open-scenario setting for better distinguishing foreground and background, which is under-explored in previous studies. The performance gains by introducing the pre-training have been observed and have validated the model's ability to detect objects across domains. To further fine-tune the model for domain-specific object detection, we propose a hierarchical feature alignment strategy to obtain a better common feature space among the various source and target domains. In the case of multi-source domains, the cross-reconstruction framework is introduced for learning more domain invariances. The proposed method is able to alleviate knowledge forgetting without any additional computational costs. Extensive experiments across different scenarios demonstrate the effectiveness of the proposed model.

Supplemental Material

MP4 File
Presentation video - short version

References

[1]
Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, and Rita Cucchiara. 2022. The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4662--4670.
[2]
Chaoqi Chen, Zebiao Zheng, Xinghao Ding, Yue Huang, and Qi Dou. 2020. Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8869--8878.
[3]
Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2018. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3339--3348.
[4]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213--3223.
[5]
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, and Ajay Divakaran. 2019. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2601--2610.
[6]
Jinhong Deng, Wen Li, Yuhua Chen, and Lixin Duan. 2021. Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4091--4101.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning. PMLR, 1180--1189.
[9]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 11 (2013), 1231--1237.
[10]
Kaixiong Gong, Shuang Li, Shugang Li, Rui Zhang, Chi Harold Liu, and Qiang Chen. 2022. Improving Transferability for Domain Adaptive Detection Transformers. In Proceedings of the 30th ACM International Conference on Multimedia. 1543--1551.
[11]
Lijun Gou, Jinrong Yang, Hangcheng Yu, Pan Wang, Xiaoping Li, and Chao Deng. 2022. A Semantic Consistency Feature Alignment Object Detection Model Based on Mixed-Class Distribution Metrics. arXiv preprint arXiv:2206.05765 (2022).
[12]
Junlin Han, Mehrdad Shoeiby, Lars Petersson, and Mohammad Ali Armin. 2021. Dual contrastive learning for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 746--755.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[14]
Mengzhe He, Yali Wang, Jiaxi Wu, Yiru Wang, Hanqing Li, Bo Li, Weihao Gan, Wei Wu, and Yu Qiao. 2022. Cross domain object detection by target-perceived dual branch distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9570--9580.
[15]
Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng. 2019. Depth-attentional features for single-image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8022--8031.
[16]
Jiaxing Huang, Dayan Guan, Aoran Xiao, Shijian Lu, and Ling Shao. 2022. Category contrast for unsupervised domain adaptation in visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1203--1214.
[17]
Wei-Jie Huang, Yu-Lin Lu, Shih-Yao Lin, Yusheng Xie, and Yen-Yu Lin. 2022. AQT: Adversarial Query Transformers for Domain Adaptive Object Detection. In 31st International Joint Conference on Artificial Intelligence, IJCAI 2022. International Joint Conferences on Artificial Intelligence, 972--979.
[18]
Zi-Rong Jin, Liang-Jian Deng, Tian-Jing Zhang, and Xiao-Xu Jin. 2021. BAM: Bilateral activation mechanism for image fusion. In Proceedings of the 29th ACM International Conference on Multimedia. 4315--4323.
[19]
Madhu Kiran, Marco Pedersoli, Jose Dolz, Louis-Antoine Blais-Morin, Eric Granger, et al. 2022. Incremental multi-target domain adaptation for object detection with efficient domain transfer. Pattern Recognition 129 (2022), 108771.
[20]
Shuang Li, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and Gao Huang. 2019. Joint adversarial domain adaptation. In Proceedings of the 27th ACM International Conference on Multimedia. 729--737.
[21]
Wuyang Li, Xinyu Liu, Xiwen Yao, and Yixuan Yuan. 2022. SCAN: Cross Domain Object Detection with Semantic Conditioned Adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 6. 7.
[22]
Xiaoxu Li, Jijie Wu, Zhuo Sun, Zhanyu Ma, Jie Cao, and Jing-Hao Xue. 2020. BSNet: Bi-similarity network for few-shot fine-grained image classification. IEEE Transactions on Image Processing 30 (2020), 1318--1331.
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.
[24]
Risheng Liu, Zhu Liu, Jinyuan Liu, and Xin Fan. 2021. Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion. In Proceedings of the 29th ACM International Conference on Multimedia. 1600--1608.
[25]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21--37.
[26]
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. 2008. Domain adaptation with multiple sources. Advances in Neural Information Processing Systems 21 (2008).
[27]
Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In Proceedings of the European Conference on Computer Vision. Springer, 319--345.
[28]
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1406--1415.
[29]
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. 2019. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1406--1415.
[30]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.
[31]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).
[32]
Farzaneh Rezaeianaran, Rakshith Shetty, Rahaf Aljundi, Daniel Olmeda Reino, Shanshan Zhang, and Bernt Schiele. 2021. Seeking similarities over differences: Similarity-based domain alignment for adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9204--9213.
[33]
Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. 2019. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6956--6965.
[34]
Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2018. Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126, 9 (2018), 973--992.
[35]
Zhiqiang Shen, Harsh Maheshwari, Weichen Yao, and Marios Savvides. 2019. Scl: Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. arXiv preprint arXiv:1911.02559 (2019).
[36]
Hengcan Shi, Munawar Hayat, Yicheng Wu, and Jianfei Cai. 2022. Proposal-CLIP: unsupervised open-category object proposal generation via exploiting clip cues. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9611--9620.
[37]
Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, and Jieping Ye. 2011. A two-stage weighting framework for multi-source domain adaptation. Advances in Neural Information Processing Systems 24 (2011).
[38]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9627--9636.
[39]
Naveen Venkat, Jogendra Nath Kundu, Durgesh Singh, Ambareesh Revanur, et al. 2020. Your classifier can secretly suffice multi-source domain adaptation. Advances in Neural Information Processing Systems 33 (2020), 4647--4659.
[40]
Vibashan VS, Vikram Gupta, Poojan Oza, Vishwanath A Sindagi, and Vishal M Patel. 2021. Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4516--4526.
[41]
Wen Wang, Yang Cao, Jing Zhang, Fengxiang He, Zheng-Jun Zha, Yonggang Wen, and Dacheng Tao. 2021. Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers. In Proceedings of the 29th ACM International Conference on Multimedia. 1730--1738.
[42]
Jiwei Wei, Xing Xu, Zheng Wang, and Guoqing Wang. 2021. Meta self-paced learning for cross-modal matching. In Proceedings of the 29th ACM international conference on multimedia. 3835--3843.
[43]
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13005--13014.
[44]
Jiwei Wei, Yang Yang, Xing Xu, Jingkuan Song, Guoqing Wang, and Heng Tao Shen. 2023. Less is Better: Exponential Loss for Cross-Modal Matching. IEEE Transactions on Circuits and Systems for Video Technology (2023). https://doi.org/ 10.1109/TCSVT.2023.3249754
[45]
Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2022. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 6534--6545. https://doi. org/10.1109/TPAMI.2021.3088863
[46]
Xing Wei, Shaofan Liu, Yaoci Xiang, Zhangling Duan, Chong Zhao, and Yang Lu. 2020. Incremental learning based multi-domain adaptation for object detection. Knowledge-Based Systems 210 (2020), 106420.
[47]
Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. 2021. Instance-invariant domain adaptive object detection via progressive disentanglement. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021), 4178--4193.
[48]
Jiaxi Wu, Jiaxin Chen, Mengzhe He, Yiru Wang, Bo Li, Bingqi Ma, Weihao Gan, Wei Wu, Yali Wang, and Di Huang. 2022. Target-relevant knowledge preservation for multi-source domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5301--5310.
[49]
Chang-Dong Xu, Xing-Ran Zhao, Xin Jin, and Xiu-Shen Wei. 2020. Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11724--11733.
[50]
Han Xu, Xinya Wang, and Jiayi Ma. 2021. DRF: Disentangled representation for visible and infrared image fusion. IEEE Transactions on Instrumentation and Measurement 70 (2021), 1--13.
[51]
Minghao Xu, Hang Wang, Bingbing Ni, Qi Tian, and Wenjun Zhang. 2020. Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12355--12364.
[52]
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, and Heng Tao Shen. 2023. Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval. IEEE Transactions on Multimedia (2023).
[53]
Xingxu Yao, Sicheng Zhao, Pengfei Xu, and Jufeng Yang. 2021. Multi-source domain adaptation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3273--3282.
[54]
Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. 2020. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2636--2645.
[55]
Jin Yuan, Feng Hou, Yangzhou Du, Zhongchao Shi, Xin Geng, Jianping Fan, and Yong Rui. 2022. Self-supervised graph neural network for multi-source domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia. 3907--3916.
[56]
Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14393--14402.
[57]
Hui Zhang, Junkun Tang, Yihong Cao, Yurong Chen, Yaonan Wang, and QM Jonathan Wu. 2022. Cycle Consistency Based Pseudo Label and Fine Alignment for Unsupervised Domain Adaptation. IEEE Transactions on Multimedia (2022).
[58]
Jingyi Zhang, Jiaxing Huang, Zichen Tian, and Shijian Lu. 2022. Spectral unsu-pervised domain adaptation for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9829--9840.
[59]
Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and GeoffreyJ Gordon. 2018. Adversarial multiple source domain adaptation. Advances in Neural Information Processing Systems 31 (2018).
[60]
Lifan Zhao, Yunlong Meng, and Lin Xu. 2022. OA-FSUI2IT: A Novel Few-Shot Cross Domain Object Detection Framework with Object-Aware Few-Shot Unsupervised Image-to-Image Translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3426--3435.
[61]
Liang Zhao and Limin Wang. 2022. Task-specific Inconsistency Alignment for Domain Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14217--14226.
[62]
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793--16803.
[63]
Qianyu Zhou, Qiqi Gu, Jiangmiao Pang, Xuequan Lu, and Lizhuang Ma. 2023. Self-adversarial disentangling for specific domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[64]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223--2232.
[65]
Xinge Zhu, Jiangmiao Pang, Ceyuan Yang, Jianping Shi, and Dahua Lin. 2019. Adapting object detectors via selective cross-domain alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 687--696.

Cited By

View all
  • (2024)Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open WorldProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681212(1991-2000)Online publication date: 28-Oct-2024

Index Terms

  1. Open-Scenario Domain Adaptive Object Detection in Autonomous Driving

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. autonomous driving
    2. domain adaptation
    3. object detection

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)97
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 01 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open WorldProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681212(1991-2000)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media