Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive Loss

Published: 18 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Deep learning models have been proven to be susceptible to malicious adversarial attacks, which manipulate input images to deceive the model into making erroneous decisions. Consequently, the threat posed to these models serves as a poignant reminder of the necessity to focus on the model security of object segmentation algorithms based on deep learning. However, the current landscape of research on adversarial attacks primarily centers around static images, resulting in a dearth of studies on adversarial attacks targeting Video Object Segmentation (VOS) models. Given that a majority of self-supervised VOS models rely on affinity matrices to learn feature representations of video sequences and achieve robust pixel correspondence, our investigation has delved into the impact of adversarial attacks on self-supervised VOS models. In response, we propose an innovative black-box attack method incorporating contrastive loss. This method induces segmentation errors in the model through perturbations in the feature space and the application of a pixel-level loss function. Diverging from conventional gradient-based attack techniques, we adopt an iterative black-box attack strategy that incorporates contrastive loss across the current frame, any two consecutive frames, and multiple frames. Through extensive experimentation conducted on the DAVIS 2016 and DAVIS 2017 datasets using three self-supervised VOS models and one unsupervised VOS model, we unequivocally demonstrate the potent attack efficiency of the black-box approach. Remarkably, the J&F metric value experiences a significant decline of up to 50.08% post-attack.

    References

    [1]
    Pulkit Agrawal, Joao Carreira, and Jitendra Malik. 2015. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision. 37–45.
    [2]
    S. Caelles, K. -K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. 2017. One-shot video object segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5320–5329.
    [3]
    Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid. 2021. Shot contrastive self-supervised learning for scene boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9796–9805.
    [4]
    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1597–1607.
    [5]
    Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, Rongrong Ji, Zhenjun Tang, and Xianxian Li. 2022. SiamBAN: Target-aware tracking with siamese box adaptive network. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2023), 5158–5173.
    [6]
    Emily L. Denton and Vighnesh Birodkar. 2017. Unsupervised learning of disentangled representations from video. Advances in Neural Information Processing Systems 30, 1 (2017), 1–10.
    [7]
    Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J. Guibas. 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5589–5597.
    [8]
    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. International Conference on Learning Representations. 1–12.
    [9]
    Junwei Han, Le Yang, Dingwen Zhang, Xiaojun Chang, and Xiaodan Liang. 2018. Reinforcement cutting-agent learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9080–9089.
    [10]
    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.
    [11]
    R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximizationd. International Conference on Learning Representations. 1–24.
    [12]
    Wenjie Hou, Zheyun Qin, Xiaoming Xi, Xiankai Lu, and Yilong Yin. 2022. Learning disentangled representation for self-supervised video object segmentation. Neurocomputing 481 (2022), 270–280. Retrieved from https://www.sciencedirect.com/science/article/pii/S0925231222000856
    [13]
    Peiliang Huang, Junwei Han, Nian Liu, Jun Ren, and Dingwen Zhang. 2021. Scribble-supervised video object segmentation. IEEE/CAA Journal of Automatica Sinica 9, 2 (2021), 339–353.
    [14]
    Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. 2019. Enhancing adversarial example transferability with an intermediate level attack. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4733–4742.
    [15]
    Nathan Inkawhich, Kevin Liang, Binghui Wang, Matthew Inkawhich, Lawrence Carin, and Yiran Chen. 2020. Perturbing across the feature hierarchy to improve standard and strict blackbox attack transferability. Advances in Neural Information Processing Systems 33 (2020), 20791–20801.
    [16]
    Nathan Inkawhich, Wei Wen, Hai Helen Li, and Yiran Chen. 2019. Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7066–7074.
    [17]
    Allan Jabri, Andrew Owens, and Alexei Efros. 2020. Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems 33 (2020), 19545–19560.
    [18]
    Huaizu Jiang, Gustav Larsson, Michael Maire Greg Shakhnarovich, and Erik Learned-Miller. 2018. Self-supervised relative depth learning for urban scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19–35.
    [19]
    Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence. 8545–8552.
    [20]
    Youngeun Kim, Seokeon Choi, Hankyeol Lee, Taekyung Kim, and Changick Kim. 2020. Rpm-net: Robust pixel-level matching networks for self-supervised video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2057–2065.
    [21]
    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations. 1–15.
    [22]
    Jyoti Kini, Fahad Shahbaz Khan, Salman Khan, and Mubarak Shah. 2022. Self-supervised video object segmentation via cutout prediction and tagging. arXiv:2204.10846. Retrieved from https://arxiv.org/abs/2204.10846
    [23]
    Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. In Proceedings of the Artificial Intelligence Safety and Security. Chapman and Hall/CRC, 99–112.
    [24]
    Zihang Lai, Erika Lu, and Weidi Xie. 2020. Mast: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6479–6488.
    [25]
    Junghyup Lee, Dohyung Kim, Jean Ponce, and Bumsub Ham. 2019. Sfnet: Learning object-aware semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2278–2287.
    [26]
    Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. 2019. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9522–9531.
    [27]
    Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C. -C. Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6526–6535.
    [28]
    Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C. -C. Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 207–223.
    [29]
    Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, and Ming-Hsuan Yang. 2019. Jointtask selfsupervised learning for temporal correspondence. Advances in Neural Information Processing Systems 32 (2019), 1–11.
    [30]
    Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. 2018. Speech emotion recognition via contrastive loss under siamese networks. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data. 21–26.
    [31]
    Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3623–3632.
    [32]
    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations. 1–28.
    [33]
    K. -K. Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2018. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 6 (2018), 1515–1530.
    [34]
    Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2574–2582.
    [35]
    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748. Retrieved from https://arxiv.org/abs/1807.03748
    [36]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 1–12.
    [37]
    Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. 2017. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2701–2710.
    [38]
    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.
    [39]
    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops. 1–6.
    [40]
    Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision. Springer, 776–794.
    [41]
    Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems 33 (2020), 6827–6839.
    [42]
    Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops. 1–6.
    [43]
    Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 391–408.
    [44]
    Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2495–2504.
    [45]
    Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. 2019. Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1308–1317.
    [46]
    Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Contrastive transformation for self-supervised correspondence learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 10174–10182.
    [47]
    Wenguan Wang, Xiankai Lu, Jianbing Shen, David J. Crandall, and Ling Shao. 2019. Zero-shot video object segmentation via attentive graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 9236–9245.
    [48]
    Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren. 2021. Feature importance-aware transferable adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7639–7648.
    [49]
    Xingxing Wei, Siyuan Liang, Ning Chen, and Xiaochun Cao. 2019. Transferable adversarial attacks for image and video object detection. International Joint Conferences on Artificial Intelligence. 954–960.
    [50]
    Olivia Wiles, A. Koepke, and Andrew Zisserman. 2018. Self-supervised learning of a facial attribute embedding from video. The British Machine Vision Conference (BMVC). 1–15.
    [51]
    Olivia Wiles, A. Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670–686.
    [52]
    Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. 2018. Generating adversarial examples with adversarial networks. International Joint Conferences on Artificial Intelligence. 3905–3911.
    [53]
    Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision. 1369–1378.
    [54]
    Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. 2021. Self-supervised video object segmentation by motion grouping. In Proceedings of the IEEE International Conference on Computer Vision. 7177–7188.
    [55]
    Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. 2019. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6210–6219.
    [56]
    Ning Yu, Guilin Liu, Aysegul Dundar, Andrew Tao, Bryan Catanzaro, Larry S. Davis, and Mario Fritz. 2021. Dual contrastive loss and attention for gans. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6731–6742.
    [57]
    Dingwen Zhang, Junwei Han, Le Yang, and Dong Xu. 2018. SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2018), 475–489.
    [58]
    Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference. Springer, 445–462.
    [59]
    Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.
    [60]
    Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 13066–13073.
    [61]
    Wen Zhou, Xin Hou, Yongjun Chen, Mengyun Tang, Xiangqi Huang, Xiang Gan, and Yong Yang. 2018. Transferable adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV’18). 452–467.
    [62]
    Wenjun Zhu, Jun Meng, and Li Xu. 2021. Self-supervised video object segmentation using integration-augmented attention. Neurocomputing 455 (2021), 325–339.
    [63]
    Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408–417.

    Cited By

    View all
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024

    Index Terms

    1. Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive Loss

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
      February 2024
      548 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613570
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 October 2023
      Online AM: 25 August 2023
      Accepted: 20 August 2023
      Revised: 07 August 2023
      Received: 23 August 2022
      Published in TOMM Volume 20, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Black-box adversarial attack
      2. self-supervised video object segmentation
      3. contrastive loss
      4. feature loss
      5. pixel-level loss

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Xuzhou Key Research and Development Program

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)227
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media