research-article

Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive Loss

Authors:

Abdulmotaleb El SaddikAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 2

Article No.: 57, Pages 1 - 21

https://doi.org/10.1145/3617502

Published: 18 October 2023 Publication History

Abstract

Deep learning models have been proven to be susceptible to malicious adversarial attacks, which manipulate input images to deceive the model into making erroneous decisions. Consequently, the threat posed to these models serves as a poignant reminder of the necessity to focus on the model security of object segmentation algorithms based on deep learning. However, the current landscape of research on adversarial attacks primarily centers around static images, resulting in a dearth of studies on adversarial attacks targeting Video Object Segmentation (VOS) models. Given that a majority of self-supervised VOS models rely on affinity matrices to learn feature representations of video sequences and achieve robust pixel correspondence, our investigation has delved into the impact of adversarial attacks on self-supervised VOS models. In response, we propose an innovative black-box attack method incorporating contrastive loss. This method induces segmentation errors in the model through perturbations in the feature space and the application of a pixel-level loss function. Diverging from conventional gradient-based attack techniques, we adopt an iterative black-box attack strategy that incorporates contrastive loss across the current frame, any two consecutive frames, and multiple frames. Through extensive experimentation conducted on the DAVIS 2016 and DAVIS 2017 datasets using three self-supervised VOS models and one unsupervised VOS model, we unequivocally demonstrate the potent attack efficiency of the black-box approach. Remarkably, the J&F metric value experiences a significant decline of up to 50.08% post-attack.

References

[1]

Pulkit Agrawal, Joao Carreira, and Jitendra Malik. 2015. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision. 37–45.

Digital Library

[2]

S. Caelles, K. -K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. 2017. One-shot video object segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5320–5329.

[3]

Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid. 2021. Shot contrastive self-supervised learning for scene boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9796–9805.

[4]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1597–1607.

[5]

Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, Rongrong Ji, Zhenjun Tang, and Xianxian Li. 2022. SiamBAN: Target-aware tracking with siamese box adaptive network. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2023), 5158–5173.

[6]

Emily L. Denton and Vighnesh Birodkar. 2017. Unsupervised learning of disentangled representations from video. Advances in Neural Information Processing Systems 30, 1 (2017), 1–10.

[7]

Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J. Guibas. 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5589–5597.

[8]

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. International Conference on Learning Representations. 1–12.

[9]

Junwei Han, Le Yang, Dingwen Zhang, Xiaojun Chang, and Xiaodan Liang. 2018. Reinforcement cutting-agent learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9080–9089.

[10]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.

[11]

R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximizationd. International Conference on Learning Representations. 1–24.

[12]

Wenjie Hou, Zheyun Qin, Xiaoming Xi, Xiankai Lu, and Yilong Yin. 2022. Learning disentangled representation for self-supervised video object segmentation. Neurocomputing 481 (2022), 270–280. Retrieved from https://www.sciencedirect.com/science/article/pii/S0925231222000856

Digital Library

[13]

Peiliang Huang, Junwei Han, Nian Liu, Jun Ren, and Dingwen Zhang. 2021. Scribble-supervised video object segmentation. IEEE/CAA Journal of Automatica Sinica 9, 2 (2021), 339–353.

[14]

Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. 2019. Enhancing adversarial example transferability with an intermediate level attack. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4733–4742.

[15]

Nathan Inkawhich, Kevin Liang, Binghui Wang, Matthew Inkawhich, Lawrence Carin, and Yiran Chen. 2020. Perturbing across the feature hierarchy to improve standard and strict blackbox attack transferability. Advances in Neural Information Processing Systems 33 (2020), 20791–20801.

[16]

Nathan Inkawhich, Wei Wen, Hai Helen Li, and Yiran Chen. 2019. Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7066–7074.

[17]

Allan Jabri, Andrew Owens, and Alexei Efros. 2020. Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems 33 (2020), 19545–19560.

[18]

Huaizu Jiang, Gustav Larsson, Michael Maire Greg Shakhnarovich, and Erik Learned-Miller. 2018. Self-supervised relative depth learning for urban scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19–35.

Digital Library

[19]

Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence. 8545–8552.

Digital Library

[20]

Youngeun Kim, Seokeon Choi, Hankyeol Lee, Taekyung Kim, and Changick Kim. 2020. Rpm-net: Robust pixel-level matching networks for self-supervised video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2057–2065.

[21]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations. 1–15.

[22]

Jyoti Kini, Fahad Shahbaz Khan, Salman Khan, and Mubarak Shah. 2022. Self-supervised video object segmentation via cutout prediction and tagging. arXiv:2204.10846. Retrieved from https://arxiv.org/abs/2204.10846

[23]

Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. In Proceedings of the Artificial Intelligence Safety and Security. Chapman and Hall/CRC, 99–112.

[24]

Zihang Lai, Erika Lu, and Weidi Xie. 2020. Mast: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6479–6488.

[25]

Junghyup Lee, Dohyung Kim, Jean Ponce, and Bumsub Ham. 2019. Sfnet: Learning object-aware semantic correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2278–2287.

[26]

Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. 2019. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9522–9531.

[27]

Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C. -C. Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6526–6535.

[28]

Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C. -C. Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 207–223.

Digital Library

[29]

Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, and Ming-Hsuan Yang. 2019. Jointtask selfsupervised learning for temporal correspondence. Advances in Neural Information Processing Systems 32 (2019), 1–11.

[30]

Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. 2018. Speech emotion recognition via contrastive loss under siamese networks. In Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data. 21–26.

Digital Library

[31]

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3623–3632.

[32]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations. 1–28.

[33]

K. -K. Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2018. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 6 (2018), 1515–1530.

Digital Library

[34]

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2574–2582.

[35]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748. Retrieved from https://arxiv.org/abs/1807.03748

[36]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 1–12.

[37]

Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. 2017. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2701–2710.

[38]

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.

[39]

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops. 1–6.

[40]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision. Springer, 776–794.

Digital Library

[41]

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems 33 (2020), 6827–6839.

[42]

Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops. 1–6.

[43]

Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 391–408.

Digital Library

[44]

Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2495–2504.

[45]

Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. 2019. Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1308–1317.

[46]

Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Contrastive transformation for self-supervised correspondence learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 10174–10182.

[47]

Wenguan Wang, Xiankai Lu, Jianbing Shen, David J. Crandall, and Ling Shao. 2019. Zero-shot video object segmentation via attentive graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 9236–9245.

[48]

Zhibo Wang, Hengchang Guo, Zhifei Zhang, Wenxin Liu, Zhan Qin, and Kui Ren. 2021. Feature importance-aware transferable adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7639–7648.

[49]

Xingxing Wei, Siyuan Liang, Ning Chen, and Xiaochun Cao. 2019. Transferable adversarial attacks for image and video object detection. International Joint Conferences on Artificial Intelligence. 954–960.

[50]

Olivia Wiles, A. Koepke, and Andrew Zisserman. 2018. Self-supervised learning of a facial attribute embedding from video. The British Machine Vision Conference (BMVC). 1–15.

[51]

Olivia Wiles, A. Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670–686.

Digital Library

[52]

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. 2018. Generating adversarial examples with adversarial networks. International Joint Conferences on Artificial Intelligence. 3905–3911.

[53]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision. 1369–1378.

[54]

Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. 2021. Self-supervised video object segmentation by motion grouping. In Proceedings of the IEEE International Conference on Computer Vision. 7177–7188.

[55]

Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. 2019. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6210–6219.

[56]

Ning Yu, Guilin Liu, Aysegul Dundar, Andrew Tao, Bryan Catanzaro, Larry S. Davis, and Mario Fritz. 2021. Dual contrastive loss and attention for gans. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6731–6742.

[57]

Dingwen Zhang, Junwei Han, Le Yang, and Dong Xu. 2018. SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2018), 475–489.

[58]

Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference. Springer, 445–462.

Digital Library

[59]

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

[60]

Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. 13066–13073.

[61]

Wen Zhou, Xin Hou, Yongjun Chen, Mengyun Tang, Xiangqi Huang, Xiang Gan, and Yong Yang. 2018. Transferable adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV’18). 452–467.

Digital Library

[62]

Wenjun Zhu, Jun Meng, and Li Xu. 2021. Self-supervised video object segmentation using integration-augmented attention. Neurocomputing 455 (2021), 325–339.

Digital Library

[63]

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408–417.

Cited By

Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727

Index Terms

Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive Loss
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Amora: Black-box Adversarial Morphing Attack
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Nowadays, digital facial content manipulation has become ubiquitous and realistic with the success of generative adversarial networks (GANs), making face recognition (FR) systems suffer from unprecedented security concerns. In this paper, we investigate ...
Saliency Attack: Towards Imperceptible Black-box Adversarial Attack
Deep neural networks are vulnerable to adversarial examples, even in the black-box setting where the attacker is only accessible to the model output. Recent studies have devised effective black-box attacks with high query efficiency. However, such ...
Improving Query Efficiency of Black-Box Adversarial Attack
Computer Vision – ECCV 2020
Abstract
Deep neural networks (DNNs) have demonstrated excellent performance on various tasks, however they are under the risk of adversarial examples that can be easily generated when the target model is accessible to an attacker (white-box setting). As ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 2

February 2024

548 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3613570

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2023

Online AM: 25 August 2023

Accepted: 20 August 2023

Revised: 07 August 2023

Received: 23 August 2022

Published in TOMM Volume 20, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Xuzhou Key Research and Development Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
227
Total Downloads

Downloads (Last 12 months)227
Downloads (Last 6 weeks)12

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents