Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

PADVG: A Simple Baseline of Active Protection for Audio-Driven Video Generation

Published: 08 March 2024 Publication History

Abstract

Over the past few years, deep generative models have significantly evolved, enabling the synthesis of realistic content and also bringing security concerns of illegal misuse. Therefore, active protection for generative models has been proposed recently, aiming to generate samples with hidden messages for future identification while preserving the original generating performance. However, existing active protection methods are specifically designed for generative adversarial networks (GANs), restricted to handling unconditional image generation. We observe that they get limited identification performance and visual quality when handling audio-driven video generation conditioned on target audio and source input to drive video generation with consistent context, e.g., identity and movement, between frame sequences. To address this issue, we introduce a simple yet effective active Protection framework for Audio-Driven Video Generation, named PADVG. To be specific, we present a novel frame-shared embedding module in which messages to hide are first transformed into frame-shared message coefficients. Then, these coefficients are assembled with the intermediate feature maps of video generators at multiple feature levels to generate the embedded video frames. Besides, PADVG further considers two visual consistent losses: (i) intra-frame loss is utilized to keep the visual consistency with different hidden messages; (ii) inter-frame loss is used to preserve the visual consistency across different video frames. Moreover, we also propose an auxiliary denoising training strategy through perturbing the assembled features by learnable pixel-level noise to improve identification performance, while enhancing robustness against real-world disturbances. Extensive experiments demonstrate that our proposed PADVG for audio-driven video generation can effectively identify the generated videos and achieve high visual quality.

References

[1]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audiovisual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2018), 8717–8727.
[2]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv:1809.00496. Retrieved from https://arxiv.org/abs/1809.00496
[3]
Abdullah Bamatraf, Rosziati Ibrahim, and Mohd Najib B. Mohd Salleh. 2010. Digital watermarking algorithm using LSB. In Proceedings of the ICCAIE. IEEE, 155–159.
[4]
Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, Hyrum S. Anderson, Heather Roff, Gregory C. Allen, Jacob Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, Simon Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, and Dario Amodei. 2018. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv:1802.07228 (2018). Retrieved from https://arxiv.org/abs/1802.07228
[5]
Shen Chen, Taiping Yao, Yang Chen, Shouhong Ding, Jilin Li, and Rongrong Ji. 2021. Local relation learning for face forgery detection. In Proceedings of the AAAI. AAAI Press, 1081–1088.
[6]
Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In ACCV (2) (Lecture Notes in Computer Science, Vol. 10112), Shang-Hong Lai, Vincent Lepetit, Ko Nishino, and Yoichi Sato (Eds.). Springer, 87–103.
[7]
Pierre Fernandez, Alexandre Sablayrolles, Teddy Furon, Hervé Jégou, and Matthijs Douze. 2022. Watermarking images in self-supervised latent spaces. In Proceedings of the ICASSP. IEEE, 3054–3058.
[8]
Tomás Filler, Jan Judas, and Jessica J. Fridrich. 2010. Minimizing embedding impact in steganography using trellis-coded quantization. In Proceedings of the SPIE. Vol. 7541, SPIE, 754105.
[9]
Shiming Ge, Fanzhao Lin, Chenyu Li, Daichi Zhang, Weiping Wang, and Dan Zeng. 2022. Deepfake video detection via predictive representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2s (2022), 1–21.
[10]
Xueluan Gong, Yanjiao Chen, Wenbin Yang, Guanghao Mei, and Qian Wang. 2021. InverseNet: Augmenting model extraction attacks with training data inversion. In Proceedings of the IJCAI. 2439–2447.
[11]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the NIPS. 2672–2680.
[12]
Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. 2021. AD-NeRF: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the ICCV. IEEE, 5764–5774.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770–778.
[14]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the NIPS. 6626–6637.
[15]
Ziheng Hu, Hongtao Xie, Yuxin Wang, Jiahong Li, Zhongyuan Wang, and Yongdong Zhang. 2021. Dynamic inconsistency-aware deepfake video detection. In Proceedings of the IJCAI. ijcai.org, 736–742.
[16]
Xun Huang and Serge J. Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the ICCV. IEEE Computer Society, 1510–1519.
[17]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. 2022. EAMM: One-shot emotional talking face via audio-based emotion-aware motion model. In Proceedings of the SIGGRAPH. ACM, 61:1–61:10.
[18]
Zhaoyang Jia, Han Fang, and Weiming Zhang. 2021. MBRS: Enhancing robustness of DNN-based watermarking by mini-batch of real and simulated JPEG compression. In Proceedings of the ACM Multimedia. ACM, 41–49.
[19]
Tero Karras, Samuli Laine, and Timo Aila. 2021. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 12 (2021), 4217–4228.
[20]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of styleGAN. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 8107–8116.
[21]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the ICLR (Poster).
[22]
Gerrit C. Langelaar, Iwan Setyawan, and Reginald L. Lagendijk. 2000. Watermarking digital image and video data. A state-of-the-art overview. IEEE Signal Processing Magazine 17, 5 (2000), 20–46.
[23]
Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020. Face x-ray for more general face forgery detection. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 5000–5009.
[24]
Xinyang Li, Shengchuan Zhang, Jie Hu, Liujuan Cao, Xiaopeng Hong, Xudong Mao, Feiyue Huang, Yongjian Wu, and Rongrong Ji. 2021. Image-to-image translation via hierarchical style disentanglement. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 8639–8648.
[25]
Dong Liang, Rui Wang, Xiaowei Tian, and Cong Zou. 2019. PCGAN: Partition-controlled human image generation. In Proceedings of the AAAI. AAAI Press, 8698–8705.
[26]
Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, and Jingdong Wang. 2023. Unified frequency-assisted transformer framework for detecting and grounding multi-modal manipulation. arXiv:2309.09667. Retrieved from https://arxiv.org/abs/2309.09667
[27]
Xiaolong Liu, Yang Yu, Xiaolong Li, Yao Zhao, and Guodong Guo. 2023. TCSD: Triple complementary streams detector for comprehensive deepfake detection. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 6 (2023), 1–22.
[28]
Yuanxun Lu, Jinxiang Chai, and Xun Cao. 2021. Live speech portraits: Real-time photorealistic talking-head animation. ACM Transactions on Graphics 40, 6 (2021), 220:1–220:17.
[29]
Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu Liu. 2020. World-consistent video-to-video synthesis. In ECCV (8) (Lecture Notes in Computer Science, Vol. 12353), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 359–378.
[30]
Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and Wael AbdAlmageed. 2020. Two-branch recurrent network for isolating deepfakes in videos. In ECCV (7) (Lecture Notes in Computer Science, Vol. 12352), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 667–684.
[31]
Changtao Miao, Zichang Tan, Qi Chu, Huan Liu, Honggang Hu, and Nenghai Yu. 2023. F 2 Trans: High-frequency fine-grained transformer for face forgery detection. IEEE Transactions on Information Forensics and Security 18 (2023), 1039–1051.
[32]
Changtao Miao, Zichang Tan, Qi Chu, Nenghai Yu, and Guodong Guo. 2022. Hierarchical frequency-assisted interactive networks for face manipulation detection. IEEE Transactions on Information Forensics and Security 17 (2022), 3008–3021.
[33]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the INTERSPEECH. ISCA, 2616–2620.
[34]
Ding Sheng Ong, Chee Seng Chan, KamWoh Ng, Lixin Fan, and Qiang Yang. 2021. Protecting intellectual property of generative adversarial networks from ambiguity attacks. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 3630–3639.
[35]
K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C. V. Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the ACM Multimedia. ACM, 484–492.
[36]
Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In ECCV (12) (Lecture Notes in Computer Science, Vol. 12357), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 86–103.
[37]
Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, and Shan Liu. 2021. PIRenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the CVPR. IEEE, 13759–13768.
[38]
Ahmed Salem, Apratim Bhattacharya, Michael Backes, Mario Fritz, and Yang Zhang. 2020. Updates-leak: Data set inference and reconstruction attacks in online learning. In Proceedings of the USENIX Security Symposium. USENIX Association, 1291–1308.
[39]
Toby Sharp. 2001. An implementation of key-based digital signal steganography. In Information Hiding (Lecture Notes in Computer Science, Vol. 2137), Ira S. Moskowitz (Ed.). Springer, 13–26.
[40]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR.
[41]
Matthew Tancik, Ben Mildenhall, and Ren Ng. 2020. StegaStamp: Invisible hyperlinks in physical photographs. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2114–2123.
[42]
L. Velazquez-Garcia, Antonio Cedillo-Hernandez, Manuel Cedillo-Hernandez, Mariko Nakano-Miyatake, and Héctor Pérez-Meana. 2022. Imperceptible-visible watermarking for copyright protection of digital videos based on temporal codes. Signal Processing: Image Communication 102 (2022), 116593.
[43]
Run Wang, Felix Juefei-Xu, Meng Luo, Yang Liu, and Lina Wang. 2021. FakeTagger: Robust safeguards against deepfake dissemination via provenance tracking. In Proceedings of the ACM Multimedia. ACM, 3546–3555.
[44]
Ruowei Wang, Chenguo Lin, Qijun Zhao, and Feiyu Zhu. 2021. Watermark faker: Towards forgery of digital image watermarking. In Proceedings of the ICME. IEEE, 1–6.
[45]
Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. 2021. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 10039–10049.
[46]
Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In ECCV Workshops (5) (Lecture Notes in Computer Science, Vol. 11133), Laura Leal-Taixé and Stefan Roth (Eds.). Springer, 63–79.
[47]
Yi Yang, Yueting Zhuang, and Yunhe Pan. 2021. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering 22, 12 (2021), 1551–1558.
[48]
Ning Yu, Vladislav Skripniuk, Sahar Abdelnabi, and Mario Fritz. 2021. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. In Proceedings of the ICCV. IEEE, 14428–14437.
[49]
Ning Yu, Vladislav Skripniuk, Dingfan Chen, Larry Davis, and Mario Fritz. 2022. Responsible disclosure of generative models using scalable fingerprinting. In Proceedings of the ICLR.
[50]
Yang Yu, Rongrong Ni, Wenjie Li, and Yao Zhao. 2022. Detection of AI-manipulated fake faces via mining generalized features. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–23.
[51]
Chenxu Zhang, Yifan Zhao, Yifei Huang, Ming Zeng, Saifeng Ni, Madhukar Budagavi, and Xiaohu Guo. 2021. FACIAL: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the ICCV. IEEE, 3847–3856.
[52]
Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu. 2021. Multi-attentional deepfake detection. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2185–2194.
[53]
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 4176–4186.
[54]
Sascha Zmudzinski. 2017. Digital Watermarking for Verification of Perception-Based Integrity of Audio Data. Ph.D. Dissertation. Darmstadt University of Technology, Germany.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
June 2024
715 pages
EISSN:1551-6865
DOI:10.1145/3613638
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2024
Online AM: 16 January 2024
Accepted: 10 December 2023
Revised: 22 November 2023
Received: 05 April 2023
Published in TOMM Volume 20, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Active protection
  2. audio-driven video generative models

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • National NSF of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 216
    Total Downloads
  • Downloads (Last 12 months)216
  • Downloads (Last 6 weeks)19
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media