Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Zhang, Yimeng; Chen, Xin; Jia, Jinghan; Zhang, Yihua; Fan, Chongyu; Liu, Jiancheng; Hong, Mingyi; Ding, Ke; Liu, Sijia

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.15234 (cs)

[Submitted on 24 May 2024 (v1), last revised 9 Oct 2024 (this version, v3)]

Title:Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Authors:Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu

View PDF HTML (experimental)

Abstract:Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: this https URL

Comments:	Accepted by NeurIPS'24. Codes are available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Cite as:	arXiv:2405.15234 [cs.CV]
	(or arXiv:2405.15234v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.15234

Submission history

From: Yimeng Zhang [view email]
[v1] Fri, 24 May 2024 05:47:23 UTC (11,198 KB)
[v2] Fri, 14 Jun 2024 21:50:22 UTC (11,198 KB)
[v3] Wed, 9 Oct 2024 16:12:40 UTC (11,225 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators