research-article

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models

Authors:

Peter Henderson,

Christopher Manning,

Chelsea FinnAuthors Info & Claims

AIES '23: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society

Pages 287 - 296

https://doi.org/10.1145/3600211.3604690

Published: 29 August 2023 Publication History

Abstract

A growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. Yet foundation models pose a clear dual-use risk, indiscriminately reducing the costs of building both harmful and beneficial machine learning systems. Policy tools such as restricted model access and export controls are the primary methods currently used to mitigate such dual-use risks. In this work, we review potential safe-release strategies and argue that both policymakers and AI researchers would benefit from fundamentally new technologies enabling more precise control over the downstream usage of open-source foundation models. We propose one such approach: the task blocking paradigm, in which foundation models are trained with an additional mechanism to impede adaptation to harmful tasks without sacrificing performance on desirable tasks. We call the resulting models self-destructing models, inspired by mechanisms that prevent adversaries from using tools for harmful purposes. We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning, which we call meta-learned adversarial censoring (MLAC). In a small-scale experiment, we show MLAC can largely prevent a BERT-style model from being re-purposed to perform gender identification without harming the model’s ability to perform profession classification.

References

[1]

A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter. 2019. Differentiable Convex Optimization Layers. In Advances in Neural Information Processing Systems.

[2]

Artashes Arutiunian, Dev Vidhani, Goutham Venkatesh, Mayank Bhaskar, Ritobrata Ghosh, and Sujit Pal. 2021. Fine tuning CLIP with Remote Sensing (Satellite) images and captions. HuggingFace Blog (2021). https://huggingface.co/blog/fine-tune-clip-rsicd

[3]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).

[4]

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. Advances in neural information processing systems 24 (2011).

[5]

James Bergstra, Dan Yamins, David D Cox, 2013. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, Vol. 13. Citeseer, 20.

[6]

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv preprint arXiv:2204.06745 (2022).

[7]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).

[8]

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, 2018. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228 (2018).

[9]

Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, 2020. Toward trustworthy AI development: mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213 (2020).

[10]

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128.

Digital Library

[11]

Nicola De Cao, W. Aziz, and Ivan Titov. 2021. Editing Factual Knowledge in Language Models. ArXiv abs/2104.08164 (2021).

[12]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339 (2022).

[13]

Tim Dettmers and Luke Zettlemoyer. 2022. The case for 4-bit precision: k-bit Inference Scaling Laws. arXiv preprint arXiv:2212.09720 (2022).

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[15]

Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-embedded modeling language for convex optimization. The Journal of Machine Learning Research 17, 1 (2016), 2909–2913.

Digital Library

[16]

Harrison Edwards and Amos Storkey. 2015. Censoring representations with an adversary. arXiv preprint arXiv:1511.05897 (2015).

[17]

Alex Engler. 2022. The EU’s attempt to regulate open-source AI is counterproductive. Brookings TechTank (2022).

[18]

Carlos Muñoz Ferrandis. 2022. OpenRAIL: Towards open and responsible AI licensing frameworks. https://huggingface.co/blog/open_rail.

[19]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1126–1135. https://proceedings.mlr.press/v70/finn17a.html

[20]

Sebastian Flennerhag, Andrei A. Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin, and Raia Hadsell. 2020. Meta-Learning with Warped Gradient Descent. In International Conference on Learning Representations. https://openreview.net/forum?id=rkeiQlBFPB

[21]

Carrick Flynn. 2020. Recommendations on export controls for artificial intelligence. Centre for Security and Emerging Technology (2020).

[22]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.

[23]

Dan Goodin. 2023. Hackers are selling a service that bypasses ChatGPT restrictions on malware. arstechnica (2023).

[24]

Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov, Franziska Meier, Douwe Kiela, Kyunghyun Cho, and Soumith Chintala. 2019. Generalized Inner Loop Meta-Learning. arXiv preprint arXiv:1910.01727 (2019).

[25]

Philipp Hacker, Andreas Engel, and Marco Mauer. 2023. Regulating ChatGPT and other Large Generative AI Models. arXiv preprint arXiv:2302.02337 (2023).

[26]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2790–2799. https://proceedings.mlr.press/v97/houlsby19a.html

[27]

Edward Hu, Yelong Shen, Phil Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arxiv:2106.09685 [cs.CL]

[28]

Tatum Hunter. 2023. AI porn is easy to make now. For women, that’s a nightmare.The Washington Post (2023).

[29]

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. 2023. Pretraining Language Models with Human Preferences. arXiv preprint arXiv:2302.08582 (2023).

[30]

Yoonho Lee and Seungjin Choi. 2018. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning. 2933–2942.

[31]

Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. 2018. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5400–5409.

[32]

Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arxiv:2101.00190 [cs.CL]

[33]

Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-SGD: Learning to Learn Quickly for Few Shot Learning. CoRR abs/1707.09835 (2017). arXiv:1707.09835http://arxiv.org/abs/1707.09835

[34]

Percy Liang, Rishi Bommasani, Kathleen A. Creel, and Rob Reich. 2022. The Time Is Now to Develop Community Norms for the Release of Foundation Models. https://crfm.stanford.edu/2022/05/17/community-norms.html

[35]

Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, and Dan Hendrycks. 2022. How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios. arXiv preprint arXiv:2210.10039 (2022).

[36]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022. Fast Model Editing at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=0DcZxeWfOPt

[37]

Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. 2022. Memory-Based Model Editing at Scale. arXiv preprint arXiv:2206.06520 (2022).

[38]

Parmy Olson. 2022. The Quiet Growth of Race-Detection Software Sparks Concerns over Bias. In Ethics of Data and Analytics. Auerbach Publications, 201–205.

[39]

Laurent Orseau and MS Armstrong. 2016. Safely interruptible agents. (2016).

[40]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022).

[41]

Aviv Ovadya and Jess Whittlestone. 2019. Reducing malicious use of synthetic media research: Considerations and potential release practices for machine learning. arXiv preprint arXiv:1907.11274 (2019).

[42]

Eunbyung Park and Junier B Oliva. 2019. Meta-Curvature. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc.

[43]

Reid Pryzant, Kelly Shen, Dan Jurafsky, and Stefan Wagner. 2018. Deconfounded lexicon induction for interpretable social science. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 1615–1625.

[44]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[45]

Javier Rando, Daniel Paleka, David Lindner, Lennard Heim, and Florian Tramèr. 2022. Red-Teaming the Stable Diffusion Safety Filter. arXiv preprint arXiv:2210.04610 (2022).

[46]

Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan Cotterell. 2022. Linear Adversarial Concept Erasure. arXiv preprint arXiv:2201.12091 (2022).

[47]

Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. 2022. Adversarial Concept Erasure in Kernel Space. arXiv preprint arXiv:2201.12191 (2022).

[48]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752 [cs.CV]

[49]

Toby Shevlane. 2022. Structured access to AI capabilities: an emerging paradigm for safe AI deployment. arXiv preprint arXiv:2201.05159 (2022).

[50]

Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitry Pyrkin, Sergei Popov, and Artem Babenko. 2020. Editable Neural Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=HJedXaEtvS

[51]

Irene Solaiman. 2023. The Gradient of Generative AI Release: Methods and Considerations. arXiv preprint arXiv:2302.04844 (2023).

[52]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, 2019. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203 (2019).

[53]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).

[54]

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2020. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models. https://openreview.net/forum?id=BJg7x1HFvB

[55]

Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. 2022. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence 4, 3 (2022), 189–191.

[56]

U.S. Department of Commerce. 2022. Implementation of Additional Export Controls: Certain Advanced Computing and Semiconductor Manufacturing Items; Supercomputer and Semiconductor End Use; Entity List Modification. Federal Register 87 (2022), 62186. https://www.federalregister.gov/documents/2022/10/13/2022-21658/implementation-of-additional-export-controls-certain-advanced-computing-and-semiconductor

[57]

James Vincent. 2022. YouTuber trains AI bot on 4chan’s pile o’bile with entirely predictable results. The Verge (2022).

[58]

James Vincent. 2023. Meta’s powerful AI language model has leaked online — what happens now?The Verge (2023).

[59]

Jess Whittlestone and Aviv Ovadya. 2019. The tension between openness and prudence in AI research. arXiv preprint arXiv:1910.01170 (2019).

[60]

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. 2022. Improving out-of-distribution robustness via selective augmentation. arXiv preprint arXiv:2201.00299 (2022).

[61]

Susan Zhang, Mona Diab, and Luke Zettlemoyer. 2022. Democratizing access to large-scale language models with OPT-175B. Meta AI (2022).

[62]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, 2022. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).

[63]

Fan Zhou, Zhuqing Jiang, Changjian Shui, Boyu Wang, and Brahim Chaib-draa. 2020. Domain generalization with optimal transport and metric learning. arXiv preprint arXiv:2007.10573 (2020).

[64]

Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. 2019. Fast Context Adaptation via Meta-Learning. Thirty-sixth International Conference on Machine Learning (ICML 2019) (2019).

[65]

Remco Zwetsloot, James Dunham, Zachary Arnold, and Tina Huang. 2019. Keeping Top AI Talent in the United States. Center for Security and Emerging Technology (December 2019).

Cited By

Domínguez Hernández AKrishna SPerini AKatell MBennett SBorda AHashem YHadjiloizou SMahomed SJayadeva SAitken MLeslie D(2024)Mapping the individual, social and biospheric impacts of Foundation ModelsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658939(776-796)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658939
Haas LSkreta MAlberti SFinn C(2024)PIGEON: Predicting Image Geolocations2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01225(12893-12902)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01225
Atanasova PAtanasova P(2024)Recent Developments on Accountability and Explainability for Complex Reasoning TasksAccountable and Explainable Methods for Complex Reasoning over Text10.1007/978-3-031-51518-7_9(191-199)Online publication date: 6-Apr-2024
https://doi.org/10.1007/978-3-031-51518-7_9

Recommendations

Incorporating Insurance Costs in Hazardous Materials Routing Models

<P>The economic loss associated with the involvement of a hazardous materials truck in an accident can be very high. Although the immediate costs are usually borne by an insurer, the carrier would incur an increase in its future insurance costs as a ...
Chorus: Foundation Models for Unified Data Discovery and Exploration

We apply foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMS) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly ...
The 1st International Workshop on Graph Foundation Models (GFM)
WWW '24: Companion Proceedings of the ACM Web Conference 2024

Foundation models such as GPT-4 for natural language processing (NLP), Flamingo for computer vision (CV), have set new benchmarks in AI by delivering state-of-the-art results across various tasks with minimal task-specific data. Despite their success, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AIES '23: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society

August 2023

1026 pages

ISBN:9798400702310

DOI:10.1145/3600211

Editors:
Francesca Rossi
IBM
,
Sanmay Das
George Mason University
,
Jenny Davis
Australian National University
,
Kay Firth-Butterfield
Centre for Trustworthy Technology
,
Alex John
London, Carnegie Mellon University

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Honorable Mention

Qualifiers

Research-article
Research
Refereed limited

Conference

AIES '23

Sponsor:

SIGAI

AIES '23: AAAI/ACM Conference on AI, Ethics, and Society

August 8 - 10, 2023

QC, Montr\'{e}al, Canada

Acceptance Rates

Overall Acceptance Rate 61 of 162 submissions, 38%

Upcoming Conference

AIES '24

Sponsor:
sigai

AAAI/ACM Conference on AI, Ethics, and Society

October 21 - 23, 2024

San Jose , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
332
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)10

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Domínguez Hernández AKrishna SPerini AKatell MBennett SBorda AHashem YHadjiloizou SMahomed SJayadeva SAitken MLeslie D(2024)Mapping the individual, social and biospheric impacts of Foundation ModelsProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658939(776-796)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658939
Haas LSkreta MAlberti SFinn C(2024)PIGEON: Predicting Image Geolocations2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01225(12893-12902)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01225
Atanasova PAtanasova P(2024)Recent Developments on Accountability and Explainability for Complex Reasoning TasksAccountable and Explainable Methods for Complex Reasoning over Text10.1007/978-3-031-51518-7_9(191-199)Online publication date: 6-Apr-2024
https://doi.org/10.1007/978-3-031-51518-7_9

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents