Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3630106.3659005acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article
Open access

Rethinking open source generative AI: open-washing and the EU AI Act

Published: 05 June 2024 Publication History

Abstract

The past year has seen a steep rise in generative AI systems that claim to be open. But how open are they really? The question of what counts as open source in generative AI is poised to take on particular importance in light of the upcoming EU AI Act that regulates open source systems differently, creating an urgent need for practical openness assessment. Here we use an evidence-based framework that distinguishes 14 dimensions of openness, from training datasets to scientific and technical documentation and from licensing to access methods. Surveying over 45 generative AI systems (both text and text-to-image), we find that while the term open source is widely used, many models are ‘open weight’ at best and many providers seek to evade scientific, legal and regulatory scrutiny by withholding information on training and fine-tuning data. We argue that openness in generative AI is necessarily composite (consisting of multiple elements) and gradient (coming in degrees), and point out the risk of relying on single features like access or licensing to declare models open or not. Evidence-based openness assessment can help foster a generative AI landscape in which models can be effectively regulated, model providers can be held accountable, scientists can scrutinise generative AI, and end users can make informed decisions.

References

[1]
Nur Ahmed, Muntasir Wahed, and Neil C. Thompson. 2023. The growing influence of industry in AI research. Science 379, 6635 (March 2023), 884–886. https://doi.org/10.1126/science.ade2420
[2]
Ariful Islam Anik and Andrea Bunt. 2021. Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Yokohama Japan, 1–13. https://doi.org/10.1145/3411764.3445736
[3]
Abeba Birhane, Atoosa Kasirzadeh, David Leslie, and Sandra Wachter. 2023. Science in the age of large language models. Nature Reviews Physics 5, 5 (May 2023), 277–280. https://doi.org/10.1038/s42254-023-00581-4 Number: 5 Publisher: Nature Publishing Group.
[4]
Abeba Birhane, Vinay Prabhu, Sang Han, Vishnu Naresh Boddeti, and Alexandra Sasha Luccioni. 2023. Into the LAION’s Den: Investigating Hate in Multimodal Datasets. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
[5]
Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision?. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). 1536–1546. https://doi.org/10.1109/WACV48630.2021.00158 ISSN: 2642-9381.
[6]
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal datasets: misogyny, pornography, and malignant stereotypes. https://doi.org/10.48550/arXiv.2110.01963 arXiv:2110.01963 [cs].
[7]
Claudi L. Bockting, Eva A. M. van Dis, Robert van Rooij, Willem Zuidema, and Johan Bollen. 2023. Living guidelines for generative AI — why scientists must oversee its use. Nature 622, 7984 (Oct. 2023), 693–696. https://doi.org/10.1038/d41586-023-03266-1 Bandiera_abtest: a Cg_type: Comment Number: 7984 Publisher: Nature Publishing Group Subject_term: Machine learning, Technology, Policy, Computer science.
[8]
Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. 2023. The Foundation Model Transparency Index. https://doi.org/10.48550/arXiv.2310.12941 arXiv:2310.12941 [cs].
[9]
Jean-Claude Burgelman, Corina Pascu, Katarzyna Szkuta, Rene Von Schomberg, Athanasios Karalopoulos, Konstantinos Repanas, and Michel Schouppe. 2019. Open Science, Open Data, and Open Scholarship: European Policies to Make Science Fit for the Twenty-First Century. Frontiers in Big Data 2 (2019). https://doi.org/10.3389/fdata.2019.00043
[10]
Kenneth Ward Church and Valia Kordoni. 2022. Emerging Trends: SOTA-Chasing. Natural Language Engineering 28, 2 (March 2022), 249–269. https://doi.org/10.1017/S1351324922000043 Publisher: Cambridge University Press.
[11]
Jennifer Cobbe, Michael Veale, and Jatinder Singh. 2023. Understanding accountability in algorithmic supply chains. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, Chicago IL USA, 1186–1197. https://doi.org/10.1145/3593013.3594073
[12]
Creative Commons, Eleuther AI, GitHub, Hugging Face, LAION, and Open Future. 2023. Supporting Open Source and Open Science in the EU AI Act. https://creativecommons.org/2023/07/26/supporting-open-source-and-open-science-in-the-eu-ai-act/
[13]
Danish Contractor, Daniel McDuff, Julia Katherine Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, and Hanlin Li. 2022. Behavioral Use Licensing for Responsible AI. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, Seoul Republic of Korea, 778–788. https://doi.org/10.1145/3531146.3533143
[14]
Anamaria Crisan, Margaret Drouhard, Jesse Vig, and Nazneen Rajani. 2022. Interactive Model Cards: A Human-Centered Approach to Model Documentation. In 2022 ACM Conference on Fairness, Accountability, and Transparency(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 427–439. https://doi.org/10.1145/3531146.3533108
[15]
Kate Downing. 2024. Choose Your Own Adventure: The EU AI Act and Openish AI. https://katedowninglaw.com/2024/02/06/choose-your-own-adventure-the-eu-ai-act-and-openish-ai-2/
[16]
Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, and Jakob Foerster. 2024. Near to Mid-term Risks and Opportunities of Open Source Generative AI. https://doi.org/10.48550/arXiv.2404.17047 arXiv:2404.17047 [cs] version: 1.
[17]
Timnit Gebru, Emily M. Bender, Angelina McMillan-Major, and Margaret Mitchell. 2023. Statement from the listed authors of Stochastic Parrots on the "AI pause" letter. https://www.dair-institute.org/blog/letter-statement-March2023
[18]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (Nov. 2021), 86–92. https://doi.org/10.1145/3458723
[19]
Ellen P. Goodman and Julia Tréhu. 2022. AI Audit-Washing and Accountability. Technical Report. German Marshall Fund of the United States. https://www.jstor.org/stable/resrep44893
[20]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. 2024. OLMo: Accelerating the Science of Language Models. https://doi.org/10.48550/arXiv.2402.00838 arXiv:2402.00838 [cs].
[21]
Odd Erik Gundersen, Yolanda Gil, and David W. Aha. 2018. On Reproducible AI: Towards Reproducible Research, Open Science, and Digital Scholarship in AI Publications. AI Magazine 39, 3 (Sept. 2018), 56–68. https://doi.org/10.1609/aimag.v39i3.2816 Number: 3.
[22]
Furkan Gursoy and Ioannis A. Kakadiaris. 2022. System Cards for AI-Based Decision-Making for Public Policy. http://arxiv.org/abs/2203.04754 arXiv:2203.04754 [cs].
[23]
Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbush, and Hugo J. W. L. Aerts. 2020. Transparency and reproducibility in artificial intelligence. Nature 586, 7829 (Oct. 2020), E14–E16. https://doi.org/10.1038/s41586-020-2766-y Number: 7829 Publisher: Nature Publishing Group.
[24]
Drew Hemment, Morgan Currie, Sj Bennett, Jake Elwes, Anna Ridler, Caroline Sinders, Matjaz Vidmar, Robin Hill, and Holly Warner. 2023. AI in the Public Eye: Investigating Public AI Literacy Through AI Art. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, Chicago IL USA, 931–942. https://doi.org/10.1145/3593013.3594052
[25]
Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 560–575. https://doi.org/10.1145/3442188.3445918
[26]
Daniel Jaffee. 2012. Weak Coffee: Certification and Co-Optation in the Fair Trade Movement. Social Problems 59, 1 (Feb. 2012), 94–116. https://doi.org/10.1525/sp.2012.59.1.94
[27]
Paul Keller. 2023. A Frankenstein-like approach: open source in the AI act. https://openfuture.eu/blog/a-frankenstein-like-approach-open-source-in-the-ai-act
[28]
Mehtab Khan and Alex Hanna. 2022. The Subjects and Stages of Al Dataset Development: A Framework for Dataset Accountability. 19, 2 (2022), 171–256.
[29]
Florian Königstorfer and Stefan Thalmann. 2022. AI Documentation: A path to accountability. Journal of Responsible Technology 11 (Oct. 2022), 100043. https://doi.org/10.1016/j.jrt.2022.100043
[30]
Clifford H. Lee and Elisabeth Soep. 2016. None But Ourselves Can Free Our Minds: Critical Computational Literacy as a Pedagogy of Resistance. Equity & Excellence in Education 49, 4 (Oct. 2016), 480–492. https://doi.org/10.1080/10665684.2016.1227157
[31]
Hee-Eun Lee, Tatiana Ermakova, Vasilis Ververis, and Benjamin Fabian. 2020. Detecting child sexual abuse material: A comprehensive survey. Forensic Science International: Digital Investigation 34 (Sept. 2020), 301022. https://doi.org/10.1016/j.fsidi.2020.301022
[32]
Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse. 2023. Opening up ChatGPT: tracking openness, transparency, and accountability in instruction-tuned text generators. In Proceedings of CUI’23. Eindhoven. https://opening-up-chatgpt.github.io/
[33]
Na Liu, Liangyu Chen, Xiaoyu Tian, Wei Zou, Kaijiang Chen, and Ming Cui. 2024. From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models. https://doi.org/10.48550/arXiv.2401.02777 arXiv:2401.02777 [cs].
[34]
Laura Lucaj, Patrick Van Der Smagt, and Djalel Benbouzid. 2023. AI Regulation Is (not) All You Need. In 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM, Chicago IL USA, 1267–1279. https://doi.org/10.1145/3593013.3594079
[35]
Nicola Lucchi. 2023. ChatGPT: A Case Study on Copyright Challenges for Generative Artificial Intelligence Systems. European Journal of Risk Regulation (Aug. 2023), 1–23. https://doi.org/10.1017/err.2023.59
[36]
Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. Journal of Machine Learning Research 24, 253 (2023), 1–15. http://jmlr.org/papers/v24/23-0069.html
[37]
Jeanna Matthews. 2020. Patterns and Antipatterns, Principles, and Pitfalls: Accountability and Transparency in Artificial Intelligence. AI Magazine 41, 1 (2020), 82–89. https://doi.org/10.1609/aimag.v41i1.5204 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1609/aimag.v41i1.5204.
[38]
Angelina McMillan-Major, Emily M. Bender, and Batya Friedman. 2023. Data Statements: From Technical Concept to Community Practice. ACM Journal on Responsible Computing (2023). https://doi.org/10.1145/3594737 Just Accepted.
[39]
Lisa Messeri and M. J. Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature 627, 8002 (March 2024), 49–58. https://doi.org/10.1038/s41586-024-07146-0 Publisher: Nature Publishing Group.
[40]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency(FAT* ’19). Association for Computing Machinery, New York, NY, USA, 220–229. https://doi.org/10.1145/3287560.3287596
[41]
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual Generalization through Multitask Finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15991–16111. https://doi.org/10.18653/v1/2023.acl-long.891
[42]
Claudio Novelli, Mariarosaria Taddeo, and Luciano Floridi. 2023. Accountability in artificial intelligence: what it is and how it works. AI & SOCIETY (Feb. 2023). https://doi.org/10.1007/s00146-023-01635-y
[43]
OpenAI. 2023. GPT-4 API general availability and deprecation of older models in the Completions API. https://openai.com/blog/gpt-4-api-general-availability
[44]
European Parliament. 2024. Artificial Intelligence Act: Provisional Agreement Resulting from Interinstitutional Negotiations., 245 pages. http://web.archive.org/web/20240310112041/https://www.europarl.europa.eu/meetdocs/2014_2019/plmrep/COMMITTEES/CJ40/AG/2024/02-13/1296003EN.pdf
[45]
Bruce Perens. 1999. The Open Source Definition. In Open Sources: Voices from the Open Source Revolution, Chris DiBona, Sam Ockman, and Mark Stone (Eds.). O’Reilly, 171–188.
[46]
Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. ACM, Barcelona Spain, 33–44. https://doi.org/10.1145/3351095.3372873
[47]
Dirk Riehle. 2023. The Future of the Open Source Definition. Computer 56, 12 (Dec. 2023), 95–99. https://doi.org/10.1109/MC.2023.3311648 Conference Name: Computer.
[48]
Hannah Ruschemeier. 2023. AI as a challenge for legal regulation – the scope of application of the artificial intelligence act proposal. ERA Forum 23, 3 (Feb. 2023), 361–376. https://doi.org/10.1007/s12027-022-00725-6
[49]
Pamela Samuelson. 2023. Generative AI meets copyright. Science 381, 6654 (July 2023), 158–161. https://doi.org/10.1126/science.adi0656 Publisher: American Association for the Advancement of Science.
[50]
Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M. Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, and Iz Beltagy. 2022. What Language Model to Train if You Have One Million GPU Hours?https://doi.org/10.48550/arXiv.2210.15424 arXiv:2210.15424 [cs].
[51]
Timo Schick and Hinrich Schütze. 2021. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2339–2352. https://doi.org/10.18653/v1/2021.naacl-main.185
[52]
Jan Smits and Tijn Borghuis. 2022. Generative AI and Intellectual Property Rights. In Law and Artificial Intelligence: Regulating AI and Applying AI in Legal Practice, Bart Custers and Eduard Fosch-Villaronga (Eds.). T.M.C. Asser Press, The Hague, 323–344. https://doi.org/10.1007/978-94-6265-523-2_17
[53]
Irene Solaiman. 2023. The Gradient of Generative AI Release: Methods and Considerations. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 111–122. https://doi.org/10.1145/3593013.3593981
[54]
Arthur Spirling. 2023. Why open-source generative AI models are an ethical way forward for science. Nature 616, 7957 (April 2023), 413–413. https://doi.org/10.1038/d41586-023-01295-4 Bandiera_abtest: a Cg_type: World View Number: 7957 Publisher: Nature Publishing Group Subject_term: Ethics, Machine learning, Technology, Scientific community.
[55]
Marilyn Strathern. 1997. ‘Improving ratings’: audit in the British University system. European Review 5, 3 (July 1997), 305–321. https://doi.org/10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184>3.0.CO;2-4 Publisher: Cambridge University Press.
[56]
Alek Tarkowski. 2024. AI Act fails to set meaningful dataset transparency standards for open source AI. https://openfuture.eu/blog/ai-act-fails-to-set-meaningful-dataset-transparency-standards-for-open-source-ai
[57]
Hugo Touvron, Louis Martin, and Kevin Stone. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. (2023). https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
[58]
Joel Walmsley. 2021. Artificial intelligence and the value of transparency. AI & SOCIETY 36, 2 (June 2021), 585–595. https://doi.org/10.1007/s00146-020-01066-z
[59]
Matt White, Ibrahim Haddad, Cailean Osborne, Xiao-Yang, Liu, Ahmed Abdelmonsef, and Sachin Varghese. 2024. The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI. https://doi.org/10.48550/arXiv.2403.13784 arXiv:2403.13784 [cs].
[60]
David Gray Widder, Sarah West, and Meredith Whittaker. 2023. Open (For Business): Big Tech, Concentrated Power, and the Political Economy of Open AI. https://doi.org/10.2139/ssrn.4543807
[61]
BigScience Workshop, Teven Le Scao et al. 2023. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. https://doi.org/10.48550/arXiv.2211.05100 arXiv:2211.05100 [cs].

Cited By

View all
  • (2024)The AI community building the future? A quantitative analysis of development activity on Hugging Face HubJournal of Computational Social Science10.1007/s42001-024-00300-8Online publication date: 24-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency
June 2024
2580 pages
ISBN:9798400704505
DOI:10.1145/3630106
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2024

Check for updates

Author Tags

  1. Technology assessment
  2. large language models
  3. text generators
  4. text-to-image generators

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

FAccT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9,971
  • Downloads (Last 6 weeks)2,423
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The AI community building the future? A quantitative analysis of development activity on Hugging Face HubJournal of Computational Social Science10.1007/s42001-024-00300-8Online publication date: 24-Jun-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media