research-article

Open access

PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

Authors:

Nicholas Synovic,

Nathaniel Bielanski,

George K. Thiruvathukal,

James C. DavisAuthors Info & Claims

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

Pages 431 - 443

https://doi.org/10.1145/3643991.3644907

Published: 02 July 2024 Publication History

Abstract

The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse.

This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model's training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions.

Our artifact is available at https://github.com/PurdueDualityLab/PeaTMOSS-Artifact. Our dataset is available at https://transfer.rcac.purdue.edu/file-manager?origin_id=ff978999-16c2-4b50-ac7a-947ffdc3eb1d&origin_path=%2F.

References

[1]

[n. d.]. ChatGPT. https://chat.openai.com

[2]

[n. d.]. CodeScan: Code Quality and Security for Salesforce. https://www.codescan.io.

[3]

[n. d.]. npm: Build Amazing Things. https://www.npmjs.com/.

[4]

[n. d.]. PyPI: The Python Package Index. https://pypi.org/.

[5]

2022. License compatibility. https://en.wikipedia.org/wiki/License_compatibility

[6]

2023. NGC Catalog - GPU-optimized AI, Machine Learning, & HPC Software. https://catalog.ngc.nvidia.com.

[7]

Meta AI. 2024. Papers With Code. https://paperswithcode.com/about

[8]

Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2023. HFCommunity: A Tool to Analyze the Hugging Face Hub Community. In SANER'23. IEEE.

[9]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, and Harald Gall. 2019. Software Engineering for Machine Learning: A Case Study. In International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[10]

Antreas Antoniou, Amos Storkey, and Harrison Edwards. 2017. Data augmentation generative adversarial networks. arXiv:1711.04340 (2017).

[11]

Matthew Arnold, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilović, Ravi Nair, K Natesan Ramamurthy, Alexandra Olteanu, David Piorkowski, et al. 2019. FactSheets: Increasing trust in AI services through supplier's declarations of conformity. IBM Journal of Research and Development 63, 4/5 (2019), 6--1.

[12]

Sebastian Baltes. 2018. SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts. In Internat'l. Conf. on Mining Software Repos. (MSR).

Digital Library

[13]

Sebastian Baltes and Stephan Diehl. 2019. Usage and attribution of Stack Over-flow code snippets in GitHub projects. EMSE (2019).

[14]

Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Oops, My Tests Broke the Build: An Explorative Analysis of Travis CI with GitHub. In International Conference on Mining Software Repositories (MSR).

[15]

Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration. In International Conference on Mining Software Repositories (MSR).

[16]

Aaditya Bhatia, Ellis E. Eghan, Manel Grichi, William G. Cavanagh, Zhen Ming Jiang, and Bram Adams. 2023. Towards a change taxonomy for machine learning pipelines: Empirical study of ML pipelines and forks related to academic publications. EMSE 28 (2023).

Digital Library

[17]

Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the factors that impact the popularity of GitHub repositories. In Internat'l. Conf. on Software Maintenance and Evolution (ICSME). IEEE.

[18]

Houssem Ben Braiek and Foutse Khomh. 2020. On testing machine learning programs. Journal of Systems and Software (JSS) 164 (2020), 110542.

[19]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712

[20]

Nicolás Cardozo, Ivana Dusparic, and Christian Cabrera. 2023. Prevalence of Code Smells in Reinforcement Learning Projects. arXiv:2303.10236 (2023).

[21]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. 2023. Analyzing the Evolution and Maintenance of ML Models on Hugging Face. arXiv (2023). https://arxiv.org/abs/2311.13380

[22]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. 2023. Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study. arXiv (2023). https://arxiv.org/pdf/2305.11164.pdf

[23]

Danish Contractor, Daniel McDuff, Julia Katherine Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, and Hanlin Li. 2022. Behavioral use licensing for responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 778--788.

Digital Library

[24]

Valerio Cosentino, Javier L Cánovas Izquierdo, and Jordi Cabot. 2017. A systematic mapping study of software development with GitHub. IEEE Access (2017).

[25]

Yaniv David, Neophytos Christou, Andreas D. Kellas, Vasileios P. Kemerlis, and Junfeng Yang. 2024. QUACK: Hindering Deserialization Attacks via Static Duck Typing. In the Network and Distributed System Security Symposium (NDSS).

[26]

James C. Davis, Purvish Jajal, Wenxin Jiang, Taylor R. Schorlemmer, Nicholas Synovic, and George K. Thiruvathukal. 2023. Reusing Deep Learning Models: Challenges and Directions in Software Engineering. In Proceedings of the IEEE John Vincent Atanasoff Symposium on Modern Computing (JVA'23).

[27]

Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proc. IEEE 108, 4 (April 2020), 485--532.

[28]

Malinda Dilhara, Ameya Ketkar, and Danny Dig. 2021. Understanding Software-2.0: A Study of Machine Learning library usage and evolution. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 4 (2021), 1--42.

Digital Library

[29]

Parijat Dube, Bishwaranjan Bhattacharjee, Siyu Huo, Patrick Watson, and Brian Belgodere. 2019. Automatic Labeling of Data for Transfer Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.

[30]

Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S Rosen, Gerbrand Ceder, Kristin Persson, and Anubhav Jain. 2022. Structured information extraction from complex scientific text with fine-tuned large language models. arXiv preprint arXiv:2212.05238 (2022).

[31]

M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. 2019. Understanding flaky tests: The developer's perspective. In ACM Join Meeting on European Software Eng. Conf. and Sympos. on the Found. of Software Eng.(ESEC/FSE).

[32]

D. Elsner, F. Hauer, A. Pretschner, and S. Reimer. 2021. Empirically evaluating readily available information for regression test optimization in continuous integration. In International Symposium on Software Testing and Analysis (ISSTA).

[33]

Hugging Face. 2021. Hugging Face - The AI community building the future. https://huggingface.co/

[34]

Hugging Face. 2023. Hugging Face Hub Library Documentation. https://github.com/HuggingFace/hub-docs/blob/main/js/src/lib//interfaces/Libraries.ts

[35]

Yuanrui Fan, Xin Xia, David Lo, Ahmed E Hassan, and Shanping Li. 2021. What makes a popular academic AI repository? EMSE 26 (2021), 1--35.

[36]

Jonas Geiping and Tom Goldstein. 2023. Cramming: Training a Language Model on a single GPU in one day. In International Conf. on Machine Learning (ICML).

[37]

Daniel German and Massimiliano Di Penta. 2012. A method for open source license compliance of java applications. IEEE software 29, 3 (2012), 58--63.

Digital Library

[38]

Daniel M German, Massimiliano Di Penta, and Julius Davies. 2010. Understanding and auditing the licensing of open source software distributions. In 2010 IEEE 18th International Conference on Program Comprehension. IEEE, 84--93.

Digital Library

[39]

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. Proceedings of the National Academy of Sciences 30 (July 2023), e2305016120. http://arxiv.org/abs/2303.15056

[40]

GitHub. 2022. GitHub REST API Documentation on Licenses. https://docs.github.com/en/rest/licenses/licenses. API version: 2022-11-28.

[41]

Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, and Timofey Bryksin. 2020. A study of potential code borrowing and license violations in java projects on github. In International Conference on Mining Software Repositories (MSR).

Digital Library

[42]

L. Gong, J. Zhang, M. Wei, H. Zhang, and Z. Huang. 2023. What is the intended usage context of this model? An exploratory study of pre-trained models on various model repositories. TOSEM 32, 3 (2023), 1--57.

Digital Library

[43]

Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: Github's data from a firehose. In Internat'l Working Conf. on Mining Software Repositories (MSR).

[44]

Remo Gresta, Vinicius Durelli, and Elder Cirilo. 2023. Naming Practices in Object-oriented Programming: An Empirical Study. Journal of Software Engineering Research and Development (2023), 5--1.

[45]

Shangwei Guo, Chunlong Xie, Jiwei Li, Lingjuan Lyu, and Tianwei Zhang. 2022. Threats to pre-trained language models: Survey and taxonomy. arXiv preprint arXiv:2202.06862 (2022).

[46]

X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, et al. 2021. Pre-trained models: Past, present and future. AI Open 2 (2021), 225--250.

[47]

André Hora, Romain Robbes, Marco Tulio Valente, Nicolas Anquetil, Anne Etien, and Stéphane Ducasse. 2018. How do developers react to API evolution? A large-scale empirical study. Software Quality Journal 26 (2018), 161--191.

Digital Library

[48]

Hugging Face. 2023. Repositories and Licenses. https://huggingface.co/docs/hub/repositories-licenses.

[49]

Purvish Jajal, Wenxin Jiang, Arav Tewari, Joseph Woo, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. 2023. Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX Ecosystem. arXiv (2023). https://arxiv.org/abs/2303.17708

[50]

Susmit Jha, Sumit Kumar Jha, Patrick Lincoln, Nathaniel D Bastian, Alvaro Velasquez, and Sandeep Neema. 2023. Dehallucinating large language models using formal methods guided iterative prompting. In 2023 IEEE International Conference on Assured Autonomy (ICAA). IEEE, 149--152.

[51]

Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In International Conference on Software Engineering (ICSE'21). IEEE, 1161--1173.

Digital Library

[52]

W. Jiang, V. Banna, N. Vivek, A. Goel, N. Synovic, G.K. Thiruvathukal, and J.C. Davis. 2023. Challenges and practices of deep learning model reengineering: A case study on computer vision. arXiv (2023).

[53]

Wenxin Jiang, Chingwo Cheung, George K. Thiruvathukal, and James C. Davis. 2023. Exploring Naming Conventions (and Defects) of Pre-trained Deep Learning Models in Hugging Face and Other Model Hubs. arXiv:2310.01642 (2023).

[54]

Wenxin. Jiang, Nicholas. Synovic, Matt. Hyatt, Taylor R. Schorlemmer, Rohan. Sethi, Yung-Hsiang Lu, George K. Thiruvathukal, and James C. Davis. 2023. An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry. In ICSE.

[55]

W. Jiang, N. Synovic, P. Jajal, T.R. Schorlemmer, A. Tewari, B. Pareek, G.K. Thiruvathukal, and J.C. Davis. 2023. PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages. MSR (2023).

[56]

W. Jiang, N. Synovic, R. Sethi, A. Indarapu, M. Hyatt, T.R. Schorlemmer, G.K. Thiruvathukal, and J.C. Davis. 2022. An Empirical Study of Artifacts and Security Risks in the Pre-Trained Model Supply Chain. In ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED).

[57]

Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. 2021. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv (2021).

[58]

Andrej Karpathy. 2017. Software 2.0. https://karpathy.medium.com/software-2-0-a64152b37c35. (2017), 1--8.

[59]

A. Kathikar, A. Nair, B. Lazarine, A. Sachdeva, and S. Samtani. 2023. Assessing the vulnerabilities of the open-source artificial intelligence (AI) landscape: A large-scale analysis of the Hugging Face platform. In Intern. Conf. on Intelligence and Security Informatics.

[60]

P. Kuckertz, J. Göpfert, O. Karras, D. Neuroth, J. Schönau, R. Pueblas, S. Ferenz, F. Engel, N. Pflugradt, J.M. Weinand, A. Nieße, S. Auer, and D. Stolten. 2023. A Metadata-Based Ecosystem to Improve the FAIRness of Research Software. http://arxiv.org/abs/2306.10620

[61]

Andrew MSt Laurent. 2004. Understanding open source and free software licensing: guide to navigating licensing issues in existing & new software. O'Reilly Media.

[62]

Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.

[63]

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. in Neural Information Processing Systems (NeurIPS) (2020).

[64]

Li Li, Jiawei Wang, and Haowei Quan. 2022. Scalpel: The Python Static Analysis Framework. arXiv preprint arXiv:2202.11840 (2022).

[65]

S. Li, J. Guo, J.G. Lou, M. Fan, T. Liu, and D. Zhang. 2022. Testing Machine Learning Systems in Industry: An Empirical Study. In International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 263--272.

[66]

Y. Li, Z. Zhang, B. Liu, Z. Yang, and Y. Liu. 2021. ModelDiff: Testing-based DNN similarity comparison for model reuse detection. In ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 139--151.

[67]

Ziyu Li, Rihan Hai, Alessandro Bozzon, and Asterios Katsifodimos. 2022. Metadata Representations for Queryable ML Model Zoos. arXiv:2207.09315

[68]

Caroline Lima and Andre Hora. 2020. What are the characteristics of popular APIs? A large-scale study on Java, Android, and 165 libraries. Software Quality Journal 28, 2 (2020), 425--458.

Digital Library

[69]

Ilyas Saïd Makari, Ahmed Zerouali, and Coen De Roover. 2022. Prevalence and Evolution of License Violations in npm and RubyGems Dependency Networks. In International Conference on Software and Software Reuse. Springer, 85--100.

Digital Library

[70]

Pedro Marcelino. 2022. Transfer learning from pre-trained models. https://towardsdatascience.com/transfer-learning-from-pre-trained-models-f2393f124751

[71]

Diego Montes, Pongpatapee Peerapatanapokin, Jeff Schultz, Chengjun Guo, Wenxin Jiang, and James C Davis. 2022. Discrepancies among pre-trained deep neural networks: a new threat to model zoo reliability. In ESEC/FSE-IVR track.

[72]

Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, and Zhen Ming. 2023. Bug Characterization in Machine Learning-based Systems. arXiv (2023). https://arxiv.org/abs/2307.14512

[73]

Nadia Nahar, Haoran Zhang, Grace Lewis, Shurui Zhou, and Christian Kästner. 2023. A Dataset and Analysis of Open-Source Machine Learning Products. arXiv preprint arXiv:2308.04328 (2023).

[74]

NLPlanet. 2021. A Brief Timeline of NLP: From Bag of Words to the Transformer Family. Medium (2021). https://medium.com/nlplanet/a-brief-timeline-of-nlp-from-bag-of-words-to-the-transformer-family-7caad8bbba56

[75]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

[76]

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training.

[77]

David Piorkowski, Michael Hind, and John Richards. 2023. Quantitative ai risk assessments: Opportunities and challenges. arXiv (2023).

[78]

Pytorch. 2021. PyTorch Hub. https://pytorch.org/hub/

[79]

Binhang Qi, Hailong Sun, Xiang Gao, Hongyu Zhang, Zhaotian Li, and Xudong Liu. 2023. Reusing Deep Neural Network Models through Model Re-engineering. In International Conference on Software Engineering (ICSE).

Digital Library

[80]

Shi Qiu, Daniel M German, and Katsuro Inoue. 2021. Empirical study on dependency-related license violation in the javascript package ecosystem. Journal of Information Processing 29 (2021), 296--304.

[81]

Saidur Rahman, Emilio River, Foutse Khomh, Yann Gal Guhneuc, and Bernd Lehnert. 2019. Machine learning software engineering in practice: An industrial case study. arXiv preprint (2019).

[82]

Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, et al. 2020. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In The 2020 conference on fairness, accountability, and transparency.

Digital Library

[83]

Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information 11, 4 (2020), 193.

[84]

Lawrence Rosen. 2005. Open source licensing. Software Freedom and Intellectual Property Law (2005).

[85]

A. Sajadi, K. Damevski, and P. Chatterjee. 2023. Interpersonal Trust in OSS: Exploring Dimensions of Trust in GitHub Pull Requests. In International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[86]

S. Schelter, J.H. Boese, J. Kirschnick, T. Klein, and S. Seufert. 2017. Automatically tracking metadata and provenance of machine learning experiments. In Conference on Neural Information Processing Systems (NeurIPS).

[87]

Taylor R Schorlemmer, Kelechi G Kalu, Luke Chigges, Kyung Myung Ko, et al. 2024. Signing in Four Public Software Package Registries: Quantity, Quality, and Influencing Factors. arXiv:2401.14635

[88]

Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang. 2023. HuggingGPT: Solving AI tasks with ChatGPT and its Friends in HuggingFace. arXiv:2303.17580 (2023).

[89]

S.S. Sohail, F. Farhat, Y. Himeur, M. Nadeem, D.O. Madsen, Y. Singh, S. Atalla, and W. Mansoor. 2023. Decoding ChatGPT: a taxonomy of existing research, current challenges, and possible future directions. Journal of King Saud University-Computer and Information Sciences (2023).

[90]

M. Sojer, O. Alexy, S. Kleinknecht, and J. Henkel. 2014. Understanding the drivers of unethical programming behavior: The inappropriate reuse of internet-accessible code. Journal of Management Info. Systems 31, 3 (2014), 287--325.

[91]

Sourcegraph. 2023. https://docs.sourcegraph.com/cli

[92]

Xin Tan, Kai Gao, Minghui Zhou, and Li Zhang. 2022. An exploratory study of deep learning supply chain. In Intern. Conf. on Software Engineering (ICSE).

Digital Library

[93]

Mina Taraghi, Gianolli Dorcelus, Armstrong Foundjem, Florian Tambon, and Foutse Khomh. 2024. Deep Learning Model Reuse in the HuggingFace Community: Challenges, Benefit and Trends. arXiv:2401.13177

[94]

The Linux Foundation. 2019. Fulfilling Open Source License Obligations: Can Checklists Help? https://events19.linuxfoundation.org/wp-content/uploads/2018/07/OSLS-2019-Fulfilling-Open-Source-license-obligations-Can-checklists-help.pdf.

[95]

J. Tsay, A. Braz, M. Hirzel, A. Shinnar, and T. Mummert. 2020. AIMMX: Artificial Intelligence Model Metadata Extractor. In Mining Softw. Repos. (MSR).

Digital Library

[96]

J. Tsay, M. Braz, A.and Hirzel, A. Shinnar, and T. Mummert. 2022. Extracting enhanced artificial intelligence model metadata from software repositories. Empirical Software Engineering 27, 7 (Dec. 2022), 176.

Digital Library

[97]

S. Van Der Burg, E. Dolstra, S. McIntosh, J. Davies, D.M. German, and A. Hemel. 2014. Tracing software build processes to uncover license compliance inconsistencies. In Automated software engineering (ASE). 731--742.

[98]

Bart Van Oort, Luís Cruz, Maurício Aniche, and Arie Van Deursen. 2021. The prevalence of code smells in machine learning projects. In 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering for AI (WAIN). IEEE, 1--8.

Digital Library

[99]

Chengcheng Wan, Shicheng Liu, Henry Hoffmann, Michael Maire, and Shan Lu. 2021. Are machine learning cloud apis used correctly?. In ICSE.

[100]

Shuo Wang, Surya Nepal, Carsten Rudolph, Marthie Grobler, Shangyu Chen, and Tianle Chen. 2022. Backdoor Attacks Against Transfer Learning With Pre-Trained Deep Learning Models. IEEE Transactions on Services Computing 15, 3 (May 2022), 1526--1539.

[101]

Zhi Wang, Chaoge Liu, Xiang Cui, Jie Yin, and Xutong Wang. 2022. EvilModel 2.0: Bringing Neural Network Models into Malware Attacks. Computers & Security (2022).

Digital Library

[102]

Thomas Wolter, Ann Barcomb, Dirk Riehle, and Nikolay Harutyunyan. 2023. Open source license inconsistencies on github. ACM TOSEM 32, 5 (2023).

[103]

Jun Xia, Yanqiao Zhu, Yuanqi Du, Y Liu, and SZ Li. 2023. A Systematic Survey of Chemical Pre-trained Models. International Joint Conference on Artificial Intelligence (IJCAI'23).

Digital Library

[104]

Weiwei Xu, Hao He, Kai Gao, and Minghui Zhou. 2023. Understanding and Remediating Open-Source License Incompatibilities in the PyPI Ecosystem. arXiv preprint arXiv:2308.05942 (2023).

[105]

Zhou Yang, Chenyu Wang, Jieke Shi, Thong Hoang, Pavneet Kochhar, Qinghua Lu, Zhenchang Xing, and David Lo. 2023. What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues. arXiv (2023).

[106]

Likang Yin and Vladimir Filkov. 2020. Team discussions and dynamics during DevOps tool adoptions in OSS projects. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 697--708.

Digital Library

[107]

J.D. Zamfirescu-Pereira, R.Y. Wong, B. Hartmann, and Q. Yang. 2023. Why Johnny can't prompt: how non-AI experts try (and fail) to design LLM prompts. In Conference on Human Factors in Computing Systems (CHI).

[108]

F. Zampetti, S. Scalabrino, R. Oliveto, G. Canfora, and M. Di Penta. 2017. How Open Source Projects Use Static Code Analysis Tools in Continuous Integration Pipelines. In International Conference on Mining Software Repositories (MSR).

[109]

Ahmed Zerouali, Eleni Constantinou, Tom Mens, Gregorio Robles, and Jesús González-Barahona. 2018. An empirical analysis of technical lag in npm package dependencies. In International Conference on Software Reuse. Springer, 95--110.

[110]

Haiyin Zhang, Luís Cruz, and Arie Van Deursen. 2022. Code smells for machine learning applications. In Internat'l Conf. on AI Eng.: Software Eng. for AI. 217--228.

[111]

Ziqi Zhang, Yuanchun Li, Jindong Wang, Bingyan Liu, Ding Li, Yao Guo, Xiangqun Chen, and Yunxin Liu. 2022. ReMoS: reducing defect inheritance in transfer learning via relevant model slicing. In ICSE'22.

Digital Library

[112]

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proc. IEEE 109, 1 (2020), 43--76.

Cited By

Zhao JWang SZhao YHou XWang KGao PZhang YWei CWang HFilkov VRay BZhou M(2024)Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model HubsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695271(2087-2098)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695271
Jones JJiang WSynovic NThiruvathukal GDavis J(2024)What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claimsProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686665(13-24)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686665
Li JZhang YXu KWang TWang H(2024)Understanding the Challenges of Data Management in the AI Application Development2024 IEEE International Conference on Joint Cloud Computing (JCC)10.1109/JCC62314.2024.00018(68-75)Online publication date: 15-Jul-2024
https://doi.org/10.1109/JCC62314.2024.00018
Show More Cited By

Index Terms

PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

Recommendations

An Empirical Study of Artifacts and Security Risks in the Pre-trained Model Supply Chain
SCORED'22: Proceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses

Deep neural networks achieve state-of-the-art performance on many tasks, but require increasingly complex architectures and costly training procedures. Engineers can reduce costs by reusing a pre-trained model (PTM) and fine-tuning it for their own ...
Ensembling machine learning models to boost molecular affinity prediction
Graphical abstract

Display Omitted
Highlights
- We propose a machine learning-based predictor for protein-ligand binding affinities.
Abstract
This study unites six popular machine learning approaches to enhance the prediction of a molecular binding affinity between receptors (large protein molecules) and ligands (small organic molecules). Here we examine a scheme where ...
Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development
CSCW2

Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

April 2024

788 pages

ISBN:9798400705878

DOI:10.1145/3643991

Chair:
Diomidis Spinellis,
Program Chair:
Alberto Bacchelli,
Program Co-chair:
Eleni Constantinou

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)
Faculty Research Participation Program at Argonne National Laboratory
NSERC (Natural Sciences and Engineering Research Council of Canada)
U.S. DOE Office of Science-Advanced Scientific Computing Research Program

Conference

MSR '24

Sponsor:

SIGSOFT

MSR '24: 21st International Conference on Mining Software Repositories

April 15 - 16, 2024

Lisbon, Portugal

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
140
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)26

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao JWang SZhao YHou XWang KGao PZhang YWei CWang HFilkov VRay BZhou M(2024)Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model HubsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695271(2087-2098)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695271
Jones JJiang WSynovic NThiruvathukal GDavis J(2024)What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claimsProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686665(13-24)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686665
Li JZhang YXu KWang TWang H(2024)Understanding the Challenges of Data Management in the AI Application Development2024 IEEE International Conference on Joint Cloud Computing (JCC)10.1109/JCC62314.2024.00018(68-75)Online publication date: 15-Jul-2024
https://doi.org/10.1109/JCC62314.2024.00018
Jiang WBanna VVivek NGoel ASynovic NThiruvathukal GDavis J(2024)Challenges and practices of deep learning model reengineering: A case study on computer visionEmpirical Software Engineering10.1007/s10664-024-10521-029:6Online publication date: 20-Aug-2024
https://doi.org/10.1007/s10664-024-10521-0

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten