research-article

Open access

Getting pwn’d by AI: Penetration Testing with Large Language Models

Authors:

Jürgen CitoAuthors Info & Claims

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 2082 - 2086

https://doi.org/10.1145/3611643.3613083

Published: 30 November 2023 Publication History

Abstract

The field of software security testing, more specifically penetration testing, requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential use of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. We discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of AI sparring partners.

References

[1]

AIAAIC. 2023. AIAAIC Repository of incidents and controversies related to AI, algorithms and automation. https://www.aiaaic.org/

[2]

The Wassenaar Arrangement. 1982. The Wassenaar Arrangement on Export Controls for Conventional Arms and Dual-Use Goods and Technologies. https://www.wassenaar.org/

[3]

MITRE ATT&CK. 2020. Abuse Elevation Control Mechanism: Sudo and Sudo Caching. https://attack.mitre.org/techniques/T1548/003/

[4]

MITRE ATT&CK. 2020. Steal or Forge Kerberos Tickets: Kerberoasting. https://attack.mitre.org/techniques/T1558/003/

[5]

Edward Beeching, Younes Belkada, Kashif Rasul, Lewis Tunstall, Leandro von Werra, Nazneen Rajani, and Nathan Lambert. 2023. StackLLaMA: An RL Fine-tuned LLaMA Model for Stack Exchange Question and Answering. https://doi.org/10.57967/hf/0513

[6]

Erik Brynjolfsson. 2023. The turing trap: The promise & peril of human-like artificial intelligence. In Augmented Education in the Global Age. Routledge, 103–116.

[7]

Erik Brynjolfsson, Danielle Li, and Lindsey Raymond. 2023. Generative AI at Work. NBER Working Paper No. 31161. National Bureau of Economic Research, April.

[8]

Vit Bukac, Vaclav Lorenc, and Vashek Matyáš. 2014. Red queen’s race: APT win-win game. In Cambridge International Workshop on Security Protocols. 55–61.

[9]

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

[10]

Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 1136–1142.

Digital Library

[11]

The Economist. 2022. Huge foundation models are turbo-charging AI progress. https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress

[12]

The Economist. 2023. Large, creative AI models will transform lives and labour markets. https://www.economist.com/interactive/science-and-technology/2023/04/22/large-creative-ai-models-will-transform-how-we-live-and-work

[13]

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A Dialogue Model for Academic Research. Blog post. https://bair.berkeley.edu/blog/2023/04/03/koala/

[14]

Georgi Gerganov. 2023. llama.cpp: Inference of LLaMA model in pure C/C++. https://github.com/ggerganov/llama.cpp

[15]

Significant Gravitas. 2023. Auto-GPT: An Autonomous GPT-4 Experiment. https://github.com/Significant-Gravitas/Auto-GPT

[16]

Andreas Happe and Jürgen Cito. 2023. Understanding Hackers’ Work: An Empirical Study of Offensive Security Practitioners. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA. 11 pages.

Digital Library

[17]

Richard Harang and Felipe N Ducau. 2018. Measuring the speed of the Red Queen’s Race. BlackHat: Las Vegas, NV, USA.

[18]

(ISC)2. 2022. (ISC)2 CYBERSECURITY WORKFORCE STUDY 2022. https://www.isc2.org//-/media/ISC2/Research/2022-WorkForce-Study/ISC2-Cybersecurity-Workforce-Study.ashx

[19]

Sydney Lake. 2022. The cybersecurity industry is short 3.4 million workers—that’s good news for cyber wages. https://fortune.com/education/articles/the-cybersecurity-industry-is-short-3-4-million-workers-thats-good-news-for-cyber-wages/

[20]

Selena Larson and Daniel Blackford. 2021. Cobalt Strike: Favorite Tool from APT to Crimeware. https://www.proofpoint.com/us/blog/threat-insight/cobalt-strike-favorite-tool-apt-crimeware

[21]

lin.security. 2018. Lin.Security: 1. https://www.vulnhub.com/entry/linsecurity-1,244/

[22]

Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–23.

Digital Library

[23]

Nestor Maslej, Loredana Fattorini, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Helen Ngo, Juan Carlos Niebles, Vanessa Parli, Yoav Shoham, Russell Wald, Jack Clark, and Raymond Perraul. 2023. The AI Index 2023 Annual Report. https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.pdf

[24]

Ron Miller. 2023. Sam Altman: Size of LLMs won’t matter as much moving forward. https://techcrunch.com/2023/04/14/sam-altman-size-of-llms-wont-matter-as-much-moving-forward/

[25]

Yohei Nakajima. 2023. BabyAGI. https://github.com/yoheinakajima/babyagi

[26]

Yohei Nakajima. 2023. Introducing Task-driven Autonomous Agent. https://twitter.com/yoheinakajima/status/1640934493489070080

[27]

Yohei Nakajima. 2023. Task-driven Autonomous Agent Utilizing GPT-4, Pinecone, and LangChain for Diverse Applications. https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-4-pinecone-and-langchain-for-diverse-applications/

[28]

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arxiv:2304.03442.

[29]

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. arxiv:2302.12813.

[30]

Carlos Polop. 2023. LinPEAS - Linux Privilege Escalation Awesome Script. https://github.com/carlospolop/PEASS-ng/tree/master/linPEAS

[31]

Katyanna Quach. 2023. LLaMA drama as Meta’s mega language model leaks. https://www.theregister.com/2023/03/08/meta_llama_ai_leak/

[32]

Kevin Schaul, Szu Yu Chean, and Nitasha Tiku. 2023. Inside the secret list of websites that make AI like ChatGPT sound smart. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

[33]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. arxiv:2303.17580.

[34]

Cybereason Global SOC and Incident Response Team. 2023. Sliver C2 Leveraged by Many Threat Actors. https://www.cybereason.com/blog/sliver-c2-leveraged-by-many-threat-actors

[35]

stability.ai. 2023. Stability AI Launches the First of its StableLM Suite of Language Models. https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models

[36]

Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M Rush. 2022. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models. IEEE transactions on visualization and computer graphics, 29, 1 (2022), 1146–1156.

[37]

Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. In Technical report. The MITRE Corporation.

[38]

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. arxiv:2206.07682.

[39]

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2023. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.

[40]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision, 130, 9 (2022), 2337–2348.

Digital Library

Cited By

Albtosh L(2024)Challenges and Limitations of Using LLMs in Software SecurityApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch012(439-464)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch012
Shamoo Y(2024)Comparative Analysis of LLMs vs. Traditional Methods in Vulnerability DetectionApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch009(335-374)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch009
Jones RJones A(2024)Integration of LLMs With Traditional Security ToolsApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch008(295-334)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch008
Show More Cited By

Index Terms

Getting pwn’d by AI: Penetration Testing with Large Language Models
1. Security and privacy
  1. Systems security

Recommendations

A Penetration Testing Method for E-Commerce Authentication System Security
ICMECG '09: Proceedings of the 2009 International Conference on Management of e-Commerce and e-Government

E-Commerce systems are suffering more and more security issues. Vulnerabilities of authentication systems are revealed when various attacks and malicious abuses are developed and deployed to violate security of system and information. To improve the ...
Network Penetration Testing Scheme Description Language
ICCIS '11: Proceedings of the 2011 International Conference on Computational and Information Sciences

Penetration testing is widely used to help ensure the security of the network. Traditional penetration testings were manually performed by tester according to scheme, the process is usually complex resulting in that it is labor-intensive and requires ...
Threat led advanced persistent threat penetration test

Cyber security attacks have been on the rise in recent years. One of the most destructive attacks are known as advanced persistent threat (APT) attacks which can inflict massive damages to a network. A common approach of testing the security of an IT ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2023

2215 pages

ISBN:9798400703270

DOI:10.1145/3611643

General Chair:
Satish Chandra
Google, USA
,
Program Chairs:
Kelly Blincoe
University of Auckland, New Zealand
,
Paolo Tonella
USI Lugano, Switzerland

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '23

Sponsor:

SIGSOFT

ESEC/FSE '23: 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

December 3 - 9, 2023

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
3,068
Total Downloads

Downloads (Last 12 months)2,901
Downloads (Last 6 weeks)336

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Albtosh L(2024)Challenges and Limitations of Using LLMs in Software SecurityApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch012(439-464)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch012
Shamoo Y(2024)Comparative Analysis of LLMs vs. Traditional Methods in Vulnerability DetectionApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch009(335-374)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch009
Jones RJones A(2024)Integration of LLMs With Traditional Security ToolsApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch008(295-334)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch008
Albtosh L(2024)Ethical Considerations in the Use of LLMs for Vulnerability DetectionApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch007(263-294)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch007
Shamoo Y(2024)Performance Evaluation of LLM-Based Security SystemsApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch005(131-166)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch005
Jones R(2024)Techniques and Approaches for Leveraging LLMs in Security AnalysisApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch003(75-104)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch003
Zangana HMohammed D(2024)Foundations of Large Language Models in Software Vulnerability DetectionApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch002(41-74)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch002
Zangana HOmar M(2024)Harnessing the Power of Large Language Models for CybersecurityApplication of Large Language Models (LLMs) for Software Vulnerability Detection10.4018/979-8-3693-9311-6.ch001(1-40)Online publication date: 18-Oct-2024
https://doi.org/10.4018/979-8-3693-9311-6.ch001
Pratama DSuryanto NAdiputra ALe TKadiptya AIqbal MKim H(2024)CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical ResearcherSensors10.3390/s2421687824:21(6878)Online publication date: 26-Oct-2024
https://doi.org/10.3390/s24216878
Szmurlo HAkhtar Z(2024)Digital Sentinels and Antagonists: The Dual Nature of Chatbots in CybersecurityInformation10.3390/info1508044315:8(443)Online publication date: 29-Jul-2024
https://doi.org/10.3390/info15080443
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents