research-article

Using LLM Embeddings with Similarity Search for Botnet TLS Certificate Detection

Authors:

Kumar Shashwat,

Xinming OuAuthors Info & Claims

AISec '24: Proceedings of the 2024 Workshop on Artificial Intelligence and Security

Pages 173 - 183

https://doi.org/10.1145/3689932.3694766

Published: 22 November 2024 Publication History

Abstract

Modern botnets leverage TLS encryption to mask C&C server communications. TLS certificates used by botnets could exhibit subtle characteristics that facilitate detection. In this paper we investigate whether text features from TLS certificates can be represented by open-source and 3rd party vendor LLM text embeddings in a projected vector space, for the purpose of building a classifier to detect botnet certificates. Our method extracts informative features, generating vector representations for effective identification, creating a projected space that can be queried with test certificates via similarity search. Using a balanced dataset consisting of the publicly available SSLBL botnet certificates and TLS certificates used by popular websites, our evaluations show that C-BERT, an open-source model, emerges as the preferred choice within our proposed system rather than a vendor solution. C-BERT achieves a competitive F1 score of 0.994 on unseen test data, 97.9% accuracy on data gathered several months after an initial projected embedding space was created, and maintains performance in a simulated zero-day evaluation against four C&C groups, with an average F1 score of 0.946. Further evaluation on a random sample of 150,000 real-world certificates collected from a full internet scan between Jan 2024 to May 2024 predicts 13 potential botnet certificates, among which one was confirmed to be malicious by VirusTotal. Comparing with the scenario where no such tool exists, we randomly selected 1,300 certificates from these 150,000 certificates and ran them through VirusTotal, and none were confirmed to be malicious. This translates to 100 fold effort reduction in identifying botnet certificates in the wild.

References

[1]

OWASP Top 10. 2024. ML10:2023 Model Poisoning. https://owasp.org/www-project-machine-learning-security-top-10/docs/ML10_2023-Model_Poisoning. Accessed: 2024-07-04.

[2]

Abuse.ch. 2024. SSLBL | Malicious SSL Certificates. https://sslbl.abuse.ch/ssl-certificates/. Accessed: 2024-06-09.

[3]

Voyage AI. 2024. voyage-large-2-instruct: Instruction-tuned and rank 1 on MTEB. https://blog.voyageai.com/2024/05/05/voyage-large-2-instruct-instruction-tuned-and-rank-1-on-mteb. Accessed: 2024-07-02.

[4]

Turki Al lelah, George Theodorakopoulos, Philipp Reinecke, Amir Javed, and Eirini Anthi. 2023. Abuse of cloud-based and public legitimate services as command-and-control (C&C) infrastructure: a systematic literature review. Journal of Cybersecurity and Privacy, Vol. 3, 3 (2023), 558--590.

[5]

Bushra A. Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% False Positives: A Qualitative Study of SOC Analysts' Perspectives on Security Alarms. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2783--2800. https://www.usenix.org/conference/usenixsecurity22/presentation/alahmadi

[6]

Amazon. 2024. Alexa Top 1 million. http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip. Accessed: 2024-07-04.

[7]

Manos Antonakakis, Tim April, Michael Bailey, Matt Bernhard, Elie Bursztein, Jaime Cochran, Zakir Durumeric, J Alex Halderman, Luca Invernizzi, Michalis Kallitsis, et al. 2017. Understanding the mirai botnet. In 26th USENIX security symposium (USENIX Security 17). 1093--1110.

[8]

Stefan Axelsson. 1999. The base-rate fallacy and its implications for the difficulty of intrusion detection. In Proceedings of the 6th ACM conference on Computer and Communications Security (CCS'99). 1--7.

Digital Library

[9]

Paul Black, Iqbal Gondal, and Robert Layton. 2018. A survey of similarities in banking malware behaviours. Computers & Security, Vol. 77 (2018), 756--772.

Digital Library

[10]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics, Vol. 5 (2017), 135--146.

[11]

Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, and Junichi Tsujii. 2020. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. arxiv: 2010.10392 [cs.CL] https://arxiv.org/abs/2010.10392

[12]

Michael W Browne. 2000. Cross-validation methods. Journal of mathematical psychology, Vol. 44, 1 (2000), 108--132.

Digital Library

[13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, Vol. abs/1810.04805 (2018). [arXiv]1810.04805 http://arxiv.org/abs/1810.04805

[14]

OpenAI et al. 2024. GPT-4 Technical Report. arxiv: 2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

[15]

Maryam Feily, Alireza Shahrestani, and Sureswaran Ramadass. 2009. A Survey of Botnet and Botnet Detection. In 2009 Third International Conference on Emerging Security Information, Systems and Technologies. 268--273. https://doi.org/10.1109/SECURWARE.2009.48

Digital Library

[16]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.

Digital Library

[17]

Guofei Gu, Phillip A Porras, Vinod Yegneswaran, Martin W Fong, and Wenke Lee. 2007. BotHunter: Detecting malware infection through IDS-driven dialog correlation. In USENIX Security Symposium, Vol. 7. 1--16.

[18]

Kaspar Hageman, Egon Kidmose, René Rydhof Hansen, and Jens Myrup Pedersen. 2021. Can a TLS certificate be phishy?. In Proceedings of the 18th International Conference on Security and Cryptography (SECRYPT).

[19]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, Vol. 7, 3 (2019), 535--547.

[20]

A Ker, T Pevny, M Kopp, and J Kroustek. 2016. Malicons: detecting payload in favicons. Electronic Imaging: Media Watermarking, Security, and Forensics 2016 2016 (2016).

[21]

Panagiotis Kintis, Najmeh Miramirkhani, Charles Lever, Yizheng Chen, Rosa Romero-Gómez, Nikolaos Pitropakis, Nick Nikiforakis, and Manos Antonakakis. 2017. Hiding in Plain Sight: A Longitudinal Study of Combosquatting Abuse. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (Dallas, Texas, USA) (CCS '17). Association for Computing Machinery, New York, NY, USA, 569--586. https://doi.org/10.1145/3133956.3134002

Digital Library

[22]

Benjamin Kuhnert, Jessica Steinberger, Harald Baier, Anna Sperotto, and Aiko Pras. 2017. Booters and Certificates: An Overview of TLS in the DDoS-as-a-Service Landscape. In 2nd International Conference on Advances in Computation, Communications and Services, ACCSE 2017. IARIA/Thinkmind, 37.

[23]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[24]

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 (2022).

[25]

Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. TESSERACT: Eliminating experimental bias in malware classification across space and time. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, Santa Clara, CA, 729--746. https://www.usenix.org/conference/usenixsecurity19/presentation/pendlebury

[26]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[27]

Daniel Plohmann, Khaled Yakdan, Michael Klatt, Johannes Bader, and Elmar Gerhards-Padilla. 2016. A comprehensive measurement study of domain generating malware. In 25th USENIX Security Symposium (USENIX Security 16). 263--278.

Digital Library

[28]

Rapid7. 2024. Project Sonar. https://www.rapid7.com/research/project-sonar/. Accessed: 2024-08-01.

[29]

Amazon Web Services. 2024. Amazon Titan Text Embeddings models. https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html. Accessed: 2024-07-02.

[30]

Wan-Chen Shi and Hung-Min Sun. 2020. DeepBot: a time-based botnet detection with deep learning. Soft Computing, Vol. 24, 21 (2020), 16605--16616. https://doi.org/10.1007/s00500-020-04963-z

Digital Library

[31]

Andreas Theofanous, Eva Papadogiannaki, Alexander Shevtsov, and Sotiris Ioannidis. 2024. Fingerprinting the Shadows: Unmasking Malicious Servers with Machine Learning-Powered TLS Analysis. In Proceedings of the ACM on Web Conference 2024 (WWW'24). Singapore.

Digital Library

[32]

Ivan Torroledo, Luis David Camacho, and Alejandro Correa Bahnsen. 2018. Hunting Malicious TLS Certificates with Deep Neural Networks. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security (AISec'18). Toronto, Canada, 64--73.

Digital Library

[33]

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arxiv: 2404.18796 [cs.CL] https://arxiv.org/abs/2404.18796

[34]

VirusTotal. 2023. https://www.virustotal.com/gui/home/search Retrieved January, 2024 from

[35]

Gaute Wangen. 2015. The role of malware in reported cyber espionage: a review of the impact and mechanism. Information, Vol. 6, 2 (2015), 183--211.

[36]

Ying Xing, Hui Shu, Hao Zhao, Dannong Li, and Li Guo. 2021. Survey on Botnet Detection Techniques: Classification, Methods, and Evaluation. Mathematical Problems in Engineering, Vol. 2021, 1 (2021), 6640499. https://doi.org/10.1155/2021/6640499 https://doi.org/10.1145/3649506

[37]

Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2024. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. ACM Trans. Knowl. Discov. Data 18, 6, Article 160 (apr 2024), 32 pages. https://doi.org/10.1145/3649506

Digital Library

[38]

Hossein Rouhani Zeidanloo, Mohammad Jorjor Zadeh Shooshtari, Payam Vahdani Amoli, M. Safari, and Mazdak Zamani. 2010. A taxonomy of Botnet detection techniques. In 2010 3rd International Conference on Computer Science and Information Technology, Vol. 2. 158--162. https://doi.org/10.1109/ICCSIT.2010.5563555

Index Terms

Using LLM Embeddings with Similarity Search for Botnet TLS Certificate Detection

Recommendations

Malicious SSL Certificate Detection: A Step Towards Advanced Persistent Threat Defence
ICFNDS '17: Proceedings of the International Conference on Future Networks and Distributed Systems

Advanced Persistent Threat (APT) is one of the most serious types of cyber attacks, which is a new and more complex version of multistep attack. Within the APT life cycle, continuous communication between infected hosts and Command and Control (C&C) ...
A Survey of Botnet and Botnet Detection
SECURWARE '09: Proceedings of the 2009 Third International Conference on Emerging Security Information, Systems and Technologies

Among the various forms of malware, botnets are emerging as the most serious threat against cyber-security as they provide a distributed platform for several illegal activities such as launching distributed denial of service attacks against critical ...
Classification of Botnet Detection Based on Botnet Architechture
CSNT '12: Proceedings of the 2012 International Conference on Communication Systems and Network Technologies

Nowadays, Botnets pose a major threat to the security of online ecosystems and computing assets. A Botnet is a network of computers which are compromised under the influence of Bot (malware) code. This paper clarifies Botnet phenomenon and discusses ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AISec '24: Proceedings of the 2024 Workshop on Artificial Intelligence and Security

November 2024

225 pages

ISBN:9798400712289

DOI:10.1145/3689932

Program Chairs:
Maura Pintor
University of Cagliari
,
Xinyun Chen
Google DeepMind
,
Matthew Jagielski
Google DeepMind

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

CCS '24

Sponsor:

SIGSAC

CCS '24: ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
66
Total Downloads

Downloads (Last 12 months)66
Downloads (Last 6 weeks)66

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents