research-article

Nebula: Self-Attention for Dynamic Malware Analysis

Authors:

Dmitrijs Trizna,

Battista Biggio,

Fabio RoliAuthors Info & Claims

IEEE Transactions on Information Forensics and Security, Volume 19

Pages 6155 - 6167

https://doi.org/10.1109/TIFS.2024.3409083

Published: 06 June 2024 Publication History

Abstract

Dynamic analysis enables detecting Windows malware by executing programs in a controlled environment and logging their actions. Previous work has proposed training machine learning models, i.e., convolutional and long short-term memory networks, on homogeneous input features like runtime APIs to either detect or classify malware, neglecting other relevant information coming from heterogeneous data like network and file operations. To overcome these issues, we introduce Nebula, a versatile, self-attention Transformer-based neural architecture that generalizes across different behavioral representations and formats, combining diverse information from dynamic log reports. Nebula is composed by several components needed to tokenize, filter, normalize and encode data to feed the transformer architecture. We firstly perform a comprehensive ablation study to evaluate their impact on the performance of the whole system, highlighting which components can be used as-is, and which must be enriched with specific domain knowledge. We perform extensive experiments on both malware detection and classification tasks, using three datasets acquired from different dynamic analyses platforms, show that, on average, Nebula outperforms state-of-the-art models at low false positive rates, with a peak of 12% improvement. Moreover, we showcase how self-supervised learning pre-training matches the performance of fully-supervised models with only 20% of training data, and we inspect the output of Nebula through explainable AI techniques, pinpointing how attention is focusing on specific tokens correlated to malicious activities of malware families. To foster reproducibility, we open-source our findings and models at <uri>https://github.com/dtrizna/nebula</uri>.

References

[1]

M. Cinque, D. Cotroneo, and A. Pecchia, “Challenges and directions in security information and event management (SIEM),” in Proc. IEEE Int. Symp. Softw. Rel. Eng. Workshops (ISSREW), Oct. 2018, pp. 95–99.

[2]

D. Trizna, “Quo vadis: Hybrid machine learning meta-model based on contextual and behavioral malware representations,” in Proc. 15th ACM Workshop Artif. Intell. Secur. New York, NY, USA. Association for Computing Machinery, Nov. 2022, pp. 127–136.

[3]

C. Jindal, C. Salls, H. Aghakhani, K. Long, C. Kruegel, and G. Vigna, “Neurlux: Dynamic malware analysis without feature engineering,” in Proc. 35th Annu. Comput. Secur. Appl. Conf. New York, NY, USA. Association for Computing Machinery, Dec. 2019, pp. 444–455.

[4]

X. Chen et al., “CruParamer: Learning on parameter-augmented API sequences for malware detection,” IEEE Trans. Inf. Forensics Security, vol. 17, pp. 788–803, 2022.

[5]

W. U. Hassan, A. Bates, and D. Marino, “Tactical provenance analysis for endpoint detection and response systems,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2020, pp. 1172–1189.

[6]

G. Apruzzese, P. Laskov, and A. Tastemirova, “SoK: The impact of unlabelled data in cyberthreat detection,” in Proc. IEEE 7th Eur. Symp. Secur. Privacy (EuroS&P). Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2022, pp. 20–42.

[7]

G. Karantzas and C. Patsakis, “An empirical assessment of endpoint detection and response systems against advanced persistent threats attack vectors,” J. Cybersecur. Privacy, vol. 1, no. 3, pp. 387–421, 2021.

[8]

Z. Zhang, P. Qi, and W. Wang, “Dynamic malware analysis with feature engineering and feature learning,” in Proc. AAAI Conf. Artif. Intell., vol. 34, Apr. 2020, pp. 1210–1217.

[9]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.

Digital Library

[10]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA, 2015.

[11]

A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, I. Guyon et al., Eds. Red Hook, NY, USA: Curran Associates, 2017.

[12]

A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” OpenAI, San Francisco, CA, USA, Jun. 2018.

[13]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., vol. 1. Minneapolis, MN, USA: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.

[14]

J. Saxe and K. Berlin, “eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys,” Tech. Rep., Feb. 2017.

[15]

A. Kyadige, E. M. Rudd, and K. Berlin, “Learning from context: A multi-view deep learning architecture for malware detection,” in Proc. IEEE Secur. Privacy Workshops (SPW), USA, May 2020, pp. 1–7.

[16]

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3319–3328.

[17]

P. Gage, “A new algorithm for data compression,” C Users J., 1994. [Online]. Available: http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM

[18]

C. Rossow et al., “Prudent practices for designing malware experiments: Status quo and outlook,” in Proc. IEEE Symp. Secur. Privacy, May 2012, pp. 65–79. 10.1109/SP.2012.14.

Digital Library

[19]

J. Vig, “A multiscale visualization of attention in the transformer model,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, Syst. Demonstrations. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 37–42.

[20]

B. Bosansky et al., “Avast-CTU public CAPE dataset,” Tech. Rep., 2022.

[21]

S. Mandlik, M. Racinsky, V. Lisy, and T. Pevny, “Mill.jl and JsonGrinder.jl: Automated differentiable feature extraction for learning from raw JSON data,” Dept. Comput. Sci., Avast Softw., AI Center, Czech Tech. University, Prague, 2021.

[22]

Verizon Communications. (2022). Verizon Data Breach Investigation Report (DBIR). Accessed: May 31, 2023. [Online]. Available: https://www.verizon.com/business/resources/reports/dbir/2022/results-and-analysis-intro/

[23]

F. Demirkıran, A. Çayır, U. Ünal, and H. Dağ, “An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,” Comput. Secur., vol. 121, Oct. 2022, Art. no.

[24]

R. Chanajitt, B. Pfahringer, H. M. Gomes, and V. Yogarajan, “Multiclass malware classification using either static opcodes or dynamic API calls,” in Proc. Adv. Artif. Intell. Conf. (Home AI), Dec. 2022, pp. 427–441.

[25]

A. Mantovani, S. Aonzo, Y. Fratantonio, and D. Balzarotti, “RE-Mind: A first look inside the mind of a reverse engineer,” in Proc. 31st USENIX Secur. Symp. (USENIX Security), Boston, MA, USA. USENIX Association, Aug. 2022, pp. 2727–2745.

[26]

S. Bird, E. Klein, and E. Loper, Natural Language Processing With Python: Analyzing Text With the Natural Language Toolkit. Sebastopol, CA, USA: O’Reilly Media, 2009.

[27]

L. Demetrio, S. E. Coull, B. Biggio, G. Lagorio, A. Armando, and F. Roli, “Adversarial exemples: A survey and experimental evaluation of practical attacks on machine learning for windows malware detection,” ACM Trans. Privacy Secur., vol. 24, no. 4, pp. 1–31, Sep. 2021. 10.1145/3473039.

Digital Library

[28]

D. Trizna, “Shell language processing: Unix command parsing for machine learning,” in Proc. Conf. Appl. Mach. Learn. Inf. Secur. (CAMLIS), 2021.

[29]

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics. Berlin, Germany: ACL, 2016, pp. 1715–1725.

[30]

T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. Conf. Empirical Methods Natural Lang. Process., Syst. Demonstrations, 2018, pp. 66–71.

[31]

Mandiant. (Nov. 2021). Speakeasy: Portable, Modular, Binary Emulator Designed to Emulate Windows Kernel and User Mode Malware. [Online]. Available: https://github.com/mandiant/speakeasy

[32]

J. Kaplan et al., “Scaling laws for neural language models,” OpenAI, San Francisco, CA, USA, 2020.

[33]

A. Brukhovetskyy and K. O’Reilly, Cape Sandbox V2.1 Book, 2022. [Online]. Available: https://capev2.readthedocs.io/en/latest/

[34]

Cuckoo Foundation. Cuckoo Sandbox. Accessed: May 30, 2023. [Online]. Available: https://github.com/cuckoosandbox/cuckoo

[35]

(Jul. 2019). Malicious Code DataSet. [Online]. Available: https://github.com/kericwy1337/Datacon2019-Malicious-Code-DataSet-Stage1

[36]

I. Loshchilov and F. Hutter, “AdamW: Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent. (ICLR), New Orleans, LA, USA, May 2019.

[37]

Cybersecurity and Infrastructure Security Agency. Emotet Malware. Accessed: Jan. 2024. [Online]. Available: https://www.cisa.gov/news-events/alerts/2018/07/20/emotet-malware

[38]

B. Krebs, “‘Operation tovar’ targets ‘gameover’ ZeuS botnet, CryptoLocker scourge,” KrebsOnSecurity, Jun. 2014.

[39]

S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” in Proc. NeurIPS, I. Guyon et al., Eds. Red Hook, NY, USA: Curran Associates, 2017, pp. 4765–4774.

[40]

M. Lindorfer, C. Kolbitsch, and P. M. Comparetti, “Detecting environment-sensitive malware,” in Recent Advances in Intrusion Detection: 14th International Symposium, RAID 2011, Menlo Park, CA, USA, September 20–21, 2011. Proceedings 14. Springer, 2011, pp. 338–357.

[41]

M. Q. Li, B. C. M. Fung, P. Charland, and S. H. H. Ding, “I-MAD: Interpretable malware detector using galaxy transformer,” Comput. Secur., vol. 108, Sep. 2021, Art. no. 10.1061/(ASCE)CO.1943-7862.0001579.

[42]

E. M. Rudd, M. S. Rahman, and P. Tully, “Transformers for end-to-end infosec tasks: A feasibility study,” in Proc. 1st Workshop Robust Malware Anal., New York, NY, USA, 2022, pp. 21–31.

[43]

K. Pei, Z. Xuan, J. Yang, S. Jana, and B. Ray, “Learning approximate execution semantics from traces for binary function similarity,” IEEE Trans. Softw. Eng., vol. 49, no. 4, pp. 2776–2790, Apr. 2023.

Digital Library

[44]

I. Rosenberg, A. Shabtai, L. Rokach, and Y. Elovici, “Generic black-box end-to-end attack against state of the art API call based malware classifiers,” in Research in Attacks, Intrusions, and Defenses: 21st International Symposium, RAID 2018, Heraklion, Crete, Greece, September 10–12, 2018, Proceedings 21. Springer, 2018, pp. 490–510.

Recommendations

Malware Analysis: Tools and Techniques
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Malicious code is a serious issue which regularly threatens the security of computer systems and act as a challenging task for cyber security& Information security personals. Malicious code is named differently according to their specification such as ...
Malware Detection by Static Checking and Dynamic Analysis of Executables

The advanced malware continue to be a challenge in digital world that signature-based detection techniques fail to conquer. The malware use many anti-detection techniques to mutate. Thus no virus scanner can claim complete malware detection even for ...
Malware Dynamic Analysis Evasion Techniques: A Survey

The cyber world is plagued with ever-evolving malware that readily infiltrate all defense mechanisms, operate viciously unbeknownst to the user, and surreptitiously exfiltrate sensitive data. Understanding the inner workings of such malware provides a ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Information Forensics and Security

IEEE Transactions on Information Forensics and Security Volume 19, Issue

2024

9612 pages

ISSN:1556-6013

Issue’s Table of Contents

1556-6021 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 06 June 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents