Abstract
In this chapter, we review sequence-based machine learning methods that are used for malware detection and classification. We start by reviewing the datatypes extracted from code: static features and dynamic traces of program execution. We review recent research that applies machine learning on opcode and API call sequences, call graphs, system calls, registry changes, information flow traces, as well as hybrid and raw data, to detect and classify malware. With a focus on metamorphic malware, we discuss Hidden Markov Models (HMMs) and Long Short-Term Memory (LSTM) networks. We describe their input formats, such as one-hot encoding and vector embeddings, the architecture of the machine learning models, the training process, and the output formats. Finally, we discuss commercial and open-source tools that are used for data extraction from software.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmed, Faraz, Haider Hameed, M. Zubair Shafiq, and Muddassar Farooq. 2009. Using spatio-temporal information in API calls with machine learning algorithms for malware detection, 55. New York City: ACM Press.
Alqurashi, Saja, and Omar Batarfi. 2016. A comparison of malware detection techniques based on hidden Markov model. Journal of Information Security 07 (03): 215–223.
Anderson, Blake, Daniel Quist, Joshua Neil, Curtis Storlie, and Terran Lane. 2011. Graph-based malware detection using dynamic analysis. Journal in Computer Virology 7 (4): 247–258.
Andrade, Eduardo de O, José Viterbo, Cristina N. Vasconcelos, Joris Guérin, and Flavia Cristina Bernardini. 2019. A model based on lstm neural networks to identify five different types of malware. Procedia Computer Science 159: 182–191.
Annachhatre, Chinmayee, Thomas H. Austin, and Mark Stamp. 2015. Hidden Markov models for malware classification. Journal of Computer Virology and Hacking Techniques 11 (2): 59–73.
Athiwaratkun, B, and J. W. Stokes. 2017. Malware classification with lstm and gru language models and a character-level cnn. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2482–2486.
Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1724–1734, Doha, Qatar. Association for Computational Linguistics.
Choi, Sunoh, Jangseong Bae, Changki Lee, Youngsoo Kim, and Jonghyun Kim. 2020. Attention-based automated feature extraction for malware analysis. Sensors 20 (10): 2893.
Choi, Y.H, B.J. Han, B.C. Bae, H.G. Oh, and K.W. Sohn. 2012. Toward extracting malware features for classification using static and dynamic analysis. In IEEE conference publication.
Christodorescu, M, S Jha, S A Seshia, D Song, and R E Bryant. 2005. Semantics-aware malware detection, 32–46, IEEE.
Christodorescu , Mihai, and Somesh Jha. 2003. Static analysis of executables to detect malicious patterns. In Proceedings of the 12th conference on USENIX security symposium - volume 12, SSYM’03, 12. USA: USENIX Association.
Dai, Jianyong, Ratan Guha, and Joohan Lee. 2009. Efficient virus detection using dynamic instruction sequences. Güncel Pediatri 4 (5).
Damodaran, Anusha, Fabio Di Troia, Corrado Aaron Visaggio, Thomas H. 2017. Austin, and Mark Stamp. A comparison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques 13(1): 1–12.
Deshpande, Prasad. 2013. Metamorphic detection using function call graph analysis.
Dinaburg, Artem, Paul Royal, Monirul Sharif, and Wenke Lee. 2008. Ether: Malware analysis via hardware virtualization extensions, 51. New York City: ACM Press.
Egele, Manuel, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A survey on automated dynamic malware-analysis techniques and tools. ACM Computing Surveys 44 (2): 1–42.
Eskandari, Mojtaba, and Sattar Hashemi. 2012. A graph mining approach for detecting unknown malwares. Journal of Visual Languages and Computing 23 (3): 154–162.
Eskandari, Mojtaba, Zeinab Khorshidpour, and Sattar Hashemi. 2013. Hdm-analyser: A hybrid analysis approach based on data mining techniques for malware detection. Journal of Computer Virology and Hacking Techniques 9 (2): 77–93.
Eskandari, Mojtaba, Zeinab Khorshidpur, and Sattar Hashemi. 2012. To incorporate sequential dynamic features in malware detection engines, 46–52, IEEE.
Fasikhov, R. The api logger tool. http://blackninja2000.narod.ru/rus/api_logger.html. Accessed 14 July 2020.
Gandotra, Ekta, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Journal of Information Security 05 (02): 56–64.
Ghahramani, Zoubin. 2001. An introduction to hidden Markov models and bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence 15 (01): 9–42.
Ghiasi, Mahboobe, Ashkan Sami, and Zahra Salehi. 2012. Dynamic malware detection using registers values set analysis, 54–59, IEEE.
Hr, Sandeep. 2019. Static analysis of android malware detection using deep learning, 841–845, IEEE.
Jain, Mugdha, William Andreopoulos, and Mark Stamp. 2020. Convolutional neural networks and extreme learning machines for malware classification. Journal of Computer Virology and Hacking Techniques.
Lu, Renjie. 2019. Malware detection with lstm using opcode language. ArXiv:abs/1906.04593.
Mathew, J, and M A Ajay Kumara. 2020. API call based malware detection approach using recurrent neural network – LSTM. In Intelligent systems design and applications, Advances in intelligent systems and computing, eds. Abraham, Ajith, Aswani Kumar Cherukuri, Patricia Melin, and NiketaEditors Gandhi, vol. 940, 87–99. Springer International Publishing.
Moser, Andreas, Christopher Kruegel, and Engin Kirda. 2007. Limits of static analysis for malware detection, 421–430, IEEE.
Naidu, Vijay, Jacqueline Whalley, and Ajit Narayanan. 2017. Exploring the effects of gap-penalties in sequence-alignment approach to polymorphic virus detection. Journal of Information Security 08: 296–327.
Park, Younghee, Douglas S. Reeves, and Mark Stamp. 2013. Deriving common malware behavior through graph clustering. Computers and Security 39: 419–430.
Qiao, Yong, Yuexiang Yang, Lin Ji, and Jie He. 2013. Analyzing malware by abstracting the frequent itemsets in API call sequences, 265–270, IEEE.
Rhee, Junghwan, Ryan Riley, Xu Dongyan, and Xuxian Jiang. 2010. Kernel malware analysis with un-tampered and temporal views of dynamic kernel memory. In Recent advances in intrusion detection, Lecture notes in computer science, eds. Somesh Jha, Robin Sommer, and Christian Kreibich, vol. 6307, 178–197. Berlin: Springer.
Rhode, Matilda, Pete Burnap, and Kevin Jones. 2018. Early-stage malware prediction using recurrent neural networks. Computers and Security 77: 578–594.
Roundy, Kevin, A., and Barton P. Miller. 2010. Hybrid analysis and control of malware. In Recent advances in intrusion detection, Lecture notes in computer science, eds. Somesh Jha, Robin Sommer, Christian Kreibich, vol. 6307, 317–338. Berlin: Springer.
Runwal, Neha, Richard M. Low, and Mark Stamp. 2012. Opcode graph similarity and metamorphic detection. Journal in Computer Virology 8 (1–2): 37–52.
Shankarapani, Madhu K., Subbu Ramamoorthy, Ram S. Movva, and Srinivas Mukkamala. 2011. Malware detection using assembly and api call sequences. Journal in Computer Virology 7 (2): 107–119.
Shanmugam, Gayathri, Richard M. Low, and Mark Stamp. 2013. Simple substitution distance and metamorphic detection. Journal of Computer Virology and Hacking Techniques 9 (3): 159–170.
Shijo, P.V., and A. Salim. 2015. Integrated static and dynamic analysis for malware detection. Procedia Computer Science 46: 804–811.
Shukla, Sanket, Gaurav Kolhe, Sai Manoj P D, and Setareh Rafatirad. 2019. Stealthy malware detection using rnn-based automated localized feature extraction and classifier. In 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI), 590–597, IEEE.
Stamp, M. A revealing introduction to hidden Markov models. tutorial. www.cs.sjsu.edu/~stamp/RUA/HMM.pdf. Accessed 14 July 2020.
Symantec. Symantec Internet security threat report (ISTR) Volume 23. Technical report, Symantec, 03 2018.
Symantec. Symantec Internet security threat report (ISTR) Volume 24. Technical report, Symantec, 02 2019.
Tabish, S. Momina, M. Zubair Shafiq, and Muddassar Farooq. 2009. Malware detection using statistical analysis of byte-level file content. In Proceedings of the ACM SIGKDD workshop on cybersecurity and intelligence informatics - CSI-KDD ’09, eds. Chen, Hsinchun, Marc Dacier, Marie-Francine Moens, Gerhard Paass, and Christopher C. Yang, 23. New York City: ACM Press.
Le Thanh, Hieu. 2013. Analysis of malware families on android mobiles: detection characteristics recognizable by ordinary phone users and how to fix it. Journal of Information Security 04 (04): 213–224.
Tobiyama, S, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi. 2016. Malware detection with deep neural network using process behavior. In 2016 IEEE 40th annual computer software and applications conference (COMPSAC), vol. 2, 577–582.
Vinayakumar, R, K P Soman, Prabaharan Poornachandran, and S Sachin Kumar. 2018. Detecting android malware using long short-term memory (lstm). Journal of Intelligent and Fuzzy Systems 34 (3): 1277–1288.
Wang, Xiaofeng. 2009. Effective and efficient malware detection at the end host. In USENIX security symposium, 351–366.
Wong, A. Symantec internet security threat report highlights. www.techarp.com/cybersecurity/2019-symantec-istr-highlights/. Accessed 14 July 2020.
Xiao, Xi, Shaofeng Zhang, Francesco Mercaldo, Guangwu Hu, and Arun Kumar Sangaiah. 2017. Android malware detection based on system call sequences and lstm. Multimedia Tools and Applications 78 (4): 1–21.
Yan, Jinpei, Yong Qi, and Qifan Rao. 2018. Lstm-based hierarchical denoising network for android malware detection. Security and Communication Networks 1–18: 2018.
Ye, Yanfang, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. 2008. An intelligent pe-malware detection system based on association mining. Journal in Computer Virology 4 (4): 323–334.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Andreopoulos, W.B. (2021). Malware Detection with Sequence-Based Machine Learning and Deep Learning. In: Stamp, M., Alazab, M., Shalaginov, A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-62582-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62581-8
Online ISBN: 978-3-030-62582-5
eBook Packages: Computer ScienceComputer Science (R0)