research-article

Open access

Malware Detection on Highly Imbalanced Data through Sequence Modeling

Authors:

Rajvardhan Oak,

Harshvardhan Takawale,

Idan AmitAuthors Info & Claims

AISec'19: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security

Pages 37 - 48

https://doi.org/10.1145/3338501.3357374

Published: 11 November 2019 Publication History

Abstract

We explore the task of Android malware detection based on dynamic analysis of application activity sequences using deep learning techniques. We show that analyzing a sequence of the activities is informative for detecting malware, but that analyzing longer sequences does not necessarily lead to a more accurate model. In the real-world scenario, the number of malware is low compared to that of harmless applications. Our dataset has more than 180,000 samples, two-thirds of which are malware. This dataset is significantly larger than other datasets used in previous studies. We mimic real-world cases by randomly sampling a small portion of malware samples. Using the state-of-the-art model BERT, we show that it is possible to achieve desired malware detection performance with an extremely unbalanced dataset. We find that our BERT based model achieves an F1 score of 0.919 with just 0.5% of the examples being malware, which significantly outperforms current state-of-the-art approaches. The results validate the effectiveness of our proposed method in dealing with highly imbalanced datasets.

References

[1]

Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2015. Are your training datasets yet relevant?. In International Symposium on Engineering Secure Software and Systems. Springer, 51--67.

[2]

Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of android apps for the research community. In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR). IEEE, 468--471.

Digital Library

[3]

Idan Amit, John Matherly, William Hewlett, Zhi Xu, Yinnon Meshi, and Yigal Weinberger. 2019. Machine Learning in Cyber-Security - Problems, Challenges and DataSets. The AAAI-19 Workshop on Engineering Dependable and Secure Machine Learning Systems (2019). Available at https://arxiv.org/abs/1812.07858.

[4]

Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket. In Ndss, Vol. 14. 23--26.

[5]

Babak Bashari Rad, Maslin Masrom, and Suhaimi Ibrahim. 2012. Camouflage In Malware: From Encryption To Metamorphism. International Journal of Computer Science And Network Security (IJCSNS), Vol. 12 (01 2012), 74--83.

[6]

Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. 2011. Crowdroid: behavior-based malware detection system for android. In Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices. ACM, 15--26.

Digital Library

[7]

Nitesh V. Chawla and Nathalie Japkowicz. 2004. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl (2004), 1--6.

[8]

Mihai Christodorescu, Somesh Jha, Sanjit A Seshia, Dawn Song, and Randal E Bryant. 2005. Semantics-aware malware detection. In 2005 IEEE Symposium on Security and Privacy (S&P'05). IEEE, 32--46.

Digital Library

[9]

Hamid Darabian, Ali Dehghantanha, Sattar Hashemi, Sajad Homayoun, and Kim-Kwang Raymond Choo. 2019. An opcode-based technique for polymorphic Internet of Things malware detection. Concurrency and Computation: Practice and Experience (2019), e5173.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[11]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285--1298.

Digital Library

[12]

Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Journal of Information Security, Vol. 5, 02 (2014), 56.

[13]

Jianfeng Gao, Michel Galley, Lihong Li, et almbox. 2019. Neural approaches to conversational AI. Foundations and Trends® in Information Retrieval, Vol. 13, 2--3 (2019), 127--298.

Digital Library

[14]

Wenbo Guo, Dongliang Mu, Xing Xinyu, Min Du, and Dawn Song. 2019. DEEPVSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis. In Proceedings of The 28th USENIX Security Symposium. USENIX.

[15]

David M. Halbfinger and Ronen Bergman. 2019 Gantz, Netanyahu's Challenger, Faces Lurid Questions After Iran Hacked His Phone. https://www.nytimes.com/2019/03/15/world/middleeast/gantz-netanyahus-challenger-faces-lurid-questions-after-iran-hacked-his-phone.html

[16]

Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier.

[17]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[18]

Jaemin Jung, Hyunjin Kim, Dongjin Shin, Myeonggeon Lee, Hyunjae Lee, Seong-je Cho, and Kyoungwon Suh. 2018. Android Malware Detection Based on Useful API Calls and Machine Learning. In 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). IEEE, 175--178.

[19]

ElMouatez Billah Karbab, Mourad Debbabi, Abdelouahid Derhab, and Djedjiga Mouheb. 2018. MalDozer: Automatic framework for android malware detection using deep learning. Digital Investigation, Vol. 24 (2018), S48--S59.

[20]

Nicolas Kiss, Jean-Francc ois Lalande, Mourad Leslous, and Valérie Viet Triem Tong. 2016. Kharon dataset: Android malware under a microscope. In The LASER Workshop: Learning from Authoritative Security Experiment Results (LASER 2016). 1--12.

[21]

Lichman. 2013. UCI Machine Learning Data Repository. http://archive.ics.uci.edu/ml

[22]

Oren Liebermann. 2019. How a hacked phone may have led killers to Khashoggi. https://edition.cnn.com/2019/01/12/middleeast/khashoggi-phone-malware-intl/index.html

[23]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111--3119.

Digital Library

[24]

Palo Alto Networks. 2019 a. AutoFocus Threat intelligence for security analysts. https://www.paloaltonetworks.com/products/secure-the-network/autofocus

[25]

Palo Alto Networks. 2019 b. WILDFIRE MALWARE ANALYSIS Find and stop unknown attacks automatically. https://www.paloaltonetworks.com/products/secure-the-network/wildfire

[26]

Naser Peiravian and Xingquan Zhu. 2013. Machine learning for android malware detection using permission and api calls. In 2013 IEEE 25th international conference on tools with artificial intelligence. IEEE, 300--305.

Digital Library

[27]

Juan Ramos et almbox. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Piscataway, NJ, 133--142.

[28]

Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 249--266.

[29]

Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. 2018. Generic black-box end-to-end attack against state of the art API call based malware classifiers. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 490--510.

[30]

Justin Sahs and Latifur Khan. 2012. A machine learning approach to android malware detection. In 2012 European Intelligence and Security Informatics Conference. IEEE, 141--147.

Digital Library

[31]

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation, Vol. 13, 7 (2001), 1443--1471.

[32]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[33]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.

[34]

Ibrahim Sogukpinar. 2019. Analysis and Evaluation of Dynamic Feature-Based Malware Detection Methods. In Innovative Security Solutions for Information Technology and Communications: 11th International Conference, SecITC 2018, Bucharest, Romania, November 8 $$u2013$$ 9, 2018, Revised Selected Papers, Vol. 11359. Springer, 247.

[35]

Xin Su, Dafang Zhang, Wenjia Li, and Kai Zhao. 2016. A deep learning approach to android malware feature learning and detection. In 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE, 244--251.

[36]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. arXiv preprint arXiv:1904.06690 (2019).

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[38]

N. Virvilis and D. Gritzalis. 2013. The Big Four - What We Did Wrong in Advanced Persistent Threat Detection?. In 2013 International Conference on Availability, Reliability and Security. 248--254. https://doi.org/10.1109/ARES.2013.32

Digital Library

[39]

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471 (2018).

[40]

Wikipedia contributors. 2019. Confusion matrix -- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Confusion_matrix&oldid=906886050. [Online; accessed 28-August-2019].

[41]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arxiv: cs.CL/1906.08237

[42]

Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda. 2007. Panorama: capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM conference on Computer and communications security. ACM, 116--127.

Digital Library

[43]

Xiao Zhang and Zhi Xu. 2018. On the Feasibility of Automatic Malware Family Signature Generation. In Proceedings of the First Workshop on Radical and Experiential Security (RESEC '18). ACM, New York, NY, USA, 69--72. https://doi.org/10.1145/3203422.3203430

Digital Library

[44]

Yin Zhang, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, Vol. 1, 1--4 (2010), 43--52.

[45]

Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. (2018).

[46]

Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016).

Cited By

Padmini AYogeshwari M(2024)A Survey on Exploring the Relationship Between Music and Mental Health Using Machine Learning AnalysisCross-Industry AI Applications10.4018/979-8-3693-5951-8.ch019(304-318)Online publication date: 31-May-2024
https://doi.org/10.4018/979-8-3693-5951-8.ch019
Singhal SKotagiri ASamayamantri LRajest S(2024)Interpretable Machine Learning Models for Human Action and Emotion DecipheringAdvancing Intelligent Networks Through Distributed Optimization10.4018/979-8-3693-3739-4.ch023(449-468)Online publication date: 30-Aug-2024
https://doi.org/10.4018/979-8-3693-3739-4.ch023
Godbole MBammidi TKumar Vadlamudi A(2024)A Visualization Approach for Analyzing Decision-Making in Human-Robot InteractionsExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch018(290-307)Online publication date: 1-Mar-2024
https://doi.org/10.4018/979-8-3693-1355-8.ch018
Show More Cited By

Index Terms

Malware Detection on Highly Imbalanced Data through Sequence Modeling
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation
  2. Systems security
    1. Operating systems security
      1. Mobile platform security

Recommendations

Malware detection using adaptive data compression
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of ...
Malware Detection Method Focusing on Anti-debugging Functions
CANDAR '14: Proceedings of the 2014 Second International Symposium on Computing and Networking

Malware has received much attention in recent years. Antivirus software is widely used as a countermeasure against malware. However, some kinds of malware can evade detection by antivirus software, hence, a new detection method is required. In this ...
Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AISec'19: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security

November 2019

123 pages

ISBN:9781450368339

DOI:10.1145/3338501

General Chairs:
Lorenzo Cavallaro
King's College London
,
Johannes Kinder
Bundeswehr University Munich
,
Program Chairs:
Sadia Afroz
UC Berkeley
,
Battista Biggio
University of Cagliari / Pluribus One
,
Nicholas Carlini
Google Brain
,
Yuval Elovici
Ben-Gurion University
,
Asaf Shabtai
Ben-Gurion University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCS '19

Sponsor:

SIGSAC

CCS '19: 2019 ACM SIGSAC Conference on Computer and Communications Security

November 15, 2019

London, United Kingdom

Acceptance Rates

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
3,773
Total Downloads

Downloads (Last 12 months)754
Downloads (Last 6 weeks)66

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Padmini AYogeshwari M(2024)A Survey on Exploring the Relationship Between Music and Mental Health Using Machine Learning AnalysisCross-Industry AI Applications10.4018/979-8-3693-5951-8.ch019(304-318)Online publication date: 31-May-2024
https://doi.org/10.4018/979-8-3693-5951-8.ch019
Singhal SKotagiri ASamayamantri LRajest S(2024)Interpretable Machine Learning Models for Human Action and Emotion DecipheringAdvancing Intelligent Networks Through Distributed Optimization10.4018/979-8-3693-3739-4.ch023(449-468)Online publication date: 30-Aug-2024
https://doi.org/10.4018/979-8-3693-3739-4.ch023
Godbole MBammidi TKumar Vadlamudi A(2024)A Visualization Approach for Analyzing Decision-Making in Human-Robot InteractionsExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch018(290-307)Online publication date: 1-Mar-2024
https://doi.org/10.4018/979-8-3693-1355-8.ch018
Deva Kirubai JPriscila S(2024)Artificial Neural Network-Based Efficient Cyber Hacking Detection System Using Deep Learning ApproachesExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch013(205-223)Online publication date: 1-Mar-2024
https://doi.org/10.4018/979-8-3693-1355-8.ch013
Hema LKumar Dwibedi RRegilan SK. DVinay Kumar KSatish Kumar S(2024)Analysis of Cyber Attack on Processor Architecture Through Exploiting VulnerabilitiesExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch012(189-204)Online publication date: 1-Mar-2024
https://doi.org/10.4018/979-8-3693-1355-8.ch012
E. CPalaniswamy K(2024)Comparative Analysis of Belief Propagation and Layered Decoding Algorithms for LDPC CodesExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch005(64-86)Online publication date: 1-Mar-2024
https://doi.org/10.4018/979-8-3693-1355-8.ch005
Settibathini VVirmani AKuppam MS. NManikandan SC. E(2024)Shedding Light on Dataset Influence for More Transparent Machine LearningExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch003(33-48)Online publication date: 1-Mar-2024
https://doi.org/10.4018/979-8-3693-1355-8.ch003
Coscia AIannacone AMaci AStamerra A(2024)SINNER: A Reward-Sensitive Algorithm for Imbalanced Malware Classification Using Neural Networks with Experience ReplayInformation10.3390/info1508042515:8(425)Online publication date: 23-Jul-2024
https://doi.org/10.3390/info15080425
Eren MBarron RBhattarai MWanna SSolovyev NRasmussen KAlcxandrov BNicholas C(2024)Catch'em all: Classification of Rare, Prominent, and Novel Malware Families2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527250(1-6)Online publication date: 29-Apr-2024
https://doi.org/10.1109/ISDFS60797.2024.10527250
Liu Z(2024)A Review of Advancements and Applications of Pre-Trained Language Models in Cybersecurity2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527236(1-10)Online publication date: 29-Apr-2024
https://doi.org/10.1109/ISDFS60797.2024.10527236
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents