Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3338501.3357374acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article
Open access

Malware Detection on Highly Imbalanced Data through Sequence Modeling

Published: 11 November 2019 Publication History

Abstract

We explore the task of Android malware detection based on dynamic analysis of application activity sequences using deep learning techniques. We show that analyzing a sequence of the activities is informative for detecting malware, but that analyzing longer sequences does not necessarily lead to a more accurate model. In the real-world scenario, the number of malware is low compared to that of harmless applications. Our dataset has more than 180,000 samples, two-thirds of which are malware. This dataset is significantly larger than other datasets used in previous studies. We mimic real-world cases by randomly sampling a small portion of malware samples. Using the state-of-the-art model BERT, we show that it is possible to achieve desired malware detection performance with an extremely unbalanced dataset. We find that our BERT based model achieves an F1 score of 0.919 with just 0.5% of the examples being malware, which significantly outperforms current state-of-the-art approaches. The results validate the effectiveness of our proposed method in dealing with highly imbalanced datasets.

References

[1]
Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2015. Are your training datasets yet relevant?. In International Symposium on Engineering Secure Software and Systems. Springer, 51--67.
[2]
Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of android apps for the research community. In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR). IEEE, 468--471.
[3]
Idan Amit, John Matherly, William Hewlett, Zhi Xu, Yinnon Meshi, and Yigal Weinberger. 2019. Machine Learning in Cyber-Security - Problems, Challenges and DataSets. The AAAI-19 Workshop on Engineering Dependable and Secure Machine Learning Systems (2019). Available at https://arxiv.org/abs/1812.07858.
[4]
Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket. In Ndss, Vol. 14. 23--26.
[5]
Babak Bashari Rad, Maslin Masrom, and Suhaimi Ibrahim. 2012. Camouflage In Malware: From Encryption To Metamorphism. International Journal of Computer Science And Network Security (IJCSNS), Vol. 12 (01 2012), 74--83.
[6]
Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. 2011. Crowdroid: behavior-based malware detection system for android. In Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices. ACM, 15--26.
[7]
Nitesh V. Chawla and Nathalie Japkowicz. 2004. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl (2004), 1--6.
[8]
Mihai Christodorescu, Somesh Jha, Sanjit A Seshia, Dawn Song, and Randal E Bryant. 2005. Semantics-aware malware detection. In 2005 IEEE Symposium on Security and Privacy (S&P'05). IEEE, 32--46.
[9]
Hamid Darabian, Ali Dehghantanha, Sattar Hashemi, Sajad Homayoun, and Kim-Kwang Raymond Choo. 2019. An opcode-based technique for polymorphic Internet of Things malware detection. Concurrency and Computation: Practice and Experience (2019), e5173.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285--1298.
[12]
Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Journal of Information Security, Vol. 5, 02 (2014), 56.
[13]
Jianfeng Gao, Michel Galley, Lihong Li, et almbox. 2019. Neural approaches to conversational AI. Foundations and Trends® in Information Retrieval, Vol. 13, 2--3 (2019), 127--298.
[14]
Wenbo Guo, Dongliang Mu, Xing Xinyu, Min Du, and Dawn Song. 2019. DEEPVSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis. In Proceedings of The 28th USENIX Security Symposium. USENIX.
[15]
David M. Halbfinger and Ronen Bergman. 2019 Gantz, Netanyahu's Challenger, Faces Lurid Questions After Iran Hacked His Phone. https://www.nytimes.com/2019/03/15/world/middleeast/gantz-netanyahus-challenger-faces-lurid-questions-after-iran-hacked-his-phone.html
[16]
Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier.
[17]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[18]
Jaemin Jung, Hyunjin Kim, Dongjin Shin, Myeonggeon Lee, Hyunjae Lee, Seong-je Cho, and Kyoungwon Suh. 2018. Android Malware Detection Based on Useful API Calls and Machine Learning. In 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). IEEE, 175--178.
[19]
ElMouatez Billah Karbab, Mourad Debbabi, Abdelouahid Derhab, and Djedjiga Mouheb. 2018. MalDozer: Automatic framework for android malware detection using deep learning. Digital Investigation, Vol. 24 (2018), S48--S59.
[20]
Nicolas Kiss, Jean-Francc ois Lalande, Mourad Leslous, and Valérie Viet Triem Tong. 2016. Kharon dataset: Android malware under a microscope. In The LASER Workshop: Learning from Authoritative Security Experiment Results (LASER 2016). 1--12.
[21]
Lichman. 2013. UCI Machine Learning Data Repository. http://archive.ics.uci.edu/ml
[22]
Oren Liebermann. 2019. How a hacked phone may have led killers to Khashoggi. https://edition.cnn.com/2019/01/12/middleeast/khashoggi-phone-malware-intl/index.html
[23]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111--3119.
[24]
Palo Alto Networks. 2019 a. AutoFocus Threat intelligence for security analysts. https://www.paloaltonetworks.com/products/secure-the-network/autofocus
[25]
Palo Alto Networks. 2019 b. WILDFIRE MALWARE ANALYSIS Find and stop unknown attacks automatically. https://www.paloaltonetworks.com/products/secure-the-network/wildfire
[26]
Naser Peiravian and Xingquan Zhu. 2013. Machine learning for android malware detection using permission and api calls. In 2013 IEEE 25th international conference on tools with artificial intelligence. IEEE, 300--305.
[27]
Juan Ramos et almbox. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Piscataway, NJ, 133--142.
[28]
Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 249--266.
[29]
Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. 2018. Generic black-box end-to-end attack against state of the art API call based malware classifiers. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 490--510.
[30]
Justin Sahs and Latifur Khan. 2012. A machine learning approach to android malware detection. In 2012 European Intelligence and Security Informatics Conference. IEEE, 141--147.
[31]
Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation, Vol. 13, 7 (2001), 1443--1471.
[32]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[33]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.
[34]
Ibrahim Sogukpinar. 2019. Analysis and Evaluation of Dynamic Feature-Based Malware Detection Methods. In Innovative Security Solutions for Information Technology and Communications: 11th International Conference, SecITC 2018, Bucharest, Romania, November 8 $$u2013$$ 9, 2018, Revised Selected Papers, Vol. 11359. Springer, 247.
[35]
Xin Su, Dafang Zhang, Wenjia Li, and Kai Zhao. 2016. A deep learning approach to android malware feature learning and detection. In 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE, 244--251.
[36]
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. arXiv preprint arXiv:1904.06690 (2019).
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[38]
N. Virvilis and D. Gritzalis. 2013. The Big Four - What We Did Wrong in Advanced Persistent Threat Detection?. In 2013 International Conference on Availability, Reliability and Security. 248--254. https://doi.org/10.1109/ARES.2013.32
[39]
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471 (2018).
[40]
Wikipedia contributors. 2019. Confusion matrix -- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Confusion_matrix&oldid=906886050. [Online; accessed 28-August-2019].
[41]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arxiv: cs.CL/1906.08237
[42]
Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda. 2007. Panorama: capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM conference on Computer and communications security. ACM, 116--127.
[43]
Xiao Zhang and Zhi Xu. 2018. On the Feasibility of Automatic Malware Family Signature Generation. In Proceedings of the First Workshop on Radical and Experiential Security (RESEC '18). ACM, New York, NY, USA, 69--72. https://doi.org/10.1145/3203422.3203430
[44]
Yin Zhang, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, Vol. 1, 1--4 (2010), 43--52.
[45]
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. (2018).
[46]
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016).

Cited By

View all
  • (2024)A Survey on Exploring the Relationship Between Music and Mental Health Using Machine Learning AnalysisCross-Industry AI Applications10.4018/979-8-3693-5951-8.ch019(304-318)Online publication date: 31-May-2024
  • (2024)Interpretable Machine Learning Models for Human Action and Emotion DecipheringAdvancing Intelligent Networks Through Distributed Optimization10.4018/979-8-3693-3739-4.ch023(449-468)Online publication date: 30-Aug-2024
  • (2024)A Visualization Approach for Analyzing Decision-Making in Human-Robot InteractionsExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch018(290-307)Online publication date: 1-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AISec'19: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security
November 2019
123 pages
ISBN:9781450368339
DOI:10.1145/3338501
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. imbalanced data
  2. malware detection
  3. sequence modeling

Qualifiers

  • Research-article

Conference

CCS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)754
  • Downloads (Last 6 weeks)66
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey on Exploring the Relationship Between Music and Mental Health Using Machine Learning AnalysisCross-Industry AI Applications10.4018/979-8-3693-5951-8.ch019(304-318)Online publication date: 31-May-2024
  • (2024)Interpretable Machine Learning Models for Human Action and Emotion DecipheringAdvancing Intelligent Networks Through Distributed Optimization10.4018/979-8-3693-3739-4.ch023(449-468)Online publication date: 30-Aug-2024
  • (2024)A Visualization Approach for Analyzing Decision-Making in Human-Robot InteractionsExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch018(290-307)Online publication date: 1-Mar-2024
  • (2024)Artificial Neural Network-Based Efficient Cyber Hacking Detection System Using Deep Learning ApproachesExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch013(205-223)Online publication date: 1-Mar-2024
  • (2024)Analysis of Cyber Attack on Processor Architecture Through Exploiting VulnerabilitiesExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch012(189-204)Online publication date: 1-Mar-2024
  • (2024)Comparative Analysis of Belief Propagation and Layered Decoding Algorithms for LDPC CodesExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch005(64-86)Online publication date: 1-Mar-2024
  • (2024)Shedding Light on Dataset Influence for More Transparent Machine LearningExplainable AI Applications for Human Behavior Analysis10.4018/979-8-3693-1355-8.ch003(33-48)Online publication date: 1-Mar-2024
  • (2024)SINNER: A Reward-Sensitive Algorithm for Imbalanced Malware Classification Using Neural Networks with Experience ReplayInformation10.3390/info1508042515:8(425)Online publication date: 23-Jul-2024
  • (2024)Catch'em all: Classification of Rare, Prominent, and Novel Malware Families2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527250(1-6)Online publication date: 29-Apr-2024
  • (2024)A Review of Advancements and Applications of Pre-Trained Language Models in Cybersecurity2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527236(1-10)Online publication date: 29-Apr-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media