Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3468264.3473925acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

A comprehensive study on learning-based PE malware family classification methods

Published: 18 August 2021 Publication History

Abstract

Driven by the high profit, Portable Executable (PE) malware has been consistently evolving in terms of both volume and sophistication. PE malware family classification has gained great attention and a large number of approaches have been proposed. With the rapid development of machine learning techniques and the exciting results they achieved on various tasks, machine learning algorithms have also gained popularity in the PE malware family classification task. Three mainstream approaches that use learning based algorithms, as categorized by the input format the methods take, are image-based, binary-based and disassembly-based approaches. Although a large number of approaches are published, there is no consistent comparisons on those approaches, especially from the practical industry adoption perspective. Moreover, there is no comparison in the scenario of concept drift, which is a fact for the malware classification task due to the fast evolving nature of malware. In this work, we conduct a thorough empirical study on learning-based PE malware classification approaches on 4 different datasets and consistent experiment settings. Based on the experiment results and an interview with our industry partners, we find that (1) there is no individual class of methods that significantly outperforms the others; (2) All classes of methods show performance degradation on concept drift (by an average F1-score of 32.23%); and (3) the prediction time and high memory consumption hinder existing approaches from being adopted for industry usage.

References

[1]
2009. Secure hash and message digest algorithm library. https://pypi.org/project/hashlib/
[2]
2017. IDA 7.0. https://www.hex-rays.com/products/ida/news/
[3]
2020. PyTorch. https://pytorch.org/
[4]
2021. MalwareBazaar. https://bazaar.abuse.ch/
[5]
2021. March 1st – Threat Intelligence Report. https://research.checkpoint.com/2021/march-1st-threat-intelligence-report/
[6]
2021. sed, a stream editor. https://www.gnu.org/software/sed/manual/sed.html
[7]
2021. Top File Types | Statistics of MalwareBazaar. https://bazaar.abuse.ch/statistics/
[8]
Mansour Ahmadi, Dmitry Ulyanov, Stanislav Semenov, Mikhail Trofimov, and Giorgio Giacinto. 2016. Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. In Proceedings of the Sixth ACM on Conference on Data and Application Security and Privacy, CODASPY 2016, New Orleans, LA, USA, March 9-11, 2016, Elisa Bertino, Ravi S. Sandhu, and Alexander Pretschner (Eds.). ACM, 183–194. https://doi.org/10.1145/2857705.2857713
[9]
AV-TEST. 2021. Malware Statistics and Trends Report by AV-TEST. https://www.av-test.org/en/statistics/malware/
[10]
Yara Awad, Mohamed Nassar, and Haïdar Safa. 2018. Modeling Malware as a Language. In 2018 IEEE International Conference on Communications, ICC 2018, Kansas City, MO, USA, May 20-24, 2018. IEEE, 1–6. https://doi.org/10.1109/ICC.2018.8422083
[11]
Niket Bhodia., Pratikkumar Prajapati., Fabio Di Troia., and Mark Stamp. 2019. Transfer Learning for Image-based Malware Classification. In Proceedings of the 5th International Conference on Information Systems Security and Privacy - Volume 1: ForSE,. SciTePress, 719–726. isbn:978-989-758-359-9 https://doi.org/10.5220/0007701407190726
[12]
Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Compression and Complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, Proceedings, Bruno Carpentieri, Alfredo De Santis, Ugo Vaccaro, and James A. Storer (Eds.). IEEE, 21–29. https://doi.org/10.1109/SEQUEN.1997.666900
[13]
Aniket Chandak, Wendy Lee, and Mark Stamp. 2021. A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification. In Malware Analysis Using Artificial Intelligence and Deep Learning. 287–320. https://doi.org/10.1007/978-3-030-62582-5_11
[14]
Jongwon Chang, Jisang Yu, Taehwa Han, Hyukjae Chang, and Eunjeong Park. 2017. A method for classifying medical images using transfer learning: A pilot study on histopathology of breast cancer. In 19th IEEE International Conference on e-Health Networking, Applications and Services, Healthcom 2017, Dalian, China, October 12-15, 2017. IEEE, 1–4. https://doi.org/10.1109/HealthCom.2017.8210843
[15]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
[16]
Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55, 1 (1997), 119–139. https://doi.org/10.1006/jcss.1997.1504
[17]
Daniel Gibert, Carles Mateu, and Jordi Planes. 2020. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. Journal of Network and Computer Applications, 153 (2020), 102526. https://doi.org/10.1016/j.jnca.2019.102526
[18]
Daniel Gibert, Carles Mateu, Jordi Planes, and Ramon Vicens. 2019. Using convolutional neural networks for classification of malware represented as images. Journal of Computer Virology and Hacking Techniques, 15, 1 (2019), 15–28. https://doi.org/10.1007/s11416-018-0323-0
[19]
Katja Hahn. 2014. Robust static analysis of portable executable malware. HTWK Leipzig, 134.
[20]
Mehadi Hassen and Philip K. Chan. 2017. Scalable Function Call Graph-based Malware Classification. In Proceedings of the Seventh ACM Conference on Data and Application Security and Privacy, CODASPY 2017, Scottsdale, AZ, USA, March 22-24, 2017, Gail-Joon Ahn, Alexander Pretschner, and Gabriel Ghinita (Eds.). ACM, 239–248. https://doi.org/10.1145/3029806.3029824
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90
[22]
M. Howard, A. Pfeffer, M. Dalai, and M. Reposa. 2017. Predicting signatures of future malware variants. In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). 126–132. https://doi.org/10.1109/MALWARE.2017.8323965
[23]
Giacomo Iadarola., Fabio Martinelli., Francesco Mercaldo., and Antonella Santone. 2020. Image-based Malware Family Detection: An Assessment between Feature Extraction and Classification Techniques. In Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - AI4EIoTs,. SciTePress, 499–506. isbn:978-989-758-426-8 issn:2184-4976 https://doi.org/10.5220/0009817804990506
[24]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Francis Bach and David Blei (Eds.) (Proceedings of Machine Learning Research, Vol. 37). PMLR, 448–456.
[25]
Mugdha Jain, William Andreopoulos, and Mark Stamp. 2020. Convolutional neural networks and extreme learning machines for malware classification. Journal of Computer Virology and Hacking Techniques, 16, 3 (2020), 229–244. https://doi.org/10.1007/s11416-020-00354-y
[26]
Sachin Jain and Yogesh Kumar Meena. 2011. Byte level n–gram analysis for malware detection. In International Conference on Information Processing. 51–59. https://doi.org/10.1007/978-3-642-22786-8_6
[27]
Huang Jiaheng, Li Xiaowei, Chen Benhui, and Yang Dengqi. 2017. A comparative study on image similarity algorithms based on hash. Journal of Dali University, 2, 12 (2017), Article 32, 32–37 pages.
[28]
Roberto Jordaney, Kumar Sharad, Santanu K. Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. 2017. Transcend: Detecting Concept Drift in Malware Classification Models. In 26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC. 625–642. isbn:978-1-931971-40-9
[29]
Mahmoud Kalash, Mrigank Rochan, Noman Mohammed, Neil DB Bruce, Yang Wang, and Farkhund Iqbal. 2018. Malware classification with deep convolutional neural networks. In 2018 9th IFIP international conference on new technologies, mobility and security (NTMS). 1–5. https://doi.org/10.1109/NTMS.2018.8328749
[30]
Joris Kinable and Orestis Kostakis. 2011. Malware classification based on call graph clustering. Journal in computer virology, 7, 4 (2011), 233–245. https://doi.org/10.1007/s11416-011-0151-y
[31]
Deguang Kong and Guanhua Yan. 2013. Discriminant malware distance learning on structural information for automated malware classification. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1357–1365. https://doi.org/10.1145/2487575.2488219
[32]
Marek Krčál, Ondřej Švec, Martin Bálek, and Otakar Jašek. 2018. Deep convolutional malware classifiers can learn from raw executables and labels only.
[33]
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In International conference on machine learning (Proceedings of Machine Learning Research, Vol. 37). PMLR, Lille, France. 957–966. http://proceedings.mlr.press/v37/kusnerb15.html
[34]
Young-Man Kwon, Jae-Ju An, Myung-Jae Lim, Seongsoo Cho, and Won-Mo Gal. 2020. Malware Classification Using Simhash Encoding and PCA (MCSP). Symmetry, 12, 5 (2020), 830. https://doi.org/10.3390/sym12050830
[35]
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation, 1, 4 (1989), 541–551. https://doi.org/10.1162/neco.1989.1.4.541
[36]
Joe Security LLC. 2011. Joe Security. https://www.joesecurity.org/
[37]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. 141–150. https://doi.org/10.1145/1242572.1242592
[38]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[39]
Barath Narayanan Narayanan, Ouboti Djaneye-Boundjou, and Temesguen M Kebede. 2016. Performance analysis of machine learning and pattern recognition algorithms for malware classification. In 2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS). 338–342. https://doi.org/10.1109/NAECON.2016.7856826
[40]
Lakshmanan Nataraj, Sreejith Karthikeyan, Gregoire Jacob, and Bangalore S Manjunath. 2011. Malware images: visualization and automatic classification. In Proceedings of the 8th international symposium on visualization for cyber security. 1–7. https://doi.org/10.1145/2016904.2016908
[41]
Sang Ni, Quan Qian, and Rui Zhang. 2018. Malware identification using visualization images and deep learning. Computers & Security, 77 (2018), 871–885. https://doi.org/10.1016/j.cose.2018.04.005
[42]
J Anthony Parker, Robert V Kenyon, and Donald E Troxel. 1983. Comparison of interpolating methods for image resampling. IEEE Transactions on medical imaging, 2, 1 (1983), 31–39. https://doi.org/10.1109/TMI.1983.4307610
[43]
Yanchen Qiao, Bin Zhang, and Weizhe Zhang. 2020. Malware Classification Method Based on Word Vector of Bytes and Multilayer Perception. In ICC 2020-2020 IEEE International Conference on Communications (ICC). 1–6. https://doi.org/10.1109/ICC40277.2020.9149143
[44]
Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles Nicholas. 2017. Malware detection by eating a whole exe. arXiv preprint arXiv:1710.09435.
[45]
Edward Raff and Charles Nicholas. 2020. A Survey of Machine Learning Methods and Challenges for Windows Malware Classification. arXiv preprint arXiv:2006.09271.
[46]
Edward Raff, Jared Sylvester, and Charles Nicholas. 2017. Learning the pe header, malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. 121–132. https://doi.org/10.1145/3128572.3140442
[47]
Radim Rehurek. 2020. gensim 3.8.3. https://pypi.org/project/gensim/
[48]
Edmar Rezende, Guilherme Ruppert, Tiago Carvalho, Fabio Ramos, and Paulo De Geus. 2017. Malicious software classification using transfer learning of resnet-50 deep neural network. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). 1011–1014. https://doi.org/10.1109/ICMLA.2017.00-19
[49]
Irina Rish. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence. 3, 41–46.
[50]
Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, and Mansour Ahmadi. 2018. Microsoft malware classification challenge. arXiv preprint arXiv:1802.10135.
[51]
Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G Bringas. 2013. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231 (2013), 64–82. https://doi.org/10.1016/j.ins.2011.08.020
[52]
Marcos Sebastián, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A Tool for Massive Malware Labeling. In Research in Attacks, Intrusions, and Defenses - 19th International Symposium, RAID 2016, Paris, France, September 19-21, 2016, Proceedings. 230–253. https://doi.org/10.1007/978-3-319-45719-2_11
[53]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[54]
Ajay Singh, Anand Handa, Nitesh Kumar, and Sandeep Kumar Shukla. 2019. Malware classification using image representation. In International Symposium on Cyber Security Cryptography and Machine Learning. 75–92. https://doi.org/10.1007/978-3-030-20951-3_6
[55]
SonicWall. 2020. 2020 SonicWall Cyber Threat Report. https://www.sonicwall.com/medialibrary/en/infographic/infographic-2020-sonicwall-mid-year-cyber-threat-report.pdf
[56]
Alireza Souri and Rahil Hosseini. 2018. A state-of-the-art survey of malware detection approaches using data mining techniques. Human-centric Computing and Information Sciences, 8, 1 (2018), 1–22. https://doi.org/10.1186/s13673-018-0125-x
[57]
Guosong Sun and Quan Qian. 2018. Deep learning and visualization for identifying malware families. IEEE Transactions on Dependable and Secure Computing, https://doi.org/10.1109/TDSC.2018.2884928
[58]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 4278–4284.
[59]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 1–9. https://doi.org/10.1109/CVPR.2015.7298594
[60]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
[61]
Daniele Ucci, Leonardo Aniello, and Roberto Baldoni. 2019. Survey of machine learning techniques for malware analysis. Computers & Security, 81 (2019), 123–147. https://doi.org/10.1016/j.cose.2018.11.001
[62]
Danish Vasan, Mamoun Alazab, Sobia Wassan, Hamad Naeem, Babak Safaei, and Qin Zheng. 2020. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Computer Networks, 171 (2020), 107138. https://doi.org/10.1016/j.comnet.2020.107138
[63]
Danish Vasan, Mamoun Alazab, Sobia Wassan, Babak Safaei, and Qin Zheng. 2020. Image-Based malware classification using ensemble of CNN architectures (IMCEC). Computers & Security, 92 (2020), 101748. https://doi.org/10.1016/j.cose.2020.101748
[64]
Mayuri Wadkar, Fabio Di Troia, and Mark Stamp. 2020. Detecting malware evolution using support vector machines. Expert Systems with Applications, 143 (2020), 113022. https://doi.org/10.1016/j.eswa.2019.113022
[65]
Jiaqi Yan, Guanhua Yan, and Dong Jin. 2019. Classifying malware represented as control flow graphs using deep graph convolutional neural network. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 52–63. https://doi.org/10.1109/DSN.2019.00020
[66]
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end-to-end deep learning architecture for graph classification. In Proceedings of the AAAI Conference on Artificial Intelligence. 32.

Cited By

View all
  • (2025)FCG-MFD: Benchmark function call graph-based dataset for malware family detectionJournal of Network and Computer Applications10.1016/j.jnca.2024.104050233(104050)Online publication date: Jan-2025
  • (2024)MalSensor: Fast and Robust Windows Malware ClassificationACM Transactions on Software Engineering and Methodology10.1145/368883334:1(1-28)Online publication date: 24-Aug-2024
  • (2024)CodeArt: Better Code Models by Attention Regularization When Symbols Are LackingProceedings of the ACM on Software Engineering10.1145/36437521:FSE(562-585)Online publication date: 12-Jul-2024
  • Show More Cited By

Index Terms

  1. A comprehensive study on learning-based PE malware family classification methods

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    August 2021
    1690 pages
    ISBN:9781450385626
    DOI:10.1145/3468264
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 August 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Concept Drift
    2. Deep Learning
    3. Malware Classification

    Qualifiers

    • Research-article

    Funding Sources

    • the NSFC-General Technology Basic Research Joint Funds under Grant
    • The NSFC Youth Funds under Grant
    • State Key Laboratory of Communication Content Cognition fund
    • the National Key Research and Development Program of China

    Conference

    ESEC/FSE '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 02 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)FCG-MFD: Benchmark function call graph-based dataset for malware family detectionJournal of Network and Computer Applications10.1016/j.jnca.2024.104050233(104050)Online publication date: Jan-2025
    • (2024)MalSensor: Fast and Robust Windows Malware ClassificationACM Transactions on Software Engineering and Methodology10.1145/368883334:1(1-28)Online publication date: 24-Aug-2024
    • (2024)CodeArt: Better Code Models by Attention Regularization When Symbols Are LackingProceedings of the ACM on Software Engineering10.1145/36437521:FSE(562-585)Online publication date: 12-Jul-2024
    • (2024)A Post-training Framework for Improving the Performance of Deep Learning Models via Model TransformationACM Transactions on Software Engineering and Methodology10.1145/363001133:3(1-41)Online publication date: 15-Mar-2024
    • (2024)Android Malware Family Clustering Based on Multiple FeaturesIEEE Transactions on Reliability10.1109/TR.2023.333209073:2(1202-1215)Online publication date: Jun-2024
    • (2024)MIGANExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122678241:COnline publication date: 25-Jun-2024
    • (2024)Neural Network Innovations in Image-Based Malware Classification: A Comparative StudyAdvanced Information Networking and Applications10.1007/978-3-031-57916-5_22(252-265)Online publication date: 9-Apr-2024
    • (2023)Imbalanced Malware Family Classification Using Multimodal Fusion and Weight Self-LearningIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2022.320889124:7(7642-7652)Online publication date: 1-Jul-2023
    • (2023)An Empirical Framework for Malware Prediction Using Multi-Layer Perceptron2023 OITS International Conference on Information Technology (OCIT)10.1109/OCIT59427.2023.10430935(485-490)Online publication date: 13-Dec-2023
    • (2023)A Cost-Efficient Threat Intelligence Platform Powered by Crowdsourced OSINT2023 IEEE International Conference on Cyber Security and Resilience (CSR)10.1109/CSR57506.2023.10225008(48-53)Online publication date: 31-Jul-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media