research-article

Open access

Malicious Package Detection using Metadata Information

Authors:

Michael Bewong,

Arash Mahboubi,

Md Rafiqul Islam,

Md Zahid Islam,

Muhammad Ejaz Ahmed,

Gowri Sankar Ramachandran, and

Muhammad Ali BabarAuthors Info & Claims

WWW '24: Proceedings of the ACM on Web Conference 2024

May 2024

Pages 1779 - 1789

https://doi.org/10.1145/3589334.3645543

Published: 13 May 2024 Publication History

Abstract

Protecting software supply chains from malicious packages is paramount in the evolving landscape of software development. Attacks on the software supply chain involve attackers injecting harmful software into commonly used packages or libraries in a software repository. For instance, JavaScript uses Node Package Manager (NPM), and Python uses Python Package Index (PyPi) as their respective package repositories. In the past, NPM has had vulnerabilities such as the event-stream incident, where a malicious package was introduced into a popular NPM package, potentially impacting a wide range of projects. As the integration of third-party packages becomes increasingly ubiquitous in modern software development, accelerating the creation and deployment of applications, the need for a robust detection mechanism has become critical. On the other hand, due to the sheer volume of new packages being released daily, the task of identifying malicious packages presents a significant challenge. To address this issue, in this paper, we introduce a metadata-based malicious package detection model, MeMPtec. This model extracts a set of features from package metadata information. These extracted features are classified as either easy-to-manipulate (ETM) or difficult-to-manipulate (DTM) features based on monotonicity and restricted control properties. By utilising these metadata features, not only do we improve the effectiveness of detecting malicious packages, but also we demonstrate its resistance to adversarial attacks in comparison with existing state-of-the-art. Our experiments indicate a significant reduction in both false positives (up to 97.56%) and false negatives (up to 91.86%).

Supplemental Material

MP4 File

Supplemental video

Download
43.66 MB

References

[1]

Ahmad Abdellatif, Yi Zeng, Mohamed Elshafei, Emad Shihab, and Weiyi Shang. 2020. Simplifying the search of npm packages. Information and Software Technology, Vol. 126 (2020), 106365.

[2]

Malek Al-Zewairi, Sufyan Almajali, and Arafat Awajan. 2017. Experimental evaluation of a multi-layer feed-forward artificial neural network classifier for network intrusion detection system. In 2017 International Conference on New Trends in Computing Sciences (ICTCS). IEEE, 167--172.

[3]

Blake Barnes-Cook and Timothy O'Shea. 2022. Scalable Wireless Anomaly Detection with Generative-LSTMs on RF Post-Detection Metadata. In 2022 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 483--488.

[4]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[5]

Kalil Garrett, Gabriel Ferreira, Limin Jia, Joshua Sunshine, and Christian K"astner. 2019. Detecting suspicious package updates. In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 13--16.

Digital Library

[6]

Danielle Gonzalez, Thomas Zimmermann, Patrice Godefroid, and Max Sch"afer. 2021. Anomalicious: Automated detection of anomalous and potentially malicious commits on github. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 258--267.

Digital Library

[7]

Samiul Islam and Saman Hassanzadeh Amin. 2020. Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques. Journal of Big Data, Vol. 7, 1 (2020), 1--22.

[8]

Yesi Novaria Kunang, Siti Nurmaini, Deris Stiawan, and Bhakti Yudho Suprapto. 2021. Attack classification of an intrusion detection system using deep learning and hyperparameter optimization. Journal of Information Security and Applications, Vol. 58 (2021), 102804.

[9]

Tysen Leckie and Alec Yasinsac. 2004. Metadata for anomaly-based security protocol attack deduction. IEEE Transactions on Knowledge and Data Engineering, Vol. 16, 9 (2004), 1157--1168.

Digital Library

[10]

Chengwei Liu, Sen Chen, Lingling Fan, Bihuan Chen, Yang Liu, and Xin Peng. 2022. Demystifying the vulnerability propagation and its evolution via dependency trees in the npm ecosystem. In Proceedings of the 44th International Conference on Software Engineering. 672--684.

Digital Library

[11]

Marlene Müller. 2012. Generalized linear models. Handbook of Computational Statistics: Concepts and Methods (2012), 681--709.

[12]

Khaled Mutmbak, Sultan Alotaibi, Khalid Alharbi, Umar Albalawi, and Osama Younes. 2022. Anomaly Detection using Network Metadata. International Journal of Advanced Computer Science and Applications, Vol. 13, 5 (2022).

[13]

Yasunobu Nohara, Koutarou Matsumoto, Hidehisa Soejima, and Naoki Nakashima. 2019. Explanation of machine learning models using improved shapley additive explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 546--546.

Digital Library

[14]

Npm, Inc. 2023. State Of Npm 2023: The Overview. Online. https://blog.sandworm.dev/series/state-of-npm-2023 Accessed on 2023--9--12.

[15]

Marc Ohm, Felix Boes, Christian Bungartz, and Michael Meier. 2022. On the feasibility of supervised machine learning for the detection of malicious software packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security. 1--10.

Digital Library

[16]

Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber's Knife Collection: A Review of Open Source Software Supply Chain Attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer.

Digital Library

[17]

Brian Pfretzschner and Lotfi ben Othmane. 2017. Identification of dependency-based attacks on node. js. In Proceedings of the 12th International Conference on Availability, Reliability and Security. 1--6.

Digital Library

[18]

Derek A Pisner and David M Schnyer. 2020. Support vector machine. In Machine learning. Elsevier, 101--121.

[19]

Simone Scalco, Ranindya Paramitha, Duc-Ly Vu, and Fabio Massacci. 2022. On the feasibility of detecting injections in malicious npm packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security. 1--8.

Digital Library

[20]

Adriana Sejfia and Max Sch"afer. 2022. Practical automated detection of malicious npm packages. In Proceedings of the 44th International Conference on Software Engineering. 1681--1692.

Digital Library

[21]

Sonatype. 2019. 2019 State of the Software Supply Chain Report Reveals Best Practices From 36,000 Open Source Software Development Teams. https://www.sonatype.com/press-release-blog/2019-state-of-thesoftware- supply-chain-report-reveals-best-practices-from-36000-opensource-software-development-teams

[22]

Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. 2022. Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference 2022. 652--660.

Digital Library

[23]

Synopsys. 2020. Synopsys 2020 Open Source Security and Risk Analysis Report. https://www.synopsys.com/content/dam/synopsys/sig-assets/reports/ 2020-ossra-report.pdf

[24]

Matthew Taylor, Ruturaj Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi. 2020. Defending against package typosquatting. In Network and System Security: 14th International Conference, NSS 2020, Melbourne, VIC, Australia, November 25--27, 2020, Proceedings 14. Springer, 112--131.

Digital Library

[25]

Laurie Voss. 2018. npm and the future of JavaScript. https://slides.com/seldo/npmfuture- of-javascript.

[26]

Duc-Ly Vu. 2021. PY2SRC: Towards the Automatic (and Reliable) Identification of Sources for PyPI Package. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1394--1396.

Digital Library

[27]

Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta. 2020. Typosquatting and combosquatting attacks on the python ecosystem. In 2020 ieee european symposium on security and privacy workshops (euros&pw). IEEE, 509--514.

[28]

Seongil Wi, Sijae Woo, Joyce Jiyoung Whang, and Sooel Son. 2022. HiddenCPG: large-scale vulnerable clone detection using subgraph isomorphism of code property graphs. In Proceedings of the ACM Web Conference 2022. 755--766.

Digital Library

[29]

Riccardo Zaccarelli, Dino Bindi, and Angelo Strollo. 2021. Anomaly detection in seismic data--metadata using simple machine-learning models. Seismological Society of America, Vol. 92, 4 (2021), 2627--2639.

[30]

Nusrat Zahan, Parth Kanakiya, Brian Hambleton, Shohanuzzaman Shohan, and Laurie Williams. 2023. OpenSSF Scorecard: On the Path Toward Ecosystem-Wide Automated Security Metrics. IEEE Security & Privacy (2023).

[31]

Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid, Brendan Murphy, Chandra Maddila, and Laurie Williams. 2022. What are weak links in the npm supply chain?. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 331--340.

Digital Library

[32]

Shijie Zhang, Hongzhi Yin, Tong Chen, Zi Huang, Lizhen Cui, and Xiangliang Zhang. 2021. Graph embedding for recommendation against attribute inference attacks. In Proceedings of the Web Conference 2021. 3002--3014.

Digital Library

[33]

LM Zhao, HY Hu, DH Wei, and SQ Wang. 1999. Multilayer Feedforward Artificial Neural Network. YellowRiver Water Conservancy Press: Zhengzhou, China (1999).

[34]

Junwei Zhou, Yijia Qian, Qingtian Zou, Peng Liu, and Jianwen Xiang. 2022. DeepSyslog: Deep Anomaly Detection on Syslog Using Sentence Embedding and Metadata. IEEE Transactions on Information Forensics and Security, Vol. 17 (2022), 3051--3061.

Digital Library

[35]

Yao Zhu, Hongzhi Liu, Yingpeng Du, and Zhonghai Wu. 2021. IFSpard: An information fusion-based framework for spam review detection. In Proceedings of the Web Conference 2021. 507--517.

Digital Library

[36]

Markus Zimmermann, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel. 2019. Small World with High Risks: A Study of Security Threats in the npm Ecosystem. In USENIX security symposium, Vol. 17. io

Index Terms

Malicious Package Detection using Metadata Information
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation
  2. Software and application security
    1. Software security engineering

Recommendations

On the feasibility of detecting injections in malicious npm packages
ARES '22: Proceedings of the 17th International Conference on Availability, Reliability and Security

Open-source packages typically have their source code available on a source code repository (e.g., on GitHub), but developers prefer to use pre-built artifacts directly from the package repositories (such as npm for JavaScript). Between the source code ...
Read More
Malicious Web Request Detection Using Character-Level CNN
Machine Learning for Cyber Security
Abstract
Web parameter injection attacks are common and have put a great threat to the security of web applications. In this kind of attacks, malicious attackers can employ HTTP requests to implement attacks against servers by injecting some malicious ...
Read More
Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks
Detection of Intrusions and Malware, and Vulnerability Assessment
Abstract
A software supply chain attack is characterized by the injection of malicious code into a software package in order to compromise dependent systems further down the chain. Recent years saw a number of supply chain attacks that leverage the ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '24: Proceedings of the ACM on Web Conference 2024

May 2024

4826 pages

ISBN:9798400701719

DOI:10.1145/3589334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Data Availability

Supplemental video https://dl.acm.org/doi/10.1145/3589334.3645543#rfp1343.mp4

Funding Sources

Cyber Security Research Centre Limited

Conference

WWW '24

Sponsor:

SIGWEB

WWW '24: The ACM Web Conference 2024

May 13 - 17, 2024

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
104
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)57

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents