Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3589334.3645543acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article
Open access

Malicious Package Detection using Metadata Information

Published: 13 May 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Protecting software supply chains from malicious packages is paramount in the evolving landscape of software development. Attacks on the software supply chain involve attackers injecting harmful software into commonly used packages or libraries in a software repository. For instance, JavaScript uses Node Package Manager (NPM), and Python uses Python Package Index (PyPi) as their respective package repositories. In the past, NPM has had vulnerabilities such as the event-stream incident, where a malicious package was introduced into a popular NPM package, potentially impacting a wide range of projects. As the integration of third-party packages becomes increasingly ubiquitous in modern software development, accelerating the creation and deployment of applications, the need for a robust detection mechanism has become critical. On the other hand, due to the sheer volume of new packages being released daily, the task of identifying malicious packages presents a significant challenge. To address this issue, in this paper, we introduce a metadata-based malicious package detection model, MeMPtec. This model extracts a set of features from package metadata information. These extracted features are classified as either easy-to-manipulate (ETM) or difficult-to-manipulate (DTM) features based on monotonicity and restricted control properties. By utilising these metadata features, not only do we improve the effectiveness of detecting malicious packages, but also we demonstrate its resistance to adversarial attacks in comparison with existing state-of-the-art. Our experiments indicate a significant reduction in both false positives (up to 97.56%) and false negatives (up to 91.86%).

    Supplemental Material

    MP4 File
    Supplemental video

    References

    [1]
    Ahmad Abdellatif, Yi Zeng, Mohamed Elshafei, Emad Shihab, and Weiyi Shang. 2020. Simplifying the search of npm packages. Information and Software Technology, Vol. 126 (2020), 106365.
    [2]
    Malek Al-Zewairi, Sufyan Almajali, and Arafat Awajan. 2017. Experimental evaluation of a multi-layer feed-forward artificial neural network classifier for network intrusion detection system. In 2017 International Conference on New Trends in Computing Sciences (ICTCS). IEEE, 167--172.
    [3]
    Blake Barnes-Cook and Timothy O'Shea. 2022. Scalable Wireless Anomaly Detection with Generative-LSTMs on RF Post-Detection Metadata. In 2022 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 483--488.
    [4]
    Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
    [5]
    Kalil Garrett, Gabriel Ferreira, Limin Jia, Joshua Sunshine, and Christian K"astner. 2019. Detecting suspicious package updates. In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 13--16.
    [6]
    Danielle Gonzalez, Thomas Zimmermann, Patrice Godefroid, and Max Sch"afer. 2021. Anomalicious: Automated detection of anomalous and potentially malicious commits on github. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 258--267.
    [7]
    Samiul Islam and Saman Hassanzadeh Amin. 2020. Prediction of probable backorder scenarios in the supply chain using Distributed Random Forest and Gradient Boosting Machine learning techniques. Journal of Big Data, Vol. 7, 1 (2020), 1--22.
    [8]
    Yesi Novaria Kunang, Siti Nurmaini, Deris Stiawan, and Bhakti Yudho Suprapto. 2021. Attack classification of an intrusion detection system using deep learning and hyperparameter optimization. Journal of Information Security and Applications, Vol. 58 (2021), 102804.
    [9]
    Tysen Leckie and Alec Yasinsac. 2004. Metadata for anomaly-based security protocol attack deduction. IEEE Transactions on Knowledge and Data Engineering, Vol. 16, 9 (2004), 1157--1168.
    [10]
    Chengwei Liu, Sen Chen, Lingling Fan, Bihuan Chen, Yang Liu, and Xin Peng. 2022. Demystifying the vulnerability propagation and its evolution via dependency trees in the npm ecosystem. In Proceedings of the 44th International Conference on Software Engineering. 672--684.
    [11]
    Marlene Müller. 2012. Generalized linear models. Handbook of Computational Statistics: Concepts and Methods (2012), 681--709.
    [12]
    Khaled Mutmbak, Sultan Alotaibi, Khalid Alharbi, Umar Albalawi, and Osama Younes. 2022. Anomaly Detection using Network Metadata. International Journal of Advanced Computer Science and Applications, Vol. 13, 5 (2022).
    [13]
    Yasunobu Nohara, Koutarou Matsumoto, Hidehisa Soejima, and Naoki Nakashima. 2019. Explanation of machine learning models using improved shapley additive explanation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 546--546.
    [14]
    Npm, Inc. 2023. State Of Npm 2023: The Overview. Online. https://blog.sandworm.dev/series/state-of-npm-2023 Accessed on 2023--9--12.
    [15]
    Marc Ohm, Felix Boes, Christian Bungartz, and Michael Meier. 2022. On the feasibility of supervised machine learning for the detection of malicious software packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security. 1--10.
    [16]
    Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber's Knife Collection: A Review of Open Source Software Supply Chain Attacks. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer.
    [17]
    Brian Pfretzschner and Lotfi ben Othmane. 2017. Identification of dependency-based attacks on node. js. In Proceedings of the 12th International Conference on Availability, Reliability and Security. 1--6.
    [18]
    Derek A Pisner and David M Schnyer. 2020. Support vector machine. In Machine learning. Elsevier, 101--121.
    [19]
    Simone Scalco, Ranindya Paramitha, Duc-Ly Vu, and Fabio Massacci. 2022. On the feasibility of detecting injections in malicious npm packages. In Proceedings of the 17th International Conference on Availability, Reliability and Security. 1--8.
    [20]
    Adriana Sejfia and Max Sch"afer. 2022. Practical automated detection of malicious npm packages. In Proceedings of the 44th International Conference on Software Engineering. 1681--1692.
    [21]
    Sonatype. 2019. 2019 State of the Software Supply Chain Report Reveals Best Practices From 36,000 Open Source Software Development Teams. https://www.sonatype.com/press-release-blog/2019-state-of-thesoftware- supply-chain-report-reveals-best-practices-from-36000-opensource-software-development-teams
    [22]
    Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. 2022. Coprotector: Protect open-source code against unauthorized training usage with data poisoning. In Proceedings of the ACM Web Conference 2022. 652--660.
    [23]
    Synopsys. 2020. Synopsys 2020 Open Source Security and Risk Analysis Report. https://www.synopsys.com/content/dam/synopsys/sig-assets/reports/ 2020-ossra-report.pdf
    [24]
    Matthew Taylor, Ruturaj Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi. 2020. Defending against package typosquatting. In Network and System Security: 14th International Conference, NSS 2020, Melbourne, VIC, Australia, November 25--27, 2020, Proceedings 14. Springer, 112--131.
    [25]
    Laurie Voss. 2018. npm and the future of JavaScript. https://slides.com/seldo/npmfuture- of-javascript.
    [26]
    Duc-Ly Vu. 2021. PY2SRC: Towards the Automatic (and Reliable) Identification of Sources for PyPI Package. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1394--1396.
    [27]
    Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta. 2020. Typosquatting and combosquatting attacks on the python ecosystem. In 2020 ieee european symposium on security and privacy workshops (euros&pw). IEEE, 509--514.
    [28]
    Seongil Wi, Sijae Woo, Joyce Jiyoung Whang, and Sooel Son. 2022. HiddenCPG: large-scale vulnerable clone detection using subgraph isomorphism of code property graphs. In Proceedings of the ACM Web Conference 2022. 755--766.
    [29]
    Riccardo Zaccarelli, Dino Bindi, and Angelo Strollo. 2021. Anomaly detection in seismic data--metadata using simple machine-learning models. Seismological Society of America, Vol. 92, 4 (2021), 2627--2639.
    [30]
    Nusrat Zahan, Parth Kanakiya, Brian Hambleton, Shohanuzzaman Shohan, and Laurie Williams. 2023. OpenSSF Scorecard: On the Path Toward Ecosystem-Wide Automated Security Metrics. IEEE Security & Privacy (2023).
    [31]
    Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid, Brendan Murphy, Chandra Maddila, and Laurie Williams. 2022. What are weak links in the npm supply chain?. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 331--340.
    [32]
    Shijie Zhang, Hongzhi Yin, Tong Chen, Zi Huang, Lizhen Cui, and Xiangliang Zhang. 2021. Graph embedding for recommendation against attribute inference attacks. In Proceedings of the Web Conference 2021. 3002--3014.
    [33]
    LM Zhao, HY Hu, DH Wei, and SQ Wang. 1999. Multilayer Feedforward Artificial Neural Network. YellowRiver Water Conservancy Press: Zhengzhou, China (1999).
    [34]
    Junwei Zhou, Yijia Qian, Qingtian Zou, Peng Liu, and Jianwen Xiang. 2022. DeepSyslog: Deep Anomaly Detection on Syslog Using Sentence Embedding and Metadata. IEEE Transactions on Information Forensics and Security, Vol. 17 (2022), 3051--3061.
    [35]
    Yao Zhu, Hongzhi Liu, Yingpeng Du, and Zhonghai Wu. 2021. IFSpard: An information fusion-based framework for spam review detection. In Proceedings of the Web Conference 2021. 507--517.
    [36]
    Markus Zimmermann, Cristian-Alexandru Staicu, Cam Tenny, and Michael Pradel. 2019. Small World with High Risks: A Study of Security Threats in the npm Ecosystem. In USENIX security symposium, Vol. 17. io

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '24: Proceedings of the ACM on Web Conference 2024
    May 2024
    4826 pages
    ISBN:9798400701719
    DOI:10.1145/3589334
    This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2024

    Check for updates

    Author Tags

    1. adversarial attacks
    2. feature extractions
    3. malicious detection
    4. npm metadata
    5. software supply chain

    Qualifiers

    • Research-article

    Data Availability

    Funding Sources

    • Cyber Security Research Centre Limited

    Conference

    WWW '24
    Sponsor:
    WWW '24: The ACM Web Conference 2024
    May 13 - 17, 2024
    Singapore, Singapore

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 104
      Total Downloads
    • Downloads (Last 12 months)104
    • Downloads (Last 6 weeks)57

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media