research-article

Understanding the Process of Data Labeling in Cybersecurity

Authors:

Giovanni ApruzzeseAuthors Info & Claims

SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

Pages 1596 - 1605

https://doi.org/10.1145/3605098.3636046

Published: 21 May 2024 Publication History

Abstract

Many domains now leverage the benefits of Machine Learning (ML), which promises solutions that can autonomously learn to solve complex tasks by training over some data. Unfortunately, in cyberthreat detection, high-quality data is hard to come by. Moreover, for some specific applications of ML, such data must be labeled by human operators. Many works "assume" that labeling is tough/challenging/costly in cyberthreat detection, thereby proposing solutions to address such a hurdle. Yet, we found no work that specifically addresses the process of labeling from the viewpoint of ML security practitioners. This is a problem: to this date, it is still mostly unknown how labeling is done in practice---thereby preventing one from pinpointing "what is needed" in the real world.

In this paper, we take the first step to build a bridge between academic research and security practice in the context of data labeling. First, we reach out to five subject matter experts and carry out open interviews to identify pain points in their labeling routines. Then, by using our findings as a scaffold, we conduct a user study with 13 practitioners from large security companies and ask detailed questions on subjects such as active learning, costs of labeling, and revision of labels. Finally, we perform proof-of-concept experiments addressing labeling-related aspects in cyberthreat detection that are sometimes overlooked in research. Altogether, our contributions and recommendations serve as a stepping stone to future endeavors aimed at improving the quality and robustness of ML-driven security systems. We release our resources.

References

[1]

2024. Our Repository. https://github.com/hihey54/sac24_labeling

[2]

Omolola A Adeoye-Olatunde and Nicole L Olenik. 2021. Research and scholarly methods: Semi-structured interviews. JACCP (2021).

[3]

Hojjat Aghakhani, Lea Schönherr, Thorsten Eisenhofer, Dorothea Kolossa, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. 2023. VenoMave: Targeted poisoning against speech recognition. In IEEE SaTML.

[4]

Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% False Positives: A Qualitative Study of Soc Analysts' Perspectives on Security Alarms. In USENIX Sec.

[5]

Giuseppina Andresini, Annalisa Appice, Luca De Rose, and Donato Malerba. 2021. GAN augmentation to deal with imbalance in imaging-based intrusion detection. Future Generation Computer Systems 123 (2021), 108--127.

[6]

Giuseppina Andresini, Feargus Pendlebury, Fabio Pierazzi, Corrado Loglisci, Annalisa Appice, and Lorenzo Cavallaro. 2021. Insomnia: Towards concept-drift robustness in network intrusion detection. In 2021 ICACT. 111--122.

[7]

Giovanni Apruzzese, Pavel Laskov, Edgardo Montes de Oca, Wissam Mallouli, Luis Brdalo Rapa, Athanasios Vasileios Grammatopoulos, and Fabio Di Franco. 2023. The role of machine learning in cybersecurity. ACM DTRAP (2023).

[8]

Giovanni Apruzzese, Pavel Laskov, and Johannes Schneider. 2023. SoK: Pragmatic Assessment of Machine Learning for Network Intrusion Detection. In EuroS&P.

[9]

Giovanni Apruzzese, Pavel Laskov, and Aliya Tastemirova. 2022. SoK: The impact of unlabelled data in cyberthreat detection. In IEEE EuroS&P.

[10]

Giovanni Apruzzese, Luca Pajola, and Mauro Conti. 2022. The cross-evaluation of machine learning-based network intrusion detection systems. TNSM (2022).

[11]

Giovanni Apruzzese and VS Subrahmanian. 2022. Mitigating Adversarial Gray-Box Attacks Against Phishing Detectors. TDSC (2022).

[12]

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. 2022. Dos and don'ts of machine learning in computer security. In USENIX Security.

[13]

Abdul Basit, Maham Zafar, Xuan Liu, Abdul Rehman Javed, Zunera Jalil, and Kashif Kifayat. 2021. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems (2021).

[14]

Nicholas Carlini. 2021. Poisoning the unlabeled dataset of {Semi-Supervised} learning. In 30th USENIX Security Symposium (USENIX Security 21). 1577--1592.

[15]

Marta Catillo, Andrea Del Vecchio, Antonio Pecchia, and Umberto Villano. 2022. Transferability of machine learning models learned from public intrusion detection datasets: the cicids2017 case study. Software Quality Journal (2022).

[16]

Abraham Chan, Arpan Gujarati, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2022. The fault in our data stars: studying mitigation techniques against faulty training data in machine learning applications. In Proc. IEEE DSN.

[17]

Chin-Wei Chen, Ching-Hung Su, Kun-Wei Lee, and Ping-Hao Bair. 2020. Malware family classification using active learning by learning. In ICACT.

[18]

Kang Leng Chiew, Choon Lin Tan, Kok Sheik Wong, Kelvin SC Yong, and Wei King Tiong. 2019. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf. Scie. (2019).

[19]

Gustavo de Carvalho Bertoli, Lourenço Alves Pereira Junior, Osamu Saotome, and Aldri Luiz dos Santos. 2023. Generalizing intrusion detection for heterogeneous networks: A stacked-unsupervised federated learning approach. Comp. Secur. (2023).

[20]

Markus De Shon. 2019. Information Security Analysis as Data Fusion. In FUSION.

[21]

Luis Dias, Simão Valente, and Miguel Correia. 2020. Go with the flow: Clustering dynamically-defined netflow features for network intrusion detection with DynIDS. In IEEE NCA.

[22]

Gints Engelen, Vera Rimmer, and Wouter Joosen. 2021. Troubleshooting an intrusion detection dataset: the CICIDS2017 case study. In IEEE S&P Workshops.

[23]

Teodor Fredriksson, David Issa Mattos, Jan Bosch, and Helena Holmström Olsson. 2020. Data labeling: An empirical investigation into industrial challenges and mitigation strategies. In PROFES 2020. Springer, 202--216.

Digital Library

[24]

Sebastian Garcia, Martin Grill, Jan Stiborek, and Alejandro Zunino. 2014. An empirical comparison of botnet detection methods. Comp. Secur. (2014).

[25]

Ibrahim Ghafir, Mohammad Hammoudeh, Vaclav Prenosil, Liangxiu Han, Robert Hegarty, Khaled Rabie, and Francisco J Aparicio-Navarro. 2018. Detection of advanced persistent threat using machine-learning correlation analysis. FGCS (2018).

[26]

Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. 2009. Active learning for network intrusion detection. In AISec. 47--54.

[27]

Jorge Luis Guerra, Carlos Catania, and Eduardo Veas. 2022. Datasets are not enough: Challenges in labeling network traffic. Computers & Security (2022).

[28]

David R Hannah. 2005. Should I keep a secret? The effects of trade secret protection procedures on employees' obligations to protect trade secrets. Organ. Sci. (2005).

[29]

Jiwon Hong, Taeri Kim, Jing Liu, Noseong Park, and Sang-Wook Kim. 2020. Phishing url detection with lexical features and blacklisted domains. Adaptive autonomous secure cyber systems (2020), 253--267.

[30]

Paul Irolla and Alexandre Dey. 2018. The duplication issue within the drebin dataset. J. Comp. Vir. Hack. Tech. (2018).

[31]

Robert J Joyce, Edward Raff, and Charles Nicholas. 2021. A framework for cluster and classifier evaluation in the absence of reference labels. In AISec.

[32]

Nektaria Kaloudi and Jingyue Li. 2020. The ai-based cyber threat landscape: A survey. ACM Computing Surveys (CSUR) 53, 1 (2020), 1--34.

Digital Library

[33]

Zeliang Kan, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. 2021. Investigating labelless drift adaptation for malware detection. In AISec.

[34]

Davinder Kaur, Suleyman Uslu, Kaley J Rittichier, and Arjan Durresi. 2022. Trustworthy artificial intelligence: a review. ACM Computing Surveys (CSUR) (2022).

[35]

Fiona Koh, Kathrin Grosse, and Giovanni Apruzzese. 2024. Voices from the Frontline: Revealing the AI Practitioners' viewpoint on the EU AI Act. In HICSS.

[36]

Antoine Lemay, Joan Calvet, François Menet, and José M Fernandez. 2018. Survey of publicly available reports on advanced persistent threat actors. Comp. Secur. (2018).

[37]

Jhen-Hao Li and Sheng-De Wang. 2017. PhishBox: An approach for phishing validation and detection. In IEEE DASC/PiCom/DataCom/CyberSciTech.

[38]

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE Transactions on knowledge and data engineering 35, 1 (2021), 857--876.

[39]

Alexandra Sasha Luccioni and David Rolnick. 2023. Bugs in the data: How ImageNet misrepresents biodiversity. In AAAI Conference on Artificial Intelligence.

Digital Library

[40]

Samaneh Mahdavifar and Ali A Ghorbani. 2019. Application of deep learning to cybersecurity: A survey. Neurocomputing (2019).

[41]

Tenga Matsuura, Ayako A Hasegawa, Mitsuaki Akiyama, and Tatsuya Mori. 2021. Careless participants are essential for our phishing study: Understanding the impact of screening methods. In Proceedings of the 2021 EuroUSEC. 36--47.

Digital Library

[42]

Jacqueline Meyer and Giovanni Apruzzese. 2022. Cybersecurity in the Smart Grid: Practitioners' Perspective. In ICSS Workshop (co-located with ACSAC).

[43]

Brad Miller, Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Rekha Bachwani, Riyaz Faizullabhoy, Ling Huang, Vaishaal Shankar, Tony Wu, George Yiu, et al. 2016. Reviewer integration and performance measurement for malware detection. In Proc. Int. Conf. DIMVA. 122--141.

Digital Library

[44]

Eric Mjolsness and Dennis DeCoste. 2001. Machine learning for science: state of the art and future prospects. Science (2001).

[45]

Biagio Montaruli, Luca Demetrio, Maura Pintor, Battista Biggio, Luca Compagna, and Davide Balzarotti. 2023. Raze to the Ground: Query-Efficient Adversarial HTML Attacks on Machine-Learning Phishing Webpage Detectors. In AISec.

[46]

Luis Muñoz-González, Javier Carnerero-Cano, Kenneth T Co, and Emil C Lupu. 2019. Challenges and Advances in Adversarial Machine Learning. Resilience and Hybrid Threats (2019), 102--120.

[47]

Antonia Nisioti, Alexios Mylonas, Paul D Yoo, and Vasilios Katos. 2018. From intrusion detection to attacker attribution: A comprehensive survey of unsupervised methods. IEEE Communications Surveys & Tutorials 20, 4 (2018), 3369--3388.

Digital Library

[48]

David Pape, Sina Däubener, Thorsten Eisenhofer, Antonio Emanuele Cinà, and Lea Schönherr. 2023. On the Limitations of Model Stealing with Uncertainty Quantification Models. ESANN.

[49]

Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. {TESSERACT}: Eliminating experimental bias in malware classification across space and time. In USENIX Security 19. 729--746.

[50]

Bahman Rashidi, Carol Fung, and Elisa Bertino. 2017. Android malicious application detection using support vector machine and active learning. In CNSM.

[51]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1--40.

[52]

William Robertson, Giovanni Vigna, Christopher Kruegel, Richard A Kemmerer, et al. 2006. Using generalization and characterization techniques in the anomaly-based detection of web attacks. In NDSS.

[53]

Iqbal H Sarker, ASM Kayes, Shahriar Badsha, Hamed Alqahtani, Paul Watters, and Alex Ng. 2020. Cybersecurity data science: an overview from machine learning perspective. Journal of Big data 7 (2020), 1--29.

[54]

Avishag Shapira, Alon Zolfi, Luca Demetrio, Battista Biggio, and Asaf Shabtai. 2023. Phantom Sponges: Exploiting Non-Maximum Suppression to Attack Deep Object Detectors. In IEEE/CVF Winter Conf. Appl. Comp. Vision.

[55]

Iman Sharafaldin, Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp (2018).

[56]

Thalles Silva, Helio Pedrini, and Adín Ramírez Rivera. 2023. Self-supervised Learning of Contextualized Local Visual Embeddings. In VIPriors 4.

[57]

Robin Sommer and Vern Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In IEEE S&P.

[58]

Ke Tian, Steve TK Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a haystack: Tracking down elite phishing domains in the wild. In IMC.

[59]

Thijs Van Ede, Hojjat Aghakhani, Noah Spahn, Riccardo Bortolameotti, Marco Cova, Andrea Continella, Maarten van Steen, Andreas Peter, Christopher Kruegel, and Giovanni Vigna. 2022. Deepcase: Semi-supervised contextual analysis of security events. In 2022 IEEE SP. 522--539.

[60]

Susan C Weller, Ben Vickers, H Russell Bernard, Alyssa M Blackburn, Stephen Borgatti, Clarence C Gravlee, and Jeffrey C Johnson. 2018. Open-ended interview questions and saturation. PloS one (2018).

[61]

Congyuan Xu, Jizhong Shen, and Xin Du. 2020. A method of few-shot network intrusion detection based on meta-learning framework. IEEE TIFS (2020).

[62]

Jun Yang, Pengpeng Yang, Xiaohui Jin, and Qian Ma. 2017. Multi-classification for malicious URL based on improved semi-supervised algorithm. In IEEE CSE.

[63]

Yong Zhang, Jie Niu, Guojian He, Lin Zhu, and Da Guo. 2021. Network Intrusion Detection Based on Active Semi-supervised Learning. In DSN-W.

[64]

Yanqiao Zhu and Kai Yang. 2019. Tripartite active learning for interactive anomaly discovery. IEEE Access 7 (2019), 63195--63203.

Cited By

Cerasuolo FBovenzi GCiuonzo DPescapè A(2025)Adaptable, incremental, and explainable network intrusion detection systems for internet of thingsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110143144(110143)Online publication date: Mar-2025
https://doi.org/10.1016/j.engappai.2025.110143

Index Terms

Understanding the Process of Data Labeling in Cybersecurity
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation

Recommendations

New Labeling Strategy for Semi-supervised Document Categorization
KSEM '09: Proceedings of the 3rd International Conference on Knowledge Science, Engineering and Management

Usually, semi-supervised learning requires a number of prior knowledge to supervise the learning process, such as, seeds in Seeded-Kmeans, pair-wise constraints in COP-Kmeans, and labeled data for training an initial useful classifier in S3VM. Such prior ...
Inter-labeler and intra-labeler variability of condition severity classification models using active and passive learning methods

AL methods produce smoother Intra-labeler learning curves during the training phase.AL methods result in almost half of the mean Inter-labeler AUC standard deviation.The consensus label resulted in an AUC that was at least as high as that of the gold ...
Sentiment labeling for extending initial labeled data to improve semi-supervised sentiment classification

Semi-supervised framework which exploits unsupervised approach (JST) is proposed.Self-training suffers from incorrectly labeling problem with insufficient data.Confidently predicted instances are labeled and used as training data by JST.Self-training ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

April 2024

1898 pages

ISBN:9798400702433

DOI:10.1145/3605098

Chair:
Jiman Hong,
Program Chairs:
Juw Won Park,
Adam Przybyłek

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Hilti Corporation

Conference

SAC '24

Sponsor:

SIGAPP

SAC '24: 39th ACM/SIGAPP Symposium on Applied Computing

April 8 - 12, 2024

Avila, Spain

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
80
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)11

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cerasuolo FBovenzi GCiuonzo DPescapè A(2025)Adaptable, incremental, and explainable network intrusion detection systems for internet of thingsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110143144(110143)Online publication date: Mar-2025
https://doi.org/10.1016/j.engappai.2025.110143

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten