Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3605098.3636046acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Understanding the Process of Data Labeling in Cybersecurity

Published: 21 May 2024 Publication History

Abstract

Many domains now leverage the benefits of Machine Learning (ML), which promises solutions that can autonomously learn to solve complex tasks by training over some data. Unfortunately, in cyberthreat detection, high-quality data is hard to come by. Moreover, for some specific applications of ML, such data must be labeled by human operators. Many works "assume" that labeling is tough/challenging/costly in cyberthreat detection, thereby proposing solutions to address such a hurdle. Yet, we found no work that specifically addresses the process of labeling from the viewpoint of ML security practitioners. This is a problem: to this date, it is still mostly unknown how labeling is done in practice---thereby preventing one from pinpointing "what is needed" in the real world.
In this paper, we take the first step to build a bridge between academic research and security practice in the context of data labeling. First, we reach out to five subject matter experts and carry out open interviews to identify pain points in their labeling routines. Then, by using our findings as a scaffold, we conduct a user study with 13 practitioners from large security companies and ask detailed questions on subjects such as active learning, costs of labeling, and revision of labels. Finally, we perform proof-of-concept experiments addressing labeling-related aspects in cyberthreat detection that are sometimes overlooked in research. Altogether, our contributions and recommendations serve as a stepping stone to future endeavors aimed at improving the quality and robustness of ML-driven security systems. We release our resources.

References

[1]
2024. Our Repository. https://github.com/hihey54/sac24_labeling
[2]
Omolola A Adeoye-Olatunde and Nicole L Olenik. 2021. Research and scholarly methods: Semi-structured interviews. JACCP (2021).
[3]
Hojjat Aghakhani, Lea Schönherr, Thorsten Eisenhofer, Dorothea Kolossa, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. 2023. VenoMave: Targeted poisoning against speech recognition. In IEEE SaTML.
[4]
Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% False Positives: A Qualitative Study of Soc Analysts' Perspectives on Security Alarms. In USENIX Sec.
[5]
Giuseppina Andresini, Annalisa Appice, Luca De Rose, and Donato Malerba. 2021. GAN augmentation to deal with imbalance in imaging-based intrusion detection. Future Generation Computer Systems 123 (2021), 108--127.
[6]
Giuseppina Andresini, Feargus Pendlebury, Fabio Pierazzi, Corrado Loglisci, Annalisa Appice, and Lorenzo Cavallaro. 2021. Insomnia: Towards concept-drift robustness in network intrusion detection. In 2021 ICACT. 111--122.
[7]
Giovanni Apruzzese, Pavel Laskov, Edgardo Montes de Oca, Wissam Mallouli, Luis Brdalo Rapa, Athanasios Vasileios Grammatopoulos, and Fabio Di Franco. 2023. The role of machine learning in cybersecurity. ACM DTRAP (2023).
[8]
Giovanni Apruzzese, Pavel Laskov, and Johannes Schneider. 2023. SoK: Pragmatic Assessment of Machine Learning for Network Intrusion Detection. In EuroS&P.
[9]
Giovanni Apruzzese, Pavel Laskov, and Aliya Tastemirova. 2022. SoK: The impact of unlabelled data in cyberthreat detection. In IEEE EuroS&P.
[10]
Giovanni Apruzzese, Luca Pajola, and Mauro Conti. 2022. The cross-evaluation of machine learning-based network intrusion detection systems. TNSM (2022).
[11]
Giovanni Apruzzese and VS Subrahmanian. 2022. Mitigating Adversarial Gray-Box Attacks Against Phishing Detectors. TDSC (2022).
[12]
Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. 2022. Dos and don'ts of machine learning in computer security. In USENIX Security.
[13]
Abdul Basit, Maham Zafar, Xuan Liu, Abdul Rehman Javed, Zunera Jalil, and Kashif Kifayat. 2021. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems (2021).
[14]
Nicholas Carlini. 2021. Poisoning the unlabeled dataset of {Semi-Supervised} learning. In 30th USENIX Security Symposium (USENIX Security 21). 1577--1592.
[15]
Marta Catillo, Andrea Del Vecchio, Antonio Pecchia, and Umberto Villano. 2022. Transferability of machine learning models learned from public intrusion detection datasets: the cicids2017 case study. Software Quality Journal (2022).
[16]
Abraham Chan, Arpan Gujarati, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2022. The fault in our data stars: studying mitigation techniques against faulty training data in machine learning applications. In Proc. IEEE DSN.
[17]
Chin-Wei Chen, Ching-Hung Su, Kun-Wei Lee, and Ping-Hao Bair. 2020. Malware family classification using active learning by learning. In ICACT.
[18]
Kang Leng Chiew, Choon Lin Tan, Kok Sheik Wong, Kelvin SC Yong, and Wei King Tiong. 2019. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf. Scie. (2019).
[19]
Gustavo de Carvalho Bertoli, Lourenço Alves Pereira Junior, Osamu Saotome, and Aldri Luiz dos Santos. 2023. Generalizing intrusion detection for heterogeneous networks: A stacked-unsupervised federated learning approach. Comp. Secur. (2023).
[20]
Markus De Shon. 2019. Information Security Analysis as Data Fusion. In FUSION.
[21]
Luis Dias, Simão Valente, and Miguel Correia. 2020. Go with the flow: Clustering dynamically-defined netflow features for network intrusion detection with DynIDS. In IEEE NCA.
[22]
Gints Engelen, Vera Rimmer, and Wouter Joosen. 2021. Troubleshooting an intrusion detection dataset: the CICIDS2017 case study. In IEEE S&P Workshops.
[23]
Teodor Fredriksson, David Issa Mattos, Jan Bosch, and Helena Holmström Olsson. 2020. Data labeling: An empirical investigation into industrial challenges and mitigation strategies. In PROFES 2020. Springer, 202--216.
[24]
Sebastian Garcia, Martin Grill, Jan Stiborek, and Alejandro Zunino. 2014. An empirical comparison of botnet detection methods. Comp. Secur. (2014).
[25]
Ibrahim Ghafir, Mohammad Hammoudeh, Vaclav Prenosil, Liangxiu Han, Robert Hegarty, Khaled Rabie, and Francisco J Aparicio-Navarro. 2018. Detection of advanced persistent threat using machine-learning correlation analysis. FGCS (2018).
[26]
Nico Görnitz, Marius Kloft, Konrad Rieck, and Ulf Brefeld. 2009. Active learning for network intrusion detection. In AISec. 47--54.
[27]
Jorge Luis Guerra, Carlos Catania, and Eduardo Veas. 2022. Datasets are not enough: Challenges in labeling network traffic. Computers & Security (2022).
[28]
David R Hannah. 2005. Should I keep a secret? The effects of trade secret protection procedures on employees' obligations to protect trade secrets. Organ. Sci. (2005).
[29]
Jiwon Hong, Taeri Kim, Jing Liu, Noseong Park, and Sang-Wook Kim. 2020. Phishing url detection with lexical features and blacklisted domains. Adaptive autonomous secure cyber systems (2020), 253--267.
[30]
Paul Irolla and Alexandre Dey. 2018. The duplication issue within the drebin dataset. J. Comp. Vir. Hack. Tech. (2018).
[31]
Robert J Joyce, Edward Raff, and Charles Nicholas. 2021. A framework for cluster and classifier evaluation in the absence of reference labels. In AISec.
[32]
Nektaria Kaloudi and Jingyue Li. 2020. The ai-based cyber threat landscape: A survey. ACM Computing Surveys (CSUR) 53, 1 (2020), 1--34.
[33]
Zeliang Kan, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. 2021. Investigating labelless drift adaptation for malware detection. In AISec.
[34]
Davinder Kaur, Suleyman Uslu, Kaley J Rittichier, and Arjan Durresi. 2022. Trustworthy artificial intelligence: a review. ACM Computing Surveys (CSUR) (2022).
[35]
Fiona Koh, Kathrin Grosse, and Giovanni Apruzzese. 2024. Voices from the Frontline: Revealing the AI Practitioners' viewpoint on the EU AI Act. In HICSS.
[36]
Antoine Lemay, Joan Calvet, François Menet, and José M Fernandez. 2018. Survey of publicly available reports on advanced persistent threat actors. Comp. Secur. (2018).
[37]
Jhen-Hao Li and Sheng-De Wang. 2017. PhishBox: An approach for phishing validation and detection. In IEEE DASC/PiCom/DataCom/CyberSciTech.
[38]
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE Transactions on knowledge and data engineering 35, 1 (2021), 857--876.
[39]
Alexandra Sasha Luccioni and David Rolnick. 2023. Bugs in the data: How ImageNet misrepresents biodiversity. In AAAI Conference on Artificial Intelligence.
[40]
Samaneh Mahdavifar and Ali A Ghorbani. 2019. Application of deep learning to cybersecurity: A survey. Neurocomputing (2019).
[41]
Tenga Matsuura, Ayako A Hasegawa, Mitsuaki Akiyama, and Tatsuya Mori. 2021. Careless participants are essential for our phishing study: Understanding the impact of screening methods. In Proceedings of the 2021 EuroUSEC. 36--47.
[42]
Jacqueline Meyer and Giovanni Apruzzese. 2022. Cybersecurity in the Smart Grid: Practitioners' Perspective. In ICSS Workshop (co-located with ACSAC).
[43]
Brad Miller, Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Rekha Bachwani, Riyaz Faizullabhoy, Ling Huang, Vaishaal Shankar, Tony Wu, George Yiu, et al. 2016. Reviewer integration and performance measurement for malware detection. In Proc. Int. Conf. DIMVA. 122--141.
[44]
Eric Mjolsness and Dennis DeCoste. 2001. Machine learning for science: state of the art and future prospects. Science (2001).
[45]
Biagio Montaruli, Luca Demetrio, Maura Pintor, Battista Biggio, Luca Compagna, and Davide Balzarotti. 2023. Raze to the Ground: Query-Efficient Adversarial HTML Attacks on Machine-Learning Phishing Webpage Detectors. In AISec.
[46]
Luis Muñoz-González, Javier Carnerero-Cano, Kenneth T Co, and Emil C Lupu. 2019. Challenges and Advances in Adversarial Machine Learning. Resilience and Hybrid Threats (2019), 102--120.
[47]
Antonia Nisioti, Alexios Mylonas, Paul D Yoo, and Vasilios Katos. 2018. From intrusion detection to attacker attribution: A comprehensive survey of unsupervised methods. IEEE Communications Surveys & Tutorials 20, 4 (2018), 3369--3388.
[48]
David Pape, Sina Däubener, Thorsten Eisenhofer, Antonio Emanuele Cinà, and Lea Schönherr. 2023. On the Limitations of Model Stealing with Uncertainty Quantification Models. ESANN.
[49]
Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. {TESSERACT}: Eliminating experimental bias in malware classification across space and time. In USENIX Security 19. 729--746.
[50]
Bahman Rashidi, Carol Fung, and Elisa Bertino. 2017. Android malicious application detection using support vector machine and active learning. In CNSM.
[51]
Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM computing surveys (CSUR) 54, 9 (2021), 1--40.
[52]
William Robertson, Giovanni Vigna, Christopher Kruegel, Richard A Kemmerer, et al. 2006. Using generalization and characterization techniques in the anomaly-based detection of web attacks. In NDSS.
[53]
Iqbal H Sarker, ASM Kayes, Shahriar Badsha, Hamed Alqahtani, Paul Watters, and Alex Ng. 2020. Cybersecurity data science: an overview from machine learning perspective. Journal of Big data 7 (2020), 1--29.
[54]
Avishag Shapira, Alon Zolfi, Luca Demetrio, Battista Biggio, and Asaf Shabtai. 2023. Phantom Sponges: Exploiting Non-Maximum Suppression to Attack Deep Object Detectors. In IEEE/CVF Winter Conf. Appl. Comp. Vision.
[55]
Iman Sharafaldin, Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp (2018).
[56]
Thalles Silva, Helio Pedrini, and Adín Ramírez Rivera. 2023. Self-supervised Learning of Contextualized Local Visual Embeddings. In VIPriors 4.
[57]
Robin Sommer and Vern Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In IEEE S&P.
[58]
Ke Tian, Steve TK Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a haystack: Tracking down elite phishing domains in the wild. In IMC.
[59]
Thijs Van Ede, Hojjat Aghakhani, Noah Spahn, Riccardo Bortolameotti, Marco Cova, Andrea Continella, Maarten van Steen, Andreas Peter, Christopher Kruegel, and Giovanni Vigna. 2022. Deepcase: Semi-supervised contextual analysis of security events. In 2022 IEEE SP. 522--539.
[60]
Susan C Weller, Ben Vickers, H Russell Bernard, Alyssa M Blackburn, Stephen Borgatti, Clarence C Gravlee, and Jeffrey C Johnson. 2018. Open-ended interview questions and saturation. PloS one (2018).
[61]
Congyuan Xu, Jizhong Shen, and Xin Du. 2020. A method of few-shot network intrusion detection based on meta-learning framework. IEEE TIFS (2020).
[62]
Jun Yang, Pengpeng Yang, Xiaohui Jin, and Qian Ma. 2017. Multi-classification for malicious URL based on improved semi-supervised algorithm. In IEEE CSE.
[63]
Yong Zhang, Jie Niu, Guojian He, Lin Zhu, and Da Guo. 2021. Network Intrusion Detection Based on Active Semi-supervised Learning. In DSN-W.
[64]
Yanqiao Zhu and Kai Yang. 2019. Tripartite active learning for interactive anomaly discovery. IEEE Access 7 (2019), 63195--63203.

Cited By

View all
  • (2025)Adaptable, incremental, and explainable network intrusion detection systems for internet of thingsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110143144(110143)Online publication date: Mar-2025

Index Terms

  1. Understanding the Process of Data Labeling in Cybersecurity

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing
      April 2024
      1898 pages
      ISBN:9798400702433
      DOI:10.1145/3605098
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 May 2024

      Check for updates

      Author Tags

      1. labeling
      2. ML
      3. practitioners
      4. user study
      5. cyberthreat detection

      Qualifiers

      • Research-article

      Funding Sources

      • Hilti Corporation

      Conference

      SAC '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

      Upcoming Conference

      SAC '25
      The 40th ACM/SIGAPP Symposium on Applied Computing
      March 31 - April 4, 2025
      Catania , Italy

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)80
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 08 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Adaptable, incremental, and explainable network intrusion detection systems for internet of thingsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110143144(110143)Online publication date: Mar-2025

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media