Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474369.3486867acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

Published: 15 November 2021 Publication History

Abstract

In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.

Supplementary Material

MP4 File (AISec21-12129.mp4)
In some problem spaces, lower quality reference datasets must be used because obtaining ground truth labels is too expensive or time-consuming. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). An AGTR allows bounds on the precision, recall, and accuracy of a multiclass classifier or clustering algorithm to be computed, and it supports other evaluation functionality. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.

References

[1]
K. Crowston, "Amazon mechanical turk: A research tool for organizations and information systems scholars," in Shaping the Future of ICT Research. Methods and Approaches, A. Bhattacherjee and B. Fitzgerald, Eds., 2012, pp. 210--221.
[2]
et al.(2016)Sebastián, Rivera, Kotzias, and Caballero]avclassM. Sebastián, R. Rivera, P. Kotzias, and J. Caballero, "Avclass: A tool for massive malware labeling," in Research in Attacks, Intrusions, and Defenses, F. Monrose, M. Dacier, G. Blanc, and J. Garcia-Alfaro, Eds., Cham, 2016, pp. 230--253.
[3]
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Sepah, E. Raff, K. Madan, V. Voleti, S. E. Kahou, V. Michalski, D. Serdyuk, T. Arbel, C. Pal, G. Varoquaux, and P. Vincent, "Accounting for Variance in Machine Learning Benchmarks," in Machine Learning and Systems (MLSys), 2021.
[4]
B. Marie, A. Fujita, and R. Rubino, "Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers," in ACL. hskip 1em plus 0.5em minus 0.4emrelax Online: Association for Computational Linguistics, aug 2021, pp. 7297--7306.
[5]
L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, J. Steinhardt, and A. Madry, "Identifying Statistical Bias in Dataset Replication," in Proceedings of the 37th International Conference on Machine Learning, vol. 119, Virtual, 2020, pp. 2922--2932.
[6]
B. Barz and J. Denzler, "Do We Train on Test Data? Purging CIFAR of Near-Duplicates," Journal of Imaging, vol. 6, no. 6, p. 41, jun 2020.
[7]
K. Musgrave, S. Belongie, and S.-N. Lim, "A Metric Learning Reality Check," in ECCV, 2020. [Online]. Available: http://arxiv.org/abs/2003.08505BIBentrySTDinterwordspacing
[8]
Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, and C. Geng, "Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison," in Fourteenth ACM Conference on Recommender Systems, ser. RecSys '20, 2020, pp. 23--32.
[9]
X. Bouthillier, C. Laurent, and P. Vincent, "Unreproducible Research is Reproducible," in Proceedings of the 36th International Conference on Machine Learning, vol. 97, 2019, pp. 725--734.
[10]
E. Raff, "A Step Toward Quantifying Independently Reproducible Machine Learning Research," in NeurIPS, 2019.
[11]
W. Deng and L. Zheng, "Are labels always necessary for classifier accuracy evaluation?" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 15,069--15,078.
[12]
Y. Jiang, V. Nagarajan, C. Baek, and J. Z. Kolter, "Assessing generalization of sgd via disagreement," ArXiv, vol. abs/2106.13799, 2021.
[13]
M. Novák, J. Mírovský, K. Rysová, and M. Rysová, "Exploiting large unlabeled data in automatic evaluation of coherence in czech," in Text, Speech, and Dialogue, K. Ekvs tein, Ed., Cham, 2019, pp. 197--210.
[14]
A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow, "Realistic evaluation of deep semi-supervised learning algorithms," in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS'18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 3239--3250.
[15]
A. T. Nguyen, E. Raff, C. Nicholas, and J. Holt, "Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints," in IJCAI-21 1st International Workshop on Adaptive Cyber Defense, 2021.
[16]
A. Kent, M. M. Berry, F. U. Luehrs Jr., and J. W. Perry, "Machine literature searching viii. operational criteria for designing information retrieval systems," American Documentation, vol. 6, no. 2, pp. 93--101, 1955.
[17]
U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda, "Scalable, behavior-based malware clustering," in NDSS 2009, 16th Annual Network and Distributed System Security Symposium, 02 2009. [Online]. Available: http://www.eurecom.fr/publication/2783
[18]
P. Li, L. Liu, D. Gao, and M. K. Reiter, "On challenges in evaluating malware clustering," in Recent Advances in Intrusion Detection, S. Jha, R. Sommer, and C. Kreibich, Eds., 2010, pp. 238--255.
[19]
R. Paige and R. E. Tarjan, "Three partition refinement algorithms," SIAM Journal on Computing, vol. 16, no. 6, pp. 973--989, 1987.
[20]
T. Chakraborty, F. Pierazzi, and V. S. Subrahmanian, "Ec2: Ensemble clustering and classification for predicting android malware families," IEEE Transactions on Dependable and Secure Computing, vol. 17, pp. 262--277, 2020.
[21]
E. Raff and C. Nicholas, "A Survey of Machine Learning Methods and Challenges for Windows Malware Classification," in NeurIPS 2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA), 2020.
[22]
BIBentryALTinterwordspacingA. Mohaisen, O. Alrawi, and M. Mohaisen, "Amal: High-fidelity, behavior-based automated malware analysis and classification," Computers & Security, vol. 52, pp. 251--266, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404815000425
[23]
A. Mohaisen and O. Alrawi, "Unveiling zeus: Automated classification of malware samples," in Proceedings of the 22nd International Conference on World Wide Web, ser. WWW '13 Companion. New York, NY, USA: Association for Computing Machinery, 2013, p. 829--832. [Online]. Available: https://doi.org/10.1145/2487788.2488056BIBentrySTDinterwordspacing
[24]
D. Votipka, S. Rabin, K. Micinski, J. S. Foster, and M. L. Mazurek, "An Observational Investigation of Reverse Engineers' Process and Mental Models," in USENIX Security Symposium, 2019.
[25]
R. Perdisci and M. U, "Vamo: Towards a fully automated malware clustering validity analysis," in Proceedings of the 28th Annual Computer Security Applications Conference, 12 2012, pp. 329--338.
[26]
Y. Zhou, "Malgenome project," http://malgenomeproject.org/, Last accessed on 2020-3-9.
[27]
P. Kotzias, S. Matic, R. Rivera, and J. Caballero, "Certified pup: Abuse in authenticode code signing," in CCS '15, 2015.
[28]
A. Nappa, M. Z. Rafique, and J. Caballero, "The malicia dataset: identification and analysis of drive-by download operations," International Journal of Information Security, vol. 14, no. 1, pp. 15--33, 2015.
[29]
S. Zhu, J. Shi, L. Yang, B. Qin, Z. Zhang, L. Song, and G. Wang, "Measuring and modeling the label dynamics of online anti-malware engines," in 29th USENIX Security Symposium (USENIX Security 20). Boston, MA: USENIX Association, Aug. 2020. [Online]. Available: https://www.usenix.org/conference/usenixsecurity20/presentation/zhu
[30]
M. Botacin, F. Ceschin, P. de Geus, and A. Grégio, "We need to talk about antiviruses: challenges & pitfalls of av evaluations," Computers & Security, vol. 95, p. 101859, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404820301310
[31]
A. Mohaisen, O. Alrawi, M. Larson, and D. McPherson, "Towards a methodical evaluation of antivirus scans and labels," in Information Security Applications, Y. Kim, H. Lee, and A. Perrig, Eds., Cham, 2014, pp. 231--241.
[32]
W. Huang and J. Stokes, "Mtnet: A multi-task neural network for dynamic malware classification," in International Conference on Detection of Intrusions, 07 2016.
[33]
K. Rieck, "Malheur dataset," https://www.sec.cs.tu-bs.de/data/malheur/, Last accessed on 2020-3-9.
[34]
K. Rieck, P. Trinius, C. Willems, and T. Holz, "Automatic analysis of malware behavior using machine learning," Journal of Computer Security, vol. 19, 2011.
[35]
D. Arp, "The drebin dataset," https://www.sec.cs.tu-bs.de/ danarp/drebin/download.html, Last accessed on 2020-3-9.
[36]
D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, and K. Rieck, "Drebin: Effective and explainable detection of android malware in your pocket," in Symposium on Network and Distributed System Security (NDSS), 02 2014.
[37]
R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, "Microsoft malware classification challenge," CoRR, vol. abs/1802.10135, 2018.
[38]
Y. Zhou and X. Jiang, "Dissecting android malware: Characterization and evolution," in 2012 IEEE Symposium on Security and Privacy, May 2012, pp. 95--109.
[39]
Y. Qiao, X. Yun, and Y. Zhang, "How to automatically identify the homology of different malware," in 2016 IEEE Trustcom/BigDataSE/ISPA, Aug 2016, pp. 929--936.
[40]
"Dataset - malicia project," http://malicia-project.com/dataset.html, Last accessed on 2020-3-9.
[41]
F. Wei, Y. Li, S. Roy, X. Ou, and W. Zhou, "Deep ground truth analysis of current android malware," in Detection of Intrusions and Malware, and Vulnerability Assessment, M. Polychronakis and M. Meier, Eds., Cham, 2017, pp. 252--276.
[42]
E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, "Maldozer: Automatic framework for android malware detection using deep learning," Digit. Investig., vol. 24, pp. S48--S59, 2018.
[43]
H. S. Anderson and P. Roth, "Ember: An open dataset for training static pe malware machine learning models," 2018.
[44]
A. Kantchelian, M. C. Tschantz, S. Afroz, B. Miller, V. Shankar, R. Bachwani, A. D. Joseph, and J. D. Tygar, "Better malware ground truth: Techniques for weighting anti-virus vendor labels," in ACM Workshop on Artificial Intelligence and Security, 2015.
[45]
G. Wicherski, "pehash: A novel approach to fast malware clustering," in LEET, 2009.
[46]
"Clamav," http://anubis.iseclab.org/, Last accessed on 2020-5-04.
[47]
"Virusshare.com - because sharing is caring," https://virusshare.com/, Last accessed on 2021-9-18.
[48]
J. Seymour, "Bsideslv 2016: Labeling the virusshare corpus- lessons learned," 2016, bSides Las Vegas. [Online]. Available: https://www.peerlyst.com/posts/bsideslv-2016-labeling-the-virusshare-corpus-lessons- learned-john-seymourBIBentrySTDinterwordspacing
[49]
VirusTotal, "File statistics during last 7 days," https://www.virustotal.com/en/statistics/, Last accessed on 2020-3-8.

Cited By

View all
  • (2024)Validation of the practicability of logical assessment formula for evaluations with inaccurate ground-truth labels: An application study on tumour segmentation for breast cancerComputing and Artificial Intelligence10.59400/cai.v2i2.14432:2(1443)Online publication date: 27-Sep-2024
  • (2024)Understanding the Process of Data Labeling in CybersecurityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636046(1596-1605)Online publication date: 21-May-2024
  • (2024)Logical assessment formula and its principles for evaluations with inaccurate ground-truth labelsKnowledge and Information Systems10.1007/s10115-023-02047-666:4(2561-2573)Online publication date: 6-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AISec '21: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security
November 2021
210 pages
ISBN:9781450386579
DOI:10.1145/3474369
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classifier evaluation
  2. label quality
  3. malware

Qualifiers

  • Research-article

Conference

CCS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)7
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Validation of the practicability of logical assessment formula for evaluations with inaccurate ground-truth labels: An application study on tumour segmentation for breast cancerComputing and Artificial Intelligence10.59400/cai.v2i2.14432:2(1443)Online publication date: 27-Sep-2024
  • (2024)Understanding the Process of Data Labeling in CybersecurityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636046(1596-1605)Online publication date: 21-May-2024
  • (2024)Logical assessment formula and its principles for evaluations with inaccurate ground-truth labelsKnowledge and Information Systems10.1007/s10115-023-02047-666:4(2561-2573)Online publication date: 6-Jan-2024
  • (2023)“Real Attackers Don't Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML54575.2023.00031(339-364)Online publication date: Feb-2023
  • (2023)SoK: Pragmatic Assessment of Machine Learning for Network Intrusion Detection2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P)10.1109/EuroSP57164.2023.00042(592-614)Online publication date: Jul-2023
  • (2023)Evaluation of the Limit of Detection in Network Dataset Quality Assessment with PerQoDAMachine Learning and Principles and Practice of Knowledge Discovery in Databases10.1007/978-3-031-23633-4_13(170-185)Online publication date: 31-Jan-2023
  • (2022)The Cross-Evaluation of Machine Learning-Based Network Intrusion Detection SystemsIEEE Transactions on Network and Service Management10.1109/TNSM.2022.315734419:4(5152-5169)Online publication date: Dec-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media