research-article

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

Authors:

Robert J. Joyce,

Edward Raff,

Charles NicholasAuthors Info & Claims

AISec '21: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security

Pages 73 - 84

https://doi.org/10.1145/3474369.3486867

Published: 15 November 2021 Publication History

Get Access

Abstract

In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.

Supplementary Material

MP4 File (AISec21-12129.mp4)

In some problem spaces, lower quality reference datasets must be used because obtaining ground truth labels is too expensive or time-consuming. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). An AGTR allows bounds on the precision, recall, and accuracy of a multiclass classifier or clustering algorithm to be computed, and it supports other evaluation functionality. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.

Download
147.06 MB

References

[1]

K. Crowston, "Amazon mechanical turk: A research tool for organizations and information systems scholars," in Shaping the Future of ICT Research. Methods and Approaches, A. Bhattacherjee and B. Fitzgerald, Eds., 2012, pp. 210--221.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Evaluating Automatic Malware Classifiers in the Absence of Reference Labels

Cluster-Oriented Ensemble Classifier: Impact of Multicluster Characterization on Ensemble Classifier Learning

Towards a Methodical Evaluation of Antivirus Scans and Labels

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations