How not to do ML: Showing the Negative Impact of Improper CVE Feature Selection in a Live Exploit Prediction Model • NorthSec 2025

Back to the list of Speakers and Sessions
Watch the stream Level: Medium

May 16 02:30 PM EDT

Talks will be streamed on YouTube and Twitch for free.

Machine learning has been used extensively for the prediction of cyber security threats for a number of years. More specifically, building predictive models for the exploitation of security vulnerabilities and the publication of vulnerability exploits is essential in anticipating threats in the cyber security landscape.

Many published approaches train ML models using publicly available data, be it online discussions or vulnerability details available through the publication of CVEs. Unfortunately, many challenges arise when encoding this data to predict exploitation. More importantly, many of these do not impact the model's performance on historical data, but instead result in a poor performance when used as a live model in a real environment.

In this talk, we will demonstrate our implementation and deployment of several of these methods. We show that performance of these models in a live environment underperforms in comparison with its historical evaluation. Vulnerability and threat information evolve over time, and are often not available on the day of a vulnerability's publication. We identify four incorrect ways to encode and evaluate features for the prediction of exploits, that causes the model to incorrectly predict exploits when used in a day-to-day live system.

Ultimately, we show how a model that has a lower performance on its historical data evaluation can better predict the publication of exploits in a live setting, by encoding the features correctly.

François Labrèche Senior Data Scientist, Sophos

François Labrèche is a Senior Data Scientist at Sophos, who focuses on applying machine learning approaches to research problems related to security alerts and vulnerabilities. He focuses on using machine learning to improve the prioritization of alerts and vulnerabilities, in the context of XDR and vulnerability management. He explores the use of OSINT sources and the dark web in assessing the importance of newly published vulnerabilities.

He has a Ph.D. from École Polytechnique de Montréal, and has published research papers on the topics of spam detection, malware analysis, threat research and machine learning applied to cybersecurity. He has presented at ACSAC 2024, CAMLIS 2022, BSides Montreal 2021, University College London and École Polytechnique de Montréal, and has published papers in conferences such as the ACM Conference on Computer and Communications Security (CCS).