TRIAGE: Characterizing and auditing training data for improved regression

Seedat, Nabeel; Crabbé, Jonathan; Qian, Zhaozhi; van der Schaar, Mihaela

Computer Science > Machine Learning

arXiv:2310.18970 (cs)

[Submitted on 29 Oct 2023]

Title:TRIAGE: Characterizing and auditing training data for improved regression

Authors:Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar

View PDF

Abstract:Data quality is crucial for robust machine learning algorithms, with the recent interest in data-centric AI emphasizing the importance of training data characterization. However, current data characterization methods are largely focused on classification settings, with regression settings largely understudied. To address this, we introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We operationalize the score to analyze individual samples' training dynamics and characterize samples as under-, over-, or well-estimated by the model. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings. Additionally, beyond sample level, we show TRIAGE enables new approaches to dataset selection and feature acquisition. Overall, TRIAGE highlights the value unlocked by data characterization in real-world regression applications

Comments:	Presented at NeurIPS 2023
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2310.18970 [cs.LG]
	(or arXiv:2310.18970v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.18970

Submission history

From: Nabeel Seedat [view email]
[v1] Sun, 29 Oct 2023 10:31:59 UTC (9,154 KB)

Computer Science > Machine Learning

Title:TRIAGE: Characterizing and auditing training data for improved regression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:TRIAGE: Characterizing and auditing training data for improved regression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators