CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]

Li, Peng; Rao, Xi; Blase, Jennifer; Zhang, Yue; Chu, Xu; Zhang, Ce

Computer Science > Databases

arXiv:1904.09483v2 (cs)

[Submitted on 20 Apr 2019 (v1), revised 26 Apr 2019 (this version, v2), latest version 5 Apr 2021 (v3)]

Title:CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]

Authors:Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, Ce Zhang

View PDF

Abstract:It is widely recognized that the data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly does cleaning affect ML --- ML community usually focuses on the effects of specific types of noises of certain distributions (e.g., mislabels) on certain ML models, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream analytics.
We propose the CleanML benchmark that systematically investigates the impact of data cleaning on downstream ML models. The CleanML benchmark currently includes 13 real-world datasets with real errors, five common error types, and seven different ML models. To ensure that our findings are statistically significant, CleanML carefully controls the randomness in ML experiments using statistical hypothesis testing, and also uses the Benjamini-Yekutieli (BY) procedure to control potential false discoveries due to many hypotheses in the benchmark. We obtain many interesting and non-trivial insights, and identify multiple open research directions. We also release the benchmark and hope to invite future studies on the important problems of joint data cleaning and ML.

Comments:	12 pages
Subjects:	Databases (cs.DB); Machine Learning (cs.LG)
Cite as:	arXiv:1904.09483 [cs.DB]
	(or arXiv:1904.09483v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.1904.09483

Submission history

From: Peng Li [view email]
[v1] Sat, 20 Apr 2019 19:12:03 UTC (310 KB)
[v2] Fri, 26 Apr 2019 00:17:24 UTC (310 KB)
[v3] Mon, 5 Apr 2021 23:35:41 UTC (790 KB)

Computer Science > Databases

Title:CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators