Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

Zhan, Xianghao; Xu, Qinmei; Zheng, Yuanning; Lu, Guangming; Gevaert, Olivier

Computer Science > Machine Learning

arXiv:2309.07332 (cs)

COVID-19 e-print

Important: e-prints posted on arXiv are not peer-reviewed by arXiv; they should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the field.

[Submitted on 13 Sep 2023]

Title:Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

Authors:Xianghao Zhan, Qinmei Xu, Yuanning Zheng, Guangming Lu, Olivier Gevaert

View PDF

Abstract:Accurately labeling biomedical data presents a challenge. Traditional semi-supervised learning methods often under-utilize available unlabeled data. To address this, we propose a novel reliability-based training data cleaning method employing inductive conformal prediction (ICP). This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers within vast quantities of noisy training data. The efficacy of the method is validated across three classification tasks within distinct modalities: filtering drug-induced-liver-injury (DILI) literature with title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced through label permutation. Results show significant enhancements in classification performance: accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4%), AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% and 69.8%), and accuracy and macro-average F1 score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% and 89.0%). Our method offers the potential to substantially boost classification performance in multi-modal biomedical machine learning tasks. Importantly, it accomplishes this without necessitating an excessive volume of meticulously curated training data.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM); Applications (stat.AP); Machine Learning (stat.ML)
Cite as:	arXiv:2309.07332 [cs.LG]
	(or arXiv:2309.07332v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2309.07332

Submission history

From: Xianghao Zhan [view email]
[v1] Wed, 13 Sep 2023 22:04:50 UTC (8,607 KB)

Computer Science > Machine Learning

Title:Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators