While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models.

To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.

References

[1]

AI Marques R Alejo J Badenas, JS Sanchez, and R Barandela. 2000. Decontamination of training data for supevised pattern recognition. Advances in Pattern Recognition Lecture Notes in Computer Science (2000).

Google Scholar

[2]

Léon Bottou and Olivier Bousquet. 2007. The tradeoffs of large scale learning. Advances in neural information processing systems 20 (2007).

Google Scholar

[3]

Yuhao Deng, Chengliang Chai, Lei Cao, Nan Tang, Ju Fan, and Jiayi Wang et al. 2024. MisDetect: Iterative Mislabel Detection using Early Loss. VLDB (2024).

Google Scholar

[4]

Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning. PMLR.

Google Scholar

[5]

Peng Li, Xi Rao, and Jennifer Blase et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE 2021. IEEE.

Google Scholar

[6]

Barak A. Pearlmutter. 1994. Fast Exact Multiplication by the Hessian. Neural Comput. 6, 1 (1994), 147--160. https://doi.org/10.1162/neco.1994.6.1.147

Digital Library

Google Scholar

[7]

Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. 2022. Interpretable Data-Based Explanations for Fairness Debugging. In SIGMOD, 2022. ACM.

Digital Library

Google Scholar

Index Terms

IDE: A System for Iterative Mislabel Detection
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

MisDetect: Iterative Mislabel Detection using Early Loss

Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered ...
Active Label Correction
ICDM '12: Proceedings of the 2012 IEEE 12th International Conference on Data Mining

Active Label Correction (ALC) is an interactive method that cleans an established training set of mislabeled examples in conjunction with a domain expert. ALC presumes that the expert who conducts this review is either more accurate than the original ...
Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

The existence of noisy labels in datasets has always been an essential dilemma in deep learning studies. Previous works detected noisy labels by analyzing the predicted probability distribution generated by the model trained on the same data and ...

Comments

Information & Contributors

Information

Published In

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

June 2024

694 pages

ISBN:9798400704222

DOI:10.1145/3626246

General Chairs:
Pablo Barcelo
Universidad Catolica, Chile
,
Nayat Sanchez-Pi
INRIA Chile
,
Program Chairs:
Alexandra Meliou
University of Massachusetts Amherst, USA
,
S. Sudarshan
Indian Institute of Technology Bombay

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

NSFC
DITDP
the Beijing Natural Science Foundation
the Research Funds of Renmin University of China
the National Key R\&D Program of China
NSF (National Science Foundation)

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

Santiago AA, Chile

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
137
Total Downloads

Downloads (Last 12 months)137
Downloads (Last 6 weeks)25

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

MisDetect: Iterative Mislabel Detection using Early Loss

Active Label Correction

Efficient Two-stage Label Noise Reduction for Retrieval-based Tasks

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations