Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3626246.3654737acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

IDE: A System for Iterative Mislabel Detection

Published: 09 June 2024 Publication History

Abstract

While machine learning techniques, especially deep neural networks, have shown remarkable success in various applications, their performance is adversely affected by label errors in training data. Acquiring high-quality annotated data is both costly and time-consuming in real-world scenarios, requiring extensive human annotation and verification. Consequently, many industry-applied models are trained over data containing substantial noise, significantly degrading the performance of these models.
To address this critical issue, we demonstrate IDE, a novel system that iteratively detects mislabeled instances and repairs the wrong labels. Specifically, IDE leverages the early loss observation and influence-based verification to iteratively identify mislabeled instances. When the mislabeled instances are obtained in each iteration, IDE will repair their labels to enhance detection accuracy for subsequent iterations. The framework automatically determines the termination point when the early loss is no longer effective. For uncertain instances, it generates pseudo labels to train a binary classification model, leveraging the model's generalization ability to make the final decision. With a real-life scenario, we demonstrate that IDE produces high-quality training data by effective mislabel detection and repair.

References

[1]
AI Marques R Alejo J Badenas, JS Sanchez, and R Barandela. 2000. Decontamination of training data for supevised pattern recognition. Advances in Pattern Recognition Lecture Notes in Computer Science (2000).
[2]
Léon Bottou and Olivier Bousquet. 2007. The tradeoffs of large scale learning. Advances in neural information processing systems 20 (2007).
[3]
Yuhao Deng, Chengliang Chai, Lei Cao, Nan Tang, Ju Fan, and Jiayi Wang et al. 2024. MisDetect: Iterative Mislabel Detection using Early Loss. VLDB (2024).
[4]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning. PMLR.
[5]
Peng Li, Xi Rao, and Jennifer Blase et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE 2021. IEEE.
[6]
Barak A. Pearlmutter. 1994. Fast Exact Multiplication by the Hessian. Neural Comput. 6, 1 (1994), 147--160. https://doi.org/10.1162/neco.1994.6.1.147
[7]
Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. 2022. Interpretable Data-Based Explanations for Fairness Debugging. In SIGMOD, 2022. ACM.

Index Terms

  1. IDE: A System for Iterative Mislabel Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
    June 2024
    694 pages
    ISBN:9798400704222
    DOI:10.1145/3626246
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. influence function
    3. mislabel detection

    Qualifiers

    • Short-paper

    Funding Sources

    • NSFC
    • DITDP
    • the Beijing Natural Science Foundation
    • the Research Funds of Renmin University of China
    • the National Key R\&D Program of China
    • NSF (National Science Foundation)

    Conference

    SIGMOD/PODS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 137
      Total Downloads
    • Downloads (Last 12 months)137
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media