Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2899409acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

Published: 26 June 2016 Publication History

Abstract

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.

References

[1]
U. Berkeley. Berkeley data analytics stack. https://amplab.cs.berkeley.edu/software/.
[2]
A. Crotty, A. Galakatos, and T. Kraska. Tupleware: Distributed machine learning on small clusters. In IEEE Data Eng. Bull., 2014.
[3]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014.
[4]
D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015.
[5]
G. Inc. Tensorflow. https://www.tensorflow.org/.
[6]
S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. In VLDB, 2015.
[7]
S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59--75, 2015.
[8]
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. http://arxiv.org/abs/1601.03797v1.
[9]
J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, and K. Y. Goldberg. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In CASE, 2014.
[10]
P. Spark. Pyspark. https://spark.apache.org/docs/0.9.0/python-programming-guide.html.
[11]
L. Van der Maaten and G. Hinton. Visualizing data using t-sne. In JMLR, 2008.
[12]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.
[13]
H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.
[14]
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013.
[15]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. In VLDB, 2011.

Cited By

View all
  • (2024)An empirically based object-oriented testing using Machine learningEAI Endorsed Transactions on Internet of Things10.4108/eetiot.534410Online publication date: 8-Mar-2024
  • (2024)Work order prioritization using neural networks to improve building operationJournal of Information Technology in Construction10.36680/j.itcon.2024.01629(324-346)Online publication date: 18-Apr-2024
  • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
  • Show More Cited By

Index Terms

  1. ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. machine learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)333
    • Downloads (Last 6 weeks)52
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An empirically based object-oriented testing using Machine learningEAI Endorsed Transactions on Internet of Things10.4108/eetiot.534410Online publication date: 8-Mar-2024
    • (2024)Work order prioritization using neural networks to improve building operationJournal of Information Technology in Construction10.36680/j.itcon.2024.01629(324-346)Online publication date: 18-Apr-2024
    • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
    • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
    • (2024)Integrated Tool for Cleaning Bulk Solar Power Data2024 Second International Conference on Smart Technologies for Power and Renewable Energy (SPECon)10.1109/SPECon61254.2024.10537256(1-5)Online publication date: 2-Apr-2024
    • (2024)Integrated quality 4.0 framework for quality improvement based on Six Sigma and machine learning techniques towards zero-defect manufacturingThe TQM Journal10.1108/TQM-11-2023-0361Online publication date: 28-Mar-2024
    • (2024)Performance enhancement of artificial intelligence: A surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.104034232(104034)Online publication date: Dec-2024
    • (2024)Survey:Time-series data preprocessing: A survey and an empirical analysisJournal of Engineering Research10.1016/j.jer.2024.02.018Online publication date: Mar-2024
    • (2024)Black-Box Testing and Auditing of Bias in ADM SystemsMinds and Machines10.1007/s11023-024-09666-034:2Online publication date: 25-May-2024
    • (2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media