Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3533028.3533311acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

Published: 12 June 2022 Publication History

Abstract

Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.

References

[1]
Ziawasch Abedjan et al. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. (2016), 993--1004.
[2]
Patricia C. Arocena et al. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. (2015), 36--47.
[3]
Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data Management in Machine Learning Systems. Morgan & Claypool Publishers.
[4]
Jeroen Castelein et al. 2018. Search-based test data generation for SQL queries. In Proc. ICSE 2018. ACM, 1220--1230.
[5]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. Proc. VLDB Endow. (2013), 1498--1509.
[6]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proc. ICDE 2018. IEEE, 458--469.
[7]
André Conrad et al. 2021. EvoBench: Benchmarking Schema Evolution in NoSQL. In Proc. TPCTC 2021. Springer, 33--49.
[8]
Michele Dallachiesa et al. 2013. NADEEF: a commodity data cleaning system. In Proc. SIGMOD 2013. ACM, 541--552.
[9]
Stefan J. Galler and Bernhard K. Aichernig. 2014. Survey on test data generation tools - An evaluation of white- and gray-box testing tools for C#, C++, Eiffel, and Java. Int. J. Softw. Tools Technol. Transf. (2014), 727--751.
[10]
Saveli Goldberg, Andrzej Niemierko, and Alexander Turchin. 2008. Analysis of Data Errors in Clinical Research Databases. In AMIA 2008, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 8-12, 2008. AMIA. https://knowledge.amia.org/amia-55142-a2008a-1.625176/t-001-1.626020/f-001-1.626021/a-049-1.626417/a-050-1.626414
[11]
Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Rec. (2020), 18--29.
[12]
Alireza Heidari et al. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proc. SIGMOD 2019. ACM, 829--846.
[13]
Nishtha Jatana and Bharti Suri. 2020. An Improved Crow Search Algorithm for Test Data Generation Using Search-Based Mutation Testing. Neural Process. Lett. (2020), 767--784.
[14]
Won Y. Kim et al. 2003. A Taxonomy of Dirty Data. Data Min. Knowl. Discov. (2003), 81--99.
[15]
Sanjay Krishnan et al. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299
[16]
Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC) (2011).
[17]
Peng Li et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In Proc. ICDE 2021. IEEE, 13--24.
[18]
Xian Li et al. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow. (2012), 97--108.
[19]
Mohammad Mahdavi et al. 2019. Raha: A Configuration-Free Error Detection System. In Proc. SIGMOD 2019. ACM, 865--882.
[20]
Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. (2020), 1948--1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdf
[21]
Heiko Müller and Johann Christoph Freytag. 2003. Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164. Humboldt University.
[22]
Paulo Oliveira et al. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.
[23]
Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. (2000), 3--13. http://sites.computer.org/debull/A00DEC-CD.pdf
[24]
Joeri Rammelaere and Floris Geerts. 2018. Explaining Repaired Data with CFDs. Proc. VLDB Endow. (2018), 1387--1399.
[25]
Theodoros Rekatsinas et al. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. (2017), 1190--1201.

Cited By

View all
  • (2024)Towards an End-to-End Data Quality Optimizer2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00039(262-266)Online publication date: 13-May-2024
  • (2024)Gecko: A Python library for the generation and mutation of realistic personal identification data at scaleSoftwareX10.1016/j.softx.2024.10184627(101846)Online publication date: Sep-2024

Index Terms

  1. GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
    June 2022
    63 pages
    ISBN:9781450393751
    DOI:10.1145/3533028
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data preparation pipelines
    2. data sets
    3. error generation
    4. evaluation

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '22
    Sponsor:

    Acceptance Rates

    DEEM '22 Paper Acceptance Rate 9 of 13 submissions, 69%;
    Overall Acceptance Rate 44 of 67 submissions, 66%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards an End-to-End Data Quality Optimizer2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00039(262-266)Online publication date: 13-May-2024
    • (2024)Gecko: A Python library for the generation and mutation of realistic personal identification data at scaleSoftwareX10.1016/j.softx.2024.10184627(101846)Online publication date: Sep-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media