research-article

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

Authors:

Valerie Restat,

Gerrit Boerner,

Uta StörlAuthors Info & Claims

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

Article No.: 2, Pages 1 - 6

https://doi.org/10.1145/3533028.3533311

Published: 12 June 2022 Publication History

Abstract

Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.

References

[1]

Ziawasch Abedjan et al. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. (2016), 993--1004.

Digital Library

[2]

Patricia C. Arocena et al. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. (2015), 36--47.

Digital Library

[3]

Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data Management in Machine Learning Systems. Morgan & Claypool Publishers.

[4]

Jeroen Castelein et al. 2018. Search-based test data generation for SQL queries. In Proc. ICSE 2018. ACM, 1220--1230.

Digital Library

[5]

Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. Proc. VLDB Endow. (2013), 1498--1509.

Digital Library

[6]

Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proc. ICDE 2018. IEEE, 458--469.

Digital Library

[7]

André Conrad et al. 2021. EvoBench: Benchmarking Schema Evolution in NoSQL. In Proc. TPCTC 2021. Springer, 33--49.

Digital Library

[8]

Michele Dallachiesa et al. 2013. NADEEF: a commodity data cleaning system. In Proc. SIGMOD 2013. ACM, 541--552.

Digital Library

[9]

Stefan J. Galler and Bernhard K. Aichernig. 2014. Survey on test data generation tools - An evaluation of white- and gray-box testing tools for C#, C++, Eiffel, and Java. Int. J. Softw. Tools Technol. Transf. (2014), 727--751.

Digital Library

[10]

Saveli Goldberg, Andrzej Niemierko, and Alexander Turchin. 2008. Analysis of Data Errors in Clinical Research Databases. In AMIA 2008, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 8-12, 2008. AMIA. https://knowledge.amia.org/amia-55142-a2008a-1.625176/t-001-1.626020/f-001-1.626021/a-049-1.626417/a-050-1.626414

[11]

Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Rec. (2020), 18--29.

Digital Library

[12]

Alireza Heidari et al. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proc. SIGMOD 2019. ACM, 829--846.

Digital Library

[13]

Nishtha Jatana and Bharti Suri. 2020. An Improved Crow Search Algorithm for Test Data Generation Using Search-Based Mutation Testing. Neural Process. Lett. (2020), 767--784.

Digital Library

[14]

Won Y. Kim et al. 2003. A Taxonomy of Dirty Data. Data Min. Knowl. Discov. (2003), 81--99.

Digital Library

[15]

Sanjay Krishnan et al. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299

[16]

Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC) (2011).

[17]

Peng Li et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In Proc. ICDE 2021. IEEE, 13--24.

[18]

Xian Li et al. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow. (2012), 97--108.

Digital Library

[19]

Mohammad Mahdavi et al. 2019. Raha: A Configuration-Free Error Detection System. In Proc. SIGMOD 2019. ACM, 865--882.

Digital Library

[20]

Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. (2020), 1948--1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdf

Digital Library

[21]

Heiko Müller and Johann Christoph Freytag. 2003. Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164. Humboldt University.

[22]

Paulo Oliveira et al. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.

[23]

Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. (2000), 3--13. http://sites.computer.org/debull/A00DEC-CD.pdf

[24]

Joeri Rammelaere and Floris Geerts. 2018. Explaining Repaired Data with CFDs. Proc. VLDB Endow. (2018), 1387--1399.

Digital Library

[25]

Theodoros Rekatsinas et al. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. (2017), 1190--1201.

Digital Library

Cited By

Restat VKlettke MStörl U(2024)Towards an End-to-End Data Quality Optimizer2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00039(262-266)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00039
Jugl MKirsten T(2024)Gecko: A Python library for the generation and mutation of realistic personal identification data at scaleSoftwareX10.1016/j.softx.2024.10184627(101846)Online publication date: Sep-2024
https://doi.org/10.1016/j.softx.2024.101846

Index Terms

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

A survey of network-based intrusion detection data sets
Abstract
Labeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying ...
Impact of data collection on interpretation and evaluation of student models
LAK '16: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge

Student modeling techniques are evaluated mostly using historical data. Researchers typically do not pay attention to details of the origin of the used data sets. However, the way data are collected can have important impact on evaluation and ...
Developing a Scalable Model to Analyze Expanding Data Sets

Ioffer a workbook to teach the scalable analysis of expanding data sets. When analyzing data sets, function ranges are often statically defined. As a result, when new data are appended to the data set, the appended data are beyond the static ranges. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

June 2022

63 pages

ISBN:9781450393751

DOI:10.1145/3533028

Conference Chairs:
Matthias Boehm,
Paroma Varma,
Doris Xin

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12, 2022

Pennsylvania, Philadelphia

Acceptance Rates

DEEM '22 Paper Acceptance Rate 9 of 13 submissions, 69%;

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
186
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Restat VKlettke MStörl U(2024)Towards an End-to-End Data Quality Optimizer2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00039(262-266)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00039
Jugl MKirsten T(2024)Gecko: A Python library for the generation and mutation of realistic personal identification data at scaleSoftwareX10.1016/j.softx.2024.10184627(101846)Online publication date: Sep-2024
https://doi.org/10.1016/j.softx.2024.101846

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents