research-article

BUNNI: Learning Repair Actions in Rule-driven Data Cleaning

Authors:

Giansalvatore Mecca,

Donatello Santoro,

Enzo VeltriAuthors Info & Claims

ACM Journal of Data and Information Quality, Volume 16, Issue 2

Article No.: 12, Pages 1 - 31

https://doi.org/10.1145/3665930

Published: 24 June 2024 Publication History

Abstract

In this work, we address the challenging and open problem of involving non-expert users in the data repairing problem as first-class citizens. Despite a large number of proposals that have been devoted to cleaning data from the point of view of expert users (IT staff and data scientists), there is a lack of studies from the perspective of non-expert ones. Given a set of available data quality rules, we exploit machine learning techniques to guide the user to identify the dirty values for each violation and repair them. We show that with a low user effort, it is possible to identify the values in tuples that can be trusted and the ones that are most likely errors. We show experimentally how this machine learning approach leads to a unique clean solution with high quality in scenarios where other approaches fail.

References

[1]

Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 68–79. DOI:

Digital Library

[2]

Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment 9, 2 (2015), 36–47. DOI:

Digital Library

[3]

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for tabular data representation: A survey of models and applications. Transactions of the Association for Computational Linguistics 11 (2023), 227–249. DOI:

[4]

Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (5th ed.). Springer. https://www.worldcat.org/oclc/71008143

[5]

Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 143–154. DOI:

Digital Library

[6]

Jean-Flavien Bussotti, Enzo Veltri, Donatello Santoro, and Paolo Papotti. 2023. Generation of training examples for tabular natural language inference. Proceedings of the ACM on Management of Data 1, 4 (Dec. 2023), Article 243, 27 pages. DOI:

Digital Library

[7]

Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. Proceedings of the VLDB Endowment 1, 1 (2008), 1166–1177. DOI:

Digital Library

[8]

Fei Chiang and Renée J. Miller. 2011. A unified model for data and constraint repair. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE ’11). IEEE, 446–457. DOI:

Digital Library

[9]

Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering denial constraints. Proceedings of the VLDB Endowment 6, 13 (2013), 1498–1509. DOI:

Digital Library

[10]

Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the 2007 33rd International Conference on Very Large Data Bases. ACM, 315–326. http://www.vldb.org/conf/2007/papers/research/p315-cong.pdf

[11]

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, 541–552. DOI:

Digital Library

[12]

Sushovan De, Yuheng Hu, Yi Chen, and Subbarao Kambhampati. 2014. BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata. In Proceedings of the 2014 IEEE International Conference on Big Data Big Data ’14). IEEE, 15–24. DOI:

[13]

Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, and Subbarao Kambhampati. 2016. BayesWipe: A scalable probabilistic framework for improving data quality. Journal of Data and Information Quality 8, 1 (2016), 5.

Digital Library

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACCL-HLT ’19), Volume 1 (Long and Short Papers). 4171–4186. DOI:

[15]

Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2013. NADEEF: A generalized data cleaning system. Proceedings of the VLDB Endowment 6, 12 (2013), 1218–1221.

Digital Library

[16]

Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers. DOI:

[17]

Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems 33, 2 (2008), Article 6, 48 pages. DOI:

Digital Library

[18]

Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2011. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering 23, 5 (2011), 683–698. DOI:

Digital Library

[19]

Wenfei Fan, Floris Geerts, Nan Tang, and Wenyuan Yu. 2013. Inferring data currency and consistency for conflict resolution. In Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE ’13). IEEE, 470–481. DOI:

Digital Library

[20]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2010. Towards certain fixes with editing rules and master data. VLDB Journal 3, 2 (2010), 173–184.

[21]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction between record matching and data repairing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’11). ACM, 469–480. DOI:

Digital Library

[22]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDB Journal 21, 2 (2012), 213–238.

Digital Library

[23]

Helena Galhardas, Daniela Florescu, Dennis E. Shasha, Eric Simon, and Cristian-Augustin Saita. 2001. Declarative data cleaning: Language, model, and algorithms. In Proceedingsof the 27th International Conference on Very Large Databases (VLDB ’01). 371–380. http://www.vldb.org/conf/2001/P371.pdf

[24]

Susan Garavaglia and Asha Sharma. 1998. A smart guide to dummy variables: Four applications and a macro. In Proceedings of the Northeast SAS Users Group Conference. 43.

[25]

Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2020. Cleaning data with Llunatic. VLDB Journal 29, 4 (2020), 867–892. DOI:

[26]

Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro, and Enzo Veltri. 2024. Similarity measures for incomplete database instances. In Proceedings 27th International Conference on Extending Database Technology (EDBT ’24).

[27]

Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, and Divesh Srivastava. 2012. Discovering conservation rules. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE ’12). IEEE, 738–749. DOI:

Digital Library

[28]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1 (2009), 10–18.

Digital Library

[29]

Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and deterministic data cleaning. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, 893–907. DOI:

Digital Library

[30]

Jeffrey Heer, Joseph M. Hellerstein, and Sean Kandel. 2015. Predictive interaction for data transformation. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15). http://cidrdb.org/cidr2015/Papers/CIDR15_Paper27.pdf

[31]

Matteo Interlandi and Nan Tang. 2015. Proof positive and negative in data cleaning. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE ’15). IEEE, 18–29. DOI:

[32]

Sijia Jiang, Zijing Tan, Jiawei Wang, Zhikang Wang, and Shuai Ma. 2023. Guided conditional functional dependency discovery. Information Systems 114 (2023), 102158. DOI:

Digital Library

[33]

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the International Conference on Human Factors in Computing Systems(CHI ’11). ACM, 3363–3372. DOI:

Digital Library

[34]

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1215–1230. DOI:

Digital Library

[35]

Sotiris B. Kotsiantis. 2007. Supervised machine learning: A review of classification techniques. In Emerging Artificial Intelligence Applications in Computer Engineering—Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. Frontiers in Artificial Intelligence and Applications, Vol. 160. IOS Press, 3–24. http://www.booksonline.iospress.nl/Content/View.aspx?piid=6950

[36]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948–959. DOI:

Digital Library

[37]

Paola Lapadula, Giansalvatore Mecca, Donatello Santoro, Luisa Solimando, and Enzo Veltri. 2018. Humanity is overrated. Or not. Automatic diagnostic suggestions by Greg, ML (Extended abstract). Communications in Computer and Information Science 909 (2018), 305–313. DOI:

[38]

Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective error correction via a unified context representation and transfer learning. Proceedings of the VLDB Endowment 13, 11 (2020), 1948–1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdf

Digital Library

[39]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, 865–882. DOI:

Digital Library

[40]

Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’10). ACM, 75–86. DOI:

Digital Library

[41]

Jennifer Neville and David Jensen. 2007. Relational dependency networks. Journal of Machine Learning Research 8 (March 2007), 653–692.

[42]

Andrew Y. Ng and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems 2 (2002), 841–848.

Digital Library

[43]

Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with Metanome. Proceedings of the VLDB Endowment 8, 12 (Aug.2015), 1860–1863. DOI:

Digital Library

[44]

Judea Pearl. 1986. Fusion, propagation, and structuring in belief networks. Artificial Intelligence 29, 3 (1986), 241–288.

Digital Library

[45]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (2020), Article 140, 67 pages.

[46]

Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB ’01). 381–390. http://www.vldb.org/conf/2001/P381.pdf

[47]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190–1201. DOI:

Digital Library

[48]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2021. A survey of deep active learning. ACM Computing Surveys 54, 9 (Oct. 2021), Article 180, 40 pages. DOI:

Digital Library

[49]

Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 269–278. DOI:

Digital Library

[50]

Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers. DOI:

[51]

Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data & Knowledge Engineering 87 (2013), 146–166. DOI:

Digital Library

[52]

Enzo Veltri, Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Data ambiguity profiling for the generation of training examples. In Proceedings of the 39th IEEE International Conference on Data Engineering (ICDE ’23). IEEE, 450–463. DOI:

[53]

Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Renée J. Miller. 2014. Continuous data cleaning. In Proceedings of the IEEE 30th International Conference on Data Engineering (ICDE ’14). IEEE, 244–255. DOI:

[54]

Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing rules. In Proceedings of the International Conference on Management of Data (SIGMOD ’14). ACM, 457–468. DOI:

Digital Library

[55]

Bo Wu and Craig A. Knoblock. 2015. An iterative approach to synthesize data transformation programs. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI ’15). 1726–1732. http://ijcai.org/Abstract/15/246

Digital Library

[56]

Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. In Proceedings of Machine Learning and Systems 2020 (MLSys ’20). https://proceedings.mlsys.org/book/307.pdf

[57]

Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don’t be SCAREd: Use scalable automatic repairing with maximal likelihood and bounded changes. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, 553–564. DOI:

Digital Library

[58]

Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. Proceedings of the VLDB Endowment 4, 5 (2011), 279–289. DOI:

Digital Library

[59]

Jian Zhou, Zhixu Li, Binbin Gu, Qing Xie, Jia Zhu, Xiangliang Zhang, and Guoliang Li. 2016. CrowdAidRepair: A crowd-aided interactive data repairing method. In Database Systems for Advanced Applications. Lecture Notes in Computer Science, Vol. 9642. Springer, 51–66. DOI:

Index Terms

BUNNI: Learning Repair Actions in Rule-driven Data Cleaning

Recommendations

Data cleaning and machine learning: a systematic literature review
Abstract
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches ...
Machine Learning and Data Cleaning: Which Serves the Other?
The last few years witnessed significant advances in building automated or semi-automated data quality, data cleaning and data integration systems powered by machine learning (ML). In parallel, large deployment of ML systems in business, science, ...
Learning Over Dirty Data Without Cleaning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Real-world datasets are dirty and contain many errors, such as violations of integrity constraints and entity duplicates. Learning over dirty databases may result in inaccurate models. Data scientists spend most of their time on preparing and repairing ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 16, Issue 2

June 2024

135 pages

EISSN:1936-1963

DOI:10.1145/3613602

Editor:
Felix Naumann
Hasso Plattner Institute, Germany

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2024

Online AM: 25 May 2024

Accepted: 15 May 2024

Revised: 06 February 2024

Received: 04 May 2023

Published in JDIQ Volume 16, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
170
Total Downloads

Downloads (Last 12 months)170
Downloads (Last 6 weeks)31

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents