research-article

HoloClean: holistic data repairs with probabilistic inference

Authors:

Theodoros Rekatsinas,

Christopher RéAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 10, Issue 11

Pages 1190 - 1201

https://doi.org/10.14778/3137628.3137631

Published: 01 August 2017 Publication History

Abstract

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean finds data repairs with an average precision of ∼ 90% and an average recall of above ∼ 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2× against state-of-the-art methods.

References

[1]

https://data.cityofchicago.org.

[2]

https://alchemy.cs.washington.edu/.

[3]

https://data.medicare.gov/data/physician-compare.

[4]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, Aug. 2016.

Digital Library

[5]

F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31--41, 2009.

Digital Library

[6]

S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss markov random fields and probabilistic soft logic. CoRR, abs/1505.04406, 2015.

[7]

L. E. Bertossi, S. Kolahi, and L. V. S. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In ICDT, pages 268--279, 2011.

Digital Library

[8]

G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010.

Digital Library

[9]

G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013.

Digital Library

[10]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005.

Digital Library

[11]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.

[12]

V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 110(13):1181--1190, 2013.

[13]

J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1):90--121, 2005.

Digital Library

[14]

X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.

Digital Library

[15]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.

Digital Library

[16]

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015.

Digital Library

[17]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007.

Digital Library

[18]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.

Digital Library

[19]

K. Das and J. Schneider. Detecting anomalous records In categorical datasets. KDD, pages 220--229, 2007.

Digital Library

[20]

S. Ermon, C. P. Gomes, and B. Selman. Uniform solution sampling using a constraint solver as an oracle. In UAI, pages 255--264, 2012.

Digital Library

[21]

R. Fagin, B. Kimelfeld, and P. G. Kolaitis. Dichotomies in the complexity of preferred repairs. PODS, pages 3--15, 2015.

Digital Library

[22]

W. Fan. Dependencies revisited for improving data quality. In SIGMOD, pages 159--170, 2008.

Digital Library

[23]

W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012.

Digital Library

[24]

W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009.

Digital Library

[25]

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010.

Digital Library

[26]

F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013.

Digital Library

[27]

L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.

Digital Library

[28]

T. P. Hayes and A. Sinclair. A general lower bound for mixing of single-site dynamics on graphs. The Annals of Applied Probability, 17(3):931--952, 2007.

[29]

J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.

[30]

I. F. Ilyas. Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull., 39:38--46, 2016.

[31]

I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.

Digital Library

[32]

M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.

[33]

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011.

Digital Library

[34]

S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.

Digital Library

[35]

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.

Digital Library

[36]

N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009.

Digital Library

[37]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006.

Digital Library

[38]

X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, pages 97--108, 2013.

Digital Library

[39]

C. Mayfield, J. Neville, and S. Prabhakar. Eracer: A database approach for statistical inference and data cleaning. SIGMOD, pages 75--86, 2010.

Digital Library

[40]

F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. 2010.

Digital Library

[41]

F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. Proc. VLDB Endow., 4(6):373--384, Mar. 2011.

Digital Library

[42]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.

[43]

A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016.

Digital Library

[44]

T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. G. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In SIGMOD, 2017.

Digital Library

[45]

C. D. Sa, C. Zhang, K. Olukotun, and C. Ré. Rapidly mixing gibbs sampling for a class of factor graphs using hierarchy width. In NIPS, pages 3097--3105, 2015.

Digital Library

[46]

J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. Proc. VLDB Endow., 8(11):1310--1321, July 2015.

Digital Library

[47]

M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.

[48]

J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014.

Digital Library

[49]

M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013.

Digital Library

[50]

C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. PVLDB, pages 1283--1294, 2014.

Digital Library

Cited By

Mohr-Daurat HTheodorakis GPirk H(2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682016
Yuan YTang BZhou TZhang ZQin J(2024)nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic SystemsProceedings of the VLDB Endowment10.14778/3681954.368200017:11(3283-3289)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682000
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Show More Cited By

Index Terms

HoloClean: holistic data repairs with probabilistic inference
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Hierarchical hidden conditional random fields for information extraction
LION'05: Proceedings of the 5th international conference on Learning and Intelligent Optimization

Hidden Markov Models (HMMs) are very popular generative models for time series data. Recent work, however, has shown that for many tasks Conditional Random Fields (CRFs), a type of discriminative model, perform better than HMMs. Information extraction ...
Endpoint detection of sio₂ plasma etching using expanded hidden markov model
ISNN'10: Proceedings of the 7th international conference on Advances in Neural Networks - Volume Part II

In this paper, extended Hidden Markov Model (eHMM) is employed to resolve transition detection problems in plasma etch processes using optical emission spectroscopy (OES) data The proposed eHMM framework is a one of various semi-Markov models: a ...
The infinite hidden Markov random field model

Hidden Markov random field (HMRF) models are widely used for image segmentation, as they appear naturally in problems where a spatially constrained clustering scheme is asked for. A major limitation of HMRF models concerns the automatic selection of the ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 10, Issue 11

August 2017

432 pages

ISSN:2150-8097

Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2017

Published in PVLDB Volume 10, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

150
Total Citations
View Citations
415
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mohr-Daurat HTheodorakis GPirk H(2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682016
Yuan YTang BZhou TZhang ZQin J(2024)nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic SystemsProceedings of the VLDB Endowment10.14778/3681954.368200017:11(3283-3289)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682000
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Sun YZhu JXu XXu XSun YSong SLi XYuan X(2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681982
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675051
Liu ZDeep SFariha APsallidas FTiwari AFloratou A(2024)Rapidash: Efficient Detection of Constraint ViolationsProceedings of the VLDB Endowment10.14778/3659437.365945417:8(2009-2021)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659454
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Deng YChai CCao LTang NWang JFan JYuan YWang G(2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648161
Ni WZhang KMiao XZhao XWu YYin J(2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
https://dl.acm.org/doi/10.1145/3674399.3674436
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents