Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

HoloClean: holistic data repairs with probabilistic inference

Published: 01 August 2017 Publication History

Abstract

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean finds data repairs with an average precision of ∼ 90% and an average recall of above ∼ 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2× against state-of-the-art methods.

References

[1]
https://data.cityofchicago.org.
[2]
https://alchemy.cs.washington.edu/.
[3]
https://data.medicare.gov/data/physician-compare.
[4]
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, Aug. 2016.
[5]
F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31--41, 2009.
[6]
S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss markov random fields and probabilistic soft logic. CoRR, abs/1505.04406, 2015.
[7]
L. E. Bertossi, S. Kolahi, and L. V. S. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In ICDT, pages 268--279, 2011.
[8]
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010.
[9]
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013.
[10]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005.
[11]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.
[12]
V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 110(13):1181--1190, 2013.
[13]
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1):90--121, 2005.
[14]
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013.
[15]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.
[16]
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015.
[17]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007.
[18]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.
[19]
K. Das and J. Schneider. Detecting anomalous records In categorical datasets. KDD, pages 220--229, 2007.
[20]
S. Ermon, C. P. Gomes, and B. Selman. Uniform solution sampling using a constraint solver as an oracle. In UAI, pages 255--264, 2012.
[21]
R. Fagin, B. Kimelfeld, and P. G. Kolaitis. Dichotomies in the complexity of preferred repairs. PODS, pages 3--15, 2015.
[22]
W. Fan. Dependencies revisited for improving data quality. In SIGMOD, pages 159--170, 2008.
[23]
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012.
[24]
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009.
[25]
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010.
[26]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013.
[27]
L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.
[28]
T. P. Hayes and A. Sinclair. A general lower bound for mixing of single-site dynamics on graphs. The Annals of Applied Probability, 17(3):931--952, 2007.
[29]
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
[30]
I. F. Ilyas. Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull., 39:38--46, 2016.
[31]
I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.
[32]
M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.
[33]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011.
[34]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.
[35]
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
[36]
N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009.
[37]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006.
[38]
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, pages 97--108, 2013.
[39]
C. Mayfield, J. Neville, and S. Prabhakar. Eracer: A database approach for statistical inference and data cleaning. SIGMOD, pages 75--86, 2010.
[40]
F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. 2010.
[41]
F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. Proc. VLDB Endow., 4(6):373--384, Mar. 2011.
[42]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.
[43]
A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016.
[44]
T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. G. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In SIGMOD, 2017.
[45]
C. D. Sa, C. Zhang, K. Olukotun, and C. Ré. Rapidly mixing gibbs sampling for a class of factor graphs using hierarchy width. In NIPS, pages 3097--3105, 2015.
[46]
J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. Proc. VLDB Endow., 8(11):1310--1321, July 2015.
[47]
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.
[48]
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014.
[49]
M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013.
[50]
C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. PVLDB, pages 1283--1294, 2014.

Cited By

View all
  • (2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
  • (2024)nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic SystemsProceedings of the VLDB Endowment10.14778/3681954.368200017:11(3283-3289)Online publication date: 30-Aug-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • Show More Cited By

Index Terms

  1. HoloClean: holistic data repairs with probabilistic inference
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 11
    August 2017
    432 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2017
    Published in PVLDB Volume 10, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)67
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Hardware-Efficient Data Imputation through DBMS ExtensibilityProceedings of the VLDB Endowment10.14778/3681954.368201617:11(3497-3510)Online publication date: 1-Jul-2024
    • (2024)nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic SystemsProceedings of the VLDB Endowment10.14778/3681954.368200017:11(3283-3289)Online publication date: 30-Aug-2024
    • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
    • (2024)Win-Win: On Simultaneous Clustering and Imputing over Incomplete DataProceedings of the VLDB Endowment10.14778/3681954.368198217:11(3045-3057)Online publication date: 30-Aug-2024
    • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
    • (2024)Rapidash: Efficient Detection of Constraint ViolationsProceedings of the VLDB Endowment10.14778/3659437.365945417:8(2009-2021)Online publication date: 1-Apr-2024
    • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
    • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
    • (2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 1-Feb-2024
    • (2024)IterClean: An Iterative Data Cleaning Framework with Large Language ModelsProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674436(100-105)Online publication date: 5-Jul-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media