research-article

ReStore - Neural Data Completion for Relational Databases

Authors:

Benjamin Hilprecht,

Carsten BinnigAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 710 - 722

https://doi.org/10.1145/3448016.3457264

Published: 18 June 2021 Publication History

Abstract

Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete. In this paper, we propose an automated approach for relational data completion called ReStore using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering.

Supplementary Material

MP4 File (3448016.3457264.mp4)

Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete. In this paper, we propose an automated approach for relational data completion called ReStore using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering.

Download
37.15 MB

References

[1]

AWS redshift. https://aws.amazon.com/redshift. Accessed: 2020-09--12.

[2]

Azure SQL data warehouse. https://azure.microsoft.com/services/synapse-analytics/. Accessed: 2020-09--12.

[3]

Snowflake cloud data warehouse. https://www.snowflake.com/. Accessed: 2020-09--12.

[4]

S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and querying of sets of possible worlds. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, SIGMOD '87, page 34--48, New York, NY, USA, 1987. Association for Computing Machinery.

Digital Library

[5]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42, 2013.

Digital Library

[6]

S. Chaudhuri, B. Ding, and S. Kandula. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 511--519, 2017.

Digital Library

[7]

H. Chen, S. Jajodia, J. Liu, N. Park, V. Sokolov, and V. S. Subrahmanian. FakeTables: Using GANs to generate functional dependency preserving tables with bounded real data. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2074--2080. International Joint Conferences on Artificial Intelligence Organization, 7 2019.

[8]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, pages 2201--2206, 2016.

Digital Library

[9]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, page 2201--2206, New York, NY, USA, 2016. Association for Computing Machinery.

Digital Library

[10]

Y. Chung, M. L. Mortensen, C. Binnig, and T. Kraska. Estimating the impact of unknown unknowns on aggregate query results. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, page 861--876, New York, NY, USA, 2016. Association for Computing Machinery.

Digital Library

[11]

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. The VLDB Journal, 16(4):523--544, Oct. 2007.

Digital Library

[12]

X. L. Dong and T. Rekatsinas. Data integration and machine learning: A natural synergy. In Proceedings of the 2018 international conference on management of data, SIGMOD '18, pages 1645--1650, 2018.

Digital Library

[13]

J. Fan, T. Liu, G. Li, J. Chen, Y. Shen, and X. Du. Relational data synthesis using generative adversarial networks: A design space exploration. Proc. VLDB Endow., 13(11):1962--1975, 2020.

Digital Library

[14]

S. Feng, A. Huber, B. Glavic, and O. Kennedy. Uncertainty annotated databases - a lightweight approach for approximating certain answers. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 1313--1330, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[15]

M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. volume 37 of Proceedings of Machine Learning Research, pages 881--889, Lille, France, 07--09 Jul 2015. PMLR.

[16]

B. Golshan, A. Halevy, G. Mihaila, and W.-C. Tan. Data integration: After the teenage years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS '17, page 101--106, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[17]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, NIPS '14, pages 2672--2680. Curran Associates, Inc., 2014.

[18]

B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig. DeepDB: Learn from data, not from queries! Proc. VLDB Endow., 13(7):992--1005, Mar. 2020.

Digital Library

[19]

A. Jha and D. Suciu. Probabilistic databases with MarkoViews. PVLDB, 5(11):1160--1171, 2012.

Digital Library

[20]

W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, page 1275--1286, New York, NY, USA, 2014. Association for Computing Machinery.

Digital Library

[21]

A. Y. Levy. Obtaining complete answers from incomplete databases. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, page 402--412, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

Digital Library

[22]

C. Mayfield, J. Neville, and S. Prabhakar. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, page 75--86, New York, NY, USA, 2010. Association for Computing Machinery.

Digital Library

[23]

A. Motro. Integrity = validity

[24]

completeness. ACM Trans. Database Syst., 14(4):480--502, Dec. 1989.

Digital Library

[25]

C. Nash and C. Durkan. Autoregressive energy machines, 2019.

[26]

D. Olteanu, J. Huang, and C. Koch. SPROUT: Lazy vs. eager query plans for tuple-independent probabilistic databases. In 2009 IEEE 25th International Conference on Data Engineering, pages 640--651, March 2009.

Digital Library

[27]

L. Orr, M. Balazinska, and D. Suciu. Sample debiasing in the Themis open world database system. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, page 257--268, New York, NY, USA, 2020. Association for Computing Machinery.

Digital Library

[28]

G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 2335--2344, Red Hook, NY, USA, 2017. Curran Associates Inc.

Digital Library

[29]

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, Neurips '19, pages 8024--8035. Curran Associates, Inc., 2019.

[30]

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré . HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow., 10(11):1190--1201, 2017.

Digital Library

[31]

T. Rekatsinas, A. Deshpande, and L. Getoor. Local structure and determinism in probabilistic databases. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 373--384, New York, NY, USA, 2012. ACM.

Digital Library

[32]

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training GANs. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, NIPS '16, pages 2234--2242. Curran Associates, Inc., 2016.

[33]

P. Sen, A. Deshpande, and L. Getoor. PrDB: Managing and exploiting rich correlations in probabilistic databases. The VLDB Journal, 18(5):1065--1090, Oct. 2009.

Digital Library

[34]

D. Suciu and C. Re. Efficient top-k query evaluation on probabilistic data, Oct. 12 2010. US Patent 7,814,113.

[35]

L. Sun and A. Erath. A bayesian network approach for population synthesis. Transportation Research Part C: Emerging Technologies, 61:49--62, 2015.

[36]

B. Sundarmurthy, P. Koutris, W. Lang, J. F. Naughton, and V. Tannen. m-tables: Representing missing data. In M. Benedikt and G. Orsi, editors, 20th International Conference on Database Theory, ICDT 2017, March 21--24, 2017, Venice, Italy, volume 68 of LIPIcs, pages 21:1--21:20. Schloss Dagstuhl - Leibniz-Zentrum fü r Informatik, 2017.

[37]

D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein. BayesStore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340--351, 2008.

Digital Library

[38]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, page 469--480, New York, NY, USA, 2014. Association for Computing Machinery.

Digital Library

[39]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, page 469--480, New York, NY, USA, 2014. Association for Computing Machinery.

Digital Library

[40]

R. Wu, A. Zhang, I. Ilyas, and T. Rekatsinas. Attention-based learning for missing data imputation in HoloClean. In I. Dhillon, D. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 307--325. 2020.

[41]

L. Xu and K. Veeramachaneni. Synthesizing tabular data using generative adversarial networks. CoRR, abs/1811.11264, 2018.

[42]

Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. Proc. VLDB Endow., 13(3):279--292, Nov. 2019.

Digital Library

[43]

J. Yoon, J. Jordon, and M. van der Schaar. GAIN: Missing data imputation using generative adversarial nets. volume 80 of Proceedings of Machine Learning Research, pages 5689--5698, Stockholmsmässan, Stockholm Sweden, 10--15 Jul 2018. PMLR.

[44]

M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, NIPS '17, pages 3391--3401. Curran Associates, Inc., 2017.

Cited By

Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Liu TFan JTang NLi GDu X(2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639283
Show More Cited By

Index Terms

ReStore - Neural Data Completion for Relational Databases
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Incomplete data
        Uncertainty
    2. Information integration
      1. Data cleaning

Recommendations

Relational data factorization

Motivated by an analogy with matrix factorization, we introduce the problem of factorizing relational data. In matrix factorization, one is given a matrix and has to factorize it as a product of other matrices. In relational data factorization, the task ...
Data completion algorithms and their applications in inverse acoustic scattering with limited-aperture backscattering data
Abstract
Limited-aperture data brings great challenge for inverse scattering problems. The limited-aperture problem we are particularly interested in is the limited-aperture “backscattering” problem where both the incident and observation ...
Highlights
- Two data completion algorithms are investigated to treat the inverse scattering problems with limited-aperture backscattering data.
Drawing CoCo Core-Sets from Incomplete Relational Data
Web and Big Data
Abstract
Incompleteness is a pervasive issue and brings challenges to answer queries with high-quality tuples. Since not all missing values can be repaired by complete values, it is crucial to provide completeness of a query answer for further decisions. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

June 2021

2969 pages

ISBN:9781450383431

DOI:10.1145/3448016

General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '21

Sponsor:

SIGMOD

SIGMOD/PODS '21: International Conference on Management of Data

June 20 - 25, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
374
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Perini MNikolic M(2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639326
Liu TFan JTang NLi GDu X(2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639283
Zeighami SSeshadri RShahabi C(2024)A Neural Database for Answering Aggregate Queries on Incomplete Relational DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.331091436:7(2790-2802)Online publication date: Jul-2024
https://doi.org/10.1109/TKDE.2023.3310914
Zeighami SSeshadri RShahabi C(2024)A Neural Database for Answering Aggregate Queries on Incomplete Relational Data (Extended Abstract)2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00483(5703-5704)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00483
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://doi.org/10.1145/3626763

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents