Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3457264acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ReStore - Neural Data Completion for Relational Databases

Published: 18 June 2021 Publication History

Abstract

Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete. In this paper, we propose an automated approach for relational data completion called ReStore using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering.

Supplementary Material

MP4 File (3448016.3457264.mp4)
Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete. In this paper, we propose an automated approach for relational data completion called ReStore using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering.

References

[1]
AWS redshift. https://aws.amazon.com/redshift. Accessed: 2020-09--12.
[2]
Azure SQL data warehouse. https://azure.microsoft.com/services/synapse-analytics/. Accessed: 2020-09--12.
[3]
Snowflake cloud data warehouse. https://www.snowflake.com/. Accessed: 2020-09--12.
[4]
S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and querying of sets of possible worlds. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, SIGMOD '87, page 34--48, New York, NY, USA, 1987. Association for Computing Machinery.
[5]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29--42, 2013.
[6]
S. Chaudhuri, B. Ding, and S. Kandula. Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 511--519, 2017.
[7]
H. Chen, S. Jajodia, J. Liu, N. Park, V. Sokolov, and V. S. Subrahmanian. FakeTables: Using GANs to generate functional dependency preserving tables with bounded real data. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2074--2080. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
[8]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, pages 2201--2206, 2016.
[9]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, page 2201--2206, New York, NY, USA, 2016. Association for Computing Machinery.
[10]
Y. Chung, M. L. Mortensen, C. Binnig, and T. Kraska. Estimating the impact of unknown unknowns on aggregate query results. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, page 861--876, New York, NY, USA, 2016. Association for Computing Machinery.
[11]
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. The VLDB Journal, 16(4):523--544, Oct. 2007.
[12]
X. L. Dong and T. Rekatsinas. Data integration and machine learning: A natural synergy. In Proceedings of the 2018 international conference on management of data, SIGMOD '18, pages 1645--1650, 2018.
[13]
J. Fan, T. Liu, G. Li, J. Chen, Y. Shen, and X. Du. Relational data synthesis using generative adversarial networks: A design space exploration. Proc. VLDB Endow., 13(11):1962--1975, 2020.
[14]
S. Feng, A. Huber, B. Glavic, and O. Kennedy. Uncertainty annotated databases - a lightweight approach for approximating certain answers. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, page 1313--1330, New York, NY, USA, 2019. Association for Computing Machinery.
[15]
M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. volume 37 of Proceedings of Machine Learning Research, pages 881--889, Lille, France, 07--09 Jul 2015. PMLR.
[16]
B. Golshan, A. Halevy, G. Mihaila, and W.-C. Tan. Data integration: After the teenage years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS '17, page 101--106, New York, NY, USA, 2017. Association for Computing Machinery.
[17]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, NIPS '14, pages 2672--2680. Curran Associates, Inc., 2014.
[18]
B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig. DeepDB: Learn from data, not from queries! Proc. VLDB Endow., 13(7):992--1005, Mar. 2020.
[19]
A. Jha and D. Suciu. Probabilistic databases with MarkoViews. PVLDB, 5(11):1160--1171, 2012.
[20]
W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, page 1275--1286, New York, NY, USA, 2014. Association for Computing Machinery.
[21]
A. Y. Levy. Obtaining complete answers from incomplete databases. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, page 402--412, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.
[22]
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: A database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, page 75--86, New York, NY, USA, 2010. Association for Computing Machinery.
[23]
A. Motro. Integrity = validity
[24]
completeness. ACM Trans. Database Syst., 14(4):480--502, Dec. 1989.
[25]
C. Nash and C. Durkan. Autoregressive energy machines, 2019.
[26]
D. Olteanu, J. Huang, and C. Koch. SPROUT: Lazy vs. eager query plans for tuple-independent probabilistic databases. In 2009 IEEE 25th International Conference on Data Engineering, pages 640--651, March 2009.
[27]
L. Orr, M. Balazinska, and D. Suciu. Sample debiasing in the Themis open world database system. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, page 257--268, New York, NY, USA, 2020. Association for Computing Machinery.
[28]
G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 2335--2344, Red Hook, NY, USA, 2017. Curran Associates Inc.
[29]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, Neurips '19, pages 8024--8035. Curran Associates, Inc., 2019.
[30]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré . HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow., 10(11):1190--1201, 2017.
[31]
T. Rekatsinas, A. Deshpande, and L. Getoor. Local structure and determinism in probabilistic databases. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 373--384, New York, NY, USA, 2012. ACM.
[32]
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training GANs. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, NIPS '16, pages 2234--2242. Curran Associates, Inc., 2016.
[33]
P. Sen, A. Deshpande, and L. Getoor. PrDB: Managing and exploiting rich correlations in probabilistic databases. The VLDB Journal, 18(5):1065--1090, Oct. 2009.
[34]
D. Suciu and C. Re. Efficient top-k query evaluation on probabilistic data, Oct. 12 2010. US Patent 7,814,113.
[35]
L. Sun and A. Erath. A bayesian network approach for population synthesis. Transportation Research Part C: Emerging Technologies, 61:49--62, 2015.
[36]
B. Sundarmurthy, P. Koutris, W. Lang, J. F. Naughton, and V. Tannen. m-tables: Representing missing data. In M. Benedikt and G. Orsi, editors, 20th International Conference on Database Theory, ICDT 2017, March 21--24, 2017, Venice, Italy, volume 68 of LIPIcs, pages 21:1--21:20. Schloss Dagstuhl - Leibniz-Zentrum fü r Informatik, 2017.
[37]
D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein. BayesStore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340--351, 2008.
[38]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, page 469--480, New York, NY, USA, 2014. Association for Computing Machinery.
[39]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, page 469--480, New York, NY, USA, 2014. Association for Computing Machinery.
[40]
R. Wu, A. Zhang, I. Ilyas, and T. Rekatsinas. Attention-based learning for missing data imputation in HoloClean. In I. Dhillon, D. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 307--325. 2020.
[41]
L. Xu and K. Veeramachaneni. Synthesizing tabular data using generative adversarial networks. CoRR, abs/1811.11264, 2018.
[42]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. Proc. VLDB Endow., 13(3):279--292, Nov. 2019.
[43]
J. Yoon, J. Jordon, and M. van der Schaar. GAIN: Missing data imputation using generative adversarial nets. volume 80 of Proceedings of Machine Learning Research, pages 5689--5698, Stockholmsmässan, Stockholm Sweden, 10--15 Jul 2018. PMLR.
[44]
M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, NIPS '17, pages 3391--3401. Curran Associates, Inc., 2017.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data completion
  2. data-driven learning
  3. deep autoregressive models
  4. incomplete data
  5. relational data

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)In-Database Data ImputationProceedings of the ACM on Management of Data10.1145/36393262:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024
  • (2024)A Neural Database for Answering Aggregate Queries on Incomplete Relational DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.331091436:7(2790-2802)Online publication date: Jul-2024
  • (2024)A Neural Database for Answering Aggregate Queries on Incomplete Relational Data (Extended Abstract)2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00483(5703-5704)Online publication date: 13-May-2024
  • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media