research-article

FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

Authors:

Oliver Kennedy,

Boris GlavicAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 3, Issue 1

Article No.: 41, Pages 1 - 25

https://doi.org/10.1145/3709691

Published: 11 February 2025 Publication History

Abstract

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation in FastPDB provides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.

References

[1]

Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. In VLDB. 1151--1154.

Digital Library

[2]

Antoine Amarilli, Pierre Bourhis, and Pierre Senellart. 2015. Provenance Circuits for Trees and Treelike Instances. In ICALP. 56--68.

[3]

Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, USA, 983--992. https://doi.org/10.1109/ICDE.2008.4497507

Digital Library

[4]

Bahareh Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu, and Qitian Zeng. 2018. GProM - A Swiss Army Knife for Your Provenance Needs. IEEE Data Eng. Bull. 41, 1 (2018), 51--62.

[5]

Subi Arumugam, Ravi Jampani, Luis Leopoldo Perez, Fei Xu, Christopher M. Jermaine, and Peter J. Haas. 2010. MCDB-R: Risk Analysis in the Database. Proc. VLDB Endow. 3, 1 (2010), 782--793.

Digital Library

[6]

ANONYMOUS AUTHORS. 2023. Probabilistic Databases Don't Have to Be Slow. https://anonymous.4open.science/r/ 2024_Bag_PDBs_Reproducibility-FA3F/tech_report.pdf

[7]

Omar Benjelloun, Anish Das Sarma, Chris Hayworth, and Jennifer Widom. 2006. An Introduction to ULDBs and the Trio System. IEEE Data Eng. Bull. 29, 1 (2006), 5--16.

[8]

George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the Repairs of Functional Dependency Violations under Hard Constraints. Proc. VLDB Endow. 3, 1 (2010), 197--207.

Digital Library

[9]

Radu Curticapean. 2013. Counting Matchings of Size k Is W[1]-Hard. In ICALP, Vol. 7965. 352--363.

[10]

Radu Curticapean and Dániel Marx. 2014. Complexity of Counting Subgraphs: Only the Boundedness of the Vertex- Cover Number Counts. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS '14). IEEE Computer Society, USA, 130--139. https://doi.org/10.1109/FOCS.2014.22

Digital Library

[11]

Nilesh Dalvi and Dan Suciu. 2007. The Dichotomy of Conjunctive Queries on Probabilistic Structures. In PODS. 293--302.

[12]

N. Dalvi and D. Suciu. 2007. Efficient query evaluation on probabilistic databases. VLDB 16, 4 (2007), 544.

Digital Library

[13]

Nilesh Dalvi and Dan Suciu. 2012. The dichotomy of probabilistic inference for unions of conjunctive queries. JACM 59, 6 (2012), 30.

Digital Library

[14]

Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gatterbauer, Floris Geerts, and Martin Theobald. 2019. Anytime Approximation in Probabilistic Databases via Scaled Dissociations. In SIGMOD. 1295--1312.

[15]

Shiyuan Deng, Shangqi Lu, and Yufei Tao. 2023. On Join Sampling and the Hardness of Combinatorial Output-Sensitive Join Algorithms. In PODS. ACM, 99--111.

[16]

Su Feng, Boris Glavic, Aaron Huber, and Oliver Kennedy. 2021. Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds. In SIGMOD.

[17]

Su Feng, Boris Glavic, and Oliver Kennedy. 2023. Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data. Proc. VLDB Endow. 16, 6 (2023), 1346--1358.

Digital Library

[18]

Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2019. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers. In SIGMOD.

[19]

Robert Fink, Jiewen Huang, and Dan Olteanu. 2013. Anytime approximation in probabilistic databases. VLDBJ 22, 6 (2013), 823--848.

Digital Library

[20]

Robert Fink and Dan Olteanu. 2011. On the optimal approximation of queries using tractable propositional languages. In ICDT. 174--185.

[21]

Robert Fink and Dan Olteanu. 2016. Dichotomies for Queries with Negation in Probabilistic Databases. TODS 41, 1 (2016), 4:1--4:47.

[22]

Jörg Flum and Martin Grohe. 2002. The Parameterized Complexity of Counting Problems. In Proceedings of the 43rd Symposium on Foundations of Computer Science (FOCS '02). IEEE Computer Society, USA, 538.

Digital Library

[23]

Wolfgang Gatterbauer and Dan Suciu. 2017. Dissociation and Propagation for Approximate Lifted Inference With Standard Relational Database Management Systems. VLDB J. 26, 1 (2017), 5--30.

Digital Library

[24]

Erich Grädel, Yuri Gurevich, and Colin Hirsch. 1998. The Complexity of Query Reliability. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Seattle, Washington, USA) (PODS '98). Association for Computing Machinery, New York, NY, USA, 227--234. https://doi.org/10.1145/275487.295124

Digital Library

[25]

Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. In PODS. 31--40.

[26]

Martin Grohe, Peter Lindner, and Christoph Standke. 2023. Probabilistic Query Evaluation with Bag Semantics. In ICDT, Floris Geerts and Brecht Vandevoort (Eds.), Vol. 255. 20:1--20:19.

[27]

Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL Queries on Databases with Nulls. SIGMOD Rec. 46, 3 (2017), 5--16.

Digital Library

[28]

P.J. Haas. 1997. Large-sample and deterministic confidence intervals for online aggregation. In Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150). 51--62. https://doi.org/10.1109/SSDM.1997.621151

[29]

P. J. Haas and J. M. Hellerstein. 1999. Ripple Joins for Online Aggregation. In SIGMOD. 287--298.

[30]

D. G. Horvitz and D. J. Thompson. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663--685. https://doi.org/10.1080/01621459.1952.10483446 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1952.10483446

[31]

R. Impagliazzo and R. Paturi. 1999. Complexity of k-SAT. In Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317). 237--240. https://doi.org/10.1109/CCC.1999.766282

[32]

Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. 2008. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD.

[33]

Oliver Kennedy and Christoph Koch. 2010. PIP: A Database System for Great and Small Expectations. In ICDE.

[34]

Kyoungmin Kim, Jaehyun Ha, George Fletcher, andWook-Shin Han. 2023. Guaranteeing the Õ(AGM/OUT) Runtime for Uniform Sampling and Size Estimation over Joins. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18--23, 2023, Floris Geerts, Hung Q. Ngo, and Stavros Sintos (Eds.). ACM, 113--125. https://doi.org/10.1145/3584372.3588676

Digital Library

[35]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 615--629. https://doi.org/10.1145/2882903.2915235

Digital Library

[36]

Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2017. Wander Join and XDB: Online Aggregation via Random Walks. SIGMOD Rec. 46, 1 (May 2017), 33--40. https://doi.org/10.1145/3093754.3093763

Digital Library

[37]

Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379--397.

[38]

Stephen Macke, Maryam Aliakbarpour, Ilias Diakonikolas, Aditya Parameswaran, and Ronitt Rubinfeld. 2021. Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1703--1714. https://doi.org/10.1109/ICDE51399.2021.00150

[39]

Makoto Matsumoto and Takuji Nishimura. 1998. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM Trans. Model. Comput. Simul. 8, 1 (1998), 3--30.

Digital Library

[40]

Barzan Mozafari. 2017. Approximate query engines: Commercial challenges and research opportunities. In SIGMOD. 521--524.

[41]

Barzan Mozafari and Ning Niu. 2015. A Handbook for Building an Approximate Query Engine. IEEE Data Eng. Bull. 38, 3 (2015), 3--29.

[42]

Raghotham Murthy, Robert Ikeda, and Jennifer Widom. 2011. Making AggregationWork in Uncertain and Probabilistic Databases. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 1261--1273.

Digital Library

[43]

Frank Olken and Doron Rotem. 1986. Simple Random Sampling from Relational Databases. In VLDB. Morgan Kaufmann, 160--169.

[44]

Dan Olteanu, Jiewen Huang, and Christoph Koch. 2010. Approximate confidence computation in probabilistic databases. In ICDE. 145--156.

[45]

Laurel J. Orr, Magdalena Balazinska, and Dan Suciu. 2020. EntropyDB: a probabilistic approach to approximate query processing. VLDB J. 29, 1 (2020), 539--567.

Digital Library

[46]

Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. Proc. VLDB Endow. 11, 6 (2018), 719--732.

Digital Library

[47]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190--1201.

Digital Library

[48]

Christopher Ré, Nilesh N. Dalvi, and Dan Suciu. 2007. Efficient Top-k Query Evaluation on Probabilistic Data. In ICDE. 886--895.

[49]

Christopher Ré and Dan Suciu. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDBJ 18, 5 (2009), 1091--1116.

Digital Library

[50]

Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. 2017. Incremental knowledge base construction using DeepDive. VLDB J. 26, 1 (2017), 81--105.

Digital Library

[51]

Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. 2018. ProvSQL: Provenance and Probability Management in PostgreSQL. Proc. VLDB Endow. 11, 12 (aug 2018), 2034--2037. https://doi.org/10.14778/3229863.3236253

Digital Library

[52]

Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic Databases. Morgan & Claypool Publishers.

[53]

Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey Naughton, and Val Tannen. 2017. m-tables: Representing Missing Data. In ICDT, Vol. 68.

[54]

The Transaction Processing Performance Council. [n. d.]. The TPC-H Benchmark. http://www.tpc.org/tpch/.

[55]

Guy Van den Broeck and Dan Suciu. 2017. Query Processing on Probabilistic Data: A Survey. Foundations and Trends in Databases (2017).

[56]

Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw. 3, 3 (1977), 253--256.

Digital Library

[57]

Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, Dieter Gawlick, and Oliver Kennedy. 2015. Lenses: An On-Demand Approach to ETL. PVLDB 8, 12 (2015), 1578--1589.

Digital Library

Index Terms

FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Incomplete data
        Uncertainty
    2. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis

Recommendations

Probabilistic top-k dominating queries in uncertain databases

Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate ...
Tight Fine-Grained Bounds for Direct Access on Join Queries
PODS '22: Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

We consider the task of lexicographic direct access to query answers. That is, we want to simulate an array containing the answers of a join query sorted in a lexicographic order chosen by the user. A recent dichotomy showed for which queries and orders ...
Demonstration of VerdictDB, the Platform-Independent AQP System
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

We demonstrate VerdictDB, the first platform-independent approximate query processing (AQP) system. Unlike existing AQP systems that are tightly-integrated into a specific database, VerdictDB operates at the driver-level, acting as a middleware between ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 3, Issue 1

SIGMOD

February 2025

2261 pages

EISSN:2836-6573

DOI:10.1145/3717614

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2025 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2025

Published in PACMMOD Volume 3, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
12
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)12

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents