Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

Published: 11 February 2025 Publication History

Abstract

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation in FastPDB provides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.

References

[1]
Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. In VLDB. 1151--1154.
[2]
Antoine Amarilli, Pierre Bourhis, and Pierre Senellart. 2015. Provenance Circuits for Trees and Treelike Instances. In ICALP. 56--68.
[3]
Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, USA, 983--992. https://doi.org/10.1109/ICDE.2008.4497507
[4]
Bahareh Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu, and Qitian Zeng. 2018. GProM - A Swiss Army Knife for Your Provenance Needs. IEEE Data Eng. Bull. 41, 1 (2018), 51--62.
[5]
Subi Arumugam, Ravi Jampani, Luis Leopoldo Perez, Fei Xu, Christopher M. Jermaine, and Peter J. Haas. 2010. MCDB-R: Risk Analysis in the Database. Proc. VLDB Endow. 3, 1 (2010), 782--793.
[6]
ANONYMOUS AUTHORS. 2023. Probabilistic Databases Don't Have to Be Slow. https://anonymous.4open.science/r/ 2024_Bag_PDBs_Reproducibility-FA3F/tech_report.pdf
[7]
Omar Benjelloun, Anish Das Sarma, Chris Hayworth, and Jennifer Widom. 2006. An Introduction to ULDBs and the Trio System. IEEE Data Eng. Bull. 29, 1 (2006), 5--16.
[8]
George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the Repairs of Functional Dependency Violations under Hard Constraints. Proc. VLDB Endow. 3, 1 (2010), 197--207.
[9]
Radu Curticapean. 2013. Counting Matchings of Size k Is W[1]-Hard. In ICALP, Vol. 7965. 352--363.
[10]
Radu Curticapean and Dániel Marx. 2014. Complexity of Counting Subgraphs: Only the Boundedness of the Vertex- Cover Number Counts. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS '14). IEEE Computer Society, USA, 130--139. https://doi.org/10.1109/FOCS.2014.22
[11]
Nilesh Dalvi and Dan Suciu. 2007. The Dichotomy of Conjunctive Queries on Probabilistic Structures. In PODS. 293--302.
[12]
N. Dalvi and D. Suciu. 2007. Efficient query evaluation on probabilistic databases. VLDB 16, 4 (2007), 544.
[13]
Nilesh Dalvi and Dan Suciu. 2012. The dichotomy of probabilistic inference for unions of conjunctive queries. JACM 59, 6 (2012), 30.
[14]
Maarten Van den Heuvel, Peter Ivanov, Wolfgang Gatterbauer, Floris Geerts, and Martin Theobald. 2019. Anytime Approximation in Probabilistic Databases via Scaled Dissociations. In SIGMOD. 1295--1312.
[15]
Shiyuan Deng, Shangqi Lu, and Yufei Tao. 2023. On Join Sampling and the Hardness of Combinatorial Output-Sensitive Join Algorithms. In PODS. ACM, 99--111.
[16]
Su Feng, Boris Glavic, Aaron Huber, and Oliver Kennedy. 2021. Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds. In SIGMOD.
[17]
Su Feng, Boris Glavic, and Oliver Kennedy. 2023. Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data. Proc. VLDB Endow. 16, 6 (2023), 1346--1358.
[18]
Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2019. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers. In SIGMOD.
[19]
Robert Fink, Jiewen Huang, and Dan Olteanu. 2013. Anytime approximation in probabilistic databases. VLDBJ 22, 6 (2013), 823--848.
[20]
Robert Fink and Dan Olteanu. 2011. On the optimal approximation of queries using tractable propositional languages. In ICDT. 174--185.
[21]
Robert Fink and Dan Olteanu. 2016. Dichotomies for Queries with Negation in Probabilistic Databases. TODS 41, 1 (2016), 4:1--4:47.
[22]
Jörg Flum and Martin Grohe. 2002. The Parameterized Complexity of Counting Problems. In Proceedings of the 43rd Symposium on Foundations of Computer Science (FOCS '02). IEEE Computer Society, USA, 538.
[23]
Wolfgang Gatterbauer and Dan Suciu. 2017. Dissociation and Propagation for Approximate Lifted Inference With Standard Relational Database Management Systems. VLDB J. 26, 1 (2017), 5--30.
[24]
Erich Grädel, Yuri Gurevich, and Colin Hirsch. 1998. The Complexity of Query Reliability. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Seattle, Washington, USA) (PODS '98). Association for Computing Machinery, New York, NY, USA, 227--234. https://doi.org/10.1145/275487.295124
[25]
Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. In PODS. 31--40.
[26]
Martin Grohe, Peter Lindner, and Christoph Standke. 2023. Probabilistic Query Evaluation with Bag Semantics. In ICDT, Floris Geerts and Brecht Vandevoort (Eds.), Vol. 255. 20:1--20:19.
[27]
Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL Queries on Databases with Nulls. SIGMOD Rec. 46, 3 (2017), 5--16.
[28]
P.J. Haas. 1997. Large-sample and deterministic confidence intervals for online aggregation. In Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150). 51--62. https://doi.org/10.1109/SSDM.1997.621151
[29]
P. J. Haas and J. M. Hellerstein. 1999. Ripple Joins for Online Aggregation. In SIGMOD. 287--298.
[30]
D. G. Horvitz and D. J. Thompson. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663--685. https://doi.org/10.1080/01621459.1952.10483446 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1952.10483446
[31]
R. Impagliazzo and R. Paturi. 1999. Complexity of k-SAT. In Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317). 237--240. https://doi.org/10.1109/CCC.1999.766282
[32]
Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. 2008. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD.
[33]
Oliver Kennedy and Christoph Koch. 2010. PIP: A Database System for Great and Small Expectations. In ICDE.
[34]
Kyoungmin Kim, Jaehyun Ha, George Fletcher, andWook-Shin Han. 2023. Guaranteeing the Õ(AGM/OUT) Runtime for Uniform Sampling and Size Estimation over Joins. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18--23, 2023, Floris Geerts, Hung Q. Ngo, and Stavros Sintos (Eds.). ACM, 113--125. https://doi.org/10.1145/3584372.3588676
[35]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 615--629. https://doi.org/10.1145/2882903.2915235
[36]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2017. Wander Join and XDB: Online Aggregation via Random Walks. SIGMOD Rec. 46, 1 (May 2017), 33--40. https://doi.org/10.1145/3093754.3093763
[37]
Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379--397.
[38]
Stephen Macke, Maryam Aliakbarpour, Ilias Diakonikolas, Aditya Parameswaran, and Ronitt Rubinfeld. 2021. Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1703--1714. https://doi.org/10.1109/ICDE51399.2021.00150
[39]
Makoto Matsumoto and Takuji Nishimura. 1998. Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM Trans. Model. Comput. Simul. 8, 1 (1998), 3--30.
[40]
Barzan Mozafari. 2017. Approximate query engines: Commercial challenges and research opportunities. In SIGMOD. 521--524.
[41]
Barzan Mozafari and Ning Niu. 2015. A Handbook for Building an Approximate Query Engine. IEEE Data Eng. Bull. 38, 3 (2015), 3--29.
[42]
Raghotham Murthy, Robert Ikeda, and Jennifer Widom. 2011. Making AggregationWork in Uncertain and Probabilistic Databases. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 1261--1273.
[43]
Frank Olken and Doron Rotem. 1986. Simple Random Sampling from Relational Databases. In VLDB. Morgan Kaufmann, 160--169.
[44]
Dan Olteanu, Jiewen Huang, and Christoph Koch. 2010. Approximate confidence computation in probabilistic databases. In ICDE. 145--156.
[45]
Laurel J. Orr, Magdalena Balazinska, and Dan Suciu. 2020. EntropyDB: a probabilistic approach to approximate query processing. VLDB J. 29, 1 (2020), 539--567.
[46]
Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. Proc. VLDB Endow. 11, 6 (2018), 719--732.
[47]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190--1201.
[48]
Christopher Ré, Nilesh N. Dalvi, and Dan Suciu. 2007. Efficient Top-k Query Evaluation on Probabilistic Data. In ICDE. 886--895.
[49]
Christopher Ré and Dan Suciu. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDBJ 18, 5 (2009), 1091--1116.
[50]
Christopher De Sa, Alexander Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. 2017. Incremental knowledge base construction using DeepDive. VLDB J. 26, 1 (2017), 81--105.
[51]
Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. 2018. ProvSQL: Provenance and Probability Management in PostgreSQL. Proc. VLDB Endow. 11, 12 (aug 2018), 2034--2037. https://doi.org/10.14778/3229863.3236253
[52]
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic Databases. Morgan & Claypool Publishers.
[53]
Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey Naughton, and Val Tannen. 2017. m-tables: Representing Missing Data. In ICDT, Vol. 68.
[54]
The Transaction Processing Performance Council. [n. d.]. The TPC-H Benchmark. http://www.tpc.org/tpch/.
[55]
Guy Van den Broeck and Dan Suciu. 2017. Query Processing on Probabilistic Data: A Survey. Foundations and Trends in Databases (2017).
[56]
Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw. 3, 3 (1977), 253--256.
[57]
Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, Dieter Gawlick, and Oliver Kennedy. 2015. Lenses: An On-Demand Approach to ETL. PVLDB 8, 12 (2015), 1578--1589.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 3, Issue 1
SIGMOD
February 2025
2261 pages
EISSN:2836-6573
DOI:10.1145/3717614
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2025
Published in PACMMOD Volume 3, Issue 1

Permissions

Request permissions for this article.

Author Tags

  1. approximate query processing
  2. fine-grained complexity
  3. lineage polynomials
  4. parameterized complexity
  5. probabilistic data model

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 12
    Total Downloads
  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)12
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media