Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data

Published: 01 February 2023 Publication History

Abstract

Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.

References

[1]
https://data.medicare.gov/data/hospital-compare. Medicare Hospital Dataset. (https://data.medicare.gov/data/hospital-compare).
[2]
https://github.com/fengsu91/uncert-ranking-availability. Paper Artifacts. (https://github.com/fengsu91/uncert-ranking-availability).
[3]
https://nsidc.org/data/g00807. Iceberg Dataset. (https://nsidc.org/data/g00807).
[4]
https://www.kaggle.com/currie32/crimes-in-chicago. Chicago Crimes Dataset. (https://www.kaggle.com/currie32/crimes-in-chicago).
[5]
Serge Abiteboul, T.-H. Hubert Chan, Evgeny Kharlamov, Werner Nutt, and Pierre Senellart. 2010. Aggregate queries for discrete and continuous probabilistic XML. In ICDT. 50--61.
[6]
Serge Abiteboul, Paris C. Kanellakis, and Gösta Grahne. 1991. On the Representation and Querying of Sets of Possible Worlds. Theor. Comput. Sci. 78, 1 (1991), 158--187.
[7]
Parag Agrawal, Anish Das Sarma, Jeffrey Ullman, and Jennifer Widom. 2010. Foundations of uncertain-data integration. PVLDB 3, 1-2 (2010), 1080--1090.
[8]
Robert Albright, Alan J. Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker M. White. 2008. SGL: a scalable language for data-driven games. In SIGMOD. 1217--1222.
[9]
Antoine Amarilli, M Lamine Ba, Daniel Deutch, and Pierre Senellart. 2014. Provenance for Non-deterministic Order-Aware Queries. Prepr int: http://a3nm.net/publications/amarilli2014provenance.pdf (2014).
[10]
Antoine Amarilli, M. Lamine Ba, Daniel Deutch, and Pierre Senellart. 2017. Possible and Certain Answers for Queries over Order-Incomplete Data. In Proc. TIME. 4:1--4:19.
[11]
Antoine Amarilli, Mouhamadou Lamine Ba, Daniel Deutch, and Pierre Senellart. 2019. Computing possible and certain answers over order-incomplete data. Theor. Comput. Sci. 797 (2019), 42--76.
[12]
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In PODS. 153--164.
[13]
George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2014. Sampling from Repairs of Conditional Functional Dependency Violations. VLDBJ 23, 1 (2014), 103--128.
[14]
Michael Brachmann, William Spoth, Oliver Kennedy, Boris Glavic, Heiko Müller, Sonia Castel, Carlos Bautista, and Juliana Freire. 2020. Your notebook is not crumby enough, REPLace it. In CIDR.
[15]
Douglas Burdick, Prasad M. Deshpande, T. S. Jayram, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. 2007. OLAP over uncertain and imprecise data. VLDBJ 16, 1 (2007), 123--144.
[16]
Arbee L. P. Chen, Jui-Shang Chiu, and Frank Shou-Cheng Tseng. 1996. Evaluating Aggregate Operations Over Imprecise Data. IEEE Trans. Knowl. Data Eng. 8, 2 (1996), 273--284.
[17]
Marco Console, Paolo Guagliardo, and Leonid Libkin. 2019. Fragments of Bag Relational Algebra: Expressiveness and Certain Answers. In ICDT. 8:1--8:16.
[18]
Marco Console, Paolo Guagliardo, Leonid Libkin, and Etienne Toussaint. 2020. Coping with Incomplete Data: Recent Advances. In PODS. ACM, 33--47.
[19]
Graham Cormode, Feifei Li, and Ke Yi. 2009. Semantics of Ranking Queries for Probabilistic Data and Expected Ranks. In ICDE. 305--316.
[20]
Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In ETAPS, C. R. Ramakrishnan and Jakob Rehof (Eds.), Vol. 4963. 337--340.
[21]
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In PODS. 159--170.
[22]
Su Feng, Boris Glavic, and Oliver Kennedy. 2022. Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (extended version). (2022). arXiv:2302.08676 [cs.DB]
[23]
Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2019. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers. In SIGMOD.
[24]
Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2021. Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds. In SIGMOD. 528--540.
[25]
Robert Fink, Larisa Han, and Dan Olteanu. 2012. Aggregation in Probabilistic Databases via Knowledge Compilation. PVLDB 5, 5 (2012), 490--501.
[26]
Stefan Grafberger, Paul Groth, and Sebastian Schelter. 2022. Towards data-centric what-if analysis for native machine learning pipelines. In DEEM@SIGMOD. 3:1--3:5.
[27]
Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In PODS.
[28]
Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In PODS.
[29]
Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL Queries on Databases with Nulls. SIGMOD Record 46, 3 (2017), 5--16.
[30]
Paolo Guagliardo and Leonid Libkin. 2019. On the Codd semantics of SQL nulls. Inf. Syst. 86 (2019), 46--60.
[31]
Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data integration: the teenage years. In VLDB. 9--16.
[32]
Ming Hua, Jian Pei, Wenjie Zhang, and Xuemin Lin. 2008. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD. 673--686.
[33]
Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in Relational Databases. J. ACM 31, 4 (1984), 761--791.
[34]
Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. 2008. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD.
[35]
T. S. Jayram, Satyen Kale, and Erik Vee. 2007. Efficient aggregation algorithms for probabilistic data. In SODA. 346--355.
[36]
Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative Support for Sensor Data Cleaning. In PERVASIVE. 83--100.
[37]
O. Kennedy and C. Koch. 2010. PIP: A database system for great and small expectations. In ICDE. 157--168.
[38]
Poonam Kumari, Said Achmiz, and Oliver Kennedy. 2016. Communicating Data Quality in On-Demand Curation. In QDB.
[39]
Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In SIGMOD. 1275--1286.
[40]
Jens Lechtenbörger, Hua Shu, and Gottfried Vossen. 2002. Aggregate Queries Over Conditional Tables. J. Intell. Inf. Syst. 19, 3 (2002), 343--362.
[41]
Jian Li, Barna Saha, and Amol Deshpande. 2009. A Unified Approach to Ranking in Probabilistic Databases. PVLDB 2, 1 (2009), 502--513.
[42]
Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J. Elmore, and Michael J. Franklin. 2020. Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints. In SIGMOD. 285--295.
[43]
Leonid Libkin. 2016. SQL's Three-Valued Logic and Certain Answers. TODS 41, 1 (2016), 1:1--1:28.
[44]
Witold Lipski. 1979. On Semantic Issues Connected with Incomplete Information Databases. TODS 4, 3 (1979), 262--296.
[45]
Raghotham Murthy, Robert Ikeda, and Jennifer Widom. 2011. Making Aggregation Work in Uncertain and Probabilistic Databases. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 1261--1273.
[46]
Dan Olteanu, Lampros Papageorgiou, and Sebastiaan J van Schaik. 2013. Pigora: An Integration System for Probabilistic Data. In ICDE. 1324--1327.
[47]
Danila Piatov and Sven Helmer. 2017. Sweeping-Based Temporal Aggregation. In Advances in Spatial and Temporal Databases, Michael Gertz, Matthias Renz, Xiaofang Zhou, Erik Hoel, Wei-Shinn Ku, Agnes Voisard, Chengyang Zhang, Haiquan Chen, Liang Tang, Yan Huang, Chang-Tien Lu, and Siva Ravada (Eds.). Springer International Publishing, Cham, 125--144.
[48]
Christopher Re, Nilesh Dalvi, and Dan Suciu. 2007. Efficient Top-k Query Evaluation on Probabilistic Data. In ICDE. 886--895.
[49]
Raymond Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values. J. ACM 33, 2 (1986), 349--370.
[50]
Babak Salimi, Romila Pradhan, Jiongli Zhu, and Boris Glavic. 2022. Interpretable Data-Based Explanations for Fairness Debugging. In SIGMOD. 247--261.
[51]
Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases 1, 3 (2008), 261--377.
[52]
Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2008. Probabilistic top-k and ranking-aggregate queries. TODS 33, 3 (2008), 13:1--13:54.
[53]
Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2008. Probabilistic Top-k and Ranking-Aggregate Queries. TODS 33, 3, Article 13 (2008), 54 pages.
[54]
Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2007. Top-k Query Processing in Uncertain Databases. In ICDE. 896--905.
[55]
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic databases. Synthesis Lectures on Data Management 3, 2 (2011), 1--180.
[56]
Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey F. Naughton, and Val Tannen. 2017. m-tables: Representing Missing Data. In ICDT.
[57]
Mohan Yang, Haixun Wang, Haiquan Chen, and Wei-Shinn Ku. 2011. Querying uncertain data with aggregate constraints. In SIGMOD. 817--828.
[58]
Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An On-demand Approach to ETL. PVLDB 8, 12 (2015), 1578--1589.
[59]
Xi Zhang and Jan Chomicki. 2008. On the semantics and evaluation of top-k queries in probabilistic databases. In ICDE. 556--563.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 6
February 2023
393 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 February 2023
Published in PVLDB Volume 16, Issue 6

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 44
    Total Downloads
  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media