research-article

Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data

Authors:

Oliver KennedyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 16, Issue 6

Pages 1346 - 1358

https://doi.org/10.14778/3583140.3583151

Published: 01 February 2023 Publication History

Abstract

Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.

References

[1]

https://data.medicare.gov/data/hospital-compare. Medicare Hospital Dataset. (https://data.medicare.gov/data/hospital-compare).

[2]

https://github.com/fengsu91/uncert-ranking-availability. Paper Artifacts. (https://github.com/fengsu91/uncert-ranking-availability).

[3]

https://nsidc.org/data/g00807. Iceberg Dataset. (https://nsidc.org/data/g00807).

[4]

https://www.kaggle.com/currie32/crimes-in-chicago. Chicago Crimes Dataset. (https://www.kaggle.com/currie32/crimes-in-chicago).

[5]

Serge Abiteboul, T.-H. Hubert Chan, Evgeny Kharlamov, Werner Nutt, and Pierre Senellart. 2010. Aggregate queries for discrete and continuous probabilistic XML. In ICDT. 50--61.

[6]

Serge Abiteboul, Paris C. Kanellakis, and Gösta Grahne. 1991. On the Representation and Querying of Sets of Possible Worlds. Theor. Comput. Sci. 78, 1 (1991), 158--187.

Digital Library

[7]

Parag Agrawal, Anish Das Sarma, Jeffrey Ullman, and Jennifer Widom. 2010. Foundations of uncertain-data integration. PVLDB 3, 1-2 (2010), 1080--1090.

Digital Library

[8]

Robert Albright, Alan J. Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker M. White. 2008. SGL: a scalable language for data-driven games. In SIGMOD. 1217--1222.

[9]

Antoine Amarilli, M Lamine Ba, Daniel Deutch, and Pierre Senellart. 2014. Provenance for Non-deterministic Order-Aware Queries. Prepr int: http://a3nm.net/publications/amarilli2014provenance.pdf (2014).

[10]

Antoine Amarilli, M. Lamine Ba, Daniel Deutch, and Pierre Senellart. 2017. Possible and Certain Answers for Queries over Order-Incomplete Data. In Proc. TIME. 4:1--4:19.

[11]

Antoine Amarilli, Mouhamadou Lamine Ba, Daniel Deutch, and Pierre Senellart. 2019. Computing possible and certain answers over order-incomplete data. Theor. Comput. Sci. 797 (2019), 42--76.

Digital Library

[12]

Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In PODS. 153--164.

[13]

George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2014. Sampling from Repairs of Conditional Functional Dependency Violations. VLDBJ 23, 1 (2014), 103--128.

Digital Library

[14]

Michael Brachmann, William Spoth, Oliver Kennedy, Boris Glavic, Heiko Müller, Sonia Castel, Carlos Bautista, and Juliana Freire. 2020. Your notebook is not crumby enough, REPLace it. In CIDR.

[15]

Douglas Burdick, Prasad M. Deshpande, T. S. Jayram, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. 2007. OLAP over uncertain and imprecise data. VLDBJ 16, 1 (2007), 123--144.

Digital Library

[16]

Arbee L. P. Chen, Jui-Shang Chiu, and Frank Shou-Cheng Tseng. 1996. Evaluating Aggregate Operations Over Imprecise Data. IEEE Trans. Knowl. Data Eng. 8, 2 (1996), 273--284.

Digital Library

[17]

Marco Console, Paolo Guagliardo, and Leonid Libkin. 2019. Fragments of Bag Relational Algebra: Expressiveness and Certain Answers. In ICDT. 8:1--8:16.

[18]

Marco Console, Paolo Guagliardo, Leonid Libkin, and Etienne Toussaint. 2020. Coping with Incomplete Data: Recent Advances. In PODS. ACM, 33--47.

[19]

Graham Cormode, Feifei Li, and Ke Yi. 2009. Semantics of Ranking Queries for Probabilistic Data and Expected Ranks. In ICDE. 305--316.

[20]

Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In ETAPS, C. R. Ramakrishnan and Jakob Rehof (Eds.), Vol. 4963. 337--340.

[21]

Wenfei Fan. 2008. Dependencies revisited for improving data quality. In PODS. 159--170.

[22]

Su Feng, Boris Glavic, and Oliver Kennedy. 2022. Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (extended version). (2022). arXiv:2302.08676 [cs.DB]

[23]

Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2019. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers. In SIGMOD.

[24]

Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2021. Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds. In SIGMOD. 528--540.

[25]

Robert Fink, Larisa Han, and Dan Olteanu. 2012. Aggregation in Probabilistic Databases via Knowledge Compilation. PVLDB 5, 5 (2012), 490--501.

Digital Library

[26]

Stefan Grafberger, Paul Groth, and Sebastian Schelter. 2022. Towards data-centric what-if analysis for native machine learning pipelines. In DEEM@SIGMOD. 3:1--3:5.

[27]

Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In PODS.

[28]

Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In PODS.

[29]

Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL Queries on Databases with Nulls. SIGMOD Record 46, 3 (2017), 5--16.

Digital Library

[30]

Paolo Guagliardo and Leonid Libkin. 2019. On the Codd semantics of SQL nulls. Inf. Syst. 86 (2019), 46--60.

Digital Library

[31]

Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data integration: the teenage years. In VLDB. 9--16.

[32]

Ming Hua, Jian Pei, Wenjie Zhang, and Xuemin Lin. 2008. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD. 673--686.

[33]

Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in Relational Databases. J. ACM 31, 4 (1984), 761--791.

Digital Library

[34]

Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. 2008. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD.

[35]

T. S. Jayram, Satyen Kale, and Erik Vee. 2007. Efficient aggregation algorithms for probabilistic data. In SODA. 346--355.

[36]

Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative Support for Sensor Data Cleaning. In PERVASIVE. 83--100.

[37]

O. Kennedy and C. Koch. 2010. PIP: A database system for great and small expectations. In ICDE. 157--168.

[38]

Poonam Kumari, Said Achmiz, and Oliver Kennedy. 2016. Communicating Data Quality in On-Demand Curation. In QDB.

[39]

Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In SIGMOD. 1275--1286.

[40]

Jens Lechtenbörger, Hua Shu, and Gottfried Vossen. 2002. Aggregate Queries Over Conditional Tables. J. Intell. Inf. Syst. 19, 3 (2002), 343--362.

Digital Library

[41]

Jian Li, Barna Saha, and Amol Deshpande. 2009. A Unified Approach to Ranking in Probabilistic Databases. PVLDB 2, 1 (2009), 502--513.

Digital Library

[42]

Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J. Elmore, and Michael J. Franklin. 2020. Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints. In SIGMOD. 285--295.

[43]

Leonid Libkin. 2016. SQL's Three-Valued Logic and Certain Answers. TODS 41, 1 (2016), 1:1--1:28.

[44]

Witold Lipski. 1979. On Semantic Issues Connected with Incomplete Information Databases. TODS 4, 3 (1979), 262--296.

Digital Library

[45]

Raghotham Murthy, Robert Ikeda, and Jennifer Widom. 2011. Making Aggregation Work in Uncertain and Probabilistic Databases. IEEE Trans. Knowl. Data Eng. 23, 8 (2011), 1261--1273.

Digital Library

[46]

Dan Olteanu, Lampros Papageorgiou, and Sebastiaan J van Schaik. 2013. Pigora: An Integration System for Probabilistic Data. In ICDE. 1324--1327.

[47]

Danila Piatov and Sven Helmer. 2017. Sweeping-Based Temporal Aggregation. In Advances in Spatial and Temporal Databases, Michael Gertz, Matthias Renz, Xiaofang Zhou, Erik Hoel, Wei-Shinn Ku, Agnes Voisard, Chengyang Zhang, Haiquan Chen, Liang Tang, Yan Huang, Chang-Tien Lu, and Siva Ravada (Eds.). Springer International Publishing, Cham, 125--144.

[48]

Christopher Re, Nilesh Dalvi, and Dan Suciu. 2007. Efficient Top-k Query Evaluation on Probabilistic Data. In ICDE. 886--895.

[49]

Raymond Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values. J. ACM 33, 2 (1986), 349--370.

Digital Library

[50]

Babak Salimi, Romila Pradhan, Jiongli Zhu, and Boris Glavic. 2022. Interpretable Data-Based Explanations for Fairness Debugging. In SIGMOD. 247--261.

[51]

Sunita Sarawagi et al. 2008. Information extraction. Foundations and Trends® in Databases 1, 3 (2008), 261--377.

[52]

Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2008. Probabilistic top-k and ranking-aggregate queries. TODS 33, 3 (2008), 13:1--13:54.

[53]

Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2008. Probabilistic Top-k and Ranking-Aggregate Queries. TODS 33, 3, Article 13 (2008), 54 pages.

[54]

Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2007. Top-k Query Processing in Uncertain Databases. In ICDE. 896--905.

[55]

Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic databases. Synthesis Lectures on Data Management 3, 2 (2011), 1--180.

Digital Library

[56]

Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey F. Naughton, and Val Tannen. 2017. m-tables: Representing Missing Data. In ICDT.

[57]

Mohan Yang, Haixun Wang, Haiquan Chen, and Wei-Shinn Ku. 2011. Querying uncertain data with aggregate constraints. In SIGMOD. 817--828.

[58]

Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An On-demand Approach to ETL. PVLDB 8, 12 (2015), 1578--1589.

Digital Library

[59]

Xi Zhang and Jan Chomicki. 2008. On the semantics and evaluation of top-k queries in probabilistic databases. In ICDE. 556--563.

Recommendations

Ranking queries on uncertain data

Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as top-k queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain ...
Probabilistic inverse ranking queries in uncertain databases

Query processing in the uncertain database has become increasingly important due to the wide existence of uncertain data in many real applications. Different from handling precise data, the uncertain query processing needs to consider the data ...
Ranking queries on uncertain data: a probabilistic threshold approach
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-k queries (also known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 16, Issue 6

February 2023

393 pages

ISSN:2150-8097

Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 February 2023

Published in PVLDB Volume 16, Issue 6

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
44
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents