research-article

Open access

Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds

Authors:

Oliver A. KennedyAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 528 - 540

https://doi.org/10.1145/3448016.3452791

Published: 18 June 2021 Publication History

Abstract

Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed.We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.

Supplementary Material

Read me (3448016.3452791_readme.pdf)

Download
65.18 KB

Source Code (3448016.3452791_source_code.zip)

Download
174.68 MB

MP4 File (3448016.3452791.mp4)

Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.

Download
295.61 MB

References

[1]

Serge Abiteboul, T.-H. Hubert Chan, Evgeny Kharlamov, Werner Nutt, and Pierre Senellart. 2010. Aggregate queries for discrete and continuous probabilistic XML. In ICDT . 50--61.

[2]

Serge Abiteboul, Paris C. Kanellakis, and Gösta Grahne. 1991. On the Representation and Querying of Sets of Possible Worlds. Theor. Comput. Sci., Vol. 78, 1 (1991), 158--187.

Digital Library

[3]

Foto N. Afrati and Phokion G. Kolaitis. 2008. Answering aggregate queries in data exchange. In PODS. 129--138.

[4]

Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. In VLDB .

Digital Library

[5]

Parag Agrawal, Anish Das Sarma, Jeffrey Ullman, and Jennifer Widom. 2010. Foundations of uncertain-data integration. PVLDB, Vol. 3, 1--2 (2010), 1080--1090.

Digital Library

[6]

Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011a. Provenance for Aggregate Queries. In PODS. 153--164.

[7]

Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011b. Provenance for aggregate queries. In PODS. 153--164.

[8]

L. Antova, T. Jansen, C. Koch, and D. Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In ICDE .

[9]

Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS .

[10]

Marcelo Arenas, Leopoldo E. Bertossi, Jan Chomicki, Xin He, Vijay Raghavan, and Jeremy P. Spinrad. 2003. Scalar aggregation in inconsistent databases. Theor. Comput. Sci., Vol. 296, 3 (2003), 405--434.

Digital Library

[11]

Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering .Morgan & Claypool Publishers.

Digital Library

[12]

George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2014. Sampling from Repairs of Conditional Functional Dependency Violations. VLDBJ, Vol. 23, 1 (2014), 103--128.

Digital Library

[13]

Mike Brachmann, Carlos Bautista, Sonia Castelo, Su Feng, Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Müller, Rémi Rampin, William Spoth, and Ying Yang. 2019. Data Debugging and Exploration with Vizier. In SIGMOD .

[14]

Douglas Burdick, Prasad M. Deshpande, T. S. Jayram, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. 2007. OLAP over uncertain and imprecise data. VLDBJ, Vol. 16, 1 (2007), 123--144.

Digital Library

[15]

Diego Calvanese, Evgeny Kharlamov, Werner Nutt, and Camilo Thorne. 2008. Aggregate queries over ontologies. In International Workshop on Ontologies and Information Systems for the Semantic Web (ONISW). 97--104.

Digital Library

[16]

Andrea Calì, Domenico Lembo, and Riccardo Rosati. 2003. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS .

[17]

Arbee L. P. Chen, Jui-Shang Chiu, and Frank Shou-Cheng Tseng. 1996. Evaluating Aggregate Operations Over Imprecise Data. IEEE Trans. Knowl. Data Eng., Vol. 8, 2 (1996), 273--284.

Digital Library

[18]

Marco Console, Paolo Guagliardo, and Leonid Libkin. 2019. Fragments of Bag Relational Algebra: Expressiveness and Certain Answers. In ICDT . 8:1--8:16.

[19]

Marco Console, Paolo Guagliardo, Leonid Libkin, and Etienne Toussaint. 2020. Coping with Incomplete Data: Recent Advances. In PODS. ACM, 33--47.

[20]

Transaction Processing Performance Council. [n.d.]. TPC-H specification. http://www.tpc.org/tpch/.

[21]

Wenfei Fan. 2008. Dependencies revisited for improving data quality. In PODS. 159--170.

[22]

Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2019. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers. In SIGMOD .

[23]

Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2021. Efficient Uncertainty Tracking for Complex Queries with Attribute-Level Bounds (extended version). (2021). arxiv: 2102.11796 [cs.DB]

[24]

Robert Fink, Larisa Han, and Dan Olteanu. 2012. Aggregation in Probabilistic Databases via Knowledge Compilation. PVLDB, Vol. 5, 5 (2012), 490--501.

Digital Library

[25]

Robert Fink, Jiewen Huang, and Dan Olteanu. 2013. Anytime approximation in probabilistic databases. VLDBJ, Vol. 22, 6 (2013), 823--848.

Digital Library

[26]

A. Fuxman, E. Fazli, and R.J. Miller. 2005. Conquer: Efficient management of inconsistent databases. In SIGMOD. 155--166.

Digital Library

[27]

Ariel D Fuxman and Renée J Miller. 2005. First-order query rewriting for inconsistent databases. In ICDT .

[28]

Floris Geerts, Fabian Pijcke, and Jef Wijsen. 2017. First-order under-approximations of consistent query answers. International Journal of Approximate Reasoning, Vol. 83 (2017), 337--355.

Digital Library

[29]

Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In PODS .

[30]

Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In PODS .

[31]

Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL Queries on Databases with Nulls. SIGMOD Record, Vol. 46, 3 (2017), 5--16.

Digital Library

[32]

Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data integration: the teenage years. In VLDB. 9--16.

[33]

Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in Relational Databases. J. ACM, Vol. 31, 4 (1984), 761--791.

Digital Library

[34]

Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. 2008. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD .

[35]

T. S. Jayram, Satyen Kale, and Erik Vee. 2007. Efficient aggregation algorithms for probabilistic data. In SODA. 346--355.

[36]

Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative Support for Sensor Data Cleaning. In PERVASIVE. 83--100.

[37]

O. Kennedy and C. Koch. 2010. PIP: A database system for great and small expectations. In ICDE . 157--168.

[38]

Phokion G. Kolaitis and Enela Pema. 2012. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett., Vol. 112, 3 (2012), 77--85.

Digital Library

[39]

Paraschos Koutris and Jef Wijsen. 2018. Consistent Query Answering for Primary Keys and Conjunctive Queries with Negated Atoms. In PODS .

[40]

Poonam Kumari, Said Achmiz, and Oliver Kennedy. 2016. Communicating Data Quality in On-Demand Curation. In QDB .

[41]

Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In SIGMOD. 1275--1286.

[42]

Jens Lechtenbörger, Hua Shu, and Gottfried Vossen. 2002. Aggregate Queries Over Conditional Tables. J. Intell. Inf. Syst., Vol. 19, 3 (2002), 343--362.

Digital Library

[43]

Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J. Elmore, and Michael J. Franklin. 2020. Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints. In SIGMOD . 285--295.

[44]

Leonid Libkin. 2016. SQL's Three-Valued Logic and Certain Answers. TODS, Vol. 41, 1 (2016), 1:1--1:28.

Digital Library

[45]

Witold Lipski. 1979. On Semantic Issues Connected with Incomplete Information Databases. TODS, Vol. 4, 3 (1979), 262--296.

Digital Library

[46]

Raghotham Murthy, Robert Ikeda, and Jennifer Widom. 2011. Making Aggregation Work in Uncertain and Probabilistic Databases. IEEE Trans. Knowl. Data Eng., Vol. 23, 8 (2011), 1261--1273.

Digital Library

[47]

Dan Olteanu, Lampros Papageorgiou, and Sebastiaan J van Schaik. 2013. Pigora: An Integration System for Probabilistic Data. In ICDE. 1324--1327.

[48]

Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Christopher Ré. 2017. Snorkel: Fast Training Set Generation for Information Extraction. In SIGMOD . 1683--1686.

Digital Library

[49]

Raymond Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values. J. ACM, Vol. 33, 2 (1986), 349--370.

Digital Library

[50]

Christopher Ré and Dan Suciu. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDBJ, Vol. 18, 5 (2009), 1091--1116.

Digital Library

[51]

Sunita Sarawagi et almbox. 2008. Information extraction. Foundations and Trends® in Databases, Vol. 1, 3 (2008), 261--377.

[52]

Yannis Sismanis, Ling Wang, Ariel Fuxman, Peter J. Haas, and Berthold Reinwald. 2009. Resolution-Aware Query Answering for Business Intelligence. In ICDE . 976--987.

[53]

Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2008. Probabilistic top-k and ranking-aggregate queries. TODS, Vol. 33, 3 (2008), 13:1--13:54.

Digital Library

[54]

Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic databases. Synthesis Lectures on Data Management, Vol. 3, 2 (2011), 1--180.

Digital Library

[55]

Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey F. Naughton, and Val Tannen. 2017. m-tables: Representing Missing Data. In ICDT .

[56]

Jef Wijsen. 2010. On the first-order expressibility of computing certain answers to conjunctive queries over uncertain databases. In PODS .

[57]

Jef Wijsen. 2012. Certain conjunctive query answering in first-order logic. TODS, Vol. 37, 2 (2012), 9:1--9:35.

Digital Library

[58]

Mohan Yang, Haixun Wang, Haiquan Chen, and Wei-Shinn Ku. 2011. Querying uncertain data with aggregate constraints. In SIGMOD . 817--828.

[59]

Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An On-demand Approach to ETL. PVLDB, Vol. 8, 12 (2015), 1578--1589.

Digital Library

Cited By

Feng SGlavic BKennedy O(2023)Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataProceedings of the VLDB Endowment10.14778/3583140.358315116:6(1346-1358)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583151
Simon EAmann BLiu RGançarski S(2023)Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic QueriesJournal of Data and Information Quality10.1145/357581215:2(1-41)Online publication date: 23-Jun-2023
https://dl.acm.org/doi/10.1145/3575812
Abelló ACheney J(2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1007/s00778-023-00810-3

Index Terms

Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may ex- clude useful (if uncertain) answers. Thus, users frequently resort to less ...
Query processing over incomplete autonomous databases: query rewriting using learned data dependencies

Incompleteness due to missing attribute values (aka "null values") is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of ...
Quantitative/qualitative region-change uncertainty/certainty in attribute reduction

Knowledge-coarsening is investigated to describe attribute deletion.Granule-merging and its region-distribution are used to gain region-change functions.Region-change with certainty/monotonicity is analyzed in qualitative Pawlak-Model.Region-change with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

June 2021

2969 pages

ISBN:9781450383431

DOI:10.1145/3448016

General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

SIGMOD/PODS '21

Sponsor:

SIGMOD

SIGMOD/PODS '21: International Conference on Management of Data

June 20 - 25, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
582
Total Downloads

Downloads (Last 12 months)191
Downloads (Last 6 weeks)24

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feng SGlavic BKennedy O(2023)Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataProceedings of the VLDB Endowment10.14778/3583140.358315116:6(1346-1358)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583151
Simon EAmann BLiu RGançarski S(2023)Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic QueriesJournal of Data and Information Quality10.1145/357581215:2(1-41)Online publication date: 23-Jun-2023
https://dl.acm.org/doi/10.1145/3575812
Abelló ACheney J(2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1007/s00778-023-00810-3

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents