Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3452791acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds

Published: 18 June 2021 Publication History

Abstract

Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed.We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.

Supplementary Material

Read me (3448016.3452791_readme.pdf)
Source Code (3448016.3452791_source_code.zip)
MP4 File (3448016.3452791.mp4)
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.

References

[1]
Serge Abiteboul, T.-H. Hubert Chan, Evgeny Kharlamov, Werner Nutt, and Pierre Senellart. 2010. Aggregate queries for discrete and continuous probabilistic XML. In ICDT . 50--61.
[2]
Serge Abiteboul, Paris C. Kanellakis, and Gösta Grahne. 1991. On the Representation and Querying of Sets of Possible Worlds. Theor. Comput. Sci., Vol. 78, 1 (1991), 158--187.
[3]
Foto N. Afrati and Phokion G. Kolaitis. 2008. Answering aggregate queries in data exchange. In PODS. 129--138.
[4]
Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. In VLDB .
[5]
Parag Agrawal, Anish Das Sarma, Jeffrey Ullman, and Jennifer Widom. 2010. Foundations of uncertain-data integration. PVLDB, Vol. 3, 1--2 (2010), 1080--1090.
[6]
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011a. Provenance for Aggregate Queries. In PODS. 153--164.
[7]
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011b. Provenance for aggregate queries. In PODS. 153--164.
[8]
L. Antova, T. Jansen, C. Koch, and D. Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In ICDE .
[9]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS .
[10]
Marcelo Arenas, Leopoldo E. Bertossi, Jan Chomicki, Xin He, Vijay Raghavan, and Jeremy P. Spinrad. 2003. Scalar aggregation in inconsistent databases. Theor. Comput. Sci., Vol. 296, 3 (2003), 405--434.
[11]
Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering .Morgan & Claypool Publishers.
[12]
George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2014. Sampling from Repairs of Conditional Functional Dependency Violations. VLDBJ, Vol. 23, 1 (2014), 103--128.
[13]
Mike Brachmann, Carlos Bautista, Sonia Castelo, Su Feng, Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Müller, Rémi Rampin, William Spoth, and Ying Yang. 2019. Data Debugging and Exploration with Vizier. In SIGMOD .
[14]
Douglas Burdick, Prasad M. Deshpande, T. S. Jayram, Raghu Ramakrishnan, and Shivakumar Vaithyanathan. 2007. OLAP over uncertain and imprecise data. VLDBJ, Vol. 16, 1 (2007), 123--144.
[15]
Diego Calvanese, Evgeny Kharlamov, Werner Nutt, and Camilo Thorne. 2008. Aggregate queries over ontologies. In International Workshop on Ontologies and Information Systems for the Semantic Web (ONISW). 97--104.
[16]
Andrea Calì, Domenico Lembo, and Riccardo Rosati. 2003. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS .
[17]
Arbee L. P. Chen, Jui-Shang Chiu, and Frank Shou-Cheng Tseng. 1996. Evaluating Aggregate Operations Over Imprecise Data. IEEE Trans. Knowl. Data Eng., Vol. 8, 2 (1996), 273--284.
[18]
Marco Console, Paolo Guagliardo, and Leonid Libkin. 2019. Fragments of Bag Relational Algebra: Expressiveness and Certain Answers. In ICDT . 8:1--8:16.
[19]
Marco Console, Paolo Guagliardo, Leonid Libkin, and Etienne Toussaint. 2020. Coping with Incomplete Data: Recent Advances. In PODS. ACM, 33--47.
[20]
Transaction Processing Performance Council. [n.d.]. TPC-H specification. http://www.tpc.org/tpch/.
[21]
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In PODS. 159--170.
[22]
Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2019. Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers. In SIGMOD .
[23]
Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. 2021. Efficient Uncertainty Tracking for Complex Queries with Attribute-Level Bounds (extended version). (2021). arxiv: 2102.11796 [cs.DB]
[24]
Robert Fink, Larisa Han, and Dan Olteanu. 2012. Aggregation in Probabilistic Databases via Knowledge Compilation. PVLDB, Vol. 5, 5 (2012), 490--501.
[25]
Robert Fink, Jiewen Huang, and Dan Olteanu. 2013. Anytime approximation in probabilistic databases. VLDBJ, Vol. 22, 6 (2013), 823--848.
[26]
A. Fuxman, E. Fazli, and R.J. Miller. 2005. Conquer: Efficient management of inconsistent databases. In SIGMOD. 155--166.
[27]
Ariel D Fuxman and Renée J Miller. 2005. First-order query rewriting for inconsistent databases. In ICDT .
[28]
Floris Geerts, Fabian Pijcke, and Jef Wijsen. 2017. First-order under-approximations of consistent query answers. International Journal of Approximate Reasoning, Vol. 83 (2017), 337--355.
[29]
Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance Semirings. In PODS .
[30]
Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In PODS .
[31]
Paolo Guagliardo and Leonid Libkin. 2017. Correctness of SQL Queries on Databases with Nulls. SIGMOD Record, Vol. 46, 3 (2017), 5--16.
[32]
Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data integration: the teenage years. In VLDB. 9--16.
[33]
Tomasz Imielinski and Witold Lipski Jr. 1984. Incomplete Information in Relational Databases. J. ACM, Vol. 31, 4 (1984), 761--791.
[34]
Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. 2008. MCDB: a monte carlo approach to managing uncertain data. In SIGMOD .
[35]
T. S. Jayram, Satyen Kale, and Erik Vee. 2007. Efficient aggregation algorithms for probabilistic data. In SODA. 346--355.
[36]
Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative Support for Sensor Data Cleaning. In PERVASIVE. 83--100.
[37]
O. Kennedy and C. Koch. 2010. PIP: A database system for great and small expectations. In ICDE . 157--168.
[38]
Phokion G. Kolaitis and Enela Pema. 2012. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett., Vol. 112, 3 (2012), 77--85.
[39]
Paraschos Koutris and Jef Wijsen. 2018. Consistent Query Answering for Primary Keys and Conjunctive Queries with Negated Atoms. In PODS .
[40]
Poonam Kumari, Said Achmiz, and Oliver Kennedy. 2016. Communicating Data Quality in On-Demand Curation. In QDB .
[41]
Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In SIGMOD. 1275--1286.
[42]
Jens Lechtenbörger, Hua Shu, and Gottfried Vossen. 2002. Aggregate Queries Over Conditional Tables. J. Intell. Inf. Syst., Vol. 19, 3 (2002), 343--362.
[43]
Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J. Elmore, and Michael J. Franklin. 2020. Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints. In SIGMOD . 285--295.
[44]
Leonid Libkin. 2016. SQL's Three-Valued Logic and Certain Answers. TODS, Vol. 41, 1 (2016), 1:1--1:28.
[45]
Witold Lipski. 1979. On Semantic Issues Connected with Incomplete Information Databases. TODS, Vol. 4, 3 (1979), 262--296.
[46]
Raghotham Murthy, Robert Ikeda, and Jennifer Widom. 2011. Making Aggregation Work in Uncertain and Probabilistic Databases. IEEE Trans. Knowl. Data Eng., Vol. 23, 8 (2011), 1261--1273.
[47]
Dan Olteanu, Lampros Papageorgiou, and Sebastiaan J van Schaik. 2013. Pigora: An Integration System for Probabilistic Data. In ICDE. 1324--1327.
[48]
Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Christopher Ré. 2017. Snorkel: Fast Training Set Generation for Information Extraction. In SIGMOD . 1683--1686.
[49]
Raymond Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values. J. ACM, Vol. 33, 2 (1986), 349--370.
[50]
Christopher Ré and Dan Suciu. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDBJ, Vol. 18, 5 (2009), 1091--1116.
[51]
Sunita Sarawagi et almbox. 2008. Information extraction. Foundations and Trends® in Databases, Vol. 1, 3 (2008), 261--377.
[52]
Yannis Sismanis, Ling Wang, Ariel Fuxman, Peter J. Haas, and Berthold Reinwald. 2009. Resolution-Aware Query Answering for Business Intelligence. In ICDE . 976--987.
[53]
Mohamed A. Soliman, Ihab F. Ilyas, and Kevin Chen-Chuan Chang. 2008. Probabilistic top-k and ranking-aggregate queries. TODS, Vol. 33, 3 (2008), 13:1--13:54.
[54]
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic databases. Synthesis Lectures on Data Management, Vol. 3, 2 (2011), 1--180.
[55]
Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey F. Naughton, and Val Tannen. 2017. m-tables: Representing Missing Data. In ICDT .
[56]
Jef Wijsen. 2010. On the first-order expressibility of computing certain answers to conjunctive queries over uncertain databases. In PODS .
[57]
Jef Wijsen. 2012. Certain conjunctive query answering in first-order logic. TODS, Vol. 37, 2 (2012), 9:1--9:35.
[58]
Mohan Yang, Haixun Wang, Haiquan Chen, and Wei-Shinn Ku. 2011. Querying uncertain data with aggregate constraints. In SIGMOD . 817--828.
[59]
Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An On-demand Approach to ETL. PVLDB, Vol. 8, 12 (2015), 1578--1589.

Cited By

View all
  • (2023)Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataProceedings of the VLDB Endowment10.14778/3583140.358315116:6(1346-1358)Online publication date: 1-Feb-2023
  • (2023)Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic QueriesJournal of Data and Information Quality10.1145/357581215:2(1-41)Online publication date: 23-Jun-2023
  • (2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. aggregation
  2. annotations
  3. incomplete databases
  4. uncertainty

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)191
  • Downloads (Last 6 weeks)24
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataProceedings of the VLDB Endowment10.14778/3583140.358315116:6(1346-1358)Online publication date: 1-Feb-2023
  • (2023)Controlling the Correctness of Aggregation Operations During Sessions of Interactive Analytic QueriesJournal of Data and Information Quality10.1145/357581215:2(1-41)Online publication date: 23-Jun-2023
  • (2023)Eris: efficiently measuring discord in multidimensional sourcesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00810-333:2(399-423)Online publication date: 20-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media