Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An epistemic approach to model uncertainty in data-graphs

Published: 01 September 2023 Publication History

Abstract

Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. Data-graphs are particular types of graph databases whose representation allows both data values in the paths and in the nodes to be treated as first class citizens by the query language. As with any other type of data repository, data-graphs may suffer from errors and discrepancies with respect to the real-world data they intend to represent. In this work, we explore the notion of probabilistic unclean data-graphs, in order to capture the idea that the observed (unclean) data-graph is actually the noisy version of a clean one that correctly models the world but that we know only partially. As the factors that lead to such a state of affairs may be many, e.g., all different types of clerical errors or unintended transformations of the data, and depend heavily on the application domain, we assume an epistemic probabilistic model that describes the distribution over all possible ways in which the clean (uncertain) data-graph could have been polluted. Based on this model we define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity when considering that the polluting transformation of the data-graph can be caused by either removing (subset), adding (superset), or modifying (update) nodes and edges. For data cleaning, we explore restricted versions when the transformation only involves updating data-values on the nodes. Finally, we look at some implications of incorporating hard and soft constraints to our framework.

References

[1]
Serge Abiteboul, T-H. Hubert Chan, Evgeny Kharlamov, Werner Nutt, Pierre Senellart, Capturing continuous data and answering aggregate queries in probabilistic XML, ACM Trans. Database Syst. 36 (4) (2011) 1–45.
[2]
Serge Abiteboul, Victor Vianu, Regular path queries with constraints, J. Comput. Syst. Sci. 58 (3) (1999) 428–452.
[3]
Foto N. Afrati, Phokion G. Kolaitis, Repair checking in inconsistent databases: algorithms and complexity, in: Proceedings of the 12th International Conference on Database Theory, 2009, pp. 31–41.
[4]
Antoine Amarilli, Mikaël Monet, Pierre Senellart, Conjunctive queries on probabilistic graphs: combined complexity, in: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2017, pp. 217–232.
[5]
Manish Kumar Anand, Shawn Bowers, Bertram Ludäscher, Techniques for efficiently querying scientific workflow provenance graphs, EDBT, vol. 10, 2010, pp. 287–298.
[6]
Marcelo Arenas, Pablo Barceló, Mikaël Monet, Counting problems over incomplete databases, in: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2020, pp. 165–177.
[7]
Marcelo Arenas, Leopoldo Bertossi, Jan Chomicki, Consistent query answers in inconsistent databases, PODS, vol. 99, Citeseer, 1999, pp. 68–79.
[8]
Marcelo Arenas, Claudio Gutierrez, Juan F. Sequeda, Querying in the age of graph databases and knowledge graphs, in: Proc. of the International Conference on Management of Data, SIGMOD '21, Association for Computing Machinery, 2021, pp. 2821–2828.
[9]
Marcelo Arenas, Jorge Pérez, Querying semantic web data with SPARQL, in: Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2011, pp. 305–316.
[10]
Pablo Barceló, Querying graph databases, in: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, ACM, 2013, pp. 175–188.
[11]
Pablo Barceló, Gaëlle Fontaine, On the data complexity of consistent query answering over graph databases, J. Comput. Syst. Sci. 88 (2017) 164–194.
[12]
Pablo Barceló, Leonid Libkin, Anthony W. Lin, Peter T. Wood, Expressive languages for path queries over graph-structured data, ACM Trans. Database Syst. 37 (4) (2012) 1–46.
[13]
Pablo Barceló, Jorge Pérez, Juan L. Reutter, Relative expressiveness of nested regular expressions, in: AMW 12, 2012, pp. 180–195.
[14]
Beame, Paul; Li, Jerry; Roy, Sudeepa; Suciu, Dan (2013): Model counting of query expressions: limitations of propositional methods. arXiv preprint arXiv:1312.4125.
[15]
Leopoldo Bertossi, Database Repairing and Consistent Query Answering, Morgan & Claypool Publishers, 2011.
[16]
Meghyn Bienvenu, On the complexity of consistent query answering in the presence of simple ontologies, in: Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
[17]
Peter Buneman, Wenfei Fan, Scott Weinstein, Path constraints in semistructured databases, J. Comput. Syst. Sci. 61 (2) (2000) 146–193.
[18]
Diego Calvanese, Magdalena Ortiz, Mantas Šimkus, Verification of evolving graph-structured data under expressive path constraints, in: 19th International Conference on Database Theory (ICDT 2016), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
[19]
Jan Chomicki, Jerzy Marcinkowski, Minimal-change integrity maintenance using tuple deletions, Inf. Comput. 197 (1–2) (2005) 90–121.
[20]
L. da F. Costa, F.A. Rodrigues, G. Travieso, P.R. Villas Boas, Characterization of complex networks: a survey of measurements, Adv. Phys. 56 (1) (jan 2007) 167–242.
[21]
Nilesh Dalvi, Dan Suciu, Efficient query evaluation on probabilistic databases, VLDB J. 16 (4) (2007) 523–544.
[22]
Nilesh Dalvi, Dan Suciu, Management of probabilistic data: foundations and challenges, in: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2007, pp. 1–12.
[23]
Nilesh Dalvi, Dan Suciu, The dichotomy of probabilistic inference for unions of conjunctive queries, J. ACM 59 (6) (2013) 1–87.
[24]
Wenfei Fan, Graph pattern matching revised for social network analysis, in: Proceedings of the 15th International Conference on Database Theory, 2012, pp. 8–21.
[25]
Yixiang Fang, Xin Huang, Lu Qin, Ying Zhang, Wenjie Zhang, Reynold Cheng, Xuemin Lin, A survey of community search over big graphs, VLDB J. 29 (2020) 353–392.
[26]
Dieter Fensel, Umutcan Simsek, Kevin Angele, Elwin Huaman, Kaerle Elias, Oleksandra Panasiuk, Ioan Toma, Jürgen Umbrich, Alexander Wahler, Knowledge Graphs - Methodology, Tools and Selected Use Cases, Springer, 2020.
[27]
Tal Friedman, Guy Van den Broeck, Symbolic querying of vector spaces: probabilistic databases meets relational embeddings, in: Conference on Uncertainty in Artificial Intelligence, PMLR, 2020, pp. 1268–1277.
[28]
Michael R. Garey, David S. Johnson, Computers and Intractability. A Guide to the Theory of NP-Completeness, 1979.
[29]
Amir Gilad, Aviram Imber, Benny Kimelfeld, The consistency of probabilistic databases with independent cells, 2022.
[30]
Eric Gribkoff, Guy Van den Broeck, Dan Suciu, The most probable database problem, in: Proceedings of the First International Workshop on Big Uncertain Data (BUDA), 2014, pp. 1–7.
[31]
Eric Gribkoff, Guy Van den Broeck, Dan Suciu, Understanding the complexity of lifted inference and asymmetric weighted model counting, in: Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
[32]
Claudio Gutierrez, Juan F. Sequeda, Knowledge graphs, Commun. ACM 64 (3) (2021) 96–104.
[33]
Teresa W. Haynes, Michael A. Henning, Domination critical graphs with respect to relative complements, Australas. J. Comb. 18 (1998) 115–126.
[34]
Bryan Hooi, Hyun Ah Song, Alex Beutel, Neil Shah, Kijung Shin, Christos Faloutsos, Fraudar: bounding graph fraud in the face of camouflage, in: Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[35]
Benny Kimelfeld, Yehoshua Sagiv, Modeling and querying probabilistic XML data, SIGMOD Rec. 37 (4) (2009) 69–77.
[36]
Xiang Lian, Lei Chen, Shaoxu Song, Consistent query answers in inconsistent probabilistic databases, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010, pp. 303–314.
[37]
Leonid Libkin, Wim Martens, Domagoj Vrgoč, Querying graphs with data, J. ACM 63 (2) (2016) 1–53.
[38]
Chenhao Ma, Yixiang Fang, Reynold Cheng, Laks V.S. Lakshmanan, Wenjie Zhang, Xuemin Lin, Efficient algorithms for densest subgraph discovery on large directed graphs, in: Proc. of ACM SIGMOD International Conference on Management of Data, SIGMOD '20, Association for Computing Machinery, 2020, pp. 1051–1066.
[39]
Silviu Maniu, Reynold Cheng, Pierre Senellart, An indexing framework for queries on probabilistic graphs, ACM Trans. Database Syst. 42 (2) (2017) 1–34.
[40]
Dan Olteanu, Jiewen Huang, Using OBDDs for efficient query evaluation on probabilistic databases, in: International Conference on Scalable Uncertainty Management, Springer, 2008, pp. 326–340.
[41]
Christos H. Papadimitriou, Computational complexity, in: Encyclopedia of Computer Science, 2003, pp. 260–265.
[42]
Rekatsinas, Theodoros; Chu, Xu; Ilyas, Ihab F.; Ré, Christopher (2017): Holoclean: holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820.
[43]
Fabrizio Riguzzi, Elena Bellodi, Evelina Lamma, Riccardo Zese, Reasoning with probabilistic ontologies, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
[44]
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas, A formal framework for probabilistic unclean databases, in: Pablo Barcelo, Marco Calautti (Eds.), 22nd International Conference on Database Theory (ICDT 2019), in: Leibniz International Proceedings in Informatics (LIPIcs), vol. 127, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2019, pp. 6:1–6:18.
[45]
Anish Das Sarma, Omar Benjelloun, Alon Halevy, Shubha Nabar, Jennifer Widom, Representing uncertain data: models, properties, and algorithms, VLDB J. 18 (5) (2009) 989–1019.
[46]
Asma Souihli, Pierre Senellart, Optimizing approximations of DNF query lineage in probabilistic XML, in: 2013 IEEE 29th International Conference on Data Engineering (ICDE), IEEE, 2013, pp. 721–732.
[47]
Balder Ten Cate, Gaëlle Fontaine, Phokion G. Kolaitis, On the data complexity of consistent query answering, in: Proceedings of the 15th International Conference on Database Theory, ICDT '12, 2012, pp. 22–33.
[48]
Balder Ten Cate, Gaëlle Fontaine, Phokion G. Kolaitis, On the data complexity of consistent query answering, Theory Comput. Syst. 57 (4) (2015) 843–891.
[49]
Guy Van den Broeck, Wannes Meert, Adnan Darwiche, Skolemization for weighted first-order model counting, in: Fourteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2014.
[50]
Guy Van den Broeck, Dan Suciu, et al., Query processing on probabilistic data: a survey, Found. Trends® Databases 7 (3–4) (2017) 197–341.
[51]
Moshe Y. Vardi, The complexity of relational query languages (extended abstract), in: Proc. ACM Symposium on Theory of Computing (STOC 82), 1982, pp. 137–146.
[52]
Kai Zeng, Shi Gao, Barzan Mozafari, Carlo Zaniolo, The analytical bootstrap: a new method for fast error estimation in approximate query processing, in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014, pp. 277–288.

Cited By

View all
  • (2023) Data Graphs with Incomplete Information (and a Way to Complete Them)Logics in Artificial Intelligence10.1007/978-3-031-43619-2_49(729-744)Online publication date: 20-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of Approximate Reasoning
International Journal of Approximate Reasoning  Volume 160, Issue C
Sep 2023
406 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 September 2023

Author Tags

  1. Data-graphs
  2. Consistent query answering
  3. Probabilistic query answering
  4. Constraints
  5. Inconsistent databases
  6. Repairing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023) Data Graphs with Incomplete Information (and a Way to Complete Them)Logics in Artificial Intelligence10.1007/978-3-031-43619-2_49(729-744)Online publication date: 20-Sep-2023

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media