Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3452760acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

INCA: Inconsistency-Aware Data Profiling and Querying

Published: 18 June 2021 Publication History

Abstract

When exploring and querying inconsistent data, inconsistency measures referring to constraint violations can help the user to quantify the quality of the underlying data and query results. We showcase INCA, a system that allows the user to execute data profiling and query answering tasks in an inconsistency-aware fashion. By using data instances annotated with novel inconsistency measures based on why-provenance and polynomial provenance, it becomes possible to visualize the share of the data which is consistent or inconsistent with respect to one or multiple denial constraints. Furthermore, data exploration by constraint or by subset of constraints allows to inspect the tuple violations according to multifaceted criteria. Finally, query profiling allows to enable inconsistency-aware query results accounting for most (in-)consistent top-k and threshold query results. To the best of our knowledge, INCA is the first system to allow such an inconsistency-driven analysis of both data and query results. Such an analysis is especially fruitful for enabling selective constraint-based data cleaning and inconsistency-aware ranking of query results in data science pipelines, thus leading to more explainable outputs of those processes.

Supplementary Material

MP4 File (3448016.3452760.mp4)
When exploring and querying inconsistent data, inconsistency measures referring to constraint violations can help the user to quantify the quality of the underlying data and query results. We showcase INCA, a system that allows the user to execute data profiling and query answering tasks in an inconsistency-aware fashion. By using data instances annotated with novel inconsistency measures based on why-provenance and polynomial provenance, it becomes possible to visualize the share of the data which is consistent or inconsistent with respect to one or multiple denial constraints. Furthermore, data exploration by constraint or by subset of constraints allows to inspect the tuple violations according to multifaceted criteria. Finally, query profiling allows to enable inconsistency-aware query results accounting for most (in-)consistent top-k and threshold query results. To the best of our knowledge, INCA is the first system to allow such an inconsistency-driven analysis of both data and query results. Such an analysis is especially fruitful for enabling selective constraint-based data cleaningand inconsistency-aware ranking of query results in data science pipelines, thus leading to more explainable outputs of those processes.

References

[1]
[n.d.]. Adult dataset. https://github.com/HoloClean/holoclean/blob/master/testdata/AdultFull.csv.
[2]
[n.d.]. Food Inspection dataset. https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5.
[3]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data Profiling: A Tutorial. In Proceedings of ACM SIGMOD. 1747--1751.
[4]
Abdallah Arioua and Angela Bonifati. 2018. User-guided Repairing of Inconsistent Knowledge Bases. In Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26--29, 2018. 133--144.
[5]
Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. [n.d.]. Why and Where: A Characterization of Data Provenance. In Database Theory - ICDT (Berlin, Heidelberg, 2001). 316--330.
[6]
Todd J. Green. 2009. Containment of conjunctive queries on annotated relations. In ICDT. ACM, 296--309.
[7]
Todd J. Green, Grigoris Karvounarakis, and Val Tannen. [n.d.]. Provenance Semirings. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (New York, NY, USA, 2007) (PODS'07). Association for Computing Machinery, 31--40. https://doi.org/10.1145/ 1265530.1265535 event-place: Beijing, China.
[8]
Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) (2008).
[9]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM. https://doi.org/10.1145/3310205
[10]
Ousmane Issa, Angela Bonifati, and Farouk Toumani. [n.d.]. Evaluating Top-k Queries with Inconsistency Degrees. 13, 12 ([n. d.]), 2146--2158. https://doi.org/10.14778/3407790.3407815 Publisher: VLDB Endowment.
[11]
Jef Wijsen. 2005. Database repairing using updates. ACM Trans. Database Syst. 30, 3 (2005), 722--768.

Cited By

View all
  • (2024)On Measuring Inconsistency in Graph Databases with Regular Path ConstraintsArtificial Intelligence10.1016/j.artint.2024.104197(104197)Online publication date: Aug-2024
  • (2024)Enhancing data preparation: insights from a time series case studyJournal of Intelligent Information Systems10.1007/s10844-024-00867-8Online publication date: 25-Jul-2024
  • (2023)Relative inconsistency measures for indefinite databases with denial constraintsProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/370(3321-3329)Online publication date: 19-Aug-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data inconsistency
  2. data quality
  3. semiring provenance
  4. top-k query processing
  5. why provenance

Qualifiers

  • Short-paper

Funding Sources

  • ANR (grant nr. 18-CE23-0002 QualiHealth)

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On Measuring Inconsistency in Graph Databases with Regular Path ConstraintsArtificial Intelligence10.1016/j.artint.2024.104197(104197)Online publication date: Aug-2024
  • (2024)Enhancing data preparation: insights from a time series case studyJournal of Intelligent Information Systems10.1007/s10844-024-00867-8Online publication date: 25-Jul-2024
  • (2023)Relative inconsistency measures for indefinite databases with denial constraintsProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/370(3321-3329)Online publication date: 19-Aug-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media