Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

QOCO: a query oriented data cleaning system with oracles

Published: 01 August 2015 Publication History

Abstract

As key decisions are often made based on information contained in a database, it is important for the database to be as complete and correct as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However, data cleaning tools provide only best-effort results and usually cannot eradicate all errors that may exist in a database. Even more importantly, existing data cleaning tools do not typically address the problem of determining what information is missing from a database.
To tackle these problems, we present QOCO, a novel query oriented cleaning system that leverages materialized views that are defined by user queries as a trigger for identifying the remaining incorrect/missing information. Given a user query, QOCO interacts with domain experts (which we model as oracle crowds) to identify potentially wrong or missing answers in the result of the user query, as well as determine and correct the wrong data that is the cause for the error(s). We will demonstrate QOCO over a World Cup Games database, and illustrate the interaction between QOCO and the oracles. Our demo audience will play the role of oracles, and we show how QOCO's underlying operations and optimization mechanisms can effectively prune the search space and minimize the number of questions that need to be posed to accelerate the cleaning process.

References

[1]
M. Bergman, T. Milo, S. Novgorodov, and W. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, 2015.
[2]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. KDD, pages 39--48, 2003.
[3]
P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In ACM PODS, pages 1--12, 2008.
[4]
A. Chapman and H. V. Jagadish. Why not? In SIGMOD, pages 523--534, 2009.
[5]
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012.
[6]
V. Ganti and A. D. Sarma. Data Cleaning: A Practical Perspective. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2013.
[7]
A. Marcus, D. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012.
[8]
M. J. Raddick, G. Bracey, P. L. Gay, C. J. Lintott, P. Murray, K. Schawinski, A. S. Szalay, and J. Vandenberg. Galaxy zoo: exploring the motivations of citizen science volunteers. Astronomy Education Review, 9(1), 2010.
[9]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge unifying wordnet and wikipedia. In WWW, pages 697--706, 2007.
[10]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(10):1483--1494, 2012.
[11]
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, pages 469--480, 2014.
[12]
World cup history. http://www.worldcup-history.com/.
[13]
C. J. Zhang, Z. Zhao, L. Chen, H. V. Jagadish, and C. C. Cao. Crowdmatcher: crowd-assisted schema matching. In SIGMOD, pages 721--724, 2014.

Cited By

View all
  • (2022)Approximation and inapproximability results on computing optimal repairsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00738-032:1(173-197)Online publication date: 12-Apr-2022
  • (2020)The computation of optimal subset repairsProceedings of the VLDB Endowment10.14778/3407790.340780913:12(2061-2074)Online publication date: 14-Sep-2020
  • (2020)Interrogating Data ScienceCompanion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing10.1145/3406865.3418584(467-473)Online publication date: 17-Oct-2020
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
  • Editors:
  • Chen Li,
  • Volker Markl
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015
Published in PVLDB Volume 8, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Approximation and inapproximability results on computing optimal repairsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00738-032:1(173-197)Online publication date: 12-Apr-2022
  • (2020)The computation of optimal subset repairsProceedings of the VLDB Endowment10.14778/3407790.340780913:12(2061-2074)Online publication date: 14-Sep-2020
  • (2020)Interrogating Data ScienceCompanion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing10.1145/3406865.3418584(467-473)Online publication date: 17-Oct-2020
  • (2020)Computing Optimal Repairs for Functional DependenciesACM Transactions on Database Systems10.1145/336090445:1(1-46)Online publication date: 17-Feb-2020
  • (2019)Learning How to Correct a Knowledge Base from the Edit HistoryThe World Wide Web Conference10.1145/3308558.3313584(1465-1475)Online publication date: 13-May-2019
  • (2018)Computing Optimal Repairs for Functional DependenciesProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196980(225-237)Online publication date: 27-May-2018
  • (2016)Query-driven repairing of inconsistent DL-Lite knowledge basesProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060621.3060754(957-964)Online publication date: 9-Jul-2016

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media