Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Combining quantitative and logical data cleaning

Published: 01 December 2015 Publication History

Abstract

Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths.
We instantiate our framework using (i) metric functional dependencies, a type of dependency that generalizes functional dependencies (FDs) to identify inconsistencies in domains where only large differences in metric data are considered to be a data quality problem, and (ii) repairs that modify the inconsistent data so as to minimize statistical distortion, measured using the Earth Mover's Distance. We show that the problem of computing a statistical distortion minimal repair is NP-hard. Given this complexity, we present an efficient algorithm for finding a minimal repair that has a small statistical distortion using EMD computation over semantically related attributes. To identify semantically related attributes, we present a sound and complete axiomatization and an efficient algorithm for testing implication of metric FDs. While the complexity of inference for some other FD extensions is co-NP complete, we show that the inference problem for metric FDs remains linear, as in traditional FDs. We prove that every instance that can be generated by our repair algorithm is set-minimal (with no unnecessary changes). Our experimental evaluation demonstrates that our techniques obtain a considerably lower statistical distortion than existing repair techniques, while achieving similar levels of efficiency.

References

[1]
M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999.
[2]
W. W. Armstrong. Dependency structures of data base relationships. In IFIP Congress, pages 580--583, 1974.
[3]
L. Berti-Equille, T. Dasu, and D. Srivastava. Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In ICDE, pages 733--744, 2011.
[4]
L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Information Systems, 33(4-5):407--434, 2008.
[5]
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1):197--207, 2010.
[6]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.
[7]
P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.
[8]
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011.
[9]
T. Dasu and J. M. Loh. Statistical distortion: Consequences of data cleaning. PVLDB, 5(11):1674--1683, 2012.
[10]
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009.
[11]
Flights data. http://www.lunadong.com/fusionDataSets.htm.
[12]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001.
[13]
L. Golab, H. Karloff, F. Korn, A. Saha, and D. Srivastava. Sequential dependencies. PVLDB, 2(1):574--585, 2009.
[14]
J. Hellerstein. Quantitative data cleaning for large databases. In Technical report, UC Berkeley, Feb 2008.
[15]
S. Kolahi and L. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.
[16]
N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric Functional Dependencies. In ICDE, pages 1291--1294, 2009.
[17]
O. Pele and M. Werman. A linear time histogram metric for improved SIFT matching. In Eur. Conf. on Computer Vision, pages 495--508, 2008.
[18]
O. Pele and M. Werman. Fast and robust earth mover's distances. In IEEE Int. Conf. on Computer Vision, pages 460--467, 2009.
[19]
N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. April 2015. http://dblab.cs.toronto.edu/project/DataQuality.
[20]
S. Song and L. Chen. Differential dependencies: Reasoning and discovery. TODS, 36(3):16, 2011.
[21]
Y. Tang, L. H. U, Y. Cai, N. Mamoulis, and R. Cheng. Earth mover's distance based similarity search at scale. PVLDB, 7(4):313--324, 2013.
[22]
UIS Data Generator. http://www.cs.utexas.edu/users/ml/riddle/data.html.
[23]
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.
[24]
X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015.
[25]
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013.
[26]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. PVLDB, 3(1):805--814, 2010.

Cited By

View all
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
  • (2024)Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and TuningACM Computing Surveys10.1145/366532356:11(1-37)Online publication date: 17-May-2024
  • (2023)LinCQA: Faster Consistent Query Answering with Linear Time GuaranteesProceedings of the ACM on Management of Data10.1145/35887181:1(1-25)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 4
December 2015
156 pages
ISSN:2150-8097
  • Editors:
  • Surajit Chaudhuri,
  • Jayant Haritsa
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2015
Published in PVLDB Volume 9, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)11
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
  • (2024)Self-tuning Database Systems: A Systematic Literature Review of Automatic Database Schema Design and TuningACM Computing Surveys10.1145/366532356:11(1-37)Online publication date: 17-May-2024
  • (2023)LinCQA: Faster Consistent Query Answering with Linear Time GuaranteesProceedings of the ACM on Management of Data10.1145/35887181:1(1-25)Online publication date: 30-May-2023
  • (2022)Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial NetworksProceedings of the VLDB Endowment10.14778/3570690.357069416:3(433-446)Online publication date: 1-Nov-2022
  • (2022)Towards Observability for Production Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3565838.356585315:13(4015-4022)Online publication date: 1-Sep-2022
  • (2022)Contextual Data Cleaning with Ontology Functional DependenciesJournal of Data and Information Quality10.1145/352430314:3(1-26)Online publication date: 23-May-2022
  • (2021)RPTProceedings of the VLDB Endowment10.14778/3457390.345739114:8(1254-1261)Online publication date: 21-Oct-2021
  • (2021)Online Topic-Aware Entity Resolution Over Incomplete Data StreamsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457238(1478-1490)Online publication date: 9-Jun-2021
  • (2021)Contextual Data Cleaning with Ontology FDsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450583(2911-2913)Online publication date: 9-Jun-2021
  • (2020)Quality of sentiment analysis toolsProceedings of the VLDB Endowment10.14778/3436905.343692414:4(668-681)Online publication date: 1-Dec-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media