research-article

Improving data quality by source analysis

Authors:

Heiko Müller,

Johann-Christoph Freytag,

Ulf LeserAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 2, Issue 4

Article No.: 15, Pages 1 - 38

https://doi.org/10.1145/2107536.2107538

Published: 02 March 2012 Publication History

Get Access

Abstract

In many domains, data cleaning is hampered by our limited ability to specify a comprehensive set of integrity constraints to assist in identification of erroneous data. An alternative approach to improve data quality is to exploit different data sources that contain information about the same set of objects. Such overlapping sources highlight hot-spots of poor data quality through conflicting data values and immediately provide alternative values for conflict resolution. In order to derive a dataset of high quality, we can merge the overlapping sources based on a quality assessment of the conflicting values. The quality of the resulting dataset, however, is highly dependent on our ability to asses the quality of conflicting values effectively.

The main objective of this article is to introduce methods that aid the developer of an integrated system over overlapping, but contradicting sources in the task of improving the quality of data. Value conflicts between contradicting sources are often systematic, caused by some characteristic of the different sources. Our goal is to identify such systematic differences and outline data patterns that occur in conjunction with them. Evaluated by an expert user, the regularities discovered provide insights into possible conflict reasons and help to assess the quality of inconsistent values. The contributions of this article are two concepts of systematic conflicts: contradiction patterns and minimal update sequences. Contradiction patterns resemble a special form of association rules that summarize characteristic data properties for conflict occurrence. We adapt existing association rule mining algorithms for mining contradiction patterns. Contradiction patterns, however, view each class of conflicts in isolation, sometimes leading to largely overlapping patterns. Sequences of set-oriented update operations that transform one data source into the other are compact descriptions for all regular differences among the sources. We consider minimal update sequences as the most likely explanation for observed differences between overlapping data sources. Furthermore, the order of operations within the sequences point out potential dependencies between systematic differences. Finding minimal update sequences, however, is beyond reach in practice. We show that the problem already is NP-complete for a restricted set of operations. In the light of this intractability result, we present heuristics that lead to convincing results for all examples we considered.

References

[1]

Abiteboul, S., Cluet, S., Milo, T., Mogilevsky, P., Simon, J., and Zohar, S. 1999. Tools for data translation and integration. IEEE Data Engin. Bull. 22, 1, 3--8.

Abstract

References

Cited By

Index Terms

Recommendations

Analyzing data and data sources towards a unified approach for ensuring end-to-end data and data sources quality in healthcare 4.0

A metrics-driven approach for quality assessment of linked open data

Analysis of Data Extraction and Data Cleaning in Web Usage Mining

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations