research-article

Clustering Heterogeneous Data Values for Data Quality Analysis

Authors:

Viola Wenz,

Arno Kesper,

Gabriele TaentzerAuthors Info & Claims

ACM Journal of Data and Information Quality, Volume 15, Issue 3

Article No.: 28, Pages 1 - 33

https://doi.org/10.1145/3603710

Published: 22 August 2023 Publication History

Get Access

Abstract

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

References

[1]

Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 207–216.

Abstract

References

Cited By

Index Terms

Recommendations

Automated Detection and Monitoring of Advanced Data Quality Rules

Towards Data Quality into the Data Warehouse Development

Data quality through model checking techniques

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations