Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Clustering Heterogeneous Data Values for Data Quality Analysis

Published: 22 August 2023 Publication History

Abstract

Data is of high quality if it is fit for its intended purpose. Data heterogeneity can be a major quality problem, as quality aspects such as understandability and consistency can be compromised. Heterogeneity of data values is particularly common when data is manually entered by different people using inadequate control rules. In this case, syntactic and semantic heterogeneity often go hand in hand. Heterogeneity of data values may be a direct result of problems in the acquisition process, quality problems of the underlying data model, or possibly erroneous data transformations. For example, in the cultural heritage domain, it is common to analyze data fields by manually searching lists of data values sorted alphabetically or by number of occurrences. Additionally, search functions such as regular expression matching are used to detect specific patterns. However, this requires a priori knowledge and technical skills that domain experts often do not have. Since such datasets often contain thousands of values, the entire process is very time-consuming. Outliers or subtle differences between values that may be critical to data quality can be easily overlooked. To improve this process of analyzing the quality of data values, we propose a bottom-up human-in-the-loop approach that clusters values of a data field according to syntactic similarity. The clustering is intended to help domain experts explore the heterogeneity of values in a data field and can be configured by domain experts according to their domain knowledge. The overview of the syntactic diversity of the data values gives an impression of the rules and practices of data acquisition as well as their violations. From this, experts can infer potential quality issues with the data acquisition process and system, as well as the data model and data transformations. We outline a proof-of-concept implementation of the approach. Our evaluation found that clustering adds value to data quality analysis, especially for detecting quality problems in data models.

References

[1]
Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 207–216.
[2]
Alsayed Algergawy, Marco Mesiti, Richi Nayak, and Gunter Saake. 2011. XML data clustering: An overview. ACM Comput. Surv. 43, 4, Article 25 (Oct.2011), 41 pages.
[3]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings ACM SIGMOD International Conference on Management of Data (SIGMOD’99), Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 49–60.
[4]
Thorsten Arendt and Gabriele Taentzer. 2013. A tool environment for quality assurance based on the Eclipse Modeling Framework. Autom. Softw. Eng. 20, 2 (2013), 141–184.
[5]
Carlo Batini. 2016. Data and Information Quality: Dimensions, Principles and Techniques. Springer, Berlin, Germany.
[6]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3, Article 16 (July2009), 52 pages.
[7]
C. Batini and M. Scannapieco. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin, Germany.
[8]
M. F. Bertoa and A. Vallecillo. 2010. Quality Attributes for Software Meta Models. Technical Report. University of Malaga, Spain.
[9]
L. Bettini, D. Di Ruscio, L. Iovino, and A. Pierantonio. 2019. Quality-driven detection and resolution of metamodel smells. IEEE Access 7 (2019), 16364–16376.
[10]
Jens Bove, Lutz Heusinger, and Angela Kailus. 2001. Marburger Informations-, Dokumentations- und Administrations-System (MIDAS): Handbuch und CD (Literatur und Archiv; 4). - 4. überarbeitete Auflage. K. G. Saur Verlag, Marburg, Germany. https://archiv.ub.uni-heidelberg.de/artdok/3770/.
[11]
Yale Chang, Junxiang Chen, Michael H. Cho, Peter J. Castaldi, Edwin K. Silverman, and Jennifer G. Dy. 2017. Clustering with domain-specific usefulness scores. In Proceedings of the 2017 SIAM International Conference on Data Mining, Nitesh V. Chawla and Wei Wang (Eds.). SIAM, 207–215.
[12]
Min Chen, Shiwen Mao, and Yunhao Liu. 2014. Big data: A survey. Mobile Networks and Applications 19, 2 (2014), 171–209.
[13]
Lukasz Ciszak. 2008. Application of clustering and association methods in data cleaning. In 2008 International Multiconference on Computer Science and Information Technology. IEEE, 97–103.
[14]
Erin Coburn, Richard Light, Gordon McKenna, Regine Stein, and Axel Vitzthum. 2010. LIDO v1.0 (Lightweight Information Describing Objects). http://www.lido-schema.org/schema/v1.0/lido-v1.0.xsd.
[15]
Erin Coburn, Richard Light, Gordon McKenna, Regine Stein, and Axel Vitzthum. 2021. LIDO v1.1 Public Beta (Lightweight Information Describing Objects). http://www.lido-schema.org/schema/v1.1/lido-v1.1-public-beta.xsd.
[16]
Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Suresh Venkatasubramanian. 2006. Column heterogeneity as a measure of data quality. In Proceedings of the 1st Int’l VLDB Workshop on Clean Databases (CleanDB’06). VLDB Endowment, Seoul, Korea (Co-located with VLDB 2006), 1–4. http://pike.psu.edu/cleandb06/papers/CameraReady_111.pdf.
[17]
Michael Diepenbroek, Frank Oliver Glöckner, Peter Grobe, Anton Güntsch, Robert Huber, Birgitta König-Ries, Ivaylo Kostadinov, Jens Nieschulze, Bernhard Seeger, Robert Tolksdorf, and Dagmar Triebel. 2014. Towards an integrated biodiversity and ecological research data management and archiving platform: The German Federation for the curation of biological data (GFBio). In 44. Jahrestagung der Gesellschaft für Informatik, Big Data - Komplexität meistern (INFORMATIK’14) (LNI), Erhard Plödereder, Lars Grunske, Eric Schneider, and Dominik Ull (Eds.), Vol. P-232. Gesellschaft für Informatik e.V., Stuttgart, Germany, 1711–1721. https://dl.gi.de/20.500.12116/2782
[18]
Uwe Draisbach, Peter Christen, and Felix Naumann. 2020. Transforming pairwise duplicates to entity clusters for high-quality duplicate detection. ACM J. Data Inf. Qual. 12, 1 (2020), 3:1–3:30.
[19]
Joseph C. Dunn. 1974. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4, 1 (1974), 95–104.
[20]
Lisa Ehrlinger and Wolfram Wöß. 2019. A novel data quality metric for minimality. In Data Quality and Trust in Big Data. SCITEPRESS, Springer International Publishing, Cham, 1–15.
[21]
Lisa Ehrlinger and Wolfram Wöß. 2022. A survey of data quality measurement and monitoring tools. Frontiers in Big Data 5 (2022), 28.
[22]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96), Evangelos Simoudis, Jiawei Han, and Usama M. Fayyad (Eds.). AAAI Press, 226–231. http://www.aaai.org/Library/KDD/1996/kdd96-037.php.
[23]
Marsha E. Fonteyn, Benjamin Kuipers, and Susan J. Grobe. 1993. A description of think aloud method and protocol analysis. Qualitative Health Research 3, 4 (1993), 430–441. arXiv:https://doi.org/10.1177/104973239300300403
[24]
James C. French, Allison L. Powell, and Eric Schulman. 2000. Using clustering strategies for creating authority files. J. Am. Soc. Inf. Sci. 51, 8 (2000), 774–786.
[25]
Brendan J. Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972–976. arXiv:https://science.sciencemag.org/content/315/5814/972.full.pdf
[26]
Anjana Gosain and Heena. 2015. Literature review of data model quality metrics of data warehouse. Procedia Computer Science 48 (2015), 236–243. https://www.sciencedirect.com/science/article/pii/S1877050915006857.
[27]
Georg Hinkel, Max E. Kramer, Erik Burger, Misha Strittmatter, and Lucia Happe. 2016. An empirical study on the perception of metamodel quality. In Proceedings of the 4th International Conference on Model-Driven Engineering and Software Development (MODELSWARD’16), Slimane Hammoudi, Luís Ferreira Pires, Bran Selic, and Philippe Desfray (Eds.). SciTePress, Rome, Italy, 145–152.
[28]
ISO. 2011. Software Engineering – Software Product Quality Requirements and Evaluation (SQuaRE) – Data Quality Model. Standard. International Organization for Standardization, Geneva, CH.
[29]
Ray S. Jackendoff. 1992. Semantic Structures. Vol. 18. MIT Press, Cambridge, MA.
[30]
Joseph B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (1964), 1–27.
[31]
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-based Python JIT compiler. In Proceedings of the 2nd Workshop on the LLVM Compiler Infrastructure in HPC (LLVM’15). Association for Computing Machinery, New York, NY, Article 7, 6 pages.
[32]
Nuno Laranjeiro, Seyma Nur Soydemir, and Jorge Bernardino. 2015. A survey on data quality: Classifying poor data. In 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC’15). IEEE, 179–188.
[33]
U. Leser, F. Naumann, and B. Eckman. 2007. Informationsintegration. Springer, Berlin, Germany.
[34]
Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics - Doklady 10, 8 (1966), 707–710.
[35]
Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. 2010. Understanding of internal clustering validation measures. In The 10th IEEE International Conference on Data Mining (ICDM’10), Geoffrey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu (Eds.). IEEE Computer Society, 911–916.
[36]
Jesús J. López-Fernández, Jesús Sánchez Cuadrado, Esther Guerra, and Juan de Lara. 2015. Example-driven meta-model development. Softw. Syst. Model. 14, 4 (2015), 1323–1347.
[37]
Jesús J. López-Fernández, Esther Guerra, and Juan de Lara. 2014. Assessing the quality of meta-models. In Proceedings of the 11th Workshop on Model-Driven Engineering, Verification and Validation Co-located with 17th International Conference on Model Driven Engineering Languages and Systems (MoDeVVa@MODELS’14) (CEUR Workshop Proceedings), Frédéric Boulanger, Michalis Famelis, and Daniel Ratiu (Eds.), Vol. 1235. CEUR-WS.org, 3–12. http://ceur-ws.org/Vol-1235/paper-02.pdf.
[38]
Jesús J. López-Fernández, Esther Guerra, and Juan de Lara. 2014. Meta-model validation and verification with MetaBest. In ACM/IEEE International Conference on Automated Software Engineering (ASE’14), Ivica Crnkovic, Marsha Chechik, and Paul Grünbacher (Eds.). ACM, 831–834.
[39]
Francesco E. Maranzana. 1963. On the location of supply points to minimize transportation costs. IBM Syst. J. 2, 2 (1963), 129–135.
[40]
Daniel L. Moody and Graeme G. Shanks. 1994. What makes a good data model? Evaluating the quality of entity relationship models. In Proceedings of the 13th International Conference on the Entity-Relationship Approach (ER’94). Springer-Verlag, Berlin, 94–111.
[41]
Daniel L. Moody and Graeme G. Shanks. 2003. Improving the quality of data models: Empirical validation of a quality management framework. Inf. Syst. 28, 6 (Sept.2003), 619–650.
[42]
Daniel Müllner. 2011. Modern hierarchical, agglomerative clustering algorithms. (2011). arxiv:1109.2378http://arxiv.org/abs/1109.2378.
[43]
Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1 (2001), 31–88.
[44]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[45]
Maria Angela Pellegrino, Luca Postiglione, and Vittorio Scarano. 2021. Detecting data accuracy issues in textual geographical data by a clustering-based approach. In 8th ACM IKDD CODS and 26th COMAD (CODS COMAD’21). Association for Computing Machinery, New York, NY, 208–212.
[46]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23, 4 (2000), 3–13.
[47]
Eric Sven Ristad and Peter N. Yianilos. 1998. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20, 5 (1998), 522–532.
[48]
Erik-André Sauleau, Jean-Philippe Paumier, and Antoine Buemi. 2005. Medical record linkage in health information systems by approximate string matching and clustering. BMC Medical Informatics Decis. Mak. 5 (2005), 32.
[49]
D. C. Schmidt. 2006. Guest editor’s introduction: Model-driven engineering. Computer 39, 2 (Feb.2006), 25–31.
[50]
Jianbo Shi and Jitendra Malik. 1997. Normalized cuts and image segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR’97). IEEE Computer Society, 731–737.
[51]
Regine Stein, Axel Ermert, Jürgen Gottschewski, Monika Hagedorn-Saupe, Regine Heuchert, Hans-Jürgen Hansen, Angela Kailus, Carlos Saro, Regine Scheffel, Gisela Schulte-Dornberg, Jörn Sieglerschmidt, and Axel Vitzthum. 2007. museumdat. Retrieved February 21, 2023, from https://www.getty.edu/research/publications/electronic_publications/cdwa/museumdat_v1_0_en.pdf.
[52]
Michael Szvetits and Uwe Zdun. 2016. Systematic literature review of the objectives, techniques, kinds, and architectures of models at runtime. Softw. Syst. Model. 15, 1 (Feb.2016), 31–69.
[53]
John R. Talburt, Awaad K. Al Sarkhi, Leon Claassens, Daniel, Pullen, and Richard Wang. 2020. An iterative, self-assessing entity resolution system: First steps toward a data washing machine. International Journal of Advanced Computer Science and Applications 11, 12 (2020), 680–689. https://par.nsf.gov/servlets/purl/10219479.
[54]
Giri Kumar Tayi and Donald P. Ballou. 1998. Examining data quality. Commun. ACM 41, 2 (Feb.1998), 54–57.
[55]
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods 17 (2020), 261–272.
[56]
Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. J. ACM 21, 1 (1974), 168–173.
[57]
Yair Wand and Richard Y. Wang. 1996. Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 11 (1996), 86–95.
[58]
Lidong Wang. 2017. Heterogeneous data and big data analytics. Automatic Control and Information Sciences 3, 1 (2017), 8–15.
[59]
Richard Y. Wang and Diane M. Strong. 1996. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems 12, 4 (1996), 5–33.
[60]
Viola Wenz, Arno Kesper, and Gabriele Taentzer. 2021. Detecting quality problems in data models by clustering heterogeneous data values. In ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS’21 Companion). IEEE, 150–159.
[61]
Viola Wenz, Arno Kesper, and Gabriele Taentzer. 2022. Tool implementation. https://github.com/Project-KONDA/data-value-clustering.
[62]
Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. 2022. A survey of human-in-the-loop for machine learning. Future Gener. Comput. Syst. 135 (2022), 364–381.
[63]
Bingyu Yi, Wen Hua, and Shazia Sadiq. 2016. A pattern-based framework for addressing data representational inconsistency. In Databases Theory and Applications, Muhammad Aamir Cheema, Wenjie Zhang, and Lijun Chang (Eds.). Springer International Publishing, Cham, 395–406.
[64]
Robert K. Yin. 2009. Case Study Research: Design and Methods (4th ed.). Sage Publications, Los Angeles.
[65]
Roman Yurchak, Roman Yurchak, Guillaume Lemaitre, and Timothee Mathieu. 2019. scikit-learn-extra documentation. https://scikit-learn-extra.readthedocs.io/en/latest/index.html.
[66]
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality assessment for linked data: A survey. Semantic Web 7, 1 (2016), 63–93.

Cited By

View all
  • (2024)Integration Approaches for Heterogeneous Big Data: A SurveyCybernetics and Information Technologies10.2478/cait-2024-000124:1(3-20)Online publication date: 23-Mar-2024
  • (2024)AI-Powered Data Governance: A Cutting-Edge Method for Ensuring Data Quality for Machine Learning Applications2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE)10.1109/ic-ETITE58242.2024.10493601(1-6)Online publication date: 22-Feb-2024
  • (2024)Current Challenges of Big Data Quality Management in Big Data Governance: A Literature ReviewAdvances in Intelligent Computing Techniques and Applications10.1007/978-3-031-59711-4_15(160-172)Online publication date: 30-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 15, Issue 3
September 2023
326 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3611329
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2023
Online AM: 22 June 2023
Accepted: 22 April 2023
Revised: 05 March 2023
Received: 12 July 2022
Published in JDIQ Volume 15, Issue 3

Check for updates

Author Tags

  1. Data quality
  2. clustering
  3. data heterogeneity
  4. data analysis
  5. value abstraction
  6. semi-structured data

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)229
  • Downloads (Last 6 weeks)18
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Integration Approaches for Heterogeneous Big Data: A SurveyCybernetics and Information Technologies10.2478/cait-2024-000124:1(3-20)Online publication date: 23-Mar-2024
  • (2024)AI-Powered Data Governance: A Cutting-Edge Method for Ensuring Data Quality for Machine Learning Applications2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE)10.1109/ic-ETITE58242.2024.10493601(1-6)Online publication date: 22-Feb-2024
  • (2024)Current Challenges of Big Data Quality Management in Big Data Governance: A Literature ReviewAdvances in Intelligent Computing Techniques and Applications10.1007/978-3-031-59711-4_15(160-172)Online publication date: 30-Jun-2024
  • (2024)Putting Sense into Incomplete Heterogeneous Data with Hypergraph Clustering AnalysisAdvances in Intelligent Data Analysis XXII10.1007/978-3-031-58553-1_10(119-130)Online publication date: 16-Apr-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media