research-article

Open access

Data Quality Assessment in the Wild: Findings from GitHub

Authors:

Ipek Ustunboyacioglu,

Dario Di Nucci,

Damian Andrew Tamburri,

Willem-Jan Van Den HeuvelAuthors Info & Claims

EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering

Pages 120 - 129

https://doi.org/10.1145/3661167.3661213

Published: 18 June 2024 Publication History

All formats PDF

Abstract

Data quality is critical to make data-driven business decisions reliable, accurate, and complete. As a result, data quality assessment has become a prevalent topic among researchers and practitioners. Previous studies have surveyed data quality tools and used them in industrial projects. However, the practitioners’ adoption of data quality tools is mainly unexplored. In this study, we systematically selected the five widely used tools and analyzed 498 GitHub repositories that use those tools. Our findings show that practitioners increasingly use data quality tools to assess and improve the quality of their data. The most common use case is software development (49.2%), followed by learning (25.5%), teaching (9.2%), and research (0.4%). However, most repositories (69%) showed less activities and collaboration. The development projects, especially with industry-based owners, were the most active. The predominant use of the data quality tools was to assess the completeness of the data, followed by uniqueness, validity, and referential integrity.

References

[1]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004.

Digital Library

[2]

Marcel Altendeitering and Tobias Moritz Guggenberger. 2021. Designing Data Quality Tools: Findings from an Action Design Research Project at Boehringer Ingelheim. In ECIS.

[3]

José Barateiro and Helena Galhardas. 2005. A survey of data quality tools.Datenbank-Spektrum 14, 15-21 (2005), 48.

[4]

Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3, Article 16 (jul 2009), 52 pages.

Digital Library

[5]

Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the Factors That Impact the Popularity of GitHub Repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344.

[6]

Ajoy Das, Gias Uddin, and Guenther Ruhe. 2022. An Empirical Study of Blockchain Repositories in GitHub. In Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering (Gothenburg, Sweden) (EASE ’22). Association for Computing Machinery, New York, NY, USA, 211–220.

Digital Library

[7]

Lisa Ehrlinger and Wolfram Wöß. 2022. A survey of data quality measurement and monitoring tools. Frontiers in big data (2022), 28.

[8]

Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-Based Systems. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI (Pittsburgh, Pennsylvania) (CAIN ’22). Association for Computing Machinery, New York, NY, USA, 229–239.

Digital Library

[9]

Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The State of the ML-universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 431–442.

Digital Library

[10]

Ulrike M. Graetsch, Hourieh Khalajzadeh, Mojtaba Shahin, Rashina Hoda, and John Grundy. 2023. Dealing With Data Challenges When Delivering Data-Intensive Software Solutions. IEEE Transactions on Software Engineering 49, 9 (2023), 4349–4370.

Digital Library

[11]

Fernando Gualo, Moisés Rodriguez, Javier Verdugo, Ismael Caballero, and Mario Piattini. 2021. Data quality certification using ISO/IEC 25012: Industrial experiences. Journal of Systems and Software 176 (2021), 110938.

[12]

Donghyun Kang, TaeYoung Kang, and Junkyu Jang. 2023. Papers with code or without code? Impact of GitHub repository usability on the diffusion of machine learning research. Information Processing and Management 60, 6 (2023), 103477.

Digital Library

[13]

David Kavaler, Asher Trockman, Bogdan Vasilescu, and Vladimir Filkov. 2019. Tool Choice Matters: JavaScript Quality Assurance Tools and Usage Outcomes in GitHub Projects. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 476–487.

[14]

David Kiron. 2017. Lessons from becoming a data-driven organization. MIT sloan management review 58, 2 (2017).

[15]

Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the Experiences of Adopting Automated Data Validation in an Industrial Machine Learning Project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248–257.

Digital Library

[16]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276–282.

[17]

Gabriel P. Oliveira 2023. Assessing Data Quality Inconsistencies in Brazilian Governmental Data. Journal of Information and Data Management 14, 1 (2023), 20–32.

[18]

Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (apr 2002), 211–218.

Digital Library

[19]

Ana B Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura. 2022. Mutation testing in the wild: findings from GitHub. Empirical Software Engineering 27, 6 (2022), 132.

Digital Library

[20]

Maung K. Sein, Ola Henfridsson, Sandeep Purao, Matti Rossi, and Rikard Lindgren. 2011. Action Design Research. MIS Quarterly 35, 1 (2011), 37–56.

Digital Library

[21]

Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, and Vishanth Weerakkody. 2017. Critical analysis of Big Data challenges and analytical methods. Journal of business research 70 (2017), 263–286.

[22]

Supatsara Wattanakriengkrai, Bodin Chinthanet, Hideaki Hata, Raula Gaikovina Kula, Christoph Treude, Jin Guo, and Kenichi Matsumoto. 2022. GitHub repositories with links to academic papers: Public access, traceability, and evolution. Journal of Systems and Software 183 (2022), 111117.

[23]

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.

Recommendations

A Model for Data Quality Assessment
OTM '08: Proceedings of the OTM Confederated International Workshops and Posters on On the Move to Meaningful Internet Systems: 2008 Workshops: ADI, AWeSoMe, COMBEK, EI2N, IWSSA, MONET, OnToContent + QSI, ORM, PerSys, RDDS, SEMELS, and SWWS

One of the major causes for the failure of information systems to deliver can be attributed to data quality. Gartner's figures and other similar studies show the failure rate hovering at a plateau of 50% for data warehouses since 2004. While the true ...
Towards a content agnostic computable knowledge repository for data quality assessment
Hightlights
- We identified research gaps in data quality literature towards automating DQA methods.
Abstract Background and objective
In recent years, several data quality conceptual frameworks have been proposed across the Data Quality and Information Quality domains towards assessment of quality of data. These frameworks are ...
Context-aware Big Data Quality Assessment: A Scoping Review
The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering

June 2024

728 pages

ISBN:9798400717017

DOI:10.1145/3661167

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EASE 2024

EASE 2024: 28th International Conference on Evaluation and Assessment in Software Engineering

June 18 - 21, 2024

Salerno, Italy

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
122
Total Downloads

Downloads (Last 12 months)122
Downloads (Last 6 weeks)42

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents