Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3661167.3661213acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article
Open access

Data Quality Assessment in the Wild: Findings from GitHub

Published: 18 June 2024 Publication History

Abstract

Data quality is critical to make data-driven business decisions reliable, accurate, and complete. As a result, data quality assessment has become a prevalent topic among researchers and practitioners. Previous studies have surveyed data quality tools and used them in industrial projects. However, the practitioners’ adoption of data quality tools is mainly unexplored. In this study, we systematically selected the five widely used tools and analyzed 498 GitHub repositories that use those tools. Our findings show that practitioners increasingly use data quality tools to assess and improve the quality of their data. The most common use case is software development (49.2%), followed by learning (25.5%), teaching (9.2%), and research (0.4%). However, most repositories (69%) showed less activities and collaboration. The development projects, especially with industry-based owners, were the most active. The predominant use of the data quality tools was to assess the completeness of the data, followed by uniqueness, validity, and referential integrity.

References

[1]
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done?Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004.
[2]
Marcel Altendeitering and Tobias Moritz Guggenberger. 2021. Designing Data Quality Tools: Findings from an Action Design Research Project at Boehringer Ingelheim. In ECIS.
[3]
José Barateiro and Helena Galhardas. 2005. A survey of data quality tools.Datenbank-Spektrum 14, 15-21 (2005), 48.
[4]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3, Article 16 (jul 2009), 52 pages.
[5]
Hudson Borges, Andre Hora, and Marco Tulio Valente. 2016. Understanding the Factors That Impact the Popularity of GitHub Repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344.
[6]
Ajoy Das, Gias Uddin, and Guenther Ruhe. 2022. An Empirical Study of Blockchain Repositories in GitHub. In Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering (Gothenburg, Sweden) (EASE ’22). Association for Computing Machinery, New York, NY, USA, 211–220.
[7]
Lisa Ehrlinger and Wolfram Wöß. 2022. A survey of data quality measurement and monitoring tools. Frontiers in big data (2022), 28.
[8]
Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-Based Systems. In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI (Pittsburgh, Pennsylvania) (CAIN ’22). Association for Computing Machinery, New York, NY, USA, 229–239.
[9]
Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The State of the ML-universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 431–442.
[10]
Ulrike M. Graetsch, Hourieh Khalajzadeh, Mojtaba Shahin, Rashina Hoda, and John Grundy. 2023. Dealing With Data Challenges When Delivering Data-Intensive Software Solutions. IEEE Transactions on Software Engineering 49, 9 (2023), 4349–4370.
[11]
Fernando Gualo, Moisés Rodriguez, Javier Verdugo, Ismael Caballero, and Mario Piattini. 2021. Data quality certification using ISO/IEC 25012: Industrial experiences. Journal of Systems and Software 176 (2021), 110938.
[12]
Donghyun Kang, TaeYoung Kang, and Junkyu Jang. 2023. Papers with code or without code? Impact of GitHub repository usability on the diffusion of machine learning research. Information Processing and Management 60, 6 (2023), 103477.
[13]
David Kavaler, Asher Trockman, Bogdan Vasilescu, and Vladimir Filkov. 2019. Tool Choice Matters: JavaScript Quality Assurance Tools and Usage Outcomes in GitHub Projects. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 476–487.
[14]
David Kiron. 2017. Lessons from becoming a data-driven organization. MIT sloan management review 58, 2 (2017).
[15]
Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the Experiences of Adopting Automated Data Validation in an Industrial Machine Learning Project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248–257.
[16]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276–282.
[17]
Gabriel P. Oliveira 2023. Assessing Data Quality Inconsistencies in Brazilian Governmental Data. Journal of Information and Data Management 14, 1 (2023), 20–32.
[18]
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (apr 2002), 211–218.
[19]
Ana B Sánchez, Pedro Delgado-Pérez, Inmaculada Medina-Bulo, and Sergio Segura. 2022. Mutation testing in the wild: findings from GitHub. Empirical Software Engineering 27, 6 (2022), 132.
[20]
Maung K. Sein, Ola Henfridsson, Sandeep Purao, Matti Rossi, and Rikard Lindgren. 2011. Action Design Research. MIS Quarterly 35, 1 (2011), 37–56.
[21]
Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, and Vishanth Weerakkody. 2017. Critical analysis of Big Data challenges and analytical methods. Journal of business research 70 (2017), 263–286.
[22]
Supatsara Wattanakriengkrai, Bodin Chinthanet, Hideaki Hata, Raula Gaikovina Kula, Christoph Treude, Jin Guo, and Kenichi Matsumoto. 2022. GitHub repositories with links to academic papers: Public access, traceability, and evolution. Journal of Systems and Software 183 (2022), 111117.
[23]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EASE '24: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering
June 2024
728 pages
ISBN:9798400717017
DOI:10.1145/3661167
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2024

Check for updates

Author Tags

  1. Data Quality Assessment
  2. Data Quality Tools
  3. GitHub
  4. Mining Software Repositories

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EASE 2024

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 122
    Total Downloads
  • Downloads (Last 12 months)122
  • Downloads (Last 6 weeks)42
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media