short-paper

Open access

Preliminary findings on the occurrence and causes of data smells in a real-world business travel data processing pipeline

Authors:

Valentina Golendukhina,

Harald Foidl,

Michael Felderer,

Rudolf RamlerAuthors Info & Claims

SEA4DQ 2022: Proceedings of the 2nd International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things

Pages 18 - 21

https://doi.org/10.1145/3549037.3561275

Published: 09 November 2022 Publication History

PDF eReader

Abstract

Detection of poor quality data is crucial for enhancing data-driven systems' quality. Although there is a lot of research on data validation, the topic of potential data quality issues is still underexplored. Such latent issues or data smells can often stay undetected and lead to the poor future performance of data-intensive systems. Detecting data smells is not trivial and requires knowledge about their causes. In this paper, we present the preliminary findings on the causes and severity of data smells based on a study of a real-world business travel data set and the data processing pipeline behind it. The results show that data smells exist in this data set and cause severe problems. Although many data smells already occur in raw data, some smells are created during the transformation and enrichment stages of the data processing pipeline. These findings indicate the importance of the data pipeline itself for future research on data smells. Thus, this article proposes potential future work in this area.

References

[1]

Daniel W Barowy, Dimitar Gochev, and Emery D Berger. 2014. Checkcell: Data debugging for spreadsheets. ACM SIGPLAN Notices, 49, 10 (2014), 507–523. https://doi.org/10.1145/2714064.2660207

Digital Library

Google Scholar

[2]

Thomas H Davenport. 2013. turning towards a smarter travel experience.

Google Scholar

[3]

Harald Foidl and Michael Felderer. 2022. An Approach for Assessing Industrial Iot Data Sources to Determine Their Data Trustworthiness. https://doi.org/10.2139/ssrn.4069988

Crossref

Google Scholar

[4]

Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems. In 1st International Conference on AI Engineering – Software Engineering for AI (CAIN). 229–239.

Digital Library

Google Scholar

[5]

Hannes Hapke and Catherine Nelson. 2020. Building machine learning pipelines. O’Reilly Media.

Google Scholar

[6]

Nick Hynes, D Sculley, and Michael Terry. 2017. The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop. 1.

Google Scholar

[7]

Joao Marcelo Borovina Josko, Lisa Ehrlinger, and Wolfram Wöß. 2019. Towards a Knowledge Graph to Describe and Process Data Defects. DBKDA 2019, 65.

Google Scholar

[8]

Lin Li, Taoxin Peng, and Jessie Kennedy. 2014. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC), 1, 2 (2014), https://doi.org/10.5176/978-981-08-6308-1_d-035

Crossref

Google Scholar

[9]

Jianzheng Liu, Jie Li, Weifeng Li, and Jiansheng Wu. 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS journal of photogrammetry and remote sensing, 115 (2016), 134–142. https://doi.org/10.1016/j.isprsjprs.2015.11.006

Crossref

Google Scholar

[10]

Aiswarya Raj Munappy, Jan Bosch, and Helena Homström Olsson. 2020. Data pipeline management in practice: Challenges and opportunities. In International Conference on Product-Focused Software Process Improvement. 168–184. https://doi.org/10.1007/978-3-030-64148-1_11

Digital Library

Google Scholar

[11]

Ben Vinod. 2016. Big data in the travel marketplace. Journal of revenue and pricing management, 15, 5 (2016), 352–359. https://doi.org/10.1057/rpm.2016.30

Crossref

Google Scholar

[12]

Haiyin Zhang, Luís Cruz, and Arie van Deursen. 2022. Code Smells for Machine Learning Applications. arXiv preprint arXiv:2203.13746.

Google Scholar

Cited By

View all

Recupito GRapacciuolo RDi Nucci DPalomba FBosch JLewis GCleland-Huang JMuccini H(2024)Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data QualityProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644960(53-63)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644960

Index Terms

Preliminary findings on the occurrence and causes of data smells in a real-world business travel data processing pipeline
1. Information systems
  1. Data management systems
    1. Information integration

Recommendations

Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems
CAIN '22: Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI

High data quality is fundamental for today's AI-based systems. However, although data quality has been an object of research for decades, there is a clear lack of research on potential data quality issues (e.g., ambiguous, extraneous values). These ...
Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers
Abstract
Data pipelines are an integral part of various modern data-driven systems. However, despite their importance, they are often unreliable and deliver poor-quality data. A critical step toward improving this situation is a solid understanding of the ...
Highlights
- 41 factors influencing the quality of data pipelines were identified.
- Data-related issues mainly occur in the data cleaning stage of pipelines.
- Data-related issues are mainly caused by incorrect data types.
- Data integration and ...
Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality
CAIN '24: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI

Artificial Intelligence (AI) is rapidly advancing with a data-centered approach suitable for various domains. Nevertheless, AI faces significant challenges, particularly in data quality. Data collection from diverse sources can introduce quality issues ...

Comments

Information & Contributors

Information

Published In

SEA4DQ 2022: Proceedings of the 2nd International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things

November 2022

25 pages

ISBN:9781450394598

DOI:10.1145/3549037

General Chair:
Phu Nguyen
SINTEF, Norway
,
Program Chairs:
Sagar Sen
SINTEF, Norway
,
Maria Chiara Magnanini
Politecnico di Milano, Italy

This work is licensed under a Creative Commons Attribution 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Österreichische Forschungsförderungsgesellschaft
BMK & BMDW

Conference

SEA4DQ '22

Sponsor:

SEA4DQ '22: 2nd International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things

November 17, 2022

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)161
Downloads (Last 6 weeks)17

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Recupito GRapacciuolo RDi Nucci DPalomba FBosch JLewis GCleland-Huang JMuccini H(2024)Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data QualityProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644960(53-63)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644960

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems

Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers

Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF

eReader

Login options

Full Access

Abstract

References

Cited By

Index Terms

Recommendations

Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems

Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers

Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations