Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3549037.3561275acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
short-paper
Open access

Preliminary findings on the occurrence and causes of data smells in a real-world business travel data processing pipeline

Published: 09 November 2022 Publication History

Abstract

Detection of poor quality data is crucial for enhancing data-driven systems' quality. Although there is a lot of research on data validation, the topic of potential data quality issues is still underexplored. Such latent issues or data smells can often stay undetected and lead to the poor future performance of data-intensive systems. Detecting data smells is not trivial and requires knowledge about their causes. In this paper, we present the preliminary findings on the causes and severity of data smells based on a study of a real-world business travel data set and the data processing pipeline behind it. The results show that data smells exist in this data set and cause severe problems. Although many data smells already occur in raw data, some smells are created during the transformation and enrichment stages of the data processing pipeline. These findings indicate the importance of the data pipeline itself for future research on data smells. Thus, this article proposes potential future work in this area.

References

[1]
Daniel W Barowy, Dimitar Gochev, and Emery D Berger. 2014. Checkcell: Data debugging for spreadsheets. ACM SIGPLAN Notices, 49, 10 (2014), 507–523. https://doi.org/10.1145/2714064.2660207
[2]
Thomas H Davenport. 2013. turning towards a smarter travel experience.
[3]
Harald Foidl and Michael Felderer. 2022. An Approach for Assessing Industrial Iot Data Sources to Determine Their Data Trustworthiness. https://doi.org/10.2139/ssrn.4069988
[4]
Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data Smells: Categories, Causes and Consequences, and Detection of Suspicious Data in AI-based Systems. In 1st International Conference on AI Engineering – Software Engineering for AI (CAIN). 229–239.
[5]
Hannes Hapke and Catherine Nelson. 2020. Building machine learning pipelines. O’Reilly Media.
[6]
Nick Hynes, D Sculley, and Michael Terry. 2017. The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop. 1.
[7]
Joao Marcelo Borovina Josko, Lisa Ehrlinger, and Wolfram Wöß. 2019. Towards a Knowledge Graph to Describe and Process Data Defects. DBKDA 2019, 65.
[8]
Lin Li, Taoxin Peng, and Jessie Kennedy. 2014. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC), 1, 2 (2014), https://doi.org/10.5176/978-981-08-6308-1_d-035
[9]
Jianzheng Liu, Jie Li, Weifeng Li, and Jiansheng Wu. 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS journal of photogrammetry and remote sensing, 115 (2016), 134–142. https://doi.org/10.1016/j.isprsjprs.2015.11.006
[10]
Aiswarya Raj Munappy, Jan Bosch, and Helena Homström Olsson. 2020. Data pipeline management in practice: Challenges and opportunities. In International Conference on Product-Focused Software Process Improvement. 168–184. https://doi.org/10.1007/978-3-030-64148-1_11
[11]
Ben Vinod. 2016. Big data in the travel marketplace. Journal of revenue and pricing management, 15, 5 (2016), 352–359. https://doi.org/10.1057/rpm.2016.30
[12]
Haiyin Zhang, Luís Cruz, and Arie van Deursen. 2022. Code Smells for Machine Learning Applications. arXiv preprint arXiv:2203.13746.

Cited By

View all
  • (2024)Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data QualityProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644960(53-63)Online publication date: 14-Apr-2024

Index Terms

  1. Preliminary findings on the occurrence and causes of data smells in a real-world business travel data processing pipeline

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SEA4DQ 2022: Proceedings of the 2nd International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things
    November 2022
    25 pages
    ISBN:9781450394598
    DOI:10.1145/3549037
    This work is licensed under a Creative Commons Attribution 4.0 International License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data smells
    2. data issues
    3. data pipeline

    Qualifiers

    • Short-paper

    Funding Sources

    • Österreichische Forschungsförderungsgesellschaft
    • BMK & BMDW

    Conference

    SEA4DQ '22
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)161
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data QualityProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644960(53-63)Online publication date: 14-Apr-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media