Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Data-Driven Analysis of Behaviors in Data Curation Processes

Published: 07 February 2023 Publication History

Abstract

Understanding how data workers interact with data, and various pieces of information related to data preparation, is key to designing systems that can better support them in exploring datasets. To date, however, there is a paucity of research studying the strategies adopted by data workers as they carry out data preparation activities. In this work, we investigate a specific data preparation activity, namely data quality discovery, and aim to (i) understand the behaviors of data workers in discovering data quality issues, (ii) explore what factors (e.g., prior experience) can affect their behaviors, as well as (iii) understand how these behavioral observations relate to their performance. To this end, we collect a multi-modal dataset through a data-driven experiment that relies on the use of eye-tracking technology with a purpose-designed platform built on top of iPython Notebook. The experiment results reveal that: (i) ‘copy–paste–modify’ is a typical strategy for writing code to complete tasks; (ii) proficiency in writing code has a significant impact on the quality of task performance, while perceived difficulty and efficacy can influence task completion patterns; and (iii) searching in external resources is a prevalent action that can be leveraged to achieve better performance. Furthermore, our experiment indicates that providing sample code within the system can help data workers get started with their task, and surfacing underlying data is an effective way to support exploration. By investigating data worker behaviors prior to each search action, we also find that the most common reasons that trigger external search actions are the need to seek assistance in writing or debugging code and to search for relevant code to reuse. Based on our experiment results, we showcase a systematic approach to select from the top best code snippets created by data workers and assemble them to achieve better performance than the best individual performer in the dataset. By doing so, our findings not only provide insights into patterns of interactions with various system components and information resources when performing data curation tasks, but also build effective and efficient data curation processes through data workers’ collective intelligence.

References

[1]
Tarek M. Ahmed, Weiyi Shang, and Ahmed E. Hassan. 2015. An empirical study of the copy and paste behavior during development. In Proceedings of the 12th Working Conference on MSR. IEEE Press, 99–110.
[2]
Anne Aula, Rehan M. Khan, Zhiwei Guan, Paul Fontes, and Peter Hong. 2010. A comparison of visual and textual page previews in judging the helpfulness of web pages. In Proceedings of WWW. ACM, 51–60.
[3]
Nurzety A. Azuan, Suzanne M. Embury, and Norman W. Paton. 2017. Observing the data scientist: Using manual corrections as implicit feedback. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. 1–6.
[4]
Alan Baddeley. 1992. Working memory. Science 255, 5044 (1992), 556–559.
[5]
Nilavra Bhattacharya and Jacek Gwizdka. 2019. Measuring learning during search: Differences in interactions, eye-gaze, and semantic similarity to expert knowledge. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (CHIIR). 63–71.
[6]
Georg Buscher, Edward Cutrell, and Meredith Ringel Morris. 2009. What do you see when you’re surfing? Using eye tracking to predict salient regions of web pages. In Proceedings of CHI. ACM, 21–30.
[7]
Alex Cao, Keshav K. Chintamani, Abhilash K. Pandya, and R. Darin Ellis. 2009. NASA TLX: Software for assessing subjective mental workload. Behavior Research Methods 41, 1 (2009), 113–117.
[8]
Souti Chattopadhyay, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. 2020. What’s wrong with computational notebooks? Pain points, needs, and design opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[9]
Yeounoh Chung, Sanjay Krishnan, and Tim Kraska. 2017. A data quality metric (DQM): How to estimate the number of undetected errors in data sets. Proceedings of the VLDB Endowment 10, 10 (2017), 1094–1105.
[10]
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning. Vol. 479. John Wiley & Sons.
[11]
Ian Drosos, Titus Barik, Philip J. Guo, Robert DeLine, and Sumit Gulwani. 2020. Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[12]
André Freitas and Edward Curry. 2016. Big Data Curation. Springer International Publishing, Cham, 87–118.
[13]
Max Grusky, Jeiran Jahani, Josh Schwartz, Dan Valente, Yoav Artzi, and Mor Naaman. 2017. Modeling sub-document attention using viewport time. In Proceedings of CHI. ACM, 6475–6480.
[14]
Adam Grzywaczewski and Rahat Iqbal. 2012. Task-specific information retrieval systems for software engineers. J. Comput. System Sci. 78, 4 (2012), 1204–1218.
[15]
Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts. In Proceedings of UIST. 65–74.
[16]
Lei Han, Tianwa Chen, Gianluca Demartini, Marta Indulska, and Shazia Sadiq. 2020. On understanding data worker interaction behaviors. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 269–278.
[17]
Sandra G. Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 50. Sage publications Sage CA: Los Angeles, CA, 904–908.
[18]
Andrew Head, Jason Jiang, James Smith, Marti A. Hearst, and Björn Hartmann. 2020. Composing flexibly-organized step-by-step tutorials from linked source code, snippets, and outputs. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[19]
Jeff Huang, Ryen W. White, and Susan Dumais. 2011. No clicks, no problem: Using cursor movements to understand and improve search. In Proceedings of CHI. ACM, 1225–1234.
[20]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 3363–3372.
[21]
Jaewon Kim, Paul Thomas, Ramesh Sankaranarayana, Tom Gedeon, and Hwan-Jin Yoon. 2015. Eye-tracking analysis of user behavior and performance in web search on large and small screens. Journal of the Association for Information Science and Technology 66, 3 (2015), 526–544.
[22]
Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. 2004. An ethnographic study of copy and paste programming practices in OOPL. In Proceedings of the 2004 International Symposium on Empirical Software Engineering, 2004. ISESE’04. IEEE, 83–92.
[23]
Paul A. Kirschner. 2002. Cognitive Load Theory: Implications of cognitive load theory on the design of learning.
[24]
Andrew J. Ko, Brad A. Myers, Michael J. Coblenz, and Htet Htet Aung. 2006. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Transactions on Software Engineering12 (2006), 971–987.
[25]
Laura Koesten, Kathleen Gregory, Paul Groth, and Elena Simperl. 2021. Talking datasets–understanding data sensemaking behaviours. International Journal of Human-Computer Studies 146 (2021), 102562.
[26]
Laura Koesten, Emilia Kacprzak, Jeni Tennison, and Elena Simperl. 2019. Collaborative practices with structured data: Do tools support what users need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
[27]
Dmitry Lagun and Mounia Lalmas. 2016. Understanding user attention and engagement in online news reading. In Proceedings of WSDM. ACM, 113–122.
[28]
Jing Li, Aixin Sun, Zhenchang Xing, and Lei Han. 2018. API caveat explorer–surfacing negative usages from practice: An API-oriented interactive exploratory search system for programmers. In Proceedings of SIGIR. 1293–1296.
[29]
Yun Lin, Xin Peng, Zhenchang Xing, Diwen Zheng, and Wenyun Zhao. 2015. Clone-based and interactive recommendation for modifying pasted code. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15). ACM, 520–531.
[30]
Jiqun Liu, Matthew Mitsui, Nicholas J. Belkin, and Chirag Shah. 2019. Task, information seeking intentions, and user behavior: Toward a multi-level understanding of web search. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (CHIIR). 123–132.
[31]
Henry B. Mann and Donald R. Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics (1947), 50–60.
[32]
R. Mehrotra, A. H. Awadallah, M. Shokouhi, E. Yilmaz, I. Zitouni, A. El Kholy, and M. Khabsa. 2017. Deep sequential models for task satisfaction prediction. In Proceedings of CIKM. ACM, 737–746.
[33]
Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer: An investigation of how developers spend their time. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, 25–35.
[34]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 126.
[35]
Krishna Narasimhan and Christoph Reichenbach. 2015. Copy and paste redeemed (T). 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2015), 630–640.
[36]
Iulian Neamtiu, Jeffrey S. Foster, and Michael Hicks. 2005. Understanding source code evolution using abstract syntax tree matching. In Proceedings of the 2005 International Workshop on Mining Software Repositories. 1–5.
[37]
David J. Piorkowski, Scott D. Fleming, Irwin Kwan, Margaret M. Burnett, Christopher Scaffidi, Rachel K. E. Bellamy, and Joshua Jordahl. 2013. The whats and hows of programmers’ foraging diets. In Proceedings of CHI. ACM, 3063–3072.
[38]
Venkatesh Potluri, Priyan Vaithilingam, Suresh Iyengar, Y. Vidya, Manohar Swaminathan, and Gopal Srinivasa. 2018. CodeTalk: Improving programming environment accessibility for visually impaired developers. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 618.
[39]
Tilmann Rabl and Meikel Poess. 2011. Parallel data generation for performance analysis of large, complex RDBMS. In Proceedings of the Fourth International Workshop on Testing Database Systems. ACM, 5.
[40]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13.
[41]
Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, Vol. 242. 133–142.
[42]
John W. Ratcliff and David E. Metzener. 1988. Pattern-matching: the Gestalt approach. Dr Dobbs Journal 13, 7 (1988), 46.
[43]
Shazia Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F. Ilyas, Sebastian Link, Renee J. Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2018. Data quality: The role of empiricism. ACM SIGMOD Record 46, 4 (2018), 35–43.
[44]
Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How developers search for code: A case study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15). ACM, 191–201.
[45]
Ripon K. Saha, Chanchal K. Roy, Kevin A. Schneider, and Dewayne E. Perry. 2013. Understanding the evolution of type-3 clones: An exploratory study. In Proceedings of the 10th Working Conference on MSR. IEEE Press, 139–148.
[46]
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. 2021. “Everyone wants to do the model work, not the data work”: Data cascades in high-stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
[47]
Janet Siegmund, Christian Kästner, Jörg Liebig, Sven Apel, and Stefan Hanenberg. 2014. Measuring and modeling programming experience. Empirical Software Engineering 19, 5 (2014), 1299–1334.
[48]
Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of KDD. 2279–2288.
[49]
Y. Tay. 2011. Data generation for application-specific benchmarking. VLDB, Challenges and Visions 7 (2011).
[50]
Ashish Thusoo and Joydeep Sarma. 2017. Creating a Data-Driven Enterprise with DataOps. O’Reilly Media, Incorporated.
[51]
Michael E. Tipping and Christopher M. Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 3 (1999), 611–622.
[52]
Jeffrey Voas and Rick Kuhn. 2017. What happened to software metrics? Computer 50, 5 (2017), 88.
[53]
Isaac Waller and Ashton Anderson. 2019. Generalists and specialists: Using community embeddings to quantify activity diversity in online platforms. In The World Wide Web Conference. 1954–1964.
[54]
Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How do developers utilize source code from stack overflow? Empirical Software Engineering 24, 2 (2019), 637–673.
[55]
Xiaohui Xie, Jiaxin Mao, Maarten de Rijke, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2018. Constructing an interaction behavior model for web image search. In Proceedings of SIGIR. 425–434.
[56]
Manuela Züger and Thomas Fritz. 2015. Interruptibility of software developers and its prediction using psycho-physiological sensors. In Proceedings of CHI. ACM, 2981–2990.

Cited By

View all
  • (2024)On Eye Tracking in Software EngineeringSN Computer Science10.1007/s42979-024-03045-35:6Online publication date: 26-Jul-2024
  • (2023)Towards a Researcher-in-the-loop Driven Curation Approach for Quantitative and Qualitative Research MethodsNew Trends in Database and Information Systems10.1007/978-3-031-42941-5_58(647-655)Online publication date: 31-Aug-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 41, Issue 3
July 2023
890 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3582880
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2023
Online AM: 07 October 2022
Accepted: 18 September 2022
Revised: 24 June 2022
Received: 16 December 2021
Published in TOIS Volume 41, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Interaction behavior
  2. search pattern
  3. data curation

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • ARC Discovery Project
  • ARC Training Centre for Information Resilience

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)197
  • Downloads (Last 6 weeks)26
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)On Eye Tracking in Software EngineeringSN Computer Science10.1007/s42979-024-03045-35:6Online publication date: 26-Jul-2024
  • (2023)Towards a Researcher-in-the-loop Driven Curation Approach for Quantitative and Qualitative Research MethodsNew Trends in Database and Information Systems10.1007/978-3-031-42941-5_58(647-655)Online publication date: 31-Aug-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media